Defining Achievement
TACACS Daemon DoS
"Troubleshooting a mysterious outage doesn't always provide an opportunity to discover critical security flaws in commercial software, but sometimes it does."
When I worked at AT&T we had an inside term for difficult, mysterious, long-running support cases: jackpots. Most of the techs I knew tried hard to avoid getting involved in anything that even resembled a jackpot. On the other hand, some of us lived for a chance to solve the next riddle, especially when no one else could.
One day I overheard a peer on the night shift complain about having to constantly restart the TACACS daemon on one of our servers. At this point I was already familiar with the many shortcomings of this particular piece of software, (port blocking during reverse DNS lookups, hard-coded limits on forked process counts, a serious lack of debugging information), so I assumed my colleague's need to restart the problem process was a misguided response to a known issue. A couple things bothered me about the symptoms described in this case, though. Why was it happening every night for the last week, and only affecting one of the servers? I knew firsthand that nothing had recently changed on this server from an operational standpoint. Was this a new problem we hadn't encountered before?
After writing a quick script to chomp through our usage logs, it became clear that a common thread before and during the outages were repeated attempts by a recently provisioned network element to perform TACACS authentication. Further research revealed that this device was sourcing its traffic from a lab segment, and that it had never attempted authentication against any of the other servers. Most importantly the logs showed that none of the previous attempts to speak the TACACS protocol had been successful. Since it would take a day or two to get in touch with the owner of this device, applying an IP filter rule and starting a tcpdump process was the best we could do to prevent another outage and further the investigative efforts.
With traffic from the offending device blocked at the server, it only took one day for the owner to contact us. Our new friend indicated that the device in question was a new model of a popular layer 3 switch -- one we knew no one had ever gotten to work in our environment. The tcpdump output was enough to see that this equipment was not acting normally. When we temporarily removed the filter rules and enticed the device to send more authentication attempts our way, we quickly confirmed that whatever the switch was doing wrong was also responsible for breaking the TACACS daemon. I don't know if this is how denial of service conditions are usually discovered, but it's how this one was. After searching the vendor's website in vain for a bug report or hotfix, things got really interesting.
I wish I could say that this version of software is no longer supported, got patched, or is so old that no one would possibly use it anymore, but that unfortunately isn't the case. As such, any specific details regarding the DoS is something I can't document here. Suffice it to say, reporting a repeatable security flaw in a security application doesn't always result in a timely solution. What I can talk about is the extent to which I continued to research the issue. Not happy with using a layer 3 switch as a TACACS sniper rifle, I focused on recreating the specific attack signature using the wonderful hping command line utility. It worked well enough, but wasn't exactly something I could count on the vendor downloading and experimenting with when I reported the flaw. I changed my focus to writing a proof of concept tool in my new favorite language: Python. Lucky for me someone had already released Python modules for crafting and capturing raw TCP packets.
This was an exceptional example of how a jackpot situation helped me learn more about TCP/IP, programming, security, and vendors, than any number of common issues ever could have. I may not have saved the day for thousands of our vendor's customers, received credit in a bugtraq or CVE posting, or even convinced the vendor to immediately fix a serious security oversight, but at least I learned a thing or two about a thing or two. From a business standpoint, the most important benefit of this exercise was that I knew exactly why it was important to upgrade this piece of software as soon as a new version became available, and already had a tool in hand to ensure that the flaw had been resolved.
-ksp