finding buffer iteration bugs with google's new code search

There's been a couple of articles and blog posts about finding bugs with google's new code search. Here's some interesting queries and results around one of my favorites: buffer iteration bugs.

For those that don't know, buffer iteration bugs don't usually involve any function calls -- it's generally all loop constructs and pointer arithmetic. Often, this construct is used as the equivalent of a strcpy():

while (*src != 0)
  *dst++ = *src++;

generally, people don't re-write strcpy() verbatim like the code above. In the parsing of protocols, email addresses, URLs, file formats, and what-have-you, you see things like this:

while (*src != '@')
  *dst++ = *src++;

This is the kind of bug in MSRPC that the Blaster worm exploited, though I first realised the widespread issue after seeing this HalfLife Server Exploit , and started working on detection mechanisms. Michael Howard had a nifty blog post/article where he actually posted the snippet of vulnerable Windows source code. I was fascinated with this kind of bug in 2003 years ago, because no code analysis tools really tried to find these kinds of bugs. I spoke about it quite a bit, and implemented a simple ad-hoc algorithm (documented at the Software Security Summit in 2004, and at the class Luis and I talk at BlackHat USA 2006 ) in an IDA plug-in to detect these simple cases with very few false positives. When BugScan 2.0 came around and we ditched IDA to write our own binary parsing functionality, the ad-hoc algorithm was thrown out and we were able to come up with a very elegant way of finding these kinds of bugs generically with extremely few false positives. That new, better way was also discussed in the aforementioned BlackHat class, which we'll be giving again at BlackHat Europe 2007 assuming enough people sign up for the class.

To this day, I know of no static analysis tools or products that even attempt to do this kind of analysis. Hopefully, bugreport will get there in the next few months; it's progressing nicely with its community of developers. Even if we can't do it up "proper", maybe we can just cheat a little and get some good results anyways via this google code search. There's a couple of barriers the google tool presents; mainly, the inability to match a regular expression across multiple lines. I can understand the need to limit the search for performance and practical reasons, but I'd really like the ability to select my preference for this in the advanced search (which gets remembered). For instance, an option that says "limit search to [n] lines", where 1 <= n <= 5, and defaults to 1. The lack of this ability greatly limits the results of the search below, which is unfortunate.

The first search that I came up with, and my friend Dan Moniz optimized a bit is this one . (Some Picasa-related wine code comes up on the first page as of this writing.) Just to explain the rationale, let's take a look at the pieces:

^[^\*|//]*while.*\s(\(\(|\()\*.+[+\ ]=[\ \*].*\+\+

First, we want to make sure we aren't looking at something that's a comment. Next, we're specifically looking for the while() looping construct as that appears to be the most common thing people use in C to introduce bugs like this (including Microsoft in MSRPC -- see Michael Howard's blog post, referenced above). Because we're limited to searching on one line, we search inside the open paren (or nested open parens) for a dereferenced variable name being incremented (*dst++), and assigned to from another dereferenced variable name being incremented (*src++). Phew. That's about it. I'm sure there's more ways to tweak this, but this was about 30 minutes of playing. People who went to the s-3con talk nearly 3 years ago, or came to the BlackHat classes this year may notice that my explanation of the regexp breakdown is almost exactly the same as when I talked about detecting the same type of bugs in binary code. Let me reiterate: code is code, binary versus source doesn't matter -- neither have a real advantage or disadvantage that matters for finding real bugs with automated analysis.

In Michael Howard's blog post mentioned above, he provides a listing for a simple perl program that does a regex to try and find these kinds of bugs in source code. It turns up some cool results as well (with some minor modifications by me). It has more potential for false positives, but skirts around the multi-line regexp match limitation of the google tool.

login is your name, passwd is 'burdell'. Of course, these searches don't produce an instant list of exploitable bugs; it just gives a good starting point. To find these bugs for real, you'd need the entire codebase, inter-function value tracking, taint tracking, and advanced control flow (with loop) analysis to really get the job done. Regardless, if anyone has suggestions on how to improve these searches/regexps, please post here.

PS: I would make note of this post on Dave Aitel's dailydave list, but he appears to be consistently censoring my posts for some reason.

Discussion:

showing all 0 messages

`YaK:: WebLog #535 Topic : 2006-10-11 21.33.24 matt : finding buffer iteration bugs with google's new code search`	`[Changes] [Calendar] [Search] [Index] [PhotoTags]`
	`[Back to weblog: pretention]`

finding buffer iteration bugs with google's new code search

Discussion:

(No messages)