mysql fun fact: you don't need tolower(), and indexing slows substring criteria

On MySQL 5.1, substring matches are case-insensitive, so tolower() does nothing except slow things down. Further, if you are doing substring searches and you index the columns searched against, the query will be twice as slow.

We were on a chore to explore what would need to happen to improve the responsiveness of a web-based search UI across a 500k records in a single table. With a substring criteria across 5 or 6 columns (amongst other things), the query was taking about 5 seconds. Our target was 500 milliseconds.

The first surprise was that the hibernate case-insensitive substring Criteria was wrapping the match for each column in a tolower(). My programming partner said that we didn't need that, since string matches are case-insensitive on MySQL in the first place. We sniffed the MySQL query, measured the time of 5 query runs, removed the 5 tolower() wrappers, and ran the query another 5 times. Removing the tolower() references one at a time, the results were consistent -- each tolower() was costing 300ms. Again, this was on 500k rows; the tolower() probably wouldn't make much difference on 50k rows or less.

We modified the Java code to create a case-sensitive substring criteria, which got rid of the tolower() references in the query. (We did this in a test-driven way by asserting that the method that created the Specification contained the case-sensitive Criteria instead of the case-insensitive Criteria.) This was kind of gross, since the query would actually be case-insensitive on the MySQL target. We later extracted a MySqlCaseInsensitiveCriteria class that mostly documented our intention, but we added a small comment with an explanatory URL as well.

At this point, our query was down to 2.5 seconds, still 5x our goal of 500ms.

The next thing was wanted to try was indexing the columns we were substring searching against. When we did this, the queries were literally twice as long. Our guess was that indexing string-based columns is meant to improve performance of prefix and suffix based string searches -- not substring searches. As such, MySQL appears to try and satisfy the query criteria for the column via the index, fails to do so, and then uses the non-index way.

Weird that indexing would actually *slow down* a query in any instance. Weird to us, anyways.

To squeeze more speed out, the customer created a story for allowing users to specify what kind of string matching they want by placing a wildcard character ('%') at the beginning, end, or beginning and end of the string. (Beginning meant suffix match, end meant prefix match, both meant substring.) We then created another story to use a prefix-based search on most of the columns, that would also be indexed.

It's a little ugly, but it makes decent sense in the context of this application. They'll just have to be really specific in the training and documentation.

The prefix-based search was much faster, even without an index. With an index, it was *really* fast. We still weren't under 500ms, but were close enough to make the customer happy.

whirred.

Discussion:

showing all 0 messages

`YaK:: WebLog #535 Topic : 2007-06-25 07.26.21 matt : mysql fun fact : you don't need tolower(), and indexing slows substring criteria`	`[Changes] [Calendar] [Search] [Index] [PhotoTags]`
	`[Back to weblog: pretention]`

mysql fun fact: you don't need tolower(), and indexing slows substring criteria

Discussion:

(No messages)