I recently came across a description of an interesting data structure called a Burkhard-Keller tree (BK tree). BK trees provide efficient lookup of the set of words that lie within a certain distance of a query word. For example, this could used to suggest corrections in a spell checker.
To play around with this I’ve written up a Python implementation of the tree. For example:
>>> tree = BKTree(levenshtein, dictionary_words)
>>> tree.query("ricohet", 2)
[(1, 'ricochet'), (2, 'richer'), (2, 'riches'),
(2, 'richest'), (2, 'ricochets')]
>>>
In the above example ‘levenshtein’ is a function that implements the Levenshtein Distance. This is commonly used to determine the distance between a word and its misspellings.
How much faster is a bk-tree than the brute force approach? Here are a few example queries compared with the brute-force time:
| word, distance | BK Tree | Brute Force |
| amphibious, 2 | 3s | 15s |
| ricochet, 2 | 3s | 11s |
| the, 2 | 0.2s | 6s |
You can see the BK tree is much faster for small distances. As the query distance increases the BK tree query time approaches that of the brute force method.
Of course, there’s no free lunch. Creating this tree of 57,024 words takes 94s on my system.
The source for this module is available here. Enjoy!
Interesting article! I don’t have your data so I can’t compare, but I’m sure that if you used the Levenshtein C extension module here: http://trific.ath.cx/resources/python/levenshtein/ your tree-building time would be much shorter.
Thanks for the pointer. I’ll take a 50X speedup any day: