Spelling correction at scale: Levenshtein distance, BK-trees, and symmetric deletion
If you’ve ever built a search feature, you’ve likely started with the "naive" approach: iterate through every word in your dictionary, calculate the edit distance, and pick the one with the lowest score. It works great for a list of 500 words. But when you scale to a full English dictionary of 200,000+ entries—like the one powering a2zdictionary.com—that approach hits a wall.
Let’s look at how to move from $O(N)$ brute force to sub-millisecond lookups.
1. The Foundation: Levenshtein Distance
The Levenshtein distance measures the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one word into another. The classic way to compute this is using dynamic programming.
def levenshtein(s1, s2):
rows, cols = len(s1) + 1, len(s2) + 1
dist = [[0 for _ in range(cols)] for _ in range(rows)]
for i in range(1, rows): dist[i][0] = i
for j in range(1, cols): dist[0][j] = j
for i in range(1, rows):
for j in range(1, cols):
cost = 0 if s1[i-1] == s2[j-1] else 1
dist[i][j] = min(dist[i-1][j] + 1, dist[i][j-1] + 1, dist[i-1][j-1] + cost)
return dist[-1][-1]
This is elegant, but it’s $O(M \times N)$ for every comparison. If you have a dictionary of 200,000 words, you are performing millions of operations just to correct a single typo. This is why your a2zwordfinder.com implementation might feel sluggish when a user mistypes a query.
2. The Scaling Wall
When you run this against a 200k-word dictionary, you are performing $O(N)$ operations per request. Even with a fast language like C++ or Rust, the latency adds up. You aren't just calculating distance; you are calculating it for every single word in the corpus. We need a way to prune the search space so we don't look at words that are obviously too far away.
3. BK-Trees: Pruning the Search Space
A BK-tree (Burkhard-Keller tree) is a metric-space index. It exploits the triangle inequality: $d(x, z) \leq d(x, y) + d(y, z)$.
In a BK-tree, you pick a root word. Every other word is added as a child based on its distance from the parent. If you are searching for a word with a maximum distance of $n$, you only need to traverse branches where the distance between your query and the node falls within the range $[d - n, d + n]$. This allows you to discard entire subtrees of the dictionary.
Here is a simplified implementation:
class BKTree:
def __init__(self, dist_func):
self.dist_func = dist_func
self.tree = None
def add(self, word):
if self.tree is None:
self.tree = (word, {})
return
curr = self.tree
while True:
d = self.dist_func(curr[0], word)
if d in curr[1]:
curr = curr[1][d]
else:
curr[1][d] = (word, {})
break
def search(self, query, n):
results = []
def _search(node):
d = self.dist_func(node[0], query)
if d <= n: results.append(node[0])
for dist in range(d - n, d + n + 1):
if dist in node[1]: _search(node[1][dist])
_search(self.tree)
return results
By pruning, you avoid calculating the Levenshtein distance for the vast majority of the dictionary.
4. The SymSpell Trick: Symmetric Delete
While BK-trees are a massive improvement, they still require tree traversal. If you need absolute maximum performance, you use Symmetric Delete (SymSpell).
The core insight of SymSpell is to pre-compute all possible deletions for every word in your dictionary up to a certain edit distance (usually 2).
- Pre-computation: For every word in your dictionary, generate all variations by deleting 1 or 2 characters. Store these in a hash map where the key is the "deleted" version and the value is a list of original words that produce this deletion.
- Lookup: When a user searches for a word, you generate the deletions for the query and check if they exist in your hash map.
Because you are looking up keys in a hash map rather than traversing a tree or calculating distances, the complexity drops to $O(1)$ (or $O(k)$ where $k$ is the number of possible deletions). You aren't calculating distances at runtime; you are simply performing a set intersection.
The Performance Reality Check
To put this into perspective, here is how these approaches stack up when searching a 200,000-word dictionary:
- Naive Brute Force: ~80ms per query. This is unusable for real-time search-as-you-type interfaces.
- BK-Tree: ~2ms per query. Excellent for most applications and very memory-efficient.
- SymSpell: ~0.1ms per query. This is the gold standard for high-traffic production systems. It trades memory (to store the massive hash map of deletions) for extreme speed.
If you are building a tool where users expect instant feedback, start with a BK-tree to get a feel for metric-space indexing. If you find yourself hitting a bottleneck at scale, move to the Symmetric Delete approach. Your users will thank you for the sub-millisecond response times.
This article was originally published by DEV Community and written by Dean Gilley.
Read original article on DEV Community