Spelling correction at scale: Levenshtein distance, BK-trees, and symmetric deletion

If youâ€™ve ever built a search feature, youâ€™ve likely started with the "naive" approach: iterate through every word in your dictionary, calculate the edit distance, and pick the one with the lowest score. It works great for a list of 500 words. But when you scale to a full English dictionary of 200,000+ entriesâ€”like the one powering a2zdictionary.comâ€”that approach hits a wall.

Letâ€™s look at how to move from $O(N)$ brute force to sub-millisecond lookups.

1. The Foundation: Levenshtein Distance

The Levenshtein distance measures the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one word into another. The classic way to compute this is using dynamic programming.

def levenshtein(s1, s2):
    rows, cols = len(s1) + 1, len(s2) + 1
    dist = [[0 for _ in range(cols)] for _ in range(rows)]
    for i in range(1, rows): dist[i][0] = i
    for j in range(1, cols): dist[0][j] = j
    for i in range(1, rows):
        for j in range(1, cols):
            cost = 0 if s1[i-1] == s2[j-1] else 1
            dist[i][j] = min(dist[i-1][j] + 1, dist[i][j-1] + 1, dist[i-1][j-1] + cost)
    return dist[-1][-1]

This is elegant, but itâ€™s $O(M \times N)$ for every comparison. If you have a dictionary of 200,000 words, you are performing millions of operations just to correct a single typo. This is why your a2zwordfinder.com implementation might feel sluggish when a user mistypes a query.

2. The Scaling Wall

When you run this against a 200k-word dictionary, you are performing $O(N)$ operations per request. Even with a fast language like C++ or Rust, the latency adds up. You aren't just calculating distance; you are calculating it for every single word in the corpus. We need a way to prune the search space so we don't look at words that are obviously too far away.

3. BK-Trees: Pruning the Search Space

A BK-tree (Burkhard-Keller tree) is a metric-space index. It exploits the triangle inequality: $d(x, z) \leq d(x, y) + d(y, z)$.

In a BK-tree, you pick a root word. Every other word is added as a child based on its distance from the parent. If you are searching for a word with a maximum distance of $n$, you only need to traverse branches where the distance between your query and the node falls within the range $[d - n, d + n]$. This allows you to discard entire subtrees of the dictionary.

Here is a simplified implementation:

class BKTree:
    def __init__(self, dist_func):
        self.dist_func = dist_func
        self.tree = None

    def add(self, word):
        if self.tree is None:
            self.tree = (word, {})
            return
        curr = self.tree
        while True:
            d = self.dist_func(curr[0], word)
            if d in curr[1]:
                curr = curr[1][d]
            else:
                curr[1][d] = (word, {})
                break

    def search(self, query, n):
        results = []
        def _search(node):
            d = self.dist_func(node[0], query)
            if d <= n: results.append(node[0])
            for dist in range(d - n, d + n + 1):
                if dist in node[1]: _search(node[1][dist])
        _search(self.tree)
        return results

By pruning, you avoid calculating the Levenshtein distance for the vast majority of the dictionary.

4. The SymSpell Trick: Symmetric Delete

While BK-trees are a massive improvement, they still require tree traversal. If you need absolute maximum performance, you use Symmetric Delete (SymSpell).

The core insight of SymSpell is to pre-compute all possible deletions for every word in your dictionary up to a certain edit distance (usually 2).

Pre-computation: For every word in your dictionary, generate all variations by deleting 1 or 2 characters. Store these in a hash map where the key is the "deleted" version and the value is a list of original words that produce this deletion.
Lookup: When a user searches for a word, you generate the deletions for the query and check if they exist in your hash map.

Because you are looking up keys in a hash map rather than traversing a tree or calculating distances, the complexity drops to $O(1)$ (or $O(k)$ where $k$ is the number of possible deletions). You aren't calculating distances at runtime; you are simply performing a set intersection.

The Performance Reality Check

To put this into perspective, here is how these approaches stack up when searching a 200,000-word dictionary:

Naive Brute Force: ~80ms per query. This is unusable for real-time search-as-you-type interfaces.
BK-Tree: ~2ms per query. Excellent for most applications and very memory-efficient.
SymSpell: ~0.1ms per query. This is the gold standard for high-traffic production systems. It trades memory (to store the massive hash map of deletions) for extreme speed.

If you are building a tool where users expect instant feedback, start with a BK-tree to get a feel for metric-space indexing. If you find yourself hitting a bottleneck at scale, move to the Symmetric Delete approach. Your users will thank you for the sub-millisecond response times.

DE

Source

This article was originally published by DEV Community and written by Dean Gilley.

Read original article on DEV Community

Back to Discover

Spelling correction at scale: Levenshtein distance, BK-trees, and symmetric deletion

Spelling correction at scale: Levenshtein distance, BK-trees, and symmetric deletion

1. The Foundation: Levenshtein Distance

2. The Scaling Wall

3. BK-Trees: Pruning the Search Space

4. The SymSpell Trick: Symmetric Delete

The Performance Reality Check

Reading List