Levenshtein Distance
The Levenshtein distance, also known as edit distance, is a string metric for measuring the difference between two sequences. It represents the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into another. This algorithm is fundamental in many applications including spell checkers, DNA analysis, fuzzy search, and text similarity comparison.
The distance is named after Vladimir Levenshtein, who considered this distance in 1965.
How It Works
The Levenshtein distance between two strings is calculated using dynamic programming. The algorithm builds a matrix where each cell represents the minimum edit distance between substrings of the two input strings.
Operations
There are three basic operations allowed:
- Insertion - Add a character
- Deletion - Remove a character
- Substitution - Replace one character with another
Algorithm Steps
- Create a matrix of size
(m+1) × (n+1)wheremandnare the lengths of the two strings - Initialize the first row and column with increasing values (0, 1, 2, ...)
-
For each cell, calculate the minimum of:
- Cell above + 1 (deletion)
- Cell to the left + 1 (insertion)
- Cell diagonally above-left + 0 (if characters match) or + 1 (substitution)
-
The bottom-right cell contains the final Levenshtein distance
Example
Let's calculate the Levenshtein distance between "KITTEN" and "SITTING":
"" S I T T I N G
"" 0 1 2 3 4 5 6 7
K 1 1 2 3 4 5 6 7
I 2 2 1 2 3 4 5 6
T 3 3 2 1 2 3 4 5
T 4 4 3 2 1 2 3 4
E 5 5 4 3 2 2 3 4
N 6 6 5 4 3 3 2 3
The Levenshtein distance is 3, requiring these operations:
- Substitute K → S
- Substitute E → I
- Insert G at the end
Step-by-Step Matrix Calculation
Let's walk through how this matrix is built:
Initialization:
- Row 0 represents converting empty string to "SITTING" prefixes: "", "S", "SI", "SIT", etc.
- Column 0 represents converting "KITTEN" prefixes to empty string: "", "K", "KI", "KIT", etc.
- Each cell
[i][j]represents the minimum edits to transform the firsticharacters of "KITTEN" into the firstjcharacters of "SITTING"
Key calculations:
-
[1][1]: Transform "K" → "S"- Options: delete K + insert S (1+1=2), substitute K→S (1), insert S + keep K (1+1=2)
- Minimum: 1 (substitution)
-
[2][2]: Transform "KI" → "SI"- "K"→"S" costs 1, "I"→"I" costs 0
- Result: 1
-
[3][3]: Transform "KIT" → "SIT"- Previous "KI"→"SI" cost 1, "T"→"T" costs 0
- Result: 1
-
[5][5]: Transform "KITTE" → "SITTI"- At position [4][4]: "KITT"→"SITT" costs 1
- "E" vs "I": they don't match, so we add substitution cost
- Options: [4][5]+1=3, [5][4]+1=3, [4][4]+1=2
- Result: 2
-
[6][7]: Transform "KITTEN" → "SITTING" (final result)- At [5][6]: "KITTE"→"SITTIN" costs 3
- Need to handle "N" vs "G": they don't match
- Options: [5][7]+1=4, [6][6]+1=3, [5][6]+1=4
- Result: 3
The three operations needed:
- K → S (substitute at position 0)
- E → I (substitute at position 4)
- Insert G (at the end)
This gives us the minimum edit distance of 3.
Implementation
Here's the basic implementation in Rust:
fn levenshtein_distance(str1: &str, str2: &str) -> usize {
let chars1: Vec<char> = str1.chars().collect();
let chars2: Vec<char> = str2.chars().collect();
let m = chars1.len();
let n = chars2.len();
// Create a matrix to store distances
let mut dp = vec![vec![0; n + 1]; m + 1];
// Initialize first row and column
for i in 0..=m {
dp[i][0] = i;
}
for j in 0..=n {
dp[0][j] = j;
}
// Fill the matrix
for i in 1..=m {
for j in 1..=n {
let cost = if chars1[i - 1] == chars2[j - 1] { 0 } else { 1 };
dp[i][j] = std::cmp::min(
std::cmp::min(
dp[i - 1][j] + 1, // deletion
dp[i][j - 1] + 1, // insertion
),
dp[i - 1][j - 1] + cost, // substitution
);
}
}
dp[m][n]
}
fn main() {
println!("{}", levenshtein_distance("kitten", "sitting")); // Output: 3
println!("{}", levenshtein_distance("hello", "world")); // Output: 4
}
Time and Space Complexity
- Time Complexity: O(m × n) where m and n are the lengths of the two strings
- Space Complexity: O(m × n) for the standard implementation
Space-Optimized Version
Since we only need the previous row to calculate the current row, we can optimize space complexity to O(min(m, n)):
fn levenshtein_distance_optimized(str1: &str, str2: &str) -> usize {
let (str1, str2) = if str1.len() < str2.len() {
(str2, str1)
} else {
(str1, str2)
};
let chars1: Vec<char> = str1.chars().collect();
let chars2: Vec<char> = str2.chars().collect();
let m = chars1.len();
let n = chars2.len();
let mut prev_row: Vec<usize> = (0..=n).collect();
for i in 1..=m {
let mut curr_row = vec![i];
for j in 1..=n {
let cost = if chars1[i - 1] == chars2[j - 1] { 0 } else { 1 };
let distance = std::cmp::min(
std::cmp::min(
prev_row[j] + 1, // deletion
curr_row[j - 1] + 1, // insertion
),
prev_row[j - 1] + cost, // substitution
);
curr_row.push(distance);
}
prev_row = curr_row;
}
prev_row[n]
}
Applications in Fuzzy Search
1. Spell Checking and Correction
Levenshtein distance is widely used in spell checkers to find the closest matching words in a dictionary:
fn spell_check(word: &str, dictionary: &[&str], max_distance: usize) -> Vec<String> {
let mut suggestions: Vec<(String, usize)> = Vec::new();
for &dict_word in dictionary {
let distance = levenshtein_distance(word, dict_word);
if distance <= max_distance {
suggestions.push((dict_word.to_string(), distance));
}
}
// Sort by distance (closest matches first)
suggestions.sort_by_key(|&(_, distance)| distance);
// Extract just the words
suggestions.into_iter().map(|(word, _)| word).collect()
}
fn main() {
let dictionary = ["hello", "world", "rust", "algorithm", "computer"];
let suggestions = spell_check("helo", &dictionary, 2);
println!("{:?}", suggestions); // Output: ["hello"]
}
2. Fuzzy String Matching
For approximate string matching, we can use normalized Levenshtein distance to get a similarity score:
fn string_similarity(str1: &str, str2: &str) -> f64 {
let distance = levenshtein_distance(str1, str2);
let max_len = std::cmp::max(str1.len(), str2.len());
if max_len == 0 {
return 1.0;
}
1.0 - (distance as f64 / max_len as f64)
}
fn fuzzy_search(query: &str, candidates: &[&str], threshold: f64) -> Vec<(String, f64)> {
let mut results: Vec<(String, f64)> = Vec::new();
for &candidate in candidates {
let similarity = string_similarity(query, candidate);
if similarity >= threshold {
results.push((candidate.to_string(), similarity));
}
}
// Sort by similarity (best matches first)
results.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap());
results
}
fn main() {
let candidates = ["apple", "application", "apply", "orange", "grape"];
let matches = fuzzy_search("app", &candidates, 0.3);
for (word, similarity) in matches {
println!("{}: {:.2}", word, similarity);
}
// Output: apple: 0.60, apply: 0.60, application: 0.45
}
3. Database Search
Levenshtein distance can enhance database searches by finding records even when there are typos:
use std::collections::HashMap;
#[derive(Debug, Clone)]
struct Record {
data: HashMap<String, String>,
}
impl Record {
fn new() -> Self {
Record {
data: HashMap::new(),
}
}
fn insert(&mut self, key: String, value: String) {
self.data.insert(key, value);
}
fn get(&self, key: &str) -> Option<&String> {
self.data.get(key)
}
}
struct FuzzyDatabase {
records: Vec<Record>,
}
impl FuzzyDatabase {
fn new(records: Vec<Record>) -> Self {
FuzzyDatabase { records }
}
fn search(&self, query: &str, field: &str, max_distance: usize) -> Vec<(Record, usize)> {
let mut results: Vec<(Record, usize)> = Vec::new();
let query_lower = query.to_lowercase();
for record in &self.records {
if let Some(field_value) = record.get(field) {
let field_lower = field_value.to_lowercase();
let distance = levenshtein_distance(&query_lower, &field_lower);
if distance <= max_distance {
results.push((record.clone(), distance));
}
}
}
results.sort_by_key(|&(_, distance)| distance);
results
}
}
fn main() {
let mut users = Vec::new();
let mut user1 = Record::new();
user1.insert("name".to_string(), "John Smith".to_string());
user1.insert("email".to_string(), "[email protected]".to_string());
users.push(user1);
let mut user2 = Record::new();
user2.insert("name".to_string(), "Jane Doe".to_string());
user2.insert("email".to_string(), "[email protected]".to_string());
users.push(user2);
let mut user3 = Record::new();
user3.insert("name".to_string(), "Jon Snow".to_string());
user3.insert("email".to_string(), "[email protected]".to_string());
users.push(user3);
let db = FuzzyDatabase::new(users);
let matches = db.search("Jon", "name", 1);
for (record, distance) in matches {
if let Some(name) = record.get("name") {
println!("{} (distance: {})", name, distance);
}
}
}
Advanced Applications
1. DNA Sequence Alignment
In bioinformatics, Levenshtein distance helps compare DNA sequences to find similarities and mutations.
2. Machine Translation
Used to measure the quality of translations by comparing them to reference translations.
3. Version Control Systems
Git and other version control systems use edit distance concepts to efficiently store and merge file changes.
4. Plagiarism Detection
Academic and content management systems use edit distance to detect similar or copied content.
Limitations and Alternatives
Limitations
- Equal weighting: All operations (insert, delete, substitute) have the same cost
- Character-level only: Doesn't consider semantic meaning or word-level operations
- Performance: Can be slow for very long strings
Alternatives
- Damerau-Levenshtein Distance: Includes transposition operations
- Jaro-Winkler Distance: Better for short strings and names
- Hamming Distance: For strings of equal length only
- Cosine Similarity: For document comparison using vector representations
Best Practices
- Preprocessing: Convert strings to lowercase and remove unnecessary whitespace
- Thresholds: Use appropriate similarity thresholds based on your use case
- Performance: Use space-optimized versions for large datasets
- Caching: Cache results for frequently compared string pairs
- Early termination: Stop calculation if distance exceeds a threshold
The Levenshtein distance remains one of the most important and widely-used algorithms in text processing, forming the foundation for many modern fuzzy search and similarity matching applications.