35x Improved T-SQL LevenShtein Distance Algorithm...at a cost
At work, we noticed a considerable performance hit using a T-SQL implementation of the Levenshtein algorithm created by Michael Gilleland found here. It seems to me that his approach was to replicate the C-code algorithm found in Wikipedia in T-SQL rather than taking a step back and re-conceptualizing the algorithm from a T-SQL standpoint. Feel free to correct me, but it seems that the algorithm focues on three things to find the shortest distance between two strings (what it takes to make them the same):
More than that, a true Levenshtein function should return the numeric distance between the two strings, not a string value of this distance. Keeping this in mind, I started editing the algorithm to understand what was going on at a low level but quickly found that I would just be better off re-architecting the whole solution, not just to make it performant, but so that I would understand it completely. I followed the above 3 traits with some small additions for performance. I wanted to do the following:
This new algorithm comes at a cost of not checking for insertions/deletions along the entire length of the comparison string, so we can now only handle one insertion/deletion consecutively except at the end of the comparison string. This, in effect, makes it no longer follow the Levenshtein algorithm's end capabilities, but this is an acceptable cost for our needs.
So, how much does this help my performance?
Test Case: I was matching first and last names on a table with over 5,500 users and using Mr. Gilleland's approach with a user that was towards the end of the table, it took ~28 seconds. I optimized that call down to 14 seconds (I was using the function twice in the proc). Still, 14 seconds is a heck of a long time.
Original Query: 28 seconds
Optimized Original Query (removed a second Levenshtein function call, using lengths as a predeterminate instead): 14 seconds
My 'optimized' Levenshtein Algorithm using the Optimized Original Query: With the same table and same user with my 'improved' algorithm, I could now perform the same query in ~400 ms. That's a 35x improvement! All of these tests were run with MSSQL 2005.
Here's the code:
- How many character insertions are necessary
- How many character deletions are necessary
- How many character substitutions are necessary
More than that, a true Levenshtein function should return the numeric distance between the two strings, not a string value of this distance. Keeping this in mind, I started editing the algorithm to understand what was going on at a low level but quickly found that I would just be better off re-architecting the whole solution, not just to make it performant, but so that I would understand it completely. I followed the above 3 traits with some small additions for performance. I wanted to do the following:
- If either string is empty, return the length of the other string
- Find the longest string, and loop on that length.
- Compare each character and see if they are equal
- If they are not, then increment the difference counter
- See if an insertion will re-align the strings; increment the right index forward if true
- Else, see if a deletion will re-align the string; increment the left index forward if true
- Return the number of differences
So, how much does this help my performance?
Test Case: I was matching first and last names on a table with over 5,500 users and using Mr. Gilleland's approach with a user that was towards the end of the table, it took ~28 seconds. I optimized that call down to 14 seconds (I was using the function twice in the proc). Still, 14 seconds is a heck of a long time.
Original Query: 28 seconds
Optimized Original Query (removed a second Levenshtein function call, using lengths as a predeterminate instead): 14 seconds
My 'optimized' Levenshtein Algorithm using the Optimized Original Query: With the same table and same user with my 'improved' algorithm, I could now perform the same query in ~400 ms. That's a 35x improvement! All of these tests were run with MSSQL 2005.
Here's the code:
/****** Object: UserDefinedFunction [dbo].[LEVENSHTEIN] Script Date: 06/10/2009 09:36:44 ******/ SET ANSI_NULLS ON GO SET QUOTED_IDENTIFIER ON GO ALTER function [dbo].[LEVENSHTEIN]( @left varchar(100), @right varchar(100) ) returns int as BEGIN DECLARE @difference int, @lenRight int, @lenLeft int, @leftIndex int, @rightIndex int, @left_char char(1), @right_char char(1), @compareLength int SET @lenLeft = LEN(@left) SET @lenRight = LEN(@right) SET @difference = 0 If @lenLeft = 0 BEGIN SET @difference = @lenRight GOTO done END If @lenRight = 0 BEGIN SET @difference = @lenLeft GOTO done END GOTO comparison comparison: IF (@lenLeft >= @lenRight) SET @compareLength = @lenLeft Else SET @compareLength = @lenRight SET @rightIndex = 1 SET @leftIndex = 1 WHILE @leftIndex <= @compareLength BEGIN SET @left_char = substring(@left, @leftIndex, 1) SET @right_char = substring(@right, @rightIndex, 1) IF @left_char <> @right_char BEGIN -- Would an insertion make them re-align? IF(@left_char = substring(@right, @rightIndex+1, 1)) SET @rightIndex = @rightIndex + 1 -- Would an deletion make them re-align? ELSE IF(substring(@left, @leftIndex+1, 1) = @right_char) SET @leftIndex = @leftIndex + 1 SET @difference = @difference + 1 END SET @leftIndex = @leftIndex + 1 SET @rightIndex = @rightIndex + 1 END GOTO done done: RETURN @difference END
I have since edited this post to more correctly reflect the cost associated with the performance of my algorithm as it is not as powerful as the original Levenshtein algorithm. However, it does handle most real-world spelling errors that an admin would want to tolerate.
ReplyDeleteHi Brent, can you show us a usage? I'd like to be able to query all records where a given field is less than the a given Levenshtein distance from a given string.
ReplyDeleteSeems to me that if we know the maximum allowable distance, we can optimize further, ie, throw out a lot of records without doing the full calculation.
I know it's been forever since you posted this reply Matt, but I was looking at this again today and I agree. If you add an argument to the function for the max difference allowed, you can short circuit the loop in many cases and it should speed things up even further.
DeleteI would hazard a guess between 30 - 50% faster for a max difference of say 3 with an average string length of 10 or more. It would really help to stabilize the average overall speed with strings that could be much longer, too.
I should have thought of that a long time ago (and you already did)!
Hi Brent,
ReplyDeleteIf you combine the info you find on the following websites, you can speed that up even further. From 45 seconds to 1 ms I kid you not, and you will have the full levenshtein functionality!
http://www.stev.org/post/2011/02/05/MSSQL-Levenshtein.aspx
and
http://www.asp.net/data-access/tutorials/creating-stored-procedures-and-user-defined-functions-with-managed-code-cs
From the first website I altered the function so it can be used straight away following step 13 from the second website.
Here is the code:
using System;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
public partial class StoredFunctions
{
[Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = false)]
public static SqlDouble Levenshtein(SqlString S1, SqlString S2)
{
if (S1.IsNull || S2.IsNull)
throw (new ArgumentNullException());
String S3 = S1.Value.ToUpper();
String S4 = S2.Value.ToUpper();
int n = S3.Length;
int m = S4.Length;
int[,] d = new int[n + 1, m + 1];
int cost = 0;
if (n + m == 0) {
return 100;
} else if (n == 0) {
return 0;
} else if (m == 0) {
return 0;
}
for (int i = 0; i <= n; i++)
d[i, 0] = i;
for (int j = 0; j <= m; j++)
d[0, j] = j;
for (int i = 1; i <= n; i++)
{
for (int j = 1; j <= m; j++)
{
if (S3[i - 1] == S4[j - 1])
cost = 0;
else
cost = 1;
d[i, j] = System.Math.Min(System.Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1), d[i - 1, j - 1] + cost);
}
}
double percentage = System.Math.Round((1.0 - ((double)d[n, m]/(double)System.Math.Max(n,m))) * 100.0,2);
return percentage;
}
};
Cheers,
Tom
A very interesting approach, Tom, and it appears to be much faster, and, of course, more powerful b/c of your C#/.NET approach.
ReplyDeleteFor the case where a user can build and use this code from a referenced dll in their SQL Server instance, this is definitely better.
As an FYI for my approach, referencing a .NET dll from our SQL server instance was not feasible (as crazy as that may sound), so I didn't even bother traversing this path.
So, if the software absolutely must go pure SQL and you don't mind the cost, my solution should work, but I pity your deployment environment if that is the case (as I pity my old deployment environment - switched jobs since then). :)
I do like the SQL approach though C# in a CLR will kick the crap out of it performance wise as soon as you encounter any kind of "loop" in SQL
ReplyDeleteThe Stev.Org Guy :)
Wow, this is amazing. I am able to use this to search our inventory of 150k products in under 1 second. Awesome :)
ReplyDeleteThank you for sharing this function! It has worked with 100% reliability for my specific purpose, and at a better-than-expected speed. I also am in a SQL environment that does not support CLRs, so this was a PERFECT solution. Thanks again!
ReplyDeleteThanks a lot man. This will help me out a lot.
ReplyDeleteI just performance tested your TSQL implementation with 6 others and yours came out on top with 1000 calls in 420ms.
The only reason I won't be using your implementation is that I want to use the full implementation of the algorithm.
Amazing job sir
Found a bug:
ReplyDeleteSELECT [dbo].[LEVENSHTEIN] (
'moo'
,'mo77o')
returns 3, should return 2 (delete 2 numeric characters from the 2nd line to get the 1st line)
I wouldn't classify this as a bug b/c my solution has many limitations from a full-blown Levenshtein implementation. However, it is worth noting that this is one of them b/c it doesn't look at the whole string to figure out removals/additions, but one character at a time.
DeleteSo, the first 7 becomes an 'o', then the second '7' is removed, then the last 'o' is removed = 3 changes.
Thanks for the note!