35x Improved T-SQL LevenShtein Distance Algorithm...at a cost

- June 10, 2009

At work, we noticed a considerable performance hit using a T-SQL implementation of the Levenshtein algorithm created by Michael Gilleland found here. It seems to me that his approach was to replicate the C-code algorithm found in Wikipedia in T-SQL rather than taking a step back and re-conceptualizing the algorithm from a T-SQL standpoint. Feel free to correct me, but it seems that the algorithm focues on three things to find the shortest distance between two strings (what it takes to make them the same):

How many character insertions are necessary
How many character deletions are necessary
How many character substitutions are necessary

These three traits are necessary to accommodate mis-aligned strings because of typos. If the algorithm is not designed with these constraints, one missing letter will result in inaccurate reporting because it throws the indexing off. The C-code example in Wikipedia uses a 2-D matrix approach, which is a natural, compact, and performant fit for C/C++, or probably any other modern programming language. However, in SQL, especially T-SQL, stuffing all these characters into one huge string and doing sub string comparisons is very costly.

More than that, a true Levenshtein function should return the numeric distance between the two strings, not a string value of this distance. Keeping this in mind, I started editing the algorithm to understand what was going on at a low level but quickly found that I would just be better off re-architecting the whole solution, not just to make it performant, but so that I would understand it completely. I followed the above 3 traits with some small additions for performance. I wanted to do the following:

If either string is empty, return the length of the other string
Find the longest string, and loop on that length.
Compare each character and see if they are equal
If they are not, then increment the difference counter
See if an insertion will re-align the strings; increment the right index forward if true
Else, see if a deletion will re-align the string; increment the left index forward if true
Return the number of differences

Seems pretty simple when you think about it... I now have one 'WHILE' loop with at most 4 substring() calls per character.

This new algorithm comes at a cost of not checking for insertions/deletions along the entire length of the comparison string, so we can now only handle one insertion/deletion consecutively except at the end of the comparison string. This, in effect, makes it no longer follow the Levenshtein algorithm's end capabilities, but this is an acceptable cost for our needs.

So, how much does this help my performance?
Test Case: I was matching first and last names on a table with over 5,500 users and using Mr. Gilleland's approach with a user that was towards the end of the table, it took ~28 seconds. I optimized that call down to 14 seconds (I was using the function twice in the proc). Still, 14 seconds is a heck of a long time.

Original Query: 28 seconds
Optimized Original Query (removed a second Levenshtein function call, using lengths as a predeterminate instead): 14 seconds

My 'optimized' Levenshtein Algorithm using the Optimized Original Query: With the same table and same user with my 'improved' algorithm, I could now perform the same query in ~400 ms. That's a 35x improvement! All of these tests were run with MSSQL 2005.

Here's the code:

/****** Object: UserDefinedFunction [dbo].[LEVENSHTEIN] Script Date: 06/10/2009 09:36:44 ******/ 
SET ANSI_NULLS ON GO 
SET QUOTED_IDENTIFIER ON GO  
ALTER function [dbo].[LEVENSHTEIN]( @left varchar(100), @right varchar(100) ) 
   returns int 
as 
BEGIN 
   DECLARE @difference int, @lenRight int, @lenLeft int, @leftIndex int, @rightIndex int,   @left_char char(1), @right_char char(1), @compareLength int  
   SET @lenLeft = LEN(@left) 
   SET @lenRight = LEN(@right) 
   SET @difference = 0  
   If @lenLeft = 0 
   BEGIN 
      SET @difference = @lenRight GOTO done 
   END 
   If @lenRight = 0 
   BEGIN 
      SET @difference = @lenLeft 
      GOTO done 
   END  
   GOTO comparison  

   comparison: 
   IF (@lenLeft >= @lenRight) 
      SET @compareLength = @lenLeft 
   Else 
      SET @compareLength = @lenRight  
   SET @rightIndex = 1 
   SET @leftIndex = 1 
   WHILE @leftIndex <= @compareLength 
   BEGIN 
      SET @left_char = substring(@left, @leftIndex, 1)
      SET @right_char = substring(@right, @rightIndex, 1)
      IF @left_char <> @right_char 
      BEGIN -- Would an insertion make them re-align? 
         IF(@left_char = substring(@right, @rightIndex+1, 1))    
            SET @rightIndex = @rightIndex + 1 
         -- Would an deletion make them re-align? 
         ELSE 
            IF(substring(@left, @leftIndex+1, 1) = @right_char)
               SET @leftIndex = @leftIndex + 1
               SET @difference = @difference + 1 
      END 
      SET @leftIndex = @leftIndex + 1 
      SET @rightIndex = @rightIndex + 1 
   END  
   GOTO done  

   done: 
      RETURN @difference 
END

Comments

Brent BullaJune 11, 2009 at 11:12 AM
I have since edited this post to more correctly reflect the cost associated with the performance of my algorithm as it is not as powerful as the original Levenshtein algorithm. However, it does handle most real-world spelling errors that an admin would want to tolerate.
ReplyDelete
Replies
UnknownMarch 3, 2010 at 10:37 PM
Hi Brent, can you show us a usage? I'd like to be able to query all records where a given field is less than the a given Levenshtein distance from a given string.

Seems to me that if we know the maximum allowable distance, we can optimize further, ie, throw out a lot of records without doing the full calculation.
ReplyDelete
Replies
Tom Van SchoorApril 20, 2011 at 5:43 PM
Hi Brent,

If you combine the info you find on the following websites, you can speed that up even further. From 45 seconds to 1 ms I kid you not, and you will have the full levenshtein functionality!

http://www.stev.org/post/2011/02/05/MSSQL-Levenshtein.aspx

and

http://www.asp.net/data-access/tutorials/creating-stored-procedures-and-user-defined-functions-with-managed-code-cs

From the first website I altered the function so it can be used straight away following step 13 from the second website.

Here is the code:

using System;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;

public partial class StoredFunctions
{
[Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = false)]
public static SqlDouble Levenshtein(SqlString S1, SqlString S2)
{
if (S1.IsNull || S2.IsNull)
throw (new ArgumentNullException());

String S3 = S1.Value.ToUpper();
String S4 = S2.Value.ToUpper();

int n = S3.Length;
int m = S4.Length;

int[,] d = new int[n + 1, m + 1];
int cost = 0;

if (n + m == 0) {
return 100;
} else if (n == 0) {
return 0;
} else if (m == 0) {
return 0;
}

for (int i = 0; i <= n; i++)
d[i, 0] = i;

for (int j = 0; j <= m; j++)
d[0, j] = j;

for (int i = 1; i <= n; i++)
{
for (int j = 1; j <= m; j++)
{
if (S3[i - 1] == S4[j - 1])
cost = 0;
else
cost = 1;

d[i, j] = System.Math.Min(System.Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1), d[i - 1, j - 1] + cost);
}
}

double percentage = System.Math.Round((1.0 - ((double)d[n, m]/(double)System.Math.Max(n,m))) * 100.0,2);

return percentage;
}
};

Cheers,
Tom
ReplyDelete
Replies
Brent BullaApril 22, 2011 at 2:41 AM
A very interesting approach, Tom, and it appears to be much faster, and, of course, more powerful b/c of your C#/.NET approach.

For the case where a user can build and use this code from a referenced dll in their SQL Server instance, this is definitely better.

As an FYI for my approach, referencing a .NET dll from our SQL server instance was not feasible (as crazy as that may sound), so I didn't even bother traversing this path.

So, if the software absolutely must go pure SQL and you don't mind the cost, my solution should work, but I pity your deployment environment if that is the case (as I pity my old deployment environment - switched jobs since then). :)
ReplyDelete
Replies
JamesMay 20, 2011 at 12:51 PM
I do like the SQL approach though C# in a CLR will kick the crap out of it performance wise as soon as you encounter any kind of "loop" in SQL

The Stev.Org Guy :)
ReplyDelete
Replies
Michael DarmoussehSeptember 27, 2011 at 6:05 PM
Wow, this is amazing. I am able to use this to search our inventory of 150k products in under 1 second. Awesome :)
ReplyDelete
Replies
JustinMarch 8, 2012 at 12:42 AM
Thank you for sharing this function! It has worked with 100% reliability for my specific purpose, and at a better-than-expected speed. I also am in a SQL environment that does not support CLRs, so this was a PERFECT solution. Thanks again!
ReplyDelete
Replies
hostess.clubMarch 14, 2012 at 11:31 AM
Thanks a lot man. This will help me out a lot.

I just performance tested your TSQL implementation with 6 others and yours came out on top with 1000 calls in 420ms.

The only reason I won't be using your implementation is that I want to use the full implementation of the algorithm.

Amazing job sir
ReplyDelete
Replies
AlexanderJuly 24, 2013 at 8:09 PM
Found a bug:
SELECT [dbo].[LEVENSHTEIN] (
'moo'
,'mo77o')

returns 3, should return 2 (delete 2 numeric characters from the 2nd line to get the 1st line)
ReplyDelete
Replies