I'm cleaning up a dirty database I inherited and need to "fuzzy match" names for human review. I've come up with a solution that works but it terribly slow — 7 minutes on 15k rows. I have the feeling I'm overlooking some really simple solution.
1 John Smith 2 John Q Smith 3 Janway Smith 4 Jane Chen 5 David Jones 6 Natalia La Brody 7 Natalia LaBrody 8 LaBrody 9 Dave Jones
I need multiple criteria for this fuzzy match. Two I've come up with include:
- Check for matches based on a concat of the first three and last five letters.
- If a single word check against all last words
- (I may add more conditions)
My code looks like this:
UPDATE authors a INNER JOIN (SELECT id, author_name FROM authors) b ON CASE WHEN a.author_name NOT REGEXP ' ' THEN a.author_name = substring_index(b.author_name, ' ', -1) ELSE concat(LEFT(a.author_name, 3), RIGHT(a.author_name, 5)) = concat(LEFT(b.author_name, 3), RIGHT(b.author_name, 5)) END SET tags = concat_ws(',',tags,'Duplicate?') WHERE a.id <> b.id
I was surprised I could put a CASE in an ON clause but it worked. Still, how can I do this with substantially better performance?
Databases (in general) are not designed for this purpose.
One algorithm that gets used is Levenshtein distance. You can readily find implementations for MySQL, but that doesn't help your problem.
To be honest, such string matching often requires manual checking. You might consider just loading the data into a spreadsheet, ordering them alphabetically, and noting in the spreadsheet when values are the same.
In the end, you are going to have to spend a lot of time figuring out where the "duplicates" are, so you might as well plan your workload around that.
One approach is to use soundex . You cannot 100% rely on it but it help you to narrow down your search result and make the query fast
select t, soundex(t) from ( select 'John Smith' as t union select 'John Q Smith' as t union select 'Janway Smith' as t union select 'Jane Chen' as t union select 'David Jones' as t union select 'Natalia La Brody' as t union select 'Natalia LaBrody' as t union select 'LaBrody' as t union select 'dave jones' as t )tbl group by soundex(t)
'Natalia La Brody', 'N34163' 'LaBrody', 'L163' 'John Smith', 'J5253' 'Jane Chen', 'J525' 'David Jones', 'D13252' 'dave jones', 'D1252'