Same Difference

In my previous post, I mentioned how I began to write a query against a list of people and identify potential duplicates.

In my case, I didn’t want to just stop at comparing details for “Rose Jones” against that of “Rosie Jones”.

Though it doesn’t happen often when people are filling out forms on their own, when information is added on behalf of someone else, careless mistakes can happen. If the person’s name is “James Scott”, maybe it would get entered incorrectly as “Scott James” because someone wrote it down as “Scott, James” and the comma wasn’t easy to see. Whatever. Mistakes happen. I’m sure you’ve seen it before.

I also feel like it is important to bring in any records that might not share the same name but do share some key piece of unique data. Depending on the app, how truly unique your data is might be different… It could be something as unique as a Social Security Number all the way down to an email address or phone number that, while not truly unique, is typically limited to a small subset of people.

Here’s the gist of what I’ve done so far:

1. Make “groups” based on the individual first name and last name combinations — being careful to weed out extra name information that probably doesn’t need factored in — middle names, punctuation, etc.

2. Create all of my name variations — keeping the ID of the original record and bring those in as members of the group belonging to the original record

3. Any records with the reversed name matching records in existing groups get brought in (allowing “Scott James” to be brought into the “James Scott” group)

4. Any records where unique data overlaps with members of existing groups, add them in, too.

5. Last but not least, if there are any groups with the same members, delete all but one of them (takes care of scenario where “Bob Smith” and “Bobby Smith” are the only members in Group ‘X’ and “Bobby Smith” and “Bob Smith” are the only members of Group ‘Y’ due to the name reversals)

So far, it’s been working like a champ. Even with about 70-100k worth of data, the entire process takes about 3-minutes, which is acceptable in this case, since it’s for a scheduled report. The main slow down in my case is the bringing in of some additional shared data between people, which creates a large many-to-many dataset. Without the extra level of checking, it only takes about 20-seconds. For basic name-only checking, you could probably skip Step 4 and have a pretty fast query.

Total execution time will also depend a lot on how many name variations you have. For reference, I currently have just under 1400 records in my “Names” table.

By utilizing the groups, I’m able to look at the records in chunks and quickly weed out the records that are likely duplicates and skip over ones that just happened to share similar names.

Share Your Thought