Wednesday, 9 March 2011

Same Difference

In my previous post, I mentioned how I began to write a query against a list of people and identify potential duplicates.

In my case, I didn’t want to just stop at comparing details for “Rose Jones” against that of “Rosie Jones”.

Though it doesn’t happen often when people are filling out forms on their own, when information is added on behalf of someone else, careless mistakes can happen. If the person’s name is “James Scott”, maybe it would get entered incorrectly as “Scott James” because someone wrote it down as “Scott, James” and the comma wasn’t easy to see. Whatever. Mistakes happen. I’m sure you’ve seen it before.

I also feel like it is important to bring in any records that might not share the same name but do share some key piece of unique data. Depending on the app, how truly unique your data is might be different… It could be something as unique as a Social Security Number all the way down to an email address or phone number that, while not truly unique, is typically limited to a small subset of people.

Here’s the gist of what I’ve done so far:

1. Make “groups” based on the individual first name and last name combinations — being careful to weed out extra name information that probably doesn’t need factored in — middle names, punctuation, etc.

2. Create all of my name variations — keeping the ID of the original record and bring those in as members of the group belonging to the original record

3. Any records with the reversed name matching records in existing groups get brought in (allowing “Scott James” to be brought into the “James Scott” group)

4. Any records where unique data overlaps with members of existing groups, add them in, too.

5. Last but not least, if there are any groups with the same members, delete all but one of them (takes care of scenario where “Bob Smith” and “Bobby Smith” are the only members in Group ‘X’ and “Bobby Smith” and “Bob Smith” are the only members of Group ‘Y’ due to the name reversals)

So far, it’s been working like a champ. Even with about 70-100k worth of data, the entire process takes about 3-minutes, which is acceptable in this case, since it’s for a scheduled report. The main slow down in my case is the bringing in of some additional shared data between people, which creates a large many-to-many dataset. Without the extra level of checking, it only takes about 20-seconds. For basic name-only checking, you could probably skip Step 4 and have a pretty fast query.

Total execution time will also depend a lot on how many name variations you have. For reference, I currently have just under 1400 records in my “Names” table.

By utilizing the groups, I’m able to look at the records in chunks and quickly weed out the records that are likely duplicates and skip over ones that just happened to share similar names.

Monday, 7 March 2011

A Rose By Any Other Name

Whether you’re dealing with an online form that users fill out or an application that internal staff enter data into, names can get a bit tricky.

Not only are there the usual issues with data accuracy due to accidental typos, but also from things as simple as variations of the person’s first name.

I’ve had a lot of fun trying to tackle this recently…

At first, I started with the Jaro–Winkler distance. This worked surprisingly well for typos, especially on fairly short names, but the downside was the amount of false-positives it gave as well as the overall performance cost of evaluating scores on fairly large datasets.

Next, I began learning what I could about diminutive/short forms of common names. I came across a few projects that I was able to use as a starting point.

Most of what I found either had too few examples or far too many. Eventually, I ended up just compiling a master list and applied my own “sniff test” to it.

After weeding out the names that I didn’t feel were worth the extra bloat, I added in a few of my own. Most of these came from census data or other sources. I mostly just looked for the top 10 or so popular names for the past couple years — both in the U.S. as well as Latin America. Not only did this help me add alternate spellings for some of the names popular in other countries, but it also helped me take into consideration “Americanized” names.

I obviously want the ability to see if “John Smith” is already in the system as “Jon Smith”, but I’d also like to have that same capability for “Juan Smith”

My lookup table ended up having a thousand or so rows of name data. To keep things simple, there are only two fields, “Name” and “ParentName”. Realistically, ParentName could’ve been an ID field or whatever. It really serves no purpose other than to link, “Bob” and “Robert” together, for instance. With the way my original data came in, though, it was just more practical to skip that.

It didn’t matter so much when it was all coming from external sources, but now that I’m maintaining my own list, perhaps I’ll switch to something a little more neutral. While “Robert” might be considered the root name for “Bob”, what about “George” and “Jorge”? It would probably make the most sense to keep things in groupings instead of an implied hierarchy.