Whether you’re dealing with an online form that users fill out or an application that internal staff enter data into, names can get a bit tricky.
Not only are there the usual issues with data accuracy due to accidental typos, but also from things as simple as variations of the person’s first name.
I’ve had a lot of fun trying to tackle this recently…
At first, I started with the Jaro–Winkler distance. This worked surprisingly well for typos, especially on fairly short names, but the downside was the amount of false-positives it gave as well as the overall performance cost of evaluating scores on fairly large datasets.
Next, I began learning what I could about diminutive/short forms of common names. I came across a few projects that I was able to use as a starting point.
Most of what I found either had too few examples or far too many. Eventually, I ended up just compiling a master list and applied my own “sniff test” to it.
After weeding out the names that I didn’t feel were worth the extra bloat, I added in a few of my own. Most of these came from census data or other sources. I mostly just looked for the top 10 or so popular names for the past couple years — both in the U.S. as well as Latin America. Not only did this help me add alternate spellings for some of the names popular in other countries, but it also helped me take into consideration “Americanized” names.
I obviously want the ability to see if “John Smith” is already in the system as “Jon Smith”, but I’d also like to have that same capability for “Juan Smith”
My lookup table ended up having a thousand or so rows of name data. To keep things simple, there are only two fields, “Name” and “ParentName”. Realistically, ParentName could’ve been an ID field or whatever. It really serves no purpose other than to link, “Bob” and “Robert” together, for instance. With the way my original data came in, though, it was just more practical to skip that.
It didn’t matter so much when it was all coming from external sources, but now that I’m maintaining my own list, perhaps I’ll switch to something a little more neutral. While “Robert” might be considered the root name for “Bob”, what about “George” and “Jorge”? It would probably make the most sense to keep things in groupings instead of an implied hierarchy.