Life as a software engineer pretty much requires that you’ll be working with data. It might be in a database, or stored in files, or streamed from the network, or beamed down from space, but it’s pretty unavoidable.

If you are dealing with data sent from a 3rd party, things can get tricky. You can’t guarantee that they named things properly, adhered to common sense techniques, or even that they’ll be using a character encoding you recognize (true story, health insurance company didn’t bother to mention that the data would be sent in EBCDIC instead of ASCII…sigh.)

Things only get worse the more sources of data you have. At a recent gig, we had to merge hundreds of different lists together, and there’s no guaranteed consistency from one to the next. Each has core set of information that we need to match against our master database, but each file can be quite different. A lot of differences we can code for, but so far we’ve needed someone — either us or the customer — to manually tag the incoming file to say “This column is a phone number, that column is the city name” so that it can progress through the system. Wouldn’t it just be better (and cooler) for the system to figure it out on its own?

It seems like it can.

TL;DR version:

  • Different types of data (phone numbers, zip codes, first & last names, cities, etc.) demonstrate differences in the probability distributions of the lengths of strings, the characters in the string, and pairs of characters (bigrams) in the string.
  • These probability distributions can be considered as a many-dimensional vector which acts as a fingerprint for that type of data.
  • Those vectors/fingerprints can be compared (using cosine similarity) to classify the columns in an unknown document quite successfully.

I’ve tested this with first & last names, address strings, city names, state abbreviations, phone numbers, email addresses, urls, also categorizing text by language. It could even tell ASCII from EBCDIC. I believe it can do a lot more.

Long version with graphs and code and stuff:
Continue reading