# Using Probability Distribution and Cosine Similarity to Automatically Detect Data Types

Life as a software engineer pretty much requires that you’ll be working with data. It might be in a database, or stored in files, or streamed from the network, or beamed down from space, but it’s pretty unavoidable.

If you are dealing with data sent from a 3rd party, things can get tricky. You can’t guarantee that they named things properly, adhered to common sense techniques, or even that they’ll be using a character encoding you recognize (true story, health insurance company didn’t bother to mention that the data would be sent in EBCDIC instead of ASCII…sigh.)

Things only get worse the more sources of data you have. At a recent gig, we had to merge hundreds of different lists together, and there’s no guaranteed consistency from one to the next. Each has core set of information that we need to match against our master database, but each file can be quite different. A lot of differences we can code for, but so far we’ve needed someone — either us or the customer — to manually tag the incoming file to say “This column is a phone number, that column is the city name” so that it can progress through the system. Wouldn’t it just be better (and cooler) for the system to figure it out on its own?

It seems like it can.

TL;DR version:

• Different types of data (phone numbers, zip codes, first & last names, cities, etc.) demonstrate differences in the probability distributions of the lengths of strings, the characters in the string, and pairs of characters (bigrams) in the string.
• These probability distributions can be considered as a many-dimensional vector which acts as a fingerprint for that type of data.
• Those vectors/fingerprints can be compared (using cosine similarity) to classify the columns in an unknown document quite successfully.

I’ve tested this with first & last names, address strings, city names, state abbreviations, phone numbers, email addresses, urls, also categorizing text by language. It could even tell ASCII from EBCDIC. I believe it can do a lot more.

Long version with graphs and code and stuff:

Different types of data demonstrate differences in the probability distributions

If you consider every phone number in the US, you might assume that the probability of each digit appearing is the same, but phone numbers aren’t random. You’ll never start a phone number with a 0 or 1, for instance. Similarly, within each area code there are specific exchanges (the first 3 digits in a local number), not just a random set of digits, so you’ll see those more often. For a quick look, I took more than a million Washington state phone numbers — 12 million digits in all — and looked at how often each digit appeared, and you can see that it isn’t random:

The zero appears more than twice as often as the number 1! That’s a pretty significant difference. This becomes even more dramatic when you look at pairs of letters (or ‘bigrams’) in the text. If you are looking at phone numbers in the area code 425, you’ll see the bigrams ’42’ and ’25’ pretty darn often compared to other pairs of characters. Again, here’s WA state:

Now consider standard 5 digit zip codes. It’s the same characters as we use for phone numbers, but the probability distribution is different:

It’s more random than the last one, but not totally random. And check out the distribution of number pairs:

You’re not likely to mistake this distribution for the one for telephone numbers, they hint at very different structures underlying the pattern of digits. Also, there are structural differences — phone numbers are typically 10 digits long but zip codes are 5 (or 9) digits long, first names are just a tiny bit shorter (5.5 characters on average) than last names (6.5 characters).

First & last names offer an interesting twist, actually. They’re often quite different, historically they are drawn from different sources — and it shows in the distribution of bigrams. In fact, here’s the top 10 bigrams appearing in first and last names (across a sample of 4M names):

 First Last #1 AN ER #2 AR ON #3 RI AN #4 ER AR #5 EL EN #6 EN LE #7 MA IN #8 IN LL #9 LE SO #10 IC EL

There is some overlap, but it’s clear that first & last names have properties that are distinct from each other. Just like with phone numbers & zip codes, the different probability distributions hint at different ‘grammars’ for the sets of words that make up first and last names.

These probability distributions can be considered as a 1869-dimensional vector which acts as a fingerprint for that type of data.

For our comparisons, I’m going to use the frequencies of:

• The length of the strings in the set of data (for each length from 0-63 characters)
• The appearance of each individual character from a 41-character ‘alphabet’. (The lower case letters (a-z), the digits (0-9), the space character, and a few common punctuation marks.)
• The appearance of every possible pair of characters from the alphabet + an additional character to represent the ‘edge’ (start or end) of a word.

Together, that’s 64+41+422 = 1869 different values that will make up our ‘fingerprint’, and I’ll treat them just as a vector.

In real life, you probably don’t want to construct your vector this way. It can be a wasteful since a lot of bigram combinations won’t ever appear (you don’t see the letters Q & X next to each other all that often, for instance), and the forced n2 is going to kill you when it comes to unicode, and pre-defining that we’re using only strings from 0-63 characters long is an obvious hack waiting to bite the unwary. I’m only doing it this way because it’s the easiest way to explain the next step.

These fingerprint vectors can be compared using Cosine Similarity

It gets hard to picture when you’re working with 1800+ dimensions, but you can calculate the angle between two vectors whether they have 2 dimensions like the figure above, or millions of dimensions. The formula remains the same:

$cos \theta = \frac{a \cdot b}{||a|| ||b||}$

Essentially, take the dot product of the two vectors and divide by the product of the lengths of the two vectors. Here it is in Python:

# Calculate the cosine similarity between the input vector & the training vector
def cosine_sim( ivec, tvec ):
# dot product
dot = 0.0
for i, t in zip(ivec, tvec):
dot += (i*t)

# vector length
ilen = sqrt( sum( [i*i for i in ivec] ) )
tlen = sqrt( sum( [t*t for t in tvec] ) )

return dot/(ilen*tlen)


If the cosine is close to 1, then the vectors are very similar. If the cosine is zero then there’s nothing similar at all (the vectors are ‘orthogonal’). If you need to find the best match, just look for the one with the largest cosine.

And so now we have the outline of a strategy. We’ll use a set of fingerprints generated from as much data as we can get from lots of different types of data — let’s call them the classification fingerprints. Then, when confronted with a file we know nothing about, we will analyze the various columns and generate the set of fingerprint vectors (call them ‘unknown fingerprints’), and compare them against the classification patterns to find the best matches. It looks something like this:

# Compare the vector fingerprints of the input columns against those for
# the training columns.  Return the best match, and any others that exceed
# the threshold value.

def classify( input, training, threshold=0.0 ):

input_cols = input.keys()
training_cols = training.keys()

# store the results here
results = defaultdict(list)

# look at each input column
for icol in input_cols:
# keep track of the similarity measure as we compare each against
# each training column fingerprint
cosines = {}
for tcol in training_cols:
cosines[tcol] = cosine_sim( input[icol], training[tcol] )

# sort the results
sorted_cosines = sorted( cosines.iteritems(),
key=lambda x:x[1],
reverse=True )

# keep the top result and any others above the threshold
for i, (candidate, score) in enumerate(sorted_cosines):
if (i == 0) or (score >= threshold):
results[icol].append( (candidate, round(score,4)) )

return results


A Quick Interruption to Talk About Data

I am keenly aware of the responsibilities we have to keep our data — especially personally identifiable information (PII) — safe and secure.

When I was preparing this post, I wanted to use real data for the ‘training’ vector/fingerprints, but I didn’t want to risk ever revealing PII, so I took pains to anonymize data in ways that maintained the distribution of characters.

I started with 3.6 million people and I kept all the same first and last names, but randomized which went to which. I also took addresses and swapped the street # around with other addresses and then made sure that no part of the address went with the original person. First name, last name, street & street address, and zip code were all randomized, and no two parts from the original person were kept together.

Phone numbers were tricky — after all, they’re more or less unique. I felt like mixing up the area code, prefix & line number weren’t good enough, so I stripped out the area code, changed it completely and then moved numbers around. This had the effect of changing the distribution of the characters, and we’ll see the effect that had next.

That’s all just to say that the ‘classification fingerprints’ are drawn from ‘real-enough’ data without ever risking revealing actual personal information if the test data ever makes it out into the world.

Ready to test my hypothesis, I grabbed a completely different data set (available online here) and fed it into my application. There were only 500 records in that data, will that be sufficient to build frequency distributions to match against?

\$ python classifier.py -t training.p -i us-500.p --threshold 0.70

state           - [('STATE', 0.9882)]
last_name       - [('LAST', 0.9829), ('FIRST', 0.8756), ('CITY', 0.8377)]
first_name      - [('FIRST', 0.9667), ('LAST', 0.9621)]
zip             - [('ZIP', 0.9531)]
city            - [('CITY', 0.9363), ('LAST', 0.8453)]
county          - [('LAST', 0.9268), ('CITY', 0.8964), ('FIRST', 0.8103)]
phone2          - [('PHONE', 0.8645)]
phone1          - [('PHONE', 0.8639)]


Check that out! I consider anything over 0.9 to be a pretty powerful match, and we see a lot of those here. Some very interesting results indeed:

• State is matched very strongly (not surprising since by definition there are only 50 bigrams in that set)
• Last name & First name were matched separately and correctly — though look at ‘first name’. The match for LAST was almost as strong.
• Addresses matched very well, as did Zip codes
• Cities matched very well, but also matched pretty well against Last name — which isn’t too surprising given how many cities are named after their founder!
• County — which I didn’t have a classification vector for — matched itself against Last Name pretty strongly, also not too surprising given how lots of Counties got their name.
• Phone numbers matched, but not a very strong match. What happened there? Well, like I said above, I mucked with phone numbers a lot in preparing the classifier data, but I didn’t muck with the ‘to be identified’ data. They were pretty different, but even still we managed to get some matches.

This is a pretty exciting result. Using only features of the textual strings in the source, we were able to match everything we had data for. We missed email & url because I didn’t pull together sample data for it, but if this technique can tell phone number from zip code and first from last names, it can certainly pick out something as distinctive as an email address or url.

Developers work with data all the time, but it’s almost always us that has to add the context about what the data means. I think techniques like this point the way toward systems that ‘just figure it out’ and add their own context. So far, I’ve built a tool named Glance which uses this and other techniques to describe the meta-content of data files including the language of the content, and I’ve built some examples of larger systems which can detect changes in incoming data feeds and auto-suggestion corrections to the problem. This space is rich with interesting possibilities!