Life as a software engineer pretty much requires that you’ll be working with data. It might be in a database, or stored in files, or streamed from the network, or beamed down from space, but it’s pretty unavoidable.

If you are dealing with data sent from a 3rd party, things can get tricky. You can’t guarantee that they named things properly, adhered to common sense techniques, or even that they’ll be using a character encoding you recognize (true story, health insurance company didn’t bother to mention that the data would be sent in EBCDIC instead of ASCII…sigh.)

Things only get worse the more sources of data you have. At a recent gig, we had to merge hundreds of different lists together, and there’s no guaranteed consistency from one to the next. Each has core set of information that we need to match against our master database, but each file can be quite different. A lot of differences we can code for, but so far we’ve needed someone — either us or the customer — to manually tag the incoming file to say “This column is a phone number, that column is the city name” so that it can progress through the system. Wouldn’t it just be better (and cooler) for the system to figure it out on its own?

It seems like it can.

TL;DR version:

  • Different types of data (phone numbers, zip codes, first & last names, cities, etc.) demonstrate differences in the probability distributions of the lengths of strings, the characters in the string, and pairs of characters (bigrams) in the string.
  • These probability distributions can be considered as a many-dimensional vector which acts as a fingerprint for that type of data.
  • Those vectors/fingerprints can be compared (using cosine similarity) to classify the columns in an unknown document quite successfully.

I’ve tested this with first & last names, address strings, city names, state abbreviations, phone numbers, email addresses, urls, also categorizing text by language. It could even tell ASCII from EBCDIC. I believe it can do a lot more.

Long version with graphs and code and stuff:
Continue reading

Early in my exploration of robotics on the Arduino platform, I ran into a number of issues simultaneously that needed addressing before I could move forward like I wanted:

  1. The setup()/loop() model of programming was pretty simplistic. I was thinking in terms of multiple subsystems all doing their own thing and interacting, but the Arduino just gives you “do this, over and over”.
  2. loop() starts executing immediately; but especially with something like a robotics project you want a chance to disconnect the USB cable and put the robot on the floor before it gets started (or get to a safe distance, depending on your application).
  3. Sometimes you want to shut the robot down from a distance, when it gets out of control.

In my previous blog post I showed how to record then recognize an IR message from a remote control button press, and in this post I’m going to introduce the scheduling framework I am using, and show how to use the remote control to switch between a dormant ‘sleep mode’ (sits quietly while an LED blinks) and normal operating ‘awake’ mode when it otherwise follows it’s programming.

Humanity is safe as long as the blue LED is blinking

Humanity is safe as long as the blue LED is blinking

Follow me through the jump if you want to see how I did it. Or just check out the completed sketch [GitHub].
Continue reading

In this post I’m going to demonstrate code to record the pulse-width timings coming from an IR sensor and use that to record a button-press from a TV remote. Then I’ll turn that around and use it to identify that same button-press later on, using an interrupt to make sure your robot gets your command. If you’re a TL;DR sort of person, consult the ir_capture.ino sketch to capture the pattern for a given button, and the ir_identify.ino sketch to respond when that specific pattern is detected.

More after the jump, if you want the full story.
Continue reading

This is an ‘encore presentation’ of a post I originally wrote for my old blog — it was lost in the great blog fire of ’12 and (thank you Wayback Machine) is being edited and reposted here since it seemed to be pretty popular at the time. If you’re going to follow along with what I’ve done, give a quick look at Step 4 where I discover I’ve been using the wrong chip and have to change things up a bit.

The source code is hosted on Github here. There’s not really too much of it, but it’s worth making public for people.


I’ve long had an interest in experimental electronic music, so I’m excited that I have something to share in that arena.

In the past I’ve mentioned my wonderful wife (who is wonderful, if I didn’t say so), and for Christmas she doubly earned that distinction by buying me a Gakken SX-150 Analog Synthesizer. As far as a ‘kit’ goes, it isn’t much to speak of — just installing the pre-built board and speaker into the plastic case and wiring up the stylus controller — but it is such a simple design that it seems built to be hacked on, and that’s what I wanted to do.

I did find a number of cool SX-150 hacks, but often they were a bit more advanced than I’m ready for, so I figured I’d start with something simple and slowly build on it and make this a multi-part project. Since I’m really enjoying getting into Arduino programming, an Arduino-based sequencer seemed like a good candidate — so let’s get started!

Continue reading

I’m spending a bit of downtime between gigs learning some new stuff, that includes digging into Haskell. In one of the articles I am reading, they referred back to this talk that i hadn’t seen in a year or so.

10 out of 10 for humour only programmers are going to get.


Edit: Thanks to an observant visitor to the site who noticed this video is gone. Now it’s only available here: https://www.destroyallsoftware.com/talks/wat

To inaugurate my latest attempt at a blog, I want to talk about something cool I’ve read about lately. I’ve received this question a couple times at interviews, and I think it was just to demonstrate that I could work with bitwise values.

For instance, take the number 57:

57 = 00111001

It’s easy enough to see that 57 has 4 bits set to 1. How do you check that in code? My answer has always been to loop through the number of bits (in this case 8, but usually 32), shifting to the right each time and then count how often the lowest bit comes up as a 1.

value = 57
count = 0
for i in xrange(8):
    count += (value >> i) & 1

Effectively what I’m doing here is just looping through the 8 bits, rotating the value to the right and stripping off the least significant bit and adding it to the total (if it’s 1, then it adds to the total).

Since I’m using python for the example, I can abbreviate it a bit like this:

count1s = lambda x: sum( (x >> i) & 1 for i in xrange(32) )

print count1s(57)
> 4

That’s more compact, certainly, but it’s still linear time with respect to the number of bits that you are examining, and that’s where today’s coolness comes into play. It turns out there’s a log time algorithm for doing this.

Because it can get a bit long when I show all my work long-hand, let me jump straight to the code that will give the results for an 8 bit integer in 3 steps (and could be extended to do 32 bits in 5 steps, and so on):

x = 57
y = (x & 0x55) + ((x & 0xaa) >> 1)
y = (y & 0x33) + ((y & 0xcc) >> 2)
y = (y & 0x0f) + ((y & 0xf0) >> 4)

print y
> 4

Follow through the link if you want the nitty gritty.
Continue reading