Life as a software engineer pretty much requires that you’ll be working with data. It might be in a database, or stored in files, or streamed from the network, or beamed down from space, but it’s pretty unavoidable.

If you are dealing with data sent from a 3rd party, things can get tricky. You can’t guarantee that they named things properly, adhered to common sense techniques, or even that they’ll be using a character encoding you recognize (true story, health insurance company didn’t bother to mention that the data would be sent in EBCDIC instead of ASCII…sigh.)

Things only get worse the more sources of data you have. At a recent gig, we had to merge hundreds of different lists together, and there’s no guaranteed consistency from one to the next. Each has core set of information that we need to match against our master database, but each file can be quite different. A lot of differences we can code for, but so far we’ve needed someone — either us or the customer — to manually tag the incoming file to say “This column is a phone number, that column is the city name” so that it can progress through the system. Wouldn’t it just be better (and cooler) for the system to figure it out on its own?

It seems like it can.

TL;DR version:

  • Different types of data (phone numbers, zip codes, first & last names, cities, etc.) demonstrate differences in the probability distributions of the lengths of strings, the characters in the string, and pairs of characters (bigrams) in the string.
  • These probability distributions can be considered as a many-dimensional vector which acts as a fingerprint for that type of data.
  • Those vectors/fingerprints can be compared (using cosine similarity) to classify the columns in an unknown document quite successfully.

I’ve tested this with first & last names, address strings, city names, state abbreviations, phone numbers, email addresses, urls, also categorizing text by language. It could even tell ASCII from EBCDIC. I believe it can do a lot more.

Long version with graphs and code and stuff:
Continue reading

Early in my exploration of robotics on the Arduino platform, I ran into a number of issues simultaneously that needed addressing before I could move forward like I wanted:

  1. The setup()/loop() model of programming was pretty simplistic. I was thinking in terms of multiple subsystems all doing their own thing and interacting, but the Arduino just gives you “do this, over and over”.
  2. loop() starts executing immediately; but especially with something like a robotics project you want a chance to disconnect the USB cable and put the robot on the floor before it gets started (or get to a safe distance, depending on your application).
  3. Sometimes you want to shut the robot down from a distance, when it gets out of control.

In my previous blog post I showed how to record then recognize an IR message from a remote control button press, and in this post I’m going to introduce the scheduling framework I am using, and show how to use the remote control to switch between a dormant ‘sleep mode’ (sits quietly while an LED blinks) and normal operating ‘awake’ mode when it otherwise follows it’s programming.

Humanity is safe as long as the blue LED is blinking

Humanity is safe as long as the blue LED is blinking

Follow me through the jump if you want to see how I did it. Or just check out the completed sketch [GitHub].
Continue reading

In this post I’m going to demonstrate code to record the pulse-width timings coming from an IR sensor and use that to record a button-press from a TV remote. Then I’ll turn that around and use it to identify that same button-press later on, using an interrupt to make sure your robot gets your command. If you’re a TL;DR sort of person, consult the ir_capture.ino sketch to capture the pattern for a given button, and the ir_identify.ino sketch to respond when that specific pattern is detected.

More after the jump, if you want the full story.
Continue reading

This is an ‘encore presentation’ of a post I originally wrote for my old blog — it was lost in the great blog fire of ’12 and (thank you Wayback Machine) is being edited and reposted here since it seemed to be pretty popular at the time. If you’re going to follow along with what I’ve done, give a quick look at Step 4 where I discover I’ve been using the wrong chip and have to change things up a bit.

The source code is hosted on Github here. There’s not really too much of it, but it’s worth making public for people.

I’ve long had an interest in experimental electronic music, so I’m excited that I have something to share in that arena.

In the past I’ve mentioned my wonderful wife (who is wonderful, if I didn’t say so), and for Christmas she doubly earned that distinction by buying me a Gakken SX-150 Analog Synthesizer. As far as a ‘kit’ goes, it isn’t much to speak of — just installing the pre-built board and speaker into the plastic case and wiring up the stylus controller — but it is such a simple design that it seems built to be hacked on, and that’s what I wanted to do.

I did find a number of cool SX-150 hacks, but often they were a bit more advanced than I’m ready for, so I figured I’d start with something simple and slowly build on it and make this a multi-part project. Since I’m really enjoying getting into Arduino programming, an Arduino-based sequencer seemed like a good candidate — so let’s get started!

Continue reading

This is a brilliant presentation, given by Bret Victor, that really highlights the moribund ‘state of the art’ of software development.

The conceit, that he’s giving his talk in 1973, looking forward at the future of software development 40 years hence, wears just a tiny bit thin. But you can’t argue that — given what was happening in the mid-70s — we have generally locked ourselves into a specific and sub-optimal model of development.

You also can’t argue that a lot of what we take for granted now has it’s roots almost 45 years in the past. if you doubt that, check this out: “The Mother of All Demos” from Dec 8, 1968:

Next time you’re feeling a little proud of that site you built, consider for a minute what these guys did with stone tablets and chisels.

I’m spending a bit of downtime between gigs learning some new stuff, that includes digging into Haskell. In one of the articles I am reading, they referred back to this talk that i hadn’t seen in a year or so.

10 out of 10 for humour only programmers are going to get.

Edit: Thanks to an observant visitor to the site who noticed this video is gone. Now it’s only available here:

This is what Gates McFadden (the actress who played Dr. Beverley Crusher on Star Trek: The Next Generation) does:

Here’s one more, but really, just head over to Ensemble Studio Theatre LA (where she is the artistic director) to see more. Also you can follow her on Twitter.

Also, I just saw that, quite surprising to me, the two-part cliffhanger “The Best of Both Worlds” (in which Picard is assimilated) will be released as a single 2-hour remastered, re-CGed event and shown in theaters on April 25! It — as well as the entirety of Season 3 — will be released to Blu-Ray on April 30.

Anyway, here’s another cool picture. Now go check out the full set.

In our office, we have an old Robotron:2084 machine. For those not old enough (or too old) to remember it, this is it here:


It featured 2 joysticks (one to move, one to shoot) and about a billion enemies on screen at the start of each level. It’s not uncommon for a single life to last only a second or two. It’s an intense and surprisingly physical game (which turns out to have been part of the plan).

The machine itself was non-functional for pretty much the entire time I’ve worked here, but it was brought back to life recently and the entire company has been competing to see who is the Robotronyist — for the record, it’s Eric. But he’s stopped playing since the right (shoot) joystick has started acting flaky, which is giving me my opening to improve. I don’t let mechanical issues stop me (just ask anyone who saw me playing Galaga after the screen died… that certainly made it more challenging).

6809 processor running at 1Mhz and it’s killer fast. There’s a lesson in there somewhere.

A few years back, I spent a bunch of time playing around with Arduino microcontroller development projects, spending some time getting into hardware & embedded software development. It was a lot of fun and I wrote a number of blog posts and pages on my projects.

I lost all that blog posts in a database incident, and I just kind of left the site sitting dead for a while. But then I ended up on my Google Analytics account and discovered that I had one page that was receiving a fair amount of traffic, 100+ visits a day!

It was a simple calculator that I put together to determine the right components to use for a timing circuit with the 555 Timer IC. It approaches things a little different from other web resources in that it starts with what most people want (the timing that you want) and gives you what you need (the right components). Most online calculators just implement a simple equation and take the parts you have and tell you what the timing would be. I always felt that I was solving a problem that those other tools weren’t, and it seems like hundreds of people a week feel the same way.

So, when I started this new blog, I wanted to get the calculator ported over and updated. Moving to this new site also let me turn on comments, and I’ve already received several very nice comments from people. I’ve even had a couple of suggestions for ways to extend the tool and I’ve started making changes. It’s really nice to see people finding it and using it — and seeing them visit the blog in general once they are here.

Who’d have thought that a throw-away project from a phase I went through a few years ago would be the thing that brings the most people to my new blog? Love it.

I’m honestly not much of a web comic guy. I read xkcd like everyone else, of course, and I had a brief infatuation with FreakAngels (though I have to go back and finish it sometime).

But nothing has quite caught my attention like Questionable Content. Jeph is at 2400 strips now, spanning the last 9-10 years, and I’ve spent hours at night in the last couple weeks getting caught up.

It’s been fun watching his drawing mature, but even if it were stick figures it wouldn’t matter. He’s built a cast of characters that — cliche though it might sound — you really care about. “Yay Hannelore!” I’ve said out loud, to my wife’s confusion.


Oh, also he has great taste in music and I’ve gone looking for a number of bands he has mentioned in his comics and ended up enjoying most of them.

Good work, Jeph. Keep it up, and I’ll be reading eagerly.