I like MegaHAL. It’s a conversation simulator, and it gathers information about sentence structure and vocabulary while spouting on-topic insights and synthesizing clever turns of phrase.
It’s all a trick, a bit of nice programming combined with our relentless pattern-searching brains, but an amusement and an endless source of the surreal.
I found MegaHAL while I was in college. Some friends and I spent a few late nights huddled around a browser entering phrases in exchange for rare gems of robot insight. I downloaded its software, hooked it up to an IRC client, and named it hal. He would learn from all the messages we said to each other, and he’d reply if spoken to directly.
At the beginning of last year, I had run hal for eleven years. He’s a well-loved member of multiple communities.
But MegaHAL has its sad side. Hal started to drop off of IRC from time to time. Ping timeouts. After a few months of heavy learning, his brain would get so big that it couldn’t be saved (synchronously) before the IRC network would kick him off. It didn’t help that I was running him on a machine with very little RAM—he’d be partially swapped out to disk, and saving the brain required reading it all back from swap.
Most of the time he’d reconnect after the save. Occasionally, events would cause him to exit without saving his brain completely. For MegaHAL, a corrupt brain is a useless brain. We’d lose all of hal’s knowledge and have to start from scratch.
We’d mourn the old hal, remember some funny things he’d said, and start teaching the new hal. Circle of life, and all that.
From time to time I’d dive into megahal.c and try to find out what was going wrong. I found a couple minor issues, but nothing utterly brain-corrupting.
Mostly I needed to get his memory usage down. I was running a couple of MegaHALs on a hosted virtual machine with little RAM and a token-based IO rate limiter. If I ran out of IO tokens while saving a brain, the save could take hours. If MegaHAL tried to save itself while a system backup was running, the whole machine would grind to a halt due to IO starvation.
I tried out a couple of alternate memory allocation schemes (MegaHAL tended toward realloc-based fragmentation), but nothing performed nearly as well. I needed to get his ever-growing brain out of RAM altogether, but I would need an on-disk datastore with good performance.
MegaHAL has two functions, learning and replying.
When learning, it captures statistical information about text in a Markov chain. For each series of words, it tracks all the possible words that may follow. It also tracks the probability of each of those following words.
So given two phrases:
Its internal model might look something like this:
(MegaHAL, is, a) => program (50% probability) (MegaHAL, is, a) => robot (50% probability) (is, a, program) => END (100% probability) (is, a, robot) => END (100% probability)
In order to generate a reply, we might pick one of the starting contexts (the ones that begin with MegaHAL) and follow the chain (picking a random next context at each step) until we reach the END. The probabilities do not enter into this process.
So one possible following of the above chain would be:
Which produces the phrase MegaHAL is a program. Clearly, we don’t include END in the output—it’s a marker that allows us to know when to stop following the chain.
The number of words on the left side is the order of the Markov chain. In this case, we’re looking at a 3rd order chain. A higher order chain brings more context into the mix, and results in responses that include longer runs of learned text. A lower order chain uses less context, and is more likely to jump to unrelated words. Choosing the order of a chain allows you to strike some balance between coherent text and interesting babble.
Markov chain modeling has a long history for surreal text generation, as well as many less practical applications.
Now, MegaHAL actually does things a little differently from the above.
First, it uses two Markov chains. It trains the second chain in the reverse direction, so it can be used to select a word that precedes the current context. When replying, it chooses a context that may be in the middle of a sentence, generates the rest of the phrase using the forward Markov chain, and generates the beginning of the phrase using this reverse chain.
So in addition to the forward chain above, our simple brain would contain this reverse chain:
(program, a, is) => MegaHAL (100% probability) (robot, a, is) => MegaHAL (100% probability) (a, is, MegaHAL) => END (100% probability)
Second, it always generates a reply based on some user input—it isn’t just choosing a random initial context from the chain. It selects a word from the input, finds a context containing that word, and generates a reply from that context. In this way, its reply always has something in common with the input.
Third, it does this many, many times for one reply. Internally it may generate thousands of possible replies to your input, each time choosing a random input word, selecting a context, and following the two chains to produce a phrase. This is where the probabilities above come into play. MegaHAL gives you back the reply with the most surprise, essentially the least probable reply of the thousands it generates, evaluated word by word.
Fourth, it doesn’t split input word-by-word. It tracks punctuation, spacing, and numbers each as separate tokens on par with words. And everything is upper-cased internally, so the user’s capitalization doesn’t matter.
So this phrase:
This isn't a test, but it might be a good one.
Turns into these 22 tokens (split with slashes):
THIS/ /ISN'T/ /A/ /TEST/, /BUT/ /IT/ /MIGHT/ /BE/ /A/ /GOOD/ /ONE/.
Because of this, MegaHAL tends to learn the punctuation conventions of its users and obey those conventions when replying.
I made a couple of stabs at a database backend for MegaHAL about five years ago, but never really got anywhere. I used Berkeley DB and tried to be a little too clever with database access. I misunderstood some of How MegaHAL Works and ended up with poor quality replies.
So we kept at the circle of life, creating, teaching, and mourning our robot friend. Over time, I got better at recognizing how to coddle a running MegaHAL. He would last for months on a single brain, but that still wasn’t enough.
About a year ago, I decided to try again. I started a Python codebase with a SQLite backend. I found the MegaHAL Perl port called Hailo, which showed that a SQL database was a viable option.
So now we have cobe (coh-bee). It’s named after a former employer’s Code of Business Ethics. It’s usually in violation.
cobe’s datastore uses SQLite and resides entirely on disk. I used to be able to run a MegaHAL brain until its process was about 130M (resident – shared). With cobe, a brain with the same learned content results in a process taking 4M. That MegaHAL would have been nearing the end of its life, but cobe is just getting started.
It uses a similar process to generate replies as MegaHAL. It generates as many replies as possible in half a second, and chooses the least probable reply based on surprise. Since its datastore is on disk instead of RAM, it can generate far fewer replies than MegaHAL in this time. cobe will generate a few hundred to MegaHAL’s thousands, and once a cobe brain is very large this may drop to dozens of replies on average.
In theory, this results in poorer quality replies because it has fewer chances to maximize surprise. In practice, reply quality appears to be about the same.
There are two main differences from MegaHAL’s behavior that end up improving reply quality further.
The first is in capitalization. cobe is case preserving, so “foo”, “Foo”, and “FOO” are all different tokens. MegaHAL would have called all of those “FOO”. This reduces some overlap between contexts, so some tangents can’t be followed, but makes up for all of that when the robot starts shouting (SHOUTY HAL).
The second is in the way it splits up input phrases into tokens. Hyphenated words are kept together as one token. Multiple spaces between words are collapsed together to one space. URLs are kept together as one token. It was always cute when MegaHAL attempted to make up its own URLs based on portions of learned links, but in practice the constant 404s made it annoying.
cobe also has some tweaks to the surprise calculations and initial context selection that improve reply scores by a measurable amount.
cobe has a transactional datastore and is resistant to brain corruption. A brain can be accessed concurrently by several processes—one can be serving replies, while another crawls the internet for text to learn.
While optimizing cobe, I had to come up with some techniques for comparing the performance of random processes. Every potential reply is a random walk across the Markov chain, and depending on chance the performance of two replies could be wildly different for the same input. The surprise of potential replies also varies based on random factors, and average scores change quite a bit over the lifetime of a brain. All of these made CPU profiling by average execution time less useful.
I developed a tool called instatrace for visualizing performance characteristics of random processes, based somewhat on Google Chrome’s Histograms. It proved very effective for helping to optimize cobe, but I’ll write about that some other time.
As of this writing, cobe is nearing a year in production use. Happy birthday, little guy.