Project Notes.

Cobe 2.0

Or, it is what it is what it is what it is what it is what it is what it is.

By peter on September 19, 2011 | Permalink

Step 3: Braze a luglet to a tiny head tube.

Step 3: Braze a luglet to a tiny head tube.

I’m releasing Cobe 2.0, with a major conceptual change in its data model and a more flexible scoring system. To explain the changes, let’s get a little deeper into Cobe 1.x.

Cobe 1.x

When Cobe 1.x learns an input phrase, say “this is a test”, it does the following:

First, it splits the phrase into tokens. Each series of alphanumeric characters becomes its own token, and each series of characters between those becomes another.

So “this is a test” becomes seven tokens:

seven tokens

Cobe 1.x groups these tokens by the Markov Order m (default five) and adds special END tokens that signal the beginning and end of a phrase:

grouped tokens

And then each group is annotated with the next token that follows.

grouped tokens, with next

And the previous token:

grouped tokens, with previous

As more phrases are learned, Cobe keeps track of all the words that have ever followed these contexts. After a while, one might look like:

groups with edge probabilities

Here you can also see each link annotated with its probability in the learned text. For phrases beginning “this is ”, “this is a” occurred 75% of the time, “this is another” occurred 17% of the time, and “this is third” occurred 8%.

When generating a reply, cobe 1.x picks a random next word, shifts it into the context, and looks for the words that follow that new context.

In functional terms, this works exceedingly well. But there’s no denying that it’s complex. Each context is clearly related to all of the others, but they’re not explicitly connected to each other—the connections are to the next word.

Maintaining the forward and backward chains independently means it’s difficult to visualize the brain as a whole, and since spaces are separate tokens, half of the useful context is taken up tracking whitespace.

Cobe 2.0

Cobe 2.0 employs two major changes to brain storage that reduce complexity, ease visualization, enhance processing speed, and leave us able to use a world of existing general-purpose algorithms.

My first change was to connect contexts directly to each other. Since each is already stored in the database, this is just a matter of creating a table that links the previous context to its next context.

Astute readers will note that this creates a graph. Each context is a graph node, and the graph’s edges are the links between them. Using a directed graph, Cobe ends up learning “this is a test” like this:

simple graph

Take note: since the edges of the graph are directed, this same data structure can be used to walk either forward or backward. No more training two chains!

At this point, we’re still looking at the equivalent of a 5th order Markov chain. I wanted to reduce the Markov order by eliminating the whitespace that’s being stored in the contexts.

The other Markov-based chatbots have taken different strategies here. They may group spaces and punctuation together as the tokens that separate words, like Cobe 1.x. They may tokenize punctuation and spaces separately, but still consider spaces as normal tokens. They may attach the spaces as leading or trailing properties of words and reconstruct the spaces when joining the words back together.

Cobe 2.0 takes a different strategy. It eliminates all spaces from the nodes and makes them a property of the graph edges. Combined with the graph layout, this is a major breakthrough. It keeps the database size small and the nodes much more dense with information.

Below is a dump of a 2nd order Cobe 2.0 brain that has been trained with four phrases:

  1. this is a test
  2. this is another test
  3. another test would help
  4. hal hal hal

The empty node at the top is a special node which signals the end of the graph. It serves the same purpose as the special END tokens in Cobe 1.x. Open arrowheads show nodes that were learned with spaces between them.

a full brain graph

Looking at this, you can easily see how a reply might be generated. Cobe follows a random arrow from every node, building a list of reply words as it goes.

When generating a reply with 2.0, all of the database queries are on the edges table alone. It doesn’t even matter what Markov order the contexts are anymore. Indeed, different contexts could have different numbers of tokens. “Raining cats and dogs” could be identified as a single concept (or collocation), and it could be treated as a single word for replies.

If you follow the graph above, it’s clear that you could get a reply like “this is another test would help” or “hal hal hal hal hal hal hal hal hal hal hal”.

Being able to say quantitatively that the first is a bad reply and the second is a great reply is a question for reply scoring, and I’ve made the groundwork for some changes there too.

Scoring changes

Cobe (both old and new) will generate as many replies as possible during half a second. These are evaluated by a scoring routine, which determines which of the candidates is best. At the end of the half second, the best candidate so far becomes the result.

Cobe 1.x used a scorer based on MegaHAL’s evaluate_reply(), which adds together the self-information of all reply tokens that also appeared in the input. This means that more surprising replies score higher.

MegaHAL’s scorer then applies a couple of dividers to discourage overly long replies.

Recently I’ve been looking more closely at the candidate replies. Many of them are subjectively better than the result, and I’d like to explore scoring for the next few releases.

For Cobe 2.x, I have introduced reply scoring based on several signals. The information content of a reply is one good metric, but there are a lot of other things that might contribute to a good reply. Some examples:

  • discouraging overly lengthy replies (as exhibited by MegaHAL’s scorer)
  • discouraging replies that are too similar to the input
  • discouraging replies that are too similar to the last N responses
  • encouraging excited or shouting replies
  • encouraging double entendre
  • encourage (discourage?) replies that loop back on themselves
  • preserve word sense as found in the input

There’s a lot of room for playing around with scoring.

Performance

Cobe 2 is quite a bit faster than 1.x, and a Cobe 2 brain is roughly half the size of the same 1.x brain.

Learning the Brown Corpus, 500 samples of English language text, Cobe 2.0 is about five times faster than 1.2. I adjusted the Cobe 1.2 code to use the same database parameters as 2.0, to get a more accurate comparison of the core learning algorithm:

Cobe 1.2 (adjusted) - 224M brain file
real 18m25.908s
user 5m6.115s
sys  0m36.146s
Cobe 2.0 - 102M brain file
real 3m24.447s
user 2m35.319s
sys  0m19.854s

Reply speed is difficult to measure, as it’s dependent on random factors, but I’m seeing at least double the number of candidate replies in the new code.

My largest brain in production has learned about 230,000 phrases, for 1.8 million edges between 1.4 million nodes. This is better than Cobe 1.x could manage, but there’s still room for improvement. Learning one million phrases is my next scaling goal.

Next steps

Where to go from here?

First, explore new scorers. There are a few listed above, plus general purpose ideas like a Bayesian classifier scorer. That would allow you to train Cobe based on a list of good and bad replies.

The scorers are sometimes at odds with each other—you might want to encourage reply length in one case, brevity in another. It might make sense to build reply profiles out of a set of weighted scorers and score each set of candidate replies based on a randomly chosen profile. (e.g. “SHORT AND SHOUTY” or “long and repeating and repeating and repeating and repeating and repeating”)

Second, continue improving the speed of candidate replies. Assuming the scorers are going to select the best response, Cobe needs to make candidates as quickly as possible.

I have two strategies planned for faster replies. Currently, Cobe performs a random walk from scratch for every candidate. This could be made a lot faster by turning the walk into a generator, searching the tree for the END token and yielding a reply every time it’s found.

The second strategy is to start generating replies in parallel. Cobe is easily distributed across a large number of computers.

Have fun out there.

Cobe learns English

Or, a different sort of singularity.

By peter on May 26, 2011 | Permalink

8:47p

8:47p

A fresh cobe brain starts completely empty. Eventually it will seem to know about vocabulary and sentence structure, but that’s all part of the essential excellence of the halgorithm.

This means you can interact with a fresh brain in any language. As long as it can split the input into words, it will generate sentences that mirror the structure of its input. I’m sure this isn’t a generalization that works for every written language, but it does a decent job in a lot of them.

In fact, cobe does somewhat better than MegaHAL, which only supported 8-bit ASCII. Cobe uses Python’s Unicode support and character tables to detect words, and in practice this works well.

This is one of its key features — access to surreal robot chatter is a basic human right —...

...read more Cobe learns English.

Hammond organ simulation

Or, roto.

By peter on May 22, 2011 | Permalink

Teatime 2

Teatime 2

If you’re like most people, you want to know how to simulate a Hammond B-3 organ using fixed-point arithmetic on an 8-bit microcontroller.

First, you’re going to need tone generators. In a Hammond, this is a set of 91 tonewheels. Each is positioned adjacent to an electromagnetic pickup, inducing a sinusoidal current at a specific frequency. On a microcontroller, we can simulate the tonewheel assembly with a set of 91 sine oscillators. Take a look at bleep bloop for direct digital synthesis, one method of sine wave generation. We’ll be using it below.

A B-3 has two keyboards, called manuals on an organ. When you press a key, nine of these oscillators connect to the audio output. Each corresponds to a note that’s harmonically related to the key you pressed (including that...

...read more Hammond organ simulation.

New noises from MidiVox

Or, bleep bloop.

By peter on February 20, 2011 | Permalink

Skyline

Skyline

I don’t like lugging my laptop up to my MIDI keyboard every time I want to play music, so I have been writing a MIDI synth that uses an Arduino and Collin Cunningham’s MidiVox Shield.

The MidiVox shield comes with software called Healer, a synthesizer with several waveforms, a resonant filter, and traditional envelope controls. Healer is nice, but it’s monophonic: it only plays one note at a time. I want a polyphonic synth, so I can play crazy chords.

Here are some notes on making a polyphonic MIDI synth from scratch.

Step 1: beeeeeeeeeeeeeeeeeeee

First step? Make a noise. There are a number of ways to get sound output from an Arduino, and MidiVox is an excellent option. It has a high quality, 12-bit digital to analog converter on board: the ...read more New noises from MidiVox.

Profiling with histograms

Or, you never should have registered with instatrace.

By peter on February 15, 2011 | Permalink

Logjammin' 2

Logjammin' 2

When the time came to improve cobe’s performance, it was easy to measure how well it was learning new text—the learning process is database intensive, but deterministic depending on the input. I could use Python’s built-in profiler, find the problem spots, and optimize as usual.

When I tried to use the same tools on cobe’s reply routines, I started to have trouble. A reply is a random walk across its Markov chains, repeated as many times as possible within a half-second time slice. I could use the same input data on the same brain of learned text and get wildly different results, depending on which input word was (randomly) chosen to follow and the state of the chains from that point.

And since each reply always runs for half a second, the...

...read more Profiling with histograms.

cobe

Or, some of my oldest friends are robots.

By peter on February 13, 2011 | Permalink

I Expected More From You, Citrus Degreaser

I Expected More From You, Citrus Degreaser

I like MegaHAL. It’s a conversation simulator, and it gathers information about sentence structure and vocabulary while spouting on-topic insights and synthesizing clever turns of phrase.

It’s all a trick, a bit of nice programming combined with our relentless pattern-searching brains, but an amusement and an endless source of the surreal.

I found MegaHAL while I was in college. Some friends and I spent a few late nights huddled around a browser entering phrases in exchange for rare gems of robot insight. I downloaded its software, hooked it up to an IRC client, and named it hal. He would learn from all the messages we said to each other, and he’d reply if spoken to directly.

At the beginning of last year,...

...read more cobe.

Screen and ssh-agent

By peter on February 23, 2006 | Permalink

My Biggest

My Biggest

I started using Screen on a daily basis, and quickly desired an ssh-agent tied to a long-lived screen session.

The obvious solution (running "ssh-agent screen") doesn't work—as soon as you log out, screen drops into the background and you lose the agent.

People have solved this problem in various ways—a quick search shows some of them. I didn't like them. They all rely on scripts external to screen for running the agent and maintaining its environment variables.

I spent a little time last night on a solution. It's completely contained in your .screenrc, and produces an ssh-agent tied to screen's life cycle.

It's just two lines:

 setenv SSH_AUTH_SOCK $HOME/.screen-ssh-agent screen 10 ssh-agent -a $SSH_AUTH_SOCK $SHELL 

This creates a screen (number 10) that holds your ssh agent. If you...

...read more Screen and ssh-agent.

Persecution Complex

By peter on December 16, 2005 | Permalink

LA 8-BALL!

LA 8-BALL!

If you are offended by discussion about Emacs, please stop reading now.

Over the last few years, Dan Mills and I have been improving an Emacs configuration style we both enjoy using. There has recently been some interest in this setup internally, so I thought I'd pull some notes together and send it out into the world.

There were two original goals for this design:

  1. A good, clean configuration—none of the mess that usually clutters .emacs files.
  2. Easy deployment of new elisp packages.

The configuration comes down to three things.

First, all the elisp files I own, either packages I've downloaded or things I've written myself, should exist under one tree. This eases maintenance, and allows me to keep everything in Subversion.

Second, all the relevant directories in that tree...

...read more Persecution Complex.