Project Notes.

Cobe learns English

Or, a different sort of singularity.

By peter on May 26, 2011 | Permalink

8:47p

8:47p

A fresh cobe brain starts completely empty. Eventually it will seem to know about vocabulary and sentence structure, but that’s all part of the essential excellence of the halgorithm.

This means you can interact with a fresh brain in any language. As long as it can split the input into words, it will generate sentences that mirror the structure of its input. I’m sure this isn’t a generalization that works for every written language, but it does a decent job in a lot of them.

In fact, cobe does somewhat better than MegaHAL, which only supported 8-bit ASCII. Cobe uses Python’s Unicode support and character tables to detect words, and in practice this works well.

This is one of its key features — access to surreal robot chatter is a basic human right — and I have been reluctant to introduce any features that require knowledge of any specific language.

However, I’m not above making a few tweaks here and there.

A little review: for a single generated reply, cobe:

  1. Identifies all of the words in its input.
  2. Produces a set of possible reply words by removing any words it hasn’t seen before.
  3. Picks one of the known words at random.
  4. Selects a Markov context containing that word.
  5. Performs a random walk on the two Markov chains from that context, keeping track of each new word along the way.

There’s an edge case where the input contains no known words, but in general this guarantees that one word in the input will be found in the generated reply.

This works, but it would be nice if cobe responded more naturally. I want it to reply to the concepts in the input, but not necessarily using the exact same words.

In cobe 1.2.0, I’m introducing an optional feature that gives more natural responses at the expense of some of its language independence.

Conflation

Take a look at 2. above. Until now, this has been a stage where cobe restricts the input words to the ones it has already seen. But if it expands that set to include similar but related words, we’ll get replies that can be about the same things without using the exact words.

The clever bit is that this conflation can choose any words, as long as they have already been learned. Since cobe follows the Markov chains using the usual random walk, the replies will only ever include phrases that came from the learned text.

There are a few ways to generate related words. I’ve chosen a strategy called stemming, which has gotten a lot of use in text-based search.

Stemming reduces words to their base form. See this excerpt from the Wikipedia page:

A stemmer for English, for example, should identify the string “cats” (and possibly “catlike”, “catty” etc.) as based on the root “cat”, and “stemmer”, “stemming”, “stemmed” as based on “stem”. A stemming algorithm reduces the words “fishing”, “fished”, “fish”, and “fisher” to the root word, “fish”.

Cobe will take the set of words in 2. and add any words that have the same stems. Any words that have the same stems are considered synonyms.

An example: I’ll use the input text “This is a test of stemming.” on a cobe brain with some experience.

Without a stemmer, this yields these possible words:

This a is of stemming test

And with an English stemmer:

A IS Is OF Of Stem TEST THIS Test Tested This a is of ofs stem stemmed stemming stems test tested testes testing tests this

After conflation, there are many more options. Every word is one cobe has already learned, so they’re all in Markov contexts that can be followed to generate replies.

You’ll note that some of these change between upper and lower case. I’ve chosen to store all the stems in lower case, so conflation can hop between cases. In theory, this can improve replies. In practice, it can be awesome.

08:27 -!- cobe has quit [Ping timeout]
09:03 <steve> nothing like a power outage
09:04 -!- cobe has joined #zzz
09:04 <steve> cobe: welcome back
09:04 <cobe> steve: KILLING ME WON'T BRING BACK YOUR GODDAMNED HONEY

Here cobe has taken the input “welcome back”, conflated that to “BACK Back Backs WELCOME Welcome back backed backing backs welcome welcomed welcoming”, chosen BACK (randomly), and generated a reply confusing a normal power outage with malicious intent.

I’ve chosen to use the PyStemmer package, a port of the Snowball stemmer to Python, because it implements stemmers for a several languages: Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, and Turkish.

To configure a stemmer on an existing brain, close all clients that might be accessing its database and run:

$ cobe set-stemmer <language>

Where <language> is any of the languages listed above. Use lower case, since that’s what PyStemmer expects.

$ cobe set-stemmer english

Adding a stemmer has no effect on the text structure cobe has learned, so it doesn’t require any relearning. The set-stemmer command is idempotent, so you can run it more than once without affecting the list of stems.

Next steps might include using latent semantic analysis to generate lists of related words independent of language.