By peter on February 15, 2011 | Permalink
When the time came to improve cobe’s performance, it was easy to measure how well it was learning new text—the learning process is database intensive, but deterministic depending on the input. I could use Python’s built-in profiler, find the problem spots, and optimize as usual.
When I tried to use the same tools on cobe’s reply routines, I started to have trouble. A reply is a random walk across its Markov chains, repeated as many times as possible within a half-second time slice. I could use the same input data on the same brain of learned text and get wildly different results, depending on which input word was (randomly) chosen to follow and the state of the chains from that point.
And since each reply always runs for half a second, the CPU usage of the reply routines would always expand to fill the same amount of time regardless of the actual performance.
If I could assume that the reply scoring algorithm was successfully picking the best candidate reply, generating more candidate replies in 0.5s would increase the chance of getting a better reply.
So maximizing the candidate reply count would improve reply scores. This gave me a place to focus.
But the reply counts would vary widely, and not just with different input. I might see 2000 candidates created in one reply, but the next reply (to the same input!) might yield 10. This typically depended on the word in the input chosen to start the random walk. Some portions of the Markov chain tended to produce candidates very quickly, and some tended to take much longer.
I wanted to focus on optimizing the average case, but needed a way to separate that out from the very fast and very slow paths.
At the time, litl had me working a bit in Chromium. Chromium measures the performance of hundreds of different operations as it’s running—some of the statistics gained are the ones sent if you’re enabled “Automatically send usage statistics and crash reports to Google” in Google Chrome.
To see these, visit about:histograms in Chrome. Here’s an example:
Histogram: DNS.PrefetchQueue recorded 699 samples, average = 2.1, standard deviation = 15.9 (flags = 0x1) 0 ------------------------------------------------------------------------O (619 = 88.6%) 1 -O (6 = 0.9%) {88.6%} 2 -O (8 = 1.1%) {89.4%} 3 -O (6 = 0.9%) {90.6%} 4 O (1 = 0.1%) {91.4%} 5 -O (5 = 0.7%) {91.6%} 6 -O (5 = 0.7%) {92.3%} 7 O (2 = 0.3%) {93.0%} 8 -O (13 = 1.9%) {93.3%} 10 -O (14 = 2.0%) {95.1%} 12 O (1 = 0.1%) {97.1%} 14 O (5 = 0.7%) {97.3%} 17 O (2 = 0.3%) {98.0%} 20 O (4 = 0.6%) {98.3%} 24 O (0 = 0.0%) {98.9%} 29 O (1 = 0.1%) {98.9%} 34 O (0 = 0.0%) {99.0%} 40 O (1 = 0.1%) {99.0%} 48 O (3 = 0.4%) {99.1%} 57 ... 81 O (1 = 0.1%) {99.6%} 96 ... 268 O (2 = 0.3%) {99.7%} 318 ...
The top line shows the name of the statistic (DNS.PrefetchQueue) and some aggregate statistics. This looks like it’s the number of DNS entries awaiting prefetch resolution, probably sampled on a schedule or in response to some event. Maybe it’s sampled when a new name is enqueued for resolution.
The rest of the lines in the histogram are buckets the sampled values fall into. In this case, the prefetch queue size is almost always zero—619 of the 699 samples (88.6%) are in that bucket. Six times, the sampled value was 1. Five times, the value was between 14 and 17 (exclusive).
But on two occasions, the queue length was huge—between 268 and 318. Maybe that’s not a big deal, since those account for 0.3% of the samples. But if we had been looking at the average queue length (2.1), we might not have guessed that it would ever get that long. Maybe those two samples signal a more fundamental issue that could be fixed.
So these histograms can be very useful for visualizing the distribution of sampled data.
I decided to log some similar data in cobe, and wrote instatrace to help produce the histograms. I wanted to minimize the burden on the programmer, so instatrace reads text data from log files. All you need to do is log lines that look like this:
Brain.reply_count 50
This is a single sample of the Brain.reply_count statistic, with a value of 50. The sample must have an integer value. Any text that is found after the value is ignored for histogram use, but can be very valuable debugging information when tracking down the causes of outlying values.
Optionally, instatrace can strip sample data from lines that contain the string INSTATRACE followed by that format. Maybe the only way your program can output data involves a log format you can’t change, or maybe you’re mixing these samples with a data stream that contains other data. In that case, this line would be read the same:
[[2011-02-15 17:03:40]] log-debug: INSTATRACE: Brain.reply_count 50
Any lines that don’t contain the string “INSTATRACE:” are ignored when that flag (—filter) is used.
So here’s a histogram of Brain.reply_count on a running cobe instance:
Histogram: Brain.reply_count recorded 1439 samples, avg = 328.3, std = 283.7 1 -O (29 = 2.0%) 2 --O (64 = 4.4%) {2.0%} 7 -O (46 = 3.2%) {6.5%} 20 -O (37 = 2.6%) {9.7%} 54 ---O (94 = 6.5%) {12.2%} 148 -----------------------------------O (862 = 59.9%) {18.8%} 403 -----------O (268 = 18.6%) {78.7%} 1096 -O (39 = 2.7%) {97.3%} 2980 ...
Here we’re using buckets that are sized exponentially (instatrace’s default). This instance of cobe has replied to 1439 inputs, and 59.9% of the time it was able to create between 148 and 403 candidate replies.
If you’d like to see the full set of statistics I’m recording for cobe, see here. I append units (us = microseconds) to most of my statistic names.
Instatrace works well with both time-based and count-based statistics. If your data logging is lightweight, you can leave the measurements in your code indefinitely. On the other hand, it’s so easy to add a trace that it’s great for temporary profiling efforts too.
I have found it extremely valuable in improving cobe toward its current state, but it’s also a useful tool for understanding some more subtle aspects of software performance.