Quick Tip #3: Creating Histograms in Python

Since Python 2.5, creating histograms has become easier. Instead of dict, we can now use defaultdict which is similar in behavior to awk's associative arrays. Instead of raising a KeyError for undefined keys, defaultdict adds a user-defined item and returns it.

I'll demonstrate this with a simple program that analyzes line length distribution in a file. In older Python versions, you'd typically write code like this:

hist = { }
for line in open(filename):
    hist[len(line)] = hist.get(len(line), 0) + 1

The code using defaultdict is much clearer and more elegant (although an additional import is needed):

from collections import defaultdict

hist = defaultdict(int)
for line in open(filename):
    hist[len(line)] += 1

Note that defaultdict's constructor expects a factory function that initializes unset items on request.

Unless you dump the contents of hist to gnuplot or similar, you might want to sort the dict by value. There are several ways to do this, but I learned from a related blog posting that this is the most efficient way:

from operator import itemgetter
sorted(hist.iteritems(), key=itemgetter(1))

The min and max builtins support the key parameter, too, by the way.

social