Since Python 2.5, creating histograms has become easier. Instead of dict
, we can now use defaultdict
which is similar in behavior to awk’s associative arrays. Instead of raising a KeyError
for undefined keys, defaultdict
adds a user-defined item and returns it.
I’ll demonstrate this with a simple program that analyzes line length distribution in a file. In older Python versions, you’d typically write code like this:
hist = { } for line in open(filename): hist[len(line)] = hist.get(len(line), 0) + 1
The code using defaultdict
is much clearer and more elegant (although an additional import is needed):
from collections import defaultdict hist = defaultdict(int) for line in open(filename): hist[len(line)] += 1
Note that defaultdict
‘s constructor expects a factory function that initializes unset items on request.
Unless you dump the contents of hist
to gnuplot or similar, you might want to sort the dict
by value. There are several ways to do this, but I learned from a related blog posting that this is the most efficient way:
from operator import itemgetter sorted(hist.iteritems(), key=itemgetter(1))
The min
and max
builtins support the key
parameter, too, by the way.
Very useful tip!!
But… with sorted(hist.iteritems(), key=itemgetter(1)) get the alphanumeric sort like:
[(’10’, 1), (‘2’, 1), (‘5’, 1), (‘8’, 1), (‘6’, 1)]
And I would like to get the numeric sort:
[(‘2’, 1), (‘5’, 1), (‘8’, 1), (‘6′, 1),(’10’, 1) ]
It seems you’re putting strings in your defaultdict (my example uses line lengths which are ints), so you get alphanumeric order. Just convert the data to int before adding it to hist and you should be fine.