I've been playing with scikit-learn recently, a machine learning package for Python. While there's great documentation on many topics, feature extraction isn't one of them. My use case was to turn article tags (like I use them on my blog) into feature vectors.
[Update: Ported the code to scikit-learn 0.11 which is incompatible to 0.10 and older.]
To get some data to play with, let's first extract tags from an RSS feed using feedparser:
>>> import feedparser >>> feed = feedparser.parse("http://blog.mafr.de/feed/") >>> tags = [[t.term for t in e.tags] for e in feed.entries]
To keep the example manageable, we'll use the following data:
tags = [ "python, tools", "linux, tools, ubuntu", "distributed systems, linux, networking, tools", ]
We want to turn our tags into a matrix with articles (samples) on the rows and tags (features) on the columns. With three articles and 6 different tags, we expect to get a 3x6 matrix.
After looking around, I found the CountVectorizer class that does almost what I want. All I have to do is to provide a tokenizer that takes a string and turns it into a list of tokens. For this particular problem, it's easy to implement:
import re REGEX = re.compile(r",\s*") def tokenize(text): return [tok.strip().lower() for tok in REGEX.split(text)]
A quick test shows that it seems to work:
>>> tokenize("foo Bar, baz") ['foo bar', 'baz']
Now let's plug it into CountVectorizer and run it:
>>> from sklearn.feature_extraction.text import CountVectorizer >>> vec = CountVectorizer(tokenizer=tokenize) >>> data = vec.fit_transform(tags).toarray() >>> print data [[0 0 0 1 1 0] [0 1 0 0 1 1] [1 1 1 0 1 0]]
In some cases it's useful to restrict the number of features. CountVectorizer has a max_features constructor argument that limits the vocabulary to the top K features only. Also, note that the fit_transform() method returns a sparse matrix, but we're turning it into a numpy array because it's easier to work with.
So, the shape of the matrix looks right, but is it really correct? Let's dump the column headers to make sure:
>>> vocab = vec.get_feature_names() >>> vocab [u'distributed systems', u'linux', u'networking', u'python', u'tools', u'ubuntu']
My samples don't contain duplicate tags, but if they did, I could weed them out in tokenize() or by modest numpy magic:
>>> import numpy as np >>> np.clip(data, 0, 1, out=data)
This sets all array entries greater than 1 to 1 (and those smaller than 0 to 0). While we're at it, we can also calculate tag distributions by calculating column sums:
>>> dist = np.sum(data, axis=0) >>> print dist [1 2 1 1 3 1]
And finally join with the vocabulary to make it readable:
>>> for tag, count in zip(vocab, dist): ... print count, tag 1 distributed systems 2 linux 1 networking 1 python 3 tools 1 ubuntu
That's it. Turning tags from an RSS feed into features with very little code.