Scikit-learn: Feature Extraction From Text

I've been playing with scikit-learn recently, a machine learning package for Python. While there's great documentation on many topics, feature extraction isn't one of them. My use case was to turn article tags (like I use them on my blog) into feature vectors.

[Update: Ported the code to scikit-learn 0.11 which is incompatible to 0.10 and older.]

To get some data to play with, let's first extract tags from an RSS feed using feedparser:

>>> import feedparser
>>> feed = feedparser.parse("http://blog.mafr.de/feed/")
>>> tags = [[t.term for t in e.tags] for e in feed.entries]

To keep the example manageable, we'll use the following data:

tags = [
  "python, tools",
  "linux, tools, ubuntu",
  "distributed systems, linux, networking, tools",
]

We want to turn our tags into a matrix with articles (samples) on the rows and tags (features) on the columns. With three articles and 6 different tags, we expect to get a 3x6 matrix.

After looking around, I found the CountVectorizer class that does almost what I want. All I have to do is to provide a tokenizer that takes a string and turns it into a list of tokens. For this particular problem, it's easy to implement:

import re
REGEX = re.compile(r",\s*")
def tokenize(text):
    return [tok.strip().lower() for tok in REGEX.split(text)]

A quick test shows that it seems to work:

>>> tokenize("foo Bar, baz")
['foo bar', 'baz']

Now let's plug it into CountVectorizer and run it:

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> vec = CountVectorizer(tokenizer=tokenize)
>>> data = vec.fit_transform(tags).toarray()
>>> print data
[[0 0 0 1 1 0]
 [0 1 0 0 1 1]
 [1 1 1 0 1 0]]

In some cases it's useful to restrict the number of features. CountVectorizer has a max_features constructor argument that limits the vocabulary to the top K features only. Also, note that the fit_transform() method returns a sparse matrix, but we're turning it into a numpy array because it's easier to work with.

So, the shape of the matrix looks right, but is it really correct? Let's dump the column headers to make sure:

>>> vocab = vec.get_feature_names()
>>> vocab
[u'distributed systems', u'linux', u'networking', u'python', u'tools', u'ubuntu']

My samples don't contain duplicate tags, but if they did, I could weed them out in tokenize() or by modest numpy magic:

>>> import numpy as np
>>> np.clip(data, 0, 1, out=data)

This sets all array entries greater than 1 to 1 (and those smaller than 0 to 0). While we're at it, we can also calculate tag distributions by calculating column sums:

>>> dist = np.sum(data, axis=0)
>>> print dist
[1 2 1 1 3 1]

And finally join with the vocabulary to make it readable:

>>> for tag, count in zip(vocab, dist):
...     print count, tag
1 distributed systems
2 linux
1 networking
1 python
3 tools
1 ubuntu

That's it. Turning tags from an RSS feed into features with very little code.

social