Scikit-learn: Feature Extraction From Text

I’ve been playing with scikit-learn recently, a machine learning package for Python. While there’s great documentation on many topics, feature extraction isn’t one of them. My use case was to turn article tags (like I use them on my blog) into feature vectors.

[Update: Ported the code to scikit-learn 0.11 which is incompatible to 0.10 and older.]

To get some data to play with, let’s first extract tags from an RSS feed using feedparser:

>>> import feedparser
>>> feed = feedparser.parse("")
>>> tags = [[t.term for t in e.tags] for e in feed.entries]

To keep the example manageable, we’ll use the following data:

tags = [
  "python, tools",
  "linux, tools, ubuntu",
  "distributed systems, linux, networking, tools",

We want to turn our tags into a matrix with articles (samples) on the rows and tags (features) on the columns. With three articles and 6 different tags, we expect to get a 3×6 matrix.

After looking around, I found the CountVectorizer class that does almost what I want. All I have to do is to provide a tokenizer that takes a string and turns it into a list of tokens. For this particular problem, it’s easy to implement:

import re
REGEX = re.compile(r",\s*")
def tokenize(text):
    return [tok.strip().lower() for tok in REGEX.split(text)]

A quick test shows that it seems to work:

>>> tokenize("foo Bar, baz")
['foo bar', 'baz']

Now let’s plug it into CountVectorizer and run it:

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> vec = CountVectorizer(tokenizer=tokenize)
>>> data = vec.fit_transform(tags).toarray()
>>> print data
[[0 0 0 1 1 0]
 [0 1 0 0 1 1]
 [1 1 1 0 1 0]]

In some cases it’s useful to restrict the number of features. CountVectorizer has a max_features constructor argument that limits the vocabulary to the top K features only. Also, note that the fit_transform() method returns a sparse matrix, but we’re turning it into a numpy array because it’s easier to work with.

So, the shape of the matrix looks right, but is it really correct? Let’s dump the column headers to make sure:

>>> vocab = vec.get_feature_names()
>>> vocab
[u'distributed systems', u'linux', u'networking', u'python', u'tools', u'ubuntu']

My samples don’t contain duplicate tags, but if they did, I could weed them out in tokenize() or by modest numpy magic:

>>> import numpy as np
>>> np.clip(data, 0, 1, out=data)

This sets all array entries greater than 1 to 1 (and those smaller than 0 to 0). While we’re at it, we can also calculate tag distributions by calculating column sums:

>>> dist = np.sum(data, axis=0)
>>> print dist
[1 2 1 1 3 1]

And finally join with the vocabulary to make it readable:

>>> for tag, count in zip(vocab, dist):
...     print count, tag
1 distributed systems
2 linux
1 networking
1 python
3 tools
1 ubuntu

That’s it. Turning tags from an RSS feed into features with very little code.

This entry was posted in python and tagged , , . Bookmark the permalink.

8 Responses to Scikit-learn: Feature Extraction From Text

  1. fix
    -vec = CountVectorizer(tokenizer=tokenize)
    +vec = CountVectorizer(tokenizer=tokenize, min_df=1)

  2. jenn says:

    can you plot a 2d graph based on count and sentence where sentence is all words within one ” “?

    • Matthias says:

      In principle, yes. You’d need a different tokenizer, perhaps something like this:

      REGEX = re.compile(r'”.*?”‘)
      def tokenize(text): return [tok.strip().lower() for tok in REGEX.findall(text)]

      Note the findall() instead of the split(). However, once your input data is a bit more complicated and has errors, you’ll probably need a hand-written tokenizer that handles edge cases. Ah, and you can use matplotlib to make a nice graph.

  3. jenn says:

    if you were to use K-means on your data using the feature you extracted above and say you want to plot a 2d chart, what would your x and y axis be, given you want to see how well the tags are classified into each cluster?

  4. eranst says:

    Thank you for posting this, helped me started with scikit-learn text feature extraction :-)

  5. Pingback: machine-learning - Posso utilizzare CountVectorizer in scikit-imparare a contare frequenza di documenti che non sono stati utilizzati per estrarre il token?

Leave a Reply to Matthias Cancel reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s