Tag Archives: computer science

Scikit-learn: Feature Extraction From Text

I’ve been playing with scikit-learn recently, a machine learning package for Python. While there’s great documentation on many topics, feature extraction isn’t one of them. My use case was to turn article tags (like I use them on my blog) … Continue reading

Posted in python | Tagged , , | 7 Comments

Basics of Near Duplicate Detection

Finding duplicate files is easy, anyone can do it. Finding files that are almost identical is more difficult, but it’s useful for use cases like detecting plagiarism. In this article, I’ll present a simple python program that calculates the textual … Continue reading

Posted in computer science | Tagged , | 9 Comments

Software Developers: You Need Computer Science Education!

Computer science and software development are two entirely different things. The former is a science, the latter is mostly craftsmanship, still struggling to become an engineering discipline in its own right. Being a good computer scientist doesn’t make you a … Continue reading

Posted in computer science | Tagged , | 22 Comments

Are Link-Sharing Services Irrelevant?

You can use RSS to easily follow a few high-profile websites and link sharing services like Slashdot or Digg to discover popular web content. But that’s like reading a classic newspaper and some magazines: The information provided may have a … Continue reading

Posted in computer science | Tagged , , | 2 Comments

Finding the Majority Item in a Stream

Going through old CACM issues I discovered a paper (PDF) on stream processing. A common problem in this field is to find frequent items in a data stream when you only get one pass through the data and you need … Continue reading

Posted in computer science | Tagged , | Leave a comment

CAP, Consistent Hashing, etc.

I’ve been reading up on distributed systems again. For quite a while, my monthly copy of CACM has been my only connection to computer science topics. This time, I followed a few references and came across interesting concepts (most of … Continue reading

Posted in computer science | Tagged , , | 2 Comments