Basics of Near Duplicate Detection

Finding duplicate files is easy, anyone can do it. Finding files that are almost identical is more difficult, but it’s useful for use cases like detecting plagiarism. In this article, I’ll present a simple python program that calculates the textual similarity of two documents.

The basic idea is to reduce each document to a set of strings and then use Jaccard similarity to calculate the similarity of those sets. Of course, this wasn’t my idea, I got it from chapter 3 of the free book Mining of Massive Datasets. You can find all the details and references to original research there, I highly recommend it.

To reduce a document to a set of strings, we use a technique called shingling. Shingling extracts the set of all substrings of length k (called k-shingles) from a text. This is implemented by sliding a window of size k over the input document character by character and throwing the substrings into a set. For example, the document "abcabcac" yields the 3-shingles {"abc", "bca", "cab", "cac"}. There are more substrings of length k, but only four shingles because the set eliminates duplicates.

Jaccard similarity is a simple measure for expressing similarity of sets. I first came across this measure when studying recommender systems, but it keeps showing up in other machine learning contexts, too. For sets A and B, it is defined as SIM(A, B) := |A ∩ B| / |A ∪ B|. The result ranges from 0 (no elements in common) to 1 (identical). Obviously, plagiarized documents have a higher similarity to the original than unrelated documents.

Here’s an implementation written in Python 3 (download):

import sys
def get_shingles(f, size):
    shingles = set()
    buf = # read entire file
    for i in range(0, len(buf)-size+1):
        yield buf[i:i+size]
def jaccard(set1, set2):
    x = len(set1.intersection(set2))
    y = len(set1.union(set2))
    return x / float(y)
file1 = sys.argv[1]
file2 = sys.argv[2]
with open(file1) as f1, open(file2) as f2:
    shingles1 = set(get_shingles(f1, size=SHINGLE_SIZE))
    shingles2 = set(get_shingles(f2, size=SHINGLE_SIZE))
print(jaccard(shingles1, shingles2), file1, file2)

You call it from the command line and it prints the similarity to stdout:

$  python3 document1 document2

Obviously, this is the most simple implementation that could work, it isn’t optimized in any way. The book I mentioned above discusses additional techniques for scaling this: Minhashing for creating short summaries of those sets that you can still compare and Locality-Sensitive Hashing to cluster documents by similarity.

This entry was posted in computer science and tagged , . Bookmark the permalink.

10 Responses to Basics of Near Duplicate Detection

  1. panzi says:

    Interesting, but document similarity algorithms usually work differently. They tokenize the text into words, stem the words and remove unimportant common words (we used nltk for these things) and count the remaining words. Then this is used as a vector where the number of different words make up the dimensions and the number of occurrences of a certain word in a document makes up the value in that dimension. This vectors are then compared by calculating the distance one way or the other (L1, L2 or L-infinity metric).

    Of course treading each word on its own looses the context of the word, so more advanced methods use ordered overlapping n-tuples of words (n=1 -> same as before).

    I learned about this in a course named information retrieval at the university of technology in Vienna. My team (2 persons) used Python (2.x, nltk isn’t there yet for Python 3.x) to develop our information extractor (output is arff, which is a format apparently often used in this context) and retriever (outputs html including a graph using flot). It could be faster but I guess using any other language would have been much more cumbersome.

    Here is a short example output:

    If you want to take a look at the Python script I’ll ask my team colleague if I can release it as open source (I very much think he will say: “yes, you have to ask?”).

    (In a further exercise we did basically the same with audio files where the features are not words or n-grams but certain audio features a Java application calculated for us.)

    • mafr says:

      Yup, there are different approaches and it depends on your use case and your data which fits best. The one I described is used in Google News (with Minhashing and LSH, of course) to group similar articles because it is easy to parallelize it via MapReduce.

      I’ve seen the vector similarity you describe in content-based recommender systems. Their drawback is that your usual similarity metrics (cosine, pearson) don’t work that well if there are many dimensions (a colleague mentioned 50 as a typical limit). That’s why you then apply dimension reduction techniques. As far as I know, TF/IDF is often used in information retrieval to find the terms that are the most important ones to a document. I guess because of this, these approaches don’t work that well for near duplicate detection. But as usual, it depends on your data and your exact use case.

  2. panzi says:

    PS: Because we used Python for the exercise and we had to use ARFF I wrote an ARFF parser/writer in Python:

    If not anything else this might be interesting to people working in the information retrieval context.

    • mafr says:

      Thanks, but I’m just playing around in my free time. If I needed it for work it had to be in Java anyway :)

      Anyway, NLTK is pretty cool, I’ve always wanted to play with it but never found the time.

    • Saugata says:

      panzi: i have ted a project recently to detect plagiarism. And for this purpose i am going to use python 2.7 along with nltk package. But cant decide how to start with. Actually, I am following what you have written earlier at frst cmmnt. Can u pls advice me regarding how will i strat my project; means steps I will need to follow.

      • panzi says:

        I just attended one lecture about information retrieval at the university, I’m really not the person to ask about these things. Maybe ask somewhere on stack exchange?

  3. Dongchen says:

    Hey,thanks for your article.But i do not know Python 。So can you give me the shingling algorithm written in c/c++?thanks for grateful! Here is my email:

  4. mafr says:

    No, sorry. Please try it yourself, it’s pretty simple. Everything you need is in the article and Python code is easy to read, too. If all else fails you can read a few short sections in the book I linked to.

  5. Vadim says:

    Nice book, thanks for reference! If somebody needs Java solution for finding near-duplicates of text documents on large scale, he can try this one:

  6. Pingback: 关于算法:搜索汉明距离小于阈值的字符串 - 4Find

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s