Basics of Near Duplicate Detection

Finding duplicate files is easy, anyone can do it. Finding files that are almost identical is more difficult, but it's useful for use cases like detecting plagiarism. In this article, I'll present a simple python program that calculates the textual similarity of two documents.

The basic idea is to reduce each document to a set of strings and then use Jaccard similarity to calculate the similarity of those sets. Of course, this wasn't my idea, I got it from chapter 3 of the free book Mining of Massive Datasets. You can find all the details and references to original research there, I highly recommend it.

To reduce a document to a set of strings, we use a technique called shingling. Shingling extracts the set of all substrings of length k (called k-shingles) from a text. This is implemented by sliding a window of size k over the input document character by character and throwing the substrings into a set. For example, the document "abcabcac" yields the 3-shingles {"abc", "bca", "cab", "cac"}. There are more substrings of length k, but only four shingles because the set eliminates duplicates.

Jaccard similarity is a simple measure for expressing similarity of sets. I first came across this measure when studying recommender systems, but it keeps showing up in other machine learning contexts, too. For sets A and B, it is defined as SIM(A, B) := |A ∩ B| / |A ∪ B|. The result ranges from 0 (no elements in common) to 1 (identical). Obviously, plagiarized documents have a higher similarity to the original than unrelated documents.

Here's an implementation written in Python 3 (download):

import sys

SHINGLE_SIZE = 5

def get_shingles(f, size):
    shingles = set()
    buf = f.read() # read entire file
    for i in range(0, len(buf)-size+1):
        yield buf[i:i+size]

def jaccard(set1, set2):
    x = len(set1.intersection(set2))
    y = len(set1.union(set2))
    return x / float(y)

file1 = sys.argv[1]
file2 = sys.argv[2]

with open(file1) as f1, open(file2) as f2:
    shingles1 = set(get_shingles(f1, size=SHINGLE_SIZE))
    shingles2 = set(get_shingles(f2, size=SHINGLE_SIZE))

print(jaccard(shingles1, shingles2), file1, file2)

You call it from the command line and it prints the similarity to stdout:

$  python3 shingle.py document1 document2

Obviously, this is the most simple implementation that could work, it isn't optimized in any way. The book I mentioned above discusses additional techniques for scaling this: Minhashing for creating short summaries of those sets that you can still compare and Locality-Sensitive Hashing to cluster documents by similarity.

social