Quick Tip #2: Set Operations Using Shell Tools

Everybody knows that Unix shell utilities are powerful. Even though they're text-based, you can build a lot of useful things outside of the text domain. Today I'll show you how to implement set operations. All we need are sorted files as input, with each file representing a set.

Let's create some input files A and B:

$ echo a b d e | tr ' ' '\n' > A
$ echo a c e f | tr ' ' '\n' > B

Note: It's an important precondition that all input files are sorted! To avoid ugly surprises, the files mustn't contain duplicates, but you would expect that from a set.

Since the input files are sorted, equality is trivial:

$ cmp A B

For use in scripts, the silent switch -s is useful.

Intersection:

$ comm -12 A B

The comm utility outputs three columns per default: The lines unique to the first file, the lines unique to the second file and the lines appearing in both files. We turn off the first two columns leaving just the intersection.

Union:

$ sort -m A B | uniq

The merge switch -m isn't necessary for correctness, but it improves runtime for large files quite a bit. Unfortunately, not all sort implementations have it.

Difference:

$ comm -23 A B

This only displays the lines unique to A, leaving the result of A minus B.

You can use the difference to get the complement. This requires a universe U first that we create using our union construct. Then we subtract A from U:

$ sort -m A B | uniq | comm -23 - A

With only two files, this is just an inefficient way of subtracting A from B.

Contains can be implemented using grep, but we have to be careful about false positives:

$ grep -Fx e A

The quiet switch -q is useful in scripts and may also reduce runtime because it exits as soon as the entry is found. This is still far less efficient than it could be if we had fixed size records and used interval halving.

More advanced operations like testing if one set is a subset of another are possible, too. Often you will also have to compare line counts using wc -l, but I'll leave that as an exercise to the reader :)

social