Quick Tip #4: Sorting Large Files

With traditional Unix sort(1), the size of the files you can sort is limited by the amount of available main memory. As soon as the file gets larger and your system has to swap, performance degrades significantly. Even GNU sort which uses temporary files to get around this limitation doesn't sort in parallel. The only viable option for sorting very large files efficiently is to split them, sort the individual parts in parallel and merge them.

First you have to split the input at line boundaries because sort works line oriented. Fortunately, most split(1) utilities today (like GNU split) provide an -l switch. Example:

$ split -l 100000 input input-

This splits the input file into chunks of 100000 lines. The chunks are named input-aa, input-ab, etc. You'll have to experiment with file sizes to see what works well for your problem.

Now sort the individual files using whatever flags you need:

$ sort input-aa > sorted-input-aa

To speed things up, you can parallelize this step. For example, on a quad core box you'd typically run four sort processes in parallel because sorting is CPU-bound if your input is large enough.

As soon as all files are sorted we merge them using sort's -m flag:

$ sort -m sorted-input-* > sorted-input

Note that you have to use the same flags you used to sort the chunks to get a correct result!

This whole process is pretty simple and you can script it easily. However, if your files get really big (several hundred GB and more) and you start considering to parallelize across multiple machines, you might want to consider using a MapReduce cluster.

social