Quick Tip #4: Sorting Large Files

With traditional Unix sort(1), the size of the files you can sort is limited by the amount of available main memory. As soon as the file gets larger and your system has to swap, performance degrades significantly. Even GNU sort which uses temporary files to get around this limitation doesn’t sort in parallel. The only viable option for sorting very large files efficiently is to split them, sort the individual parts in parallel and merge them.

First you have to split the input at line boundaries because sort works line oriented. Fortunately, most split(1) utilities today (like GNU split) provide an -l switch. Example:

  $ split -l 100000 input input-

This splits the input file into chunks of 100000 lines. The chunks are named input-aa, input-ab, etc. You’ll have to experiment with file sizes to see what works well for your problem.

Now sort the individual files using whatever flags you need:

  $ sort input-aa > sorted-input-aa

To speed things up, you can parallelize this step. For example, on a quad core box you’d typically run four sort processes in parallel because sorting is CPU-bound if your input is large enough.

As soon as all files are sorted we merge them using sort’s -m flag:

  $ sort -m sorted-input-* > sorted-input

Note that you have to use the same flags you used to sort the chunks to get a correct result!

This whole process is pretty simple and you can script it easily. However, if your files get really big (several hundred GB and more) and you start considering to parallelize across multiple machines, you might want to consider using a MapReduce cluster.

About these ads
This entry was posted in shell and tagged , , , , . Bookmark the permalink.

One Response to Quick Tip #4: Sorting Large Files

  1. Good article. Very useful. Thank you.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s