Quick Tip #4: Sorting Large Files

With traditional Unix sort(1), the size of the files you can sort is limited by the amount of available main memory. As soon as the file gets larger and your system has to swap, performance degrades significantly. Even GNU sort which uses temporary files to get around this limitation doesn’t sort in parallel. The only viable option for sorting very large files efficiently is to split them, sort the individual parts in parallel and merge them.

First you have to split the input at line boundaries because sort works line oriented. Fortunately, most split(1) utilities today (like GNU split) provide an -l switch. Example:

  $ split -l 100000 input input-

This splits the input file into chunks of 100000 lines. The chunks are named input-aa, input-ab, etc. You’ll have to experiment with file sizes to see what works well for your problem.

Now sort the individual files using whatever flags you need:

  $ sort input-aa > sorted-input-aa

To speed things up, you can parallelize this step. For example, on a quad core box you’d typically run four sort processes in parallel because sorting is CPU-bound if your input is large enough.

As soon as all files are sorted we merge them using sort’s -m flag:

  $ sort -m sorted-input-* > sorted-input

Note that you have to use the same flags you used to sort the chunks to get a correct result!

This whole process is pretty simple and you can script it easily. However, if your files get really big (several hundred GB and more) and you start considering to parallelize across multiple machines, you might want to consider using a MapReduce cluster.

Advertisements
This entry was posted in shell and tagged , , , , . Bookmark the permalink.

3 Responses to Quick Tip #4: Sorting Large Files

  1. Good article. Very useful. Thank you.

  2. Pingback: Unix:How to sort big files? – Unix Questions

  3. You can sort the many huge files (the sorted result can be terabytes and bigger) with [ZZZServer][1] it is free for non-commercial use:

    ZZZServer -sortinit -sort file1.txt
    ZZZServer -sort file2.txt
    ZZZServer -sort file3.txt

    ZZZServer -sortsave sorted.txt
    After sorting the result is saved in

    sorted.txt
    P.S. Your input files must be encoded in the UTF-8 or ASCII format!

    The ZZZServer using about 1MB RAM on sorting big files!

    [1]: http://demo.zzz.bg/en/#download

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s