When setting up a Hadoop cluster using Debian packages, it's often useful to work with a local mirror. In this article, I'll walk you through creating an apt mirror for Cloudera's Hadoop distribution.

There's quite a few advantages in having a local mirror, all of which should be familiar to people running their own Maven proxies:

You can still install packages when the origin server is down or has disappeared completely

When installing packages on a large number of machines, you don't overwhelm the origin server

You save network bandwidth and as a result,

The installation process will typically run much faster

The first point is particularly important because when adding a new machine to a cluster, you usually want it to run the exact same software versions that the rest of the cluster already uses.

For creating the mirror, we'll be using the apt-mirror command line tool. You can install apt-mirror from a Debian or Ubuntu package, but it also runs from a checked out source tree:

$ git clone https://github.com/apt-mirror/apt-mirror.git $ cd apt-mirror

Inside the source directory, you have to edit the mirror.list config file. The following settings download Cloudera's CDH 4.4.0 Hadoop distribution and try to be nice by using only four threads in parallel:

set base_path /tmp/my-mirror set defaultarch amd64 set run_postmirror 0 set nthreads 4 deb http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh precise-cdh4.4.0 contrib deb-src http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh precise-cdh4.4.0 contrib

Now that the configuration is ready, we can create the work directory and start the mirroring process:

$ mkdir /tmp/my-mirror $ ./apt-mirror mirror.list

This will take a while (20 minutes on my system), but afterwards, you'll find the mirror in /tmp/my-mirror/archive.cloudera.com/. To install from your new mirror, you have to serve it from a web server:

# apt-get install apache2 # mv /tmp/my-mirror/archive.cloudera.com/cdh4 /var/www/

On each machine that is supposed to use the new mirror, reference the it from your apt configuration, for example by creating a file local-cdh-mirror.list in /etc/apt/sources.list.d:

deb [arch=amd64] http://example.com/cdh4/ubuntu/precise/amd64/cdh precise-cdh4.4.0 contrib deb-src http://example.com/cdh4/ubuntu/precise/amd64/cdh precise-cdh4.4.0 contrib

For some repositories, you'll also have to import the distributor's PGP public key:

$ curl -s https://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh/archive.key | sudo apt-key add -

After running apt-get update, you should be able to install packages.

Matthias Friedrich's Blog

Other articles

Maven Archetypes Updated!

A Maven Archetype for Hadoop Jobs

Maven: Building a Self-Contained Hadoop Job

Other articles

social