Maven: Building a Self-Contained Hadoop Job

Non-trivial Hadoop jobs usually have dependencies that go beyond those provided by the Hadoop runtime environment. That means, if your job needs additional libraries you have to make sure they are on Hadoop’s classpath as soon as the job is executed. This article shows how you can build a self-contained job JAR that contains all your dependencies.

The Hadoop runtime environment expects additional dependencies inside a lib directory. That’s where we need Maven’s Assembly plugin: Using the plugin, we create a JAR that contains the project’s resources (usually just your job’s class files) and a lib directory with all dependencies that are not on Hadoop’s classpath already.

First of all, we create an assembly descriptor file (I usually put it in src/main/assembly/hadoop-job.xml):

 
<assembly>
  <id>job</id>
  <formats>
    <format>jar</format>
  </formats>
  <includeBaseDirectory>false</includeBaseDirectory>
  <dependencySets>
    <dependencySet>
      <unpack>false</unpack>
      <scope>runtime</scope>
      <outputDirectory>lib</outputDirectory>
      <excludes>
        <exclude>${groupId}:${artifactId}</exclude>
      </excludes>
    </dependencySet>
    <dependencySet>
      <unpack>true</unpack>
      <includes>
        <include>${groupId}:${artifactId}</include>
      </includes>
    </dependencySet>
  </dependencySets>
</assembly>

Note that we collect all dependencies on the runtime scope but exclude the project’s artifact. Instead, we add the artifact unpacked which is a little surprising. If we didn’t do that, Hadoop would look for our dependencies inside the project’s artifact JAR and not inside the surrounding assembly JAR’s lib directory. For the whole mechanism to work, the class you set in your job driver via Job.setJarByClass() must be from your project, not from your dependencies!

Inside the POM’s dependency section, we set Hadoop to the provided scope:

 
<dependencies>
  <dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-core</artifactId>
    <version>0.20.2</version>
    <scope>provided</scope>
  </dependency>
</dependencies>

As a result, the assembly process will ignore the hadoop-core artifact and all its dependencies.

In our final step we reference the assembly descriptor from our pom.xml:

 
<build>
  <plugins>
    <plugin>
      <artifactId>maven-assembly-plugin</artifactId>
      <version>2.2.1</version>
      <configuration>
        <descriptors>
          <descriptor>src/main/assembly/hadoop-job.xml</descriptor>
        </descriptors>
        <archive>
          <manifest>
            <mainClass>de.mafr.hadoop.Main</mainClass>
          </manifest>
        </archive>
      </configuration>
      <executions>
        <execution>
          <id>make-assembly</id>
          <phase>package</phase>
          <goals>
            <goal>single</goal>
          </goals>
        </execution>
      </executions>
    </plugin>
  </plugins>
</build>

Setting the assembly JAR’s Main class is optional, but makes the job more user friendly because you don’t have to specify the class name explicitly for each run.

Also, we attach the assembly process to the build lifecycle’s package phase for convenience. If we didn’t do this we would have to run the Assembly plugin manually using assembly:assembly after each mvn package run.

That’s it, we’re done. You can now build your job JAR:

 
mvn clean package

Your self-contained job JAR is the file in target ending with -job.jar. Run it using Hadoop’s jar sub-command:

 
hadoop jar YourJob-1.0-job.jar ARGS...

To get started quickly consider downloading my example Maven project or use the Maven archetype I present in my next article.

About these ads
This entry was posted in java and tagged , , , , . Bookmark the permalink.

7 Responses to Maven: Building a Self-Contained Hadoop Job

  1. Martin says:

    Matthias
    Many thanks for the example project – saved me many hours fighting with my Maven configuration….
    Martin

  2. Is there any easy way to distribute these jobs to your Hadoop nodes?

    • mafr says:

      Hadoop distributes your job automatically when executing it via the “hadoop jar” command as shown in the article. Of course, your local Hadoop installation has to know the addresses of your job tracker and namenode.

  3. ntgd says:

    thank u

  4. Manoj says:

    Hi,

    Could you kindly tell how the main class will look for its dependencies inside the lib folder?
    Where we are providing it?

    Thanks in advance.

    • mafr says:

      The main class has nothing to do with it; when your main runs, the lib dependencies are on the classpath already. The Hadoop framework extracts the libraries, puts them on the classpath and then executes your main class.

  5. very useful article!
    thank you !

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s