Maven: Building a Self-Contained Hadoop Job

Non-trivial Hadoop jobs usually have dependencies that go beyond those provided by the Hadoop runtime environment. That means, if your job needs additional libraries you have to make sure they are on Hadoop's classpath as soon as the job is executed. This article shows how you can build a self-contained job JAR that contains all your dependencies.

The Hadoop runtime environment expects additional dependencies inside a lib directory. That's where we need Maven's Assembly plugin: Using the plugin, we create a JAR that contains the project's resources (usually just your job's class files) and a lib directory with all dependencies that are not on Hadoop's classpath already.

First of all, we create an assembly descriptor file (I usually put it in src/main/assembly/hadoop-job.xml):

<assembly>
  <id>job</id>
  <formats>
    <format>jar</format>
  </formats>
  <includeBaseDirectory>false</includeBaseDirectory>
  <dependencySets>
    <dependencySet>
      <unpack>false</unpack>
      <scope>runtime</scope>
      <outputDirectory>lib</outputDirectory>
      <excludes>
        <exclude>${groupId}:${artifactId}</exclude>
      </excludes>
    </dependencySet>
    <dependencySet>
      <unpack>true</unpack>
      <includes>
        <include>${groupId}:${artifactId}</include>
      </includes>
    </dependencySet>
  </dependencySets>
</assembly>

Note that we collect all dependencies on the runtime scope but exclude the project's artifact. Instead, we add the artifact unpacked which is a little surprising. If we didn't do that, Hadoop would look for our dependencies inside the project's artifact JAR and not inside the surrounding assembly JAR's lib directory. For the whole mechanism to work, the class you set in your job driver via Job.setJarByClass() must be from your project, not from your dependencies!

Inside the POM's dependency section, we set Hadoop to the provided scope:

<dependencies>
  <dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-core</artifactId>
    <version>0.20.2</version>
    <scope>provided</scope>
  </dependency>
</dependencies>

As a result, the assembly process will ignore the hadoop-core artifact and all its dependencies.

In our final step we reference the assembly descriptor from our pom.xml:

<build>
  <plugins>
    <plugin>
      <artifactId>maven-assembly-plugin</artifactId>
      <version>2.2.1</version>
      <configuration>
        <descriptors>
          <descriptor>src/main/assembly/hadoop-job.xml</descriptor>
        </descriptors>
        <archive>
          <manifest>
            <mainClass>de.mafr.hadoop.Main</mainClass>
          </manifest>
        </archive>
      </configuration>
      <executions>
        <execution>
          <id>make-assembly</id>
          <phase>package</phase>
          <goals>
            <goal>single</goal>
          </goals>
        </execution>
      </executions>
    </plugin>
  </plugins>
</build>

Setting the assembly JAR's Main class is optional, but makes the job more user friendly because you don't have to specify the class name explicitly for each run.

Also, we attach the assembly process to the build lifecycle's package phase for convenience. If we didn't do this we would have to run the Assembly plugin manually using assembly:assembly after each mvn package run.

That's it, we're done. You can now build your job JAR:

mvn clean package

Your self-contained job JAR is the file in target ending with -job.jar. Run it using Hadoop's jar sub-command:

hadoop jar YourJob-1.0-job.jar ARGS...

To get started quickly consider downloading my example Maven project or use the Maven archetype I present in my next article.

social