A Maven Archetype for Hadoop Jobs

In my last article I showed how to build a Hadoop job that contains all its dependencies. To make things even easier, I created a Maven archetype that turns project setup into a simple 30 second process.

To generate a new project run the following command (on one line):

  mvn archetype:generate 

Then follow the instructions on the screen: Pick the hadoop-job-basic archetype from the list and enter your project’s coordinates (groupId, artifactId, etc.). If you use a different Hadoop version you can adjust the version number in the generated pom.xml. And that’s it!

The Maven archetype:generate command above downloads my personal catalog of archetypes, which is just a simple XML file that I created manually.

Since my last time with the Maven Archetype plugin, things have really improved. They ditched the old descriptor format and introduced a new one that gives you almost complete control over the files that will be part of the generated project. More than that, creating the archetype was as simple as calling the archetype:create-from-project goal on an existing project.

I’m releasing the archetype under the Apache 2.0 license on Github, so feel free to use it as you see fit.

This entry was posted in java and tagged , , , , . Bookmark the permalink.

8 Responses to A Maven Archetype for Hadoop Jobs

  1. Pingback: Hadoop Maven Archetype « Shekhar Gulati : Java Consultant, Freelance Writer

  2. David Milne says:

    Awesome, thanks for this!

  3. rlclayton says:

    Matthais, very nice. I’m teaching a class on Hadoop Development and was trying to formalize the build process so it wasn’t just in Eclipse. This was perfect, especially the inclusion of the WordCount example and the Unit Test. Thank you so much.

  4. mafr says:

    I’m glad it’s useful, thank you very much for your comment!

  5. Blew says:

    Thanks, this article was very useful!

    What I am stuck on is having the ability to control the version of dependencies that conflicts with what hadoop is deployed with to the version that my code uses.

    Example, my code uses newer version of gson, jets3t, commons-lang than what EMR hadoop is deployed with. If i utilize settings to force load my dependencies first (mapreduce.job.user.classpath, HADOOP_USER_CLASSPATH_FIRST=true) then both my code as well as hadoop code utilizes the newer dependencies, which appears to break hadoop.

    I want my code to use my version of newer dependencies, and hadoop code to continue utilizing its version of dependencies. I imagine this is pretty common scenario?

    • mafr says:

      Yes, that’s a very common problem. Unfortunately, there’s no solution (see issues MAPREDUCE-1700 and MAPREDUCE-1938). In more mature technologies like Java EE, user code is isolated using classloader hierarchies, with applications only depending on a minimal API. Hadoop just isn’t there yet.

      Ugly workarounds include downgrading your own dependencies, using different libraries in your code (Google Guava instead of commons-lang etc.) or importing your dependencies into a different package namespace (“com.example.thirdparty.org.apache.*”). I agree, this really, really sucks.

  6. Pingback: Hadoop on Azure - Creating and Running a simple Java MapReduce - I'm on a mission from God object

  7. Mirko Kämpf says:

    I thank you ver much !!! It saved me a lot of time …

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s