First Hadoop job

Dataset

You may generate Yourself something, You may take some text file or just google the internet to find data sets from random to specially prepared to experiment with machine learning and AI.

Code

I started with the immortal word count. By and large googled bigger parts of this code and tried to glue from those fragments something myself. The biggest problem was the fact the API was a subject for a change. The below code works on Hadoop 2.6, so not sure if it refers to the most fresh API version. By and large the code schema as follows:
public class WordCount 
  // convenient constant object to initialize the keys in a map
  private final static IntWritable one = new IntWritable(1);

  // the code trigger
  public static void main (String [] args) throws Exception {
    // in version 1.x it was done differently 
    Configuration c = new Configuration();
    [..]
    Job j = Job.getInstance(c, "wordcount");

    // main class declaration
    j.setJarByClass(WordCount.class);
    // mapper class declaration (which is declared within the main class scope)
    j.setMapperClass(WordLineMap.class);
    // reducer class declaration (which is declared within the main class scope)
    j.setReducerClass(WordReduce.class);

    // output declaration
    j.setOutputKeyClass(Text.class);
    j.setOutputValueClass(IntWritable.class);

    FileInputFormat.addInputPath(j, input);
    FileOutputFormat.setOutputPath(j, output);

    // waiting loop
    System.exit(j.waitForCompletion(true) ? 0 : 1);
  }

  // mapper static class definition
  public static class WordLineMap extends Mapper  {
    public void map(LongWritable key, Text value, Context ctx) 
      throws IOException, InterruptedException { [..] }
  }

  // reducer static class definition
  public static class WordReduce extends Reducer {
    public void reduce(Text key, Iterable values, Context ctx) 
      throws IOException, InterruptedException { [..] }
  }
}
The other activities:
# one-time
hadoop fs -mkdir examples/wordcount
hadoop fs -put [dataset] examples/wordcount
echo Main-Class: WordCount.WordCount > Manifest.txt

# every code change
rm -rf WordCount.jar WordCount/
javac -d . WordCount.java
jar cfm WordCount.jar Manifest.txt WordCount/*.class


# MapReduce job launch
# if the output dir exists, the job fails 
hadoop fs -rm examples/wordcount-out/*
hadoop fs -rmdir examples/wordcount-out/
hadoop jar WordCount.jar examples/wordcount examples/wordcount-out

Komentarze

Popularne posty z tego bloga

For a start