Posty

First Hadoop job

Dataset You may generate Yourself something, You may take some text file or just google the internet to find data sets from random to specially prepared to experiment with machine learning and AI. Code I started with the immortal word count. By and large googled bigger parts of this code and tried to glue from those fragments something myself. The biggest problem was the fact the API was a subject for a change. The below code works on Hadoop 2.6, so not sure if it refers to the most fresh API version. By and large the code schema as follows: public class WordCount // convenient constant object to initialize the keys in a map private final static IntWritable one = new IntWritable(1); // the code trigger public static void main (String [] args) throws Exception { // in version 1.x it was done differently Configuration c = new Configuration(); [..] Job j = Job.getInstance(c, "wordcount"); // main class declaration j.setJarByClass(WordCount....

For a start

Initial setup For an initial setup I 've got a ready-to-play virtual machine from Oracle https://www.oracle.com/technetwork/database/bigdata-appliance/oracle-bigdatalite-2104726.html . Of course it is beneficial to start with own hadoop installation, but to look around this is much faster start. Hadoop versions What I hit first was Hadoop versions. There are 3. HDFS All of them built on HDFS, which can be think of as a virtual file system, where the user may find files and directories, and ACLs (not fully supported) and other goodies known from other file systems. Data schema/constraints/etc is enforced here not on write (as e.g. in RDBMS-es), but on read, so all those limitations are up to the user code, which by default works with raw input. By and large it consists of NameNode and DataNode , where the NameNode is a manager of Hadoop metadata, which cover the data distribution among DataNodes, while DataNodes keep the data itself in the form of standardized blocks - this...