Initial setup

For an initial setup I 've got a ready-to-play virtual machine from Oracle https://www.oracle.com/technetwork/database/bigdata-appliance/oracle-bigdatalite-2104726.html.

Of course it is beneficial to start with own hadoop installation, but to look around this is much faster start.

Hadoop versions

What I hit first was Hadoop versions. There are 3.

HDFS

All of them built on HDFS, which can be think of as a virtual file system, where the user may find files and directories, and ACLs (not fully supported) and other goodies known from other file systems. Data schema/constraints/etc is enforced here not on write (as e.g. in RDBMS-es), but on read, so all those limitations are up to the user code, which by default works with raw input.
By and large it consists of NameNode and DataNode, where the NameNode is a manager of Hadoop metadata, which cover the data distribution among DataNodes, while DataNodes keep the data itself in the form of standardized blocks - this allows for usage of cheap commodity hardware, not necessarily standardized (though this is apparently the best approach to the cluster creation).

And here few most significant differences between versions:

- there is one NameNode only
- data safety is provided by simple replication time X, where by default X=3
- there is one NameNode and standby, there are additional JournalNodes, keeping metadata changes
- data safety is still provided by simple replication time X, where by default X=3
- there is one NameNode and many standby nodes, there are additional JournalNodes, keeping metadata changes and it is possible to add many name domains
- data safety is similar to RAID5, where a parity block is created per every 2 data blocks - which in turn leads to significant reduction in storage allocation (overhead 50% vs 200% in earlier versions)

Processing framework

By and large the processing framework is a MapReduce kind. The architecture is shortly described here. The differences between versions:

- it is called MapReduce and is as simple as that
- rewritten and called YARN, the link above refers mostly to this one
- improved YARN

While the processing framework of the Hadoop is called YARN, there is also Spark, which seems to me just another framework with a different approach (among others it leverages in much greater degree RAM of nodes, and uses different algorithms than MapReduce), but also uses the HDFS as the storage platform.

Szukaj na tym blogu

Hadoop and around

For a start

Initial setup

Hadoop versions

HDFS

Processing framework

Komentarze

Prześlij komentarz

Popularne posty z tego bloga

First Hadoop job