Big Data & Image Processing: Hadoop Tutorial

Hadoop tutorial :-

Hadoop is a apache framework developed completely in java java with opensource brand.

Hadoop analyze and process large amount of data i.e peta bytes of data in parallel with less time located in distributed environment. Hadoop is not a single tool which contains combination of different sub frameworks called Hadoop Core,Map Reduce,HDFS,Pig,HBase. Hadoop is mostly used for OLTP transactions. Some big companies like facebook uses hadoop for OLAP transactions as well.Hadoop can be setup in clustered environment as well as single node environment.

HDFS is developed based on google file system (GFS)

Map Reduce is developed on google map reduce concept.

Pig framework is SQL wrapper for map reduce jobs

Basics of Hadoop HDFS :-

In Hadoop, data is stored in hadoop distributed file system (HDFS) software in the form blocks (chunk of size),data is replicated in different nodes (called machines) in clustered (many nodes) environment. In real time scenario, thousands of peta bytes data is stored in thousand of nodes in distribute environment. The advantages with storing replicated data is data available when one of the node is down. With this, data available all the time for the clients apps. Do we need high end hardware for all these nodes? Answer is no, we can accommodate commodity hardware for all these nodes.

if data is grown rapidly, we can add nodes without fail over of the whole system and losing the data. This we can call it as scalable system in network terminology. This system handles the case of losing data while adding machines to the existing machines or after the machine add to the cluster.

As you know cluster has different nodes, if one node fails, hadoop handles the scenario without losing the data and serves the required work as expected.

HDFS stores the data in the files where this files uses the underlying operating system’s file structure.

HDFS is suitable for storing the large amount of data like peta and tera bytes of data which process the data using Map reduce for OLTP transaction.

Data storage in HDFS:-

In HDFS, set of data is called as blocks,each block of data is replicated in different node or machines. The number of nodes where this data is replicated is configured in hadoop system.

Basics of Hadoop Map Reduce :-

In hadoop system, peta bytes of data is distributed in thousands of nodes in cloud environment.To process the data stored in HDFS, We need a applications.

Map Reduce is a java framework used to write programs, which are executed parallel to process large amount of data in clustered environment.

Hadoop provides Map Reduce API’s to write map reduce programs. We have to make use of those API’s and customize our data analysis login in the code.

The map reduce piece of code fetchs and process the data in distributed environment

As we want to process the data stored in HDFS,For this we need to write programs using some langugage like java or python etc.

Map reduce has two different task 1.Mapper 2. Reducer

Map takes the input data and process this data in set of tasks with dividing input data. and out of this map is result of set of task, which are given to reducer. reducer process this data and combine the data output the data.

Basics of Hadoop Hive:-

Hive is apache framework which is SQL wrapper implementation of map reduce programs. Hive provides sql languages which understand by hadoop system.

Most of the times, data analysis done by database developers, so DB developer is not aware of the java programs, in that case, hive tool is useful.

Database developer writes Hive Queries in hive tool for the result. This queries calls the underlying map reduce jobs, process the data, finally data is returned to hive tool.

Advantages with hive is no need of java programming code in hadoop environment.

In some cases, sql queries are not performing well, or some database feature (group by sql function) is not implemented in hive, then we have to write map reduce plugin and register this plugin to hive repository. This is one time tasks.

On top of all these, we have Hadoop Common which is core framework written for processing distributed large set of data to handling the hadoop features.

Hadoop Use case explained:-

Let me take scenario before apache hadoop is introduced into software world

i am going to explain about the use cases of data processing for one big data company.

Retail company has 20000 stores in the world. Each store will have the data related to products in different regions.This data will be stored in different data sets including different popular databases in multi software environment. Data company would need licenses of different softwares including databases and hardware

For each month, if this company wants to process the data by store wise for finding the best product as well as loyal customer that means we need to process and calculate the data and findout the best customer as well as best selling product in each region to give better offers.

Assume that 2000 stores will have the data of all the products and customer details, and customer purchase information per each stores.

To target the the below use cases.

To find the popular product sold in the last year Christmas per the Store A, so that this year we can target the customers to give different discounts for this products

Top 10 Products sold

selecting the top 100 customers per each store to give more offers.

From the technical point of view, we have large data infrastructure to store this data as well as we need to process the data, for this we need data warehousing tools to process this normalized and unorganized which are costly. and also storing should have reliable as it will impact the data loss

Over this time data process is complex and license cost of maintain is more.

Suppose 10000 more stores added, data is grown, more nodes are added to current infrastructure, but overall performance system degrades as the nodes are added

Apache hadoop solve the above problems.