Hadoop tutorial :-
Hadoop is a apache framework developed completely in java java with opensource brand.
Hadoop analyze and process large amount of data i.e peta bytes of data in parallel with less time located in distributed environment. Hadoop is not a single tool which contains combination of different sub frameworks called Hadoop Core,Map Reduce,HDFS,Pig,HBase. Hadoop is mostly used for OLTP transactions. Some big companies like facebook uses hadoop for OLAP transactions as well.Hadoop can be setup in clustered environment as well as single node environment.
HDFS is developed based on google file system (GFS)
Map Reduce is developed on google map reduce concept.
Pig framework is SQL wrapper for map reduce jobs
Basics of Hadoop HDFS :-
In
Hadoop, data is stored in hadoop distributed file system (HDFS) software
in the form blocks (chunk of size),data is replicated in different
nodes (called machines) in clustered (many nodes) environment. In real
time scenario, thousands of peta bytes data is stored in thousand of
nodes in distribute environment. The advantages with storing replicated
data is data available when one of the node is down. With this, data
available all the time for the clients apps. Do we need high end
hardware for all these nodes? Answer is no, we can accommodate commodity
hardware for all these nodes.
if data is grown rapidly, we can add nodes without fail over of the whole system and losing the data. This we can call it as scalable system
in network terminology. This system handles the case of losing data
while adding machines to the existing machines or after the machine add
to the cluster.
As
you know cluster has different nodes, if one node fails, hadoop handles
the scenario without losing the data and serves the required work as
expected.
HDFS stores the data in the files where this files uses the underlying operating system’s file structure.
HDFS is suitable for storing the large amount of data like peta and tera bytes of data which process the data using Map reduce for OLTP transaction.
Data storage in HDFS:-
In
HDFS, set of data is called as blocks,each block of data is replicated
in different node or machines. The number of nodes where this data is
replicated is configured in hadoop system.
Basics of Hadoop Map Reduce :-
In hadoop system, peta bytes of data is distributed in thousands of nodes in cloud environment.To process the data stored in HDFS, We need a applications.
Map Reduce is a java framework used to write programs, which are executed parallel to process large amount of data in clustered environment.
Hadoop
provides Map Reduce API’s to write map reduce programs. We have to make
use of those API’s and customize our data analysis login in the code.
The map reduce piece of code fetchs and process the data in distributed environment
As we want to process the data stored in HDFS,For this we need to write programs using some langugage like java or python etc.
Map reduce has two different task 1.Mapper 2. Reducer
Map takes the input data and process
this data in set of tasks with dividing input data. and out of this map
is result of set of task, which are given to reducer. reducer process this data and combine the data output the data.
Basics of Hadoop Hive:-
Hive
is apache framework which is SQL wrapper implementation of map reduce
programs. Hive provides sql languages which understand by hadoop system.
Most
of the times, data analysis done by database developers, so DB
developer is not aware of the java programs, in that case, hive tool is
useful.
Database developer writes Hive Queries in hive tool for the result. This queries calls the underlying map reduce jobs, process the data, finally data is returned to hive tool.
Advantages with hive is no need of java programming code in hadoop environment.
In
some cases, sql queries are not performing well, or some database
feature (group by sql function) is not implemented in hive, then we have
to write map reduce plugin and register this plugin to hive repository.
This is one time tasks.
On top of all these, we have Hadoop Common which is core framework written for processing distributed large set of data to handling the hadoop features.
Hadoop Use case explained:-
Let me take scenario before apache hadoop is introduced into software world
i am going to explain about the use cases of data processing for one big data company.
Retail
company has 20000 stores in the world. Each store will have the data
related to products in different regions.This data will be stored in
different data sets including different popular databases in multi
software environment. Data company would need licenses of different
softwares including databases and hardware
For each month, if this company wants to process the data by store wise for finding the best product as well as loyal customer that means we need to process and calculate the data and findout the best customer as well as best selling product in each region to give better offers.
Assume
that 2000 stores will have the data of all the products and customer
details, and customer purchase information per each stores.
To target the the below use cases.
To
find the popular product sold in the last year Christmas per the Store
A, so that this year we can target the customers to give different
discounts for this products
Top 10 Products sold
selecting the top 100 customers per each store to give more offers.
From the technical point of view, we have large data infrastructure to store this data as well as we need to process the data, for this we need data warehousing tools to process this normalized and unorganized which are costly. and also storing should have reliable as it will impact the data loss
Over this time data process is complex and license cost of maintain is more.
Suppose
10000 more stores added, data is grown, more nodes are added to current
infrastructure, but overall performance system degrades as the nodes
are added
Apache hadoop solve the above problems.