Sunday, 20 December 2015

Watch & Learn

https://www.youtube.com/watch?v=Pq3OyQO-l3E&list=PLpc4L8tPSURCdIXH5FspLDUesmTGRQ39I

https://www.youtube.com/watch?v=16bArEVJ-a8&list=PLd3UqWTnYXOl1VueyAuU2pBb8C6PlBMzN

Hadoop Concepts

Big Data is data which is difficult to process by a normal computing File System, due to the following reasons-
1. Huge Volume in (TB, PB) which is very huge to process. As per a study made in 2013- around 90 percent of all the data in the world has been generated over the last two years. 
2. Variety of data sources and variety of data forms.
3. The Velocity of processing such data is too low;  in comparison to the huge size of data.





Hadoop-
Hadoop is introduced as the best solution for Big Data.

Hadoop knows how to store Big Data and process Big Data in less time.

History: In year 2003 Google came up with GFS(Google File System) to store data. In year 2004 Google came up with MapReduce Algorithm to process huge volume of data stored as per GFS. Google Published a white Paper on GFS and MapReduce, which Yahoo tried to implement as HDFS and  MapReduce.

So Mr. Doug Cutting used HDFS(Hadoop Distributed File System) and MapReduce are the two core concepts used by HADOOP.


Hadoop is a "Open Source Frame work" given by Apache Software Foundation, for storing huge data set and for processing huge data sets with a cluster of commodity hardware.  Cluster is set of machines in a single LAN.


Hadoop must be used only when your data set is very large… It is useless to use Hadoop for small data set.




HDFS-

HDFS is a specially designed file system for storing huge data sets with cluster of commodity hardware with streaming access pattern.

Streaming Access Pattern says, “Write Once read any number of times but don't try to change the content of the file” .

HDFZ defines a Block Size of 64 MB compared to 4 KB block in a normal Hard disk
HDFS Services –
  •  Master Daemon – Name Node, Secondary Name Node, Job Tracker
  • Slave Daemon – Data Node, Task Tracker


Master Services can Talk to each other. Slave Services Can talk to each other. A Master service cannot talk to any other Slave service than its corresponding Slave. Example, Name Node Can Talk only to Data Node. And Data Node can Talk only with Name Node.

By default, HDFS Stores 3 replications of the same file as backup.

Data Nodes will give regular “Block Report” & “Heart Beat” to Name Node. Whenever Data is stored by Data Node, The Data Nodes communicate amongst themselves updating each other about the backup details. “Heart Beat” information helps the Name Node to know if the Data Node is still Alive.

Name Node Stores the Metadata about the Data stored in the Slave Data Nodes.
When a Name Node understands about the failure of Data Node, it removes its data from its Metadata and creates replication of the data it had into other Alive Data Nodes. Thus, at any point of time there will be 3 replications of the same data. Then when the admin repairs the failed Data Node and attaches it to the cluster, this Data Node will be treated as a “New Data Node” without any Data Stored in it.

Single Point of Failure: If the Metadata is lost the Cluster becomes inaccessible. Thus Metadata which is stored in Name Node, should be a Highly reliable system. Lesser the Number of Blocks less will be the Metadata that the Name Node has to store. That reasons the notion of increasing the Block Size from 4KB to 64MB in HADOOP Architecture.

Job Tracker accepts a program that processes the data stored in HDFS from the Client. Job Tracker communicated with the Name Node and accepts the Metadata from it. Then it assigns task to Task Tracker.
Task Tracker will now apply the Program on the Data file. This process is called as Map. Thus for Processing a File the number of Maps will be equal to Number of Splits of the Input File.
Task Tracker will also give “Heart Beat” information to Job Tracker letting it know that the Task Tracker is still Alive.


Reducer: Output of all the Maps is combined together and a single result is formed by Reducer. Reducer is present in one of the computers in the cluster which is also a “Data Note”. Once the Reducer has the result ready, Data Node informs the Name Node about the output result through the “Block Report”. Name Node will prepare the Metadata accordingly and then the Client  can check with the Name Node as to if the result is available with the Data Node.