Big Data is data which is difficult to
process by a normal computing File System, due to the following reasons-
1. Huge Volume in (TB, PB) which is very huge to
process. As per a study made in 2013- around 90 percent
of all the data in the world has been generated over the last two years.
2. Variety of data sources and variety of data forms.
3. The Velocity of
processing such data is too low; in
comparison to the huge size of data.
Hadoop-
Hadoop is introduced as the best solution for Big
Data.
Hadoop knows how to store Big Data and process Big Data in
less time.
History: In year 2003 Google came up with GFS(Google
File System) to store data. In year 2004 Google came up with MapReduce
Algorithm to process huge volume of data stored as per GFS.
Google Published a white Paper on GFS and MapReduce, which
Yahoo tried to implement as HDFS and MapReduce.
So Mr. Doug
Cutting used HDFS(Hadoop Distributed File System) and MapReduce
are the two core concepts used by HADOOP.
Hadoop is a "Open Source Frame work" given by
Apache Software Foundation, for storing huge data set and for processing huge
data sets with a cluster of commodity hardware.
Cluster is set of machines in a single LAN.
Hadoop must be used only when
your data set is very large… It is useless to use Hadoop for small data set.
HDFS-
HDFS is a specially designed file system for storing
huge data sets with cluster of commodity hardware with streaming access
pattern.
Streaming Access Pattern says, “Write Once read any
number of times but don't try to change the content of the file” .
HDFZ defines a Block Size of 64 MB compared to 4 KB
block in a normal Hard disk
HDFS Services –
- Master Daemon – Name
Node, Secondary Name Node, Job Tracker
- Slave Daemon – Data
Node, Task Tracker
Master Services can Talk to each other. Slave Services Can
talk to each other. A Master service cannot talk to any other Slave service
than its corresponding Slave. Example, Name Node Can Talk only to Data Node.
And Data Node can Talk only with Name Node.
By default, HDFS Stores 3 replications of the same
file as backup.
Data Nodes will give regular “Block Report” &
“Heart Beat” to Name Node. Whenever Data is stored by Data Node, The Data Nodes
communicate amongst themselves updating each other about the backup details. “Heart
Beat” information helps the Name Node to know if the Data Node is still
Alive.
Name Node Stores the Metadata about the Data stored
in the Slave Data Nodes.
When a Name Node understands about the failure of Data
Node, it removes its data from its Metadata and creates replication of the
data it had into other Alive Data Nodes. Thus, at any point of time there will
be 3 replications of the same data. Then when the admin repairs the failed
Data Node and attaches it to the cluster, this Data Node will be treated as
a “New Data Node” without any Data Stored in it.
Single Point of Failure: If the Metadata is lost the
Cluster becomes inaccessible. Thus Metadata which is stored in Name Node,
should be a Highly reliable system. Lesser the Number of Blocks less will be
the Metadata that the Name Node has to store. That reasons the notion of
increasing the Block Size from 4KB to 64MB in HADOOP Architecture.
Job Tracker accepts a program that processes
the data stored in HDFS from the Client. Job Tracker communicated
with the Name Node and accepts the Metadata from it. Then it
assigns task to Task Tracker.
Task Tracker will now apply the Program on the Data
file. This process is called as Map. Thus for Processing a File
the number of Maps will be equal to Number of Splits of the Input
File.
Task Tracker will also give “Heart Beat”
information to Job Tracker letting it know that the Task Tracker is
still Alive.
Reducer: Output of all
the Maps is combined together and a single result is formed by Reducer. Reducer
is present in one of the computers in the cluster which is also a “Data Note”.
Once the Reducer has the result ready, Data Node informs
the Name Node about the output result through the “Block Report”. Name
Node will prepare the Metadata accordingly and then the Client
can check with the Name Node as
to if the result is available with the Data Node.