HDFS assumptions and goals

Assumptions and goals of HDFS are,

1. Horizontal scalability

HDFS is based on a scale-out model.
We can scale up to thousands of nodes, to store terabytes or petabytes of data.
As the data increases, we can increase the data nodes.
Increasing the data nodes will give additional storage and more processing power.

2. Fault tolerance

By default, hadoop creating three copies of the data.
Two copies on same rack and one copy on different rack.
Even though if rack fails also, we 'll not lose the data
If one copy of the data is not able to access or gets corrupted then no need to worry.
The framework itself takes care to get high availability of data.

Still not understand fault tolerance, then for you you short definition,

Hardware failure are very common
So, instead of believing hardware to deliver high availability of the data, its good to believe a framework which is designed to handle failure and even delivered exact services with default recovery.

Even still not understand fault tolerance, refresh yourself by having cup of coffee.

3. Capability to run on commodity hardware

HDFS runs on commodity hardware, means to store the large data we can use low cost hardware.
RDBMS is more expensive to store and process the data.

4. Write once, read many times

HDFS is based on a concept of write once, read many times, means once data is written then it will not be modified.
HDFS focuses on retrieving the data in fastest possible way.
HDFS was originally designed for batch processing.
In Hadoop 2.0, it can be used even for interactive process.

5. Capable to handle large data set

6. Data locality

Data node and Task tracker are presents in Slave nodes in Hadoop cluster.
Data node is used to store the data and Task tracker is used to process the data.
When you run a query or Map Reduce job, the Task Tracker processes data at the node where the data exists.
Because of this, minimizing the need for data transfer across nodes and improving job performance this is called as Data locality.
If the size of the data is HUGE, then

7. HDFS file system namespace

HDFS uses traditional hierarchical file organization.
Any user or application can create directories and recursively store files inside these directories.
This enables you to create a file, delete a file, rename a file, and move a file from one directory to another one.

Example

8. Streaming access

HDFS is based on the principle of “write once, read many times.”
This supports streaming access to the data, and its whole focus is on reading the data in the fastest possible way (instead of focusing on the speed of the data write).
HDFS has also been designed for batch processing more than interactive querying (although this has changed in Hadoop 2.0.)

Still not understand Streaming Access then read below

In other words, in HDFS, reading the complete data set in the fastest possible way is more important than taking the time to fetch a single record from the data set.

Even still not understand Streaming Access then no coffee, no option read it again.

9. High throughput

HDFS was designed for parallel data storage and retrieval.
When you run a job, it gets broken down into smaller units called tasks.
These tasks are executed on multiple nodes (data nodes) in parallel, and final results are merged to produce the final output.Reading data from multiple nodes in parallel means reduces the actual time to read data.

Thanks for your time.

-Nireekshan

Tuesday, 4 April 2017