Tuesday 4 April 2017

HDFS assumptions and goals

HDFS assumptions and goals

Assumptions and goals of HDFS are,
        1. Horizontal scalability
        2. Fault tolerance
        3. Capability to run on commodity hardware
        4. Write once, read many times
        5. Capacity to handle large data sets
        6. Data locality
        7. HDFS file system namespace
        8. Streaming access
        9. High throughput   
1. Horizontal scalability
  • HDFS is based on a scale-out model.
  • We can scale up to thousands of nodes, to store terabytes or petabytes of data. 
  • As the data increases, we can increase the data nodes.
  • Increasing the data nodes will give additional storage and more processing power. 
2. Fault tolerance
  • HDFS assumes that failures (Hardware and software) are very common.
  • Even though failure occurs, HDFS by default it provides data replication.
  • Rule
    • By default, hadoop creating three copies of the data.
    • Two copies on same rack and one copy on different rack.
    • Even though if rack fails also, we 'll not lose the data
    • If one copy of the data is not able to access or gets corrupted then no need to worry.
    • The framework itself takes care to get high availability of data.
Still not understand fault tolerance, then for you you short definition,
  1. Hardware failure are very common
  2. So, instead of believing hardware to deliver high availability of the data, its good to believe a framework which is designed to handle failure and even delivered exact services with default recovery.
Even still not understand fault tolerance, refresh yourself by having cup of coffee.
3. Capability to run on commodity hardware
  • HDFS runs on commodity hardware, means to store the large data we can use low cost hardware.
  • RDBMS is more expensive to store and process the data.
 4. Write once, read many times
  • HDFS is based on a concept of write once, read many times, means once data is written then it will not be modified.
  • HDFS focuses on retrieving the data in fastest possible way.
  • HDFS was originally designed for batch processing.
  • In Hadoop 2.0, it can be used even for interactive process.  
 5. Capable to handle large data set
  • HDFS is the best to store large data sets in size of GB, TB and TB etc
 6. Data locality
  • Data node and Task tracker are presents in Slave nodes in Hadoop cluster. 
  • Data node is used to store the data and Task tracker is used to process the data. 
  • When you run a query or Map Reduce job, the Task Tracker processes data at the node where the data exists. 
  • Because of this, minimizing the need for data transfer across nodes and improving job performance this is called as Data locality. 
  • If the size of the data is HUGE, then 
    • It’s highly recommended to move computation logic near to data. 
    • It’s not recommended to move data near to computation logic. 
    • Advantage: Minimizes the risk of network traffic and improve job performance.
 7. HDFS file system namespace
  • HDFS uses traditional hierarchical file organization. 
  • Any user or application can create directories and recursively store files inside these directories. 
  • This enables you to create a file, delete a file, rename a file, and move a file from one directory to another one. 
Example 
    • /user/abc/sampleone.txt 
    • /user/xyz/sampletwo.txt 
8. Streaming access 
  • HDFS is based on the principle of “write once, read many times.” 
  • This supports streaming access to the data, and its whole focus is on reading the data in the fastest possible way (instead of focusing on the speed of the data write). 
  • HDFS has also been designed for batch processing more than interactive querying (although this has changed in Hadoop 2.0.) 
Still not understand Streaming Access then read below
  • In other words, in HDFS, reading the complete data set in the fastest possible way is more important than taking the time to fetch a single record from the data set.
Even still not understand Streaming Access then no coffee, no option read it again.
9. High throughput
  • HDFS was designed for parallel data storage and retrieval. 
  • When you run a job, it gets broken down into smaller units called tasks. 
  • These tasks are executed on multiple nodes (data nodes) in parallel, and final results are merged to produce the final output.Reading data from multiple nodes in parallel means reduces the actual time to read data.
Thanks for your time.
-Nireekshan