Search
Close this search box.

Apache Hadoop

HDFS is the Hadoop Distributed File System. The HDFS design has aimed to achieve reliability and high-throughput access to application data. One of the most important design goals is the minimized cost of handling system faults caused by planned or unplanned outages.

HDFS has a master/slave architecture, where the master node exposes the NameNode service that manages files and their metadata, and the slave nodes expose DataNode services that manage block storage.

Why HDFS?

  • HDFS is built to work on low-cost hardware
  • HDFS has a simple design so it’s easy for developers to code and reason about it
  • HDFS has built-in replication so it’s very reliable. It replicates every file 3 times across different slaves, so even if a slave node fails it can recover
  • HDFS is high-throughput
  • HDFS has good availability of data (if the DataNode is available)
  • HDFS provides block-level storage (i.e., storing data in large blocks that can be spread over different nodes)
  • HDFS has good performance
  • HDFS is highly configurable and can be used for various applications