HDFS is the Hadoop Distributed File System. The HDFS design has aimed to achieve reliability and high-throughput access to application data. One of the most important design goals is the minimized cost of handling system faults caused by planned or unplanned outages.
HDFS has a master/slave architecture, where the master node exposes the NameNode service that manages files and their metadata, and the slave nodes expose DataNode services that manage block storage.
Why HDFS?
- HDFS is built to work on low-cost hardware
- HDFS has a simple design so it’s easy for developers to code and reason about it
- HDFS has built-in replication so it’s very reliable. It replicates every file 3 times across different slaves, so even if a slave node fails it can recover
- HDFS is high-throughput
- HDFS has good availability of data (if the DataNode is available)
- HDFS provides block-level storage (i.e., storing data in large blocks that can be spread over different nodes)
- HDFS has good performance
- HDFS is highly configurable and can be used for various applications