Apache Hadoop is the most popular and powerful big data tool, Hadoop provides world’s most reliable storage layer, a batch Processing engine and a Resource Management Layer . HADOOP can be further explained as below ::
- Open-source– Apache Hadoop is an open source project. It means its code can be modified according to business requirements.
- Distributed Processing– As data is stored in a distributed manner in HDFS across the cluster, data is processed in parallel on cluster of nodes.
- Fault Tolerance– By default 3 replicas of each block is stored across the cluster in Hadoop and it can be changed also as per the requirement. So if any node goes down, data on that node can be recovered from other nodes easily. Failures of nodes or tasks are recovered automatically by the framework.
- Reliability– Due to replication of data in the cluster, data is reliably stored on the cluster of machine despite machine failures. If your machine goes down, then also your data will be stored reliably.
- High Availability– Data is highly available and accessible despite hardware failure due to multiple copies of data. If a machine or few hardware crashes, then data will be accessed from other path.
- Scalability– Hadoop is highly scalable in the way new hardware can be easily added to the nodes. It also provides horizontal scalability which means new nodes can be added on the fly without any downtime.
- Economic– Hadoop is not very expensive as it runs on cluster of commodity hardware. Hadoop provides huge cost saving also as it is very easy to add more nodes here.
- Easy to use– No need of client to deal with distributed computing, framework takes care of all the things. So it is easy to use.
- Data Locality– Hadoop works on data locality principle which states that move computation to data instead of data to computation. When client submits the algorithm, this algorithm is moved to data in the cluster rather than bringing data to the location where algorithm is submitted and then processing it.