Many customers ask what kind of machine to purchase to be used in a Hadoop environment, and what configuration to use. The answer to this can be essentially derived from some simple calculations that I want to write about and demonstrate.
Sizing a Hadoop cluster is important, as the right resources will allow you to optimize the environment for your purpose. However, this is no simple task as optimizing a distributed environment and its related software can have its complexities.
The number of machines, and specs of the machines, depends on a few factors:
- The volume of data (obviously)
- The data retention policy (how much can you afford to keep before throwing away)
- The type of workload you have (data science/CPU driven vs “vanilla” use case/IO-bound)
- Also the data storage mechanism (data container, type of compression used if any)
We have to make some assumptions from the beginning; otherwise there are just too many parameters to deal with. These assumptions drive the data nodes configuration.The other types of machines (Name Node/Job tracker, in Hadoop ) will need different specs, and are generally more straightforward.
Hardware Requirement for NameNodes:
Both Active and Passive NameNode servers should have highly reliable storage for namespace storage and edit-log journaling. 4-6 1TB SAS disk storage requirement is recommended. One separate for OS[RAID 1] 2 for the FS image [RAID 5/6]. Please remember, the disk on which you are planning to store File-system Image and edits-log, should be RAID. JournalNode should also need reliable storage to store edits-log. If your cluster is HA cluster, then plan your hadoop cluster in such a way that JNs should be configured on Master Node. At least 3 JNs is required for a typical HA cluster.
Hardware Requirement for JobTracker/ResourceManager:
JobTracker/ResourceManager servers do not need a RAID storage, because they save their persistent state to HDFS. The JobTracker/ResourceManger server can also run on a slave node with a bit of extra RAM. Therefore commodity hardware can be used for it. If you want to migrate NameNode and JobTracker/ResourceManager on same server then reliable hardware is recommended.
- Memory sizing:
The amount of memory required for the master nodes depends on the size of data i.e. number of file system objects (files and block replicas) to be created and tracked by the NameNode. Typically memory ranges should be 24GB to 64 GB.
NameNodes and its clients are very “chatty”. We therefore recommend providing 2 CPU (8 or even 12 CPU cores on each CPU)to handle messaging traffic for the master nodes. Single CPU with 16 or even 24 CPU cores can also be used. CPU clock speed should be running at least 2-2.5GHz.
Hardware Requirement for SlavesNodes: In general, when considering higher-performance vs lower performance components.
“Save the money, buy more nodes!”
The HDFS’ configuration is usually set up to replicate the data 3 ways. So you will need 3x the actual storage capacity for your data. In addition, you will need to sandbag the machine capacity for temporary storage for computation (i.e. storage for transient Map outputs stays local to the machine, it doesn’t get stored on HDFS. Also, local storage for compression is needed). A good rule of thumb is to keep the disks at 70% capacity. Then we also need to take into account the compression ratio.
Let’s take an example:
Say we have 70TB of raw data to store on a yearly basis (i.e. moving window of 1 year). So after compression (say, with Gzip) we will get 70 – (70 * 60%) = 28Tb that will multiply by 3x = 84, but keep 70% capacity: 84Tb = x * 70% thus x = 84/70% = 120Tb is the value we need for capacity planning.
Number of nodes: Here are the recommended specifications for DataNode/TaskTrackers in a balanced Hadoop cluster from Cloudera:
- 12-24 1-4TB hard disks in a JBOD (Just a Bunch Of Disks) configuration (no RAID, please!)
- multi-core CPUs, running at least 2-2.5GHz
- So let’s divide up the value we have in capacity planning by the number of hard disks we need in a way that makes sense: 120Tb/12 1Tb = 10 nodes.
Number of tasks per node: First, let’s figure out the # of tasks per node.Usually count 1 core per task. If the job is not too heavy on CPU, then the number of tasks can be greater than the number of cores.
Example: 12 cores, jobs use ~75% of CPU
Let’s assign free slots= 14 (slightly > # of cores is a good rule of thumb), maxMapTasks=8, maxReduceTasks=6.
Memory: Now let’s figure out the memory we can assign to these tasks. By default, the tasktracker and datanode take up each 1 GB of RAM per default. For each task calculate mapred.child.java.opts (200MB per default) of RAM. In addition, count 2 GB for the OS. So say, having 24 Gigs of memory available, 24-2= 22 Gig available for our 14 tasks – thus we can assign 1.5 Gig for each of our tasks (14 * 1.5 = 21 Gigs).
The memory requirement depends on the type of your hadoop cluster like Balanced workload or compute intensive workload. If your cluster is a compute intensive cluster then more memory, CPU cores are required on slaves for faster processing.
All slaves machine should be ability to add additional CPUs, disk and RAM in future. This can help in expanding an existing cluster without adding more racks or network changes, if you are plan to expand your hadoop cluster in future. Alternatively you can also add additional machines.
Network Considerations: Hadoop is very bandwidth-intensive! Often, all nodes are communicating with each other at the same time
- Use dedicated switches for your Hadoop cluster
- Nodes are connected to a top-of-rack switch
- Nodes should be connected at a minimum speed of 1Gb/sec
- For clusters where large amounts of intermediate data is generated, consider 10Gb/sec connections – Expensive – Alternative: bond two 1Gb/sec connections to each node
- Racks are interconnected via core switches
- Core switches should connect to top-of-rack switches at 10Gb/ sec or faster
- Beware of oversubscription in top-of-rack and core switches
- Consider bonded Ethernet to mitigate against failure
- Consider redundant top-of-rack and core switches
Operating System Recommendations: You always should choose an OS where you’re comfortable to administering them.
- CentOS: geared towards servers rather than individual workstations – Conservative about package versions – Very widely used in production
- RedHat Enterprise Linux (RHEL): RedHat-supported analog to CentOS – Includes support contracts, for a price
- In production, we often see a mixture of RHEL and CentOS machines – Often RHEL on master nodes, CentOS on slaves
- Fedora Core: geared towards individual workstations – Includes newer versions of software, at the expense of some stability – We recommend server-based, rather than workstation-based, Linux distributions
- Ubuntu: Very popular distribution, based on Debian – Both desktop and server versions available – Try to use an LTS (Long Term Support) version
- SuSE: popular distribution, especially in Europe – Cloudera provides CDH packages for SuSE
- Solaris, OpenSolaris: not commonly seen in production clusters
Conclusion: Apache Hadoop Cluster design is a serious platform-engineering project and design decisions need to be well understood. What you can get away with in a small cluster may cause issues as the cluster grows. I tried to explain the basics of Hadoop Cluster design and.I hope it will be helpful to you but will always suggest you to discuss with your distribution provider whenever you are planning to setup a cluster.