Namenode may keep crashing due to excessive logging
Category : HDFS
Namenode may keep crashing even if you restart all services and you have enough heap size. And you see following error in logs.
java.io.IOException: IPC’s epoch 197 is less than the last promised epoch 198
or
2017-09-28 09:16:11,371 INFO ha.ZKFailoverController (ZKFailoverController.java:setLastHealthState(851)) – Local service NameNode at m1.hdp22 entered state: SERVICE_NOT_RESPONDING
Root Cause: In my case it was because too much logging was happening in namenode for Blockstatechange and hdfs.statechange. If the logging is constantly occurring nonstop, the NameNode takes time to respond to other rpc requests. Hence we need to increase the NN log level (from INFO to WARN) for certain classes to take some load off the namenode.
Solution: Increased the log level for two classes: Added the below in hdfs log4j using Ambari (Ambari UI > HDFS > Config > Advanced hdfs-log4j)
log4j.logger.BlockStateChange=ERROR
log4j.logger.org.apache.hadoop.hdfs.StateChange=ERROR