Standby NameNode is faling and only one is running
Category : HDFS
Standby NameNode is unable to start up. Or, once bring up standby NameNode, the active NameNode will go down soon, leaving only one live NameNode. NameNode log shows:
FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(398)) – Error: flush failed for required journal (JournalAndStream(mgr=QJM to ))
java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
ROOT CAUSE:
This can be caused by network issue, which causes JournalNode to take long time to sync. The following snippet from JournalNode log shows it took unusual long time to sync:
WARN server.Journal (Journal.java:journal(384)) – Sync of transaction range 187176137-187176137 took 44461ms
WARN ipc.Server (Server.java:processResponse(1027)) – IPC Server handler 3 on 8485, call org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocol.journal from Call#10890 Retry#0: output error
INFO ipc.Server (Server.java:run(2105)) – IPC Server handler 3 on 8485 caught an exception
java.nio.channels.ClosedChannelException
at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:265)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:474)
at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2573)
at org.apache.hadoop.ipc.Server.access$1900(Server.java:135)
at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:977)
at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1042)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2094)
RESOLUTION:
Increase the values of following JournalNode timeout properties:
dfs.qjournal.select-input-streams.timeout.ms = 60000
dfs.qjournal.start-segment.timeout.ms = 60000
dfs.qjournal.write-txns.timeout.ms = 60000