Standby NameNode is faling and only one is running

  • 0

Standby NameNode is faling and only one is running

Category : HDFS

Standby NameNode is unable to start up. Or, once bring up standby NameNode, the active NameNode will go down soon, leaving only one live NameNode. NameNode log shows:

FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(398)) – Error: flush failed for required journal (JournalAndStream(mgr=QJM to ))

java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond. 

ROOT CAUSE: 

This can be caused by network issue, which causes JournalNode to take long time to sync. The following snippet from JournalNode log shows it took unusual long time to sync:

WARN server.Journal (Journal.java:journal(384)) – Sync of transaction range 187176137-187176137 took 44461ms

WARN ipc.Server (Server.java:processResponse(1027)) – IPC Server handler 3 on 8485, call org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocol.journal from Call#10890 Retry#0: output error

INFO ipc.Server (Server.java:run(2105)) – IPC Server handler 3 on 8485 caught an exception

java.nio.channels.ClosedChannelException

at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:265)

at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:474)

at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2573)

at org.apache.hadoop.ipc.Server.access$1900(Server.java:135)

at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:977)

at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1042)

at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2094)

 

RESOLUTION:

Increase the values of following JournalNode timeout properties:

dfs.qjournal.select-input-streams.timeout.ms = 60000 

dfs.qjournal.start-segment.timeout.ms = 60000 

dfs.qjournal.write-txns.timeout.ms = 60000


Leave a Reply