Tuesday, October 15, 2019

NameNode HA fails over due to connection interruption with JournalNodes

Issue:

Occasionally NameNode HA fails over due to network connection interruption with JournalNodes.

The error message in NameNode log:

2015-11-06 17:39:09,497 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: starting log segment 2710407 failed for required journal (JournalAndStream(mgr=QJM to [172.xxx.xxx.xxx:8485, 172.xxx.xxx.xxx:8485, 172.xxx.xxx.xxx:8485], stream=null))
java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.startLogSegment(QuorumJournalManager.java:403)
at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalAndStream.startLogSegment(JournalSet.java:107)
at org.apache.hadoop.hdfs.server.namenode.JournalSet$3.apply(JournalSet.java:222)

at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
2015-11-06 17:39:09,500 INFO util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with status 1
2015-11-06 17:39:09,506 INFO namenode.NameNode (StringUtils.java:run(659)) - SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at <FQDN>/172.xxx.xxx.xxx
************************************************************

Cause:

Potential network interruption between active NameNode and all JournalNodes.


Solution:

Increase the journal quorum connection timeout value to 60 seconds by adding
following property to HDFS config in the hdfs-site:

Property Name:  dfs.qjournal.write-txns.timeout.ms
Property Value: 60000

No comments:

Post a Comment

Hive Architecture

Hive Architecture in One Image