Hadoop Administration
A place to discuss all Hadoop admin issues - Go green
Tuesday, December 17, 2019
Monday, November 4, 2019
Wednesday, October 23, 2019
Tuesday, October 22, 2019
Delete Old log files from HDFS - Older than 10 days
Script: ( Tested ) :
today=`date +'%s'`
hdfs dfs -ls /file/Path/ | grep "^d" | while read line ; do
dir_date=$(echo ${line} | awk '{print $6}')
difference=$(( ( ${today} - $(date -d ${dir_date} +%s) ) / ( 24*60*60 ) ))
filePath=$(echo ${line} | awk '{print $8}')
if [ ${difference} -gt 10 ]; then
hdfs dfs -rm -r $filePath
fi
done
Note: Change file path and It is not skipping the trash.
today=`date +'%s'`
hdfs dfs -ls /file/Path/ | grep "^d" | while read line ; do
dir_date=$(echo ${line} | awk '{print $6}')
difference=$(( ( ${today} - $(date -d ${dir_date} +%s) ) / ( 24*60*60 ) ))
filePath=$(echo ${line} | awk '{print $8}')
if [ ${difference} -gt 10 ]; then
hdfs dfs -rm -r $filePath
fi
done
Note: Change file path and It is not skipping the trash.
Tuesday, October 15, 2019
NameNode HA fails over due to connection interruption with JournalNodes
Issue:
Occasionally NameNode HA fails over due to network connection interruption with JournalNodes.
The error message in NameNode log:
2015-11-06 17:39:09,497 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: starting log segment 2710407 failed for required journal (JournalAndStream(mgr=QJM to [172.xxx.xxx.xxx:8485, 172.xxx.xxx.xxx:8485, 172.xxx.xxx.xxx:8485], stream=null))
java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.startLogSegment(QuorumJournalManager.java:403)
at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalAndStream.startLogSegment(JournalSet.java:107)
at org.apache.hadoop.hdfs.server.namenode.JournalSet$3.apply(JournalSet.java:222)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
2015-11-06 17:39:09,500 INFO util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with status 1
2015-11-06 17:39:09,506 INFO namenode.NameNode (StringUtils.java:run(659)) - SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at <FQDN>/172.xxx.xxx.xxx
************************************************************
Cause:
Potential network interruption between active NameNode and all JournalNodes.
Solution:
Increase the journal quorum connection timeout value to 60 seconds by adding
following property to HDFS config in the hdfs-site:
Property Name: dfs.qjournal.write-txns.timeout.ms
Property Value: 60000
Occasionally NameNode HA fails over due to network connection interruption with JournalNodes.
The error message in NameNode log:
2015-11-06 17:39:09,497 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: starting log segment 2710407 failed for required journal (JournalAndStream(mgr=QJM to [172.xxx.xxx.xxx:8485, 172.xxx.xxx.xxx:8485, 172.xxx.xxx.xxx:8485], stream=null))
java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.startLogSegment(QuorumJournalManager.java:403)
at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalAndStream.startLogSegment(JournalSet.java:107)
at org.apache.hadoop.hdfs.server.namenode.JournalSet$3.apply(JournalSet.java:222)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
2015-11-06 17:39:09,500 INFO util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with status 1
2015-11-06 17:39:09,506 INFO namenode.NameNode (StringUtils.java:run(659)) - SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at <FQDN>/172.xxx.xxx.xxx
************************************************************
Cause:
Potential network interruption between active NameNode and all JournalNodes.
Solution:
Increase the journal quorum connection timeout value to 60 seconds by adding
following property to HDFS config in the hdfs-site:
Property Name: dfs.qjournal.write-txns.timeout.ms
Property Value: 60000
While running a Tez job, it fails with 'vertex failure' error
The error below is seen while a Hive query runs in TEZ execution mode:
Logs:
Vertex failed, vertexName=Reducer 34, vertexId=vertex_1424999265634_0222_1_23, diagnostics=[Task failed, taskId=task_1424999265634_0222_1_23_000007, diagnostics=[AttemptID:attempt_1424999265634_0222_1_23_000007_0 Info:Error: java.lang.RuntimeException: java.lang.RuntimeException: Reduce operator initialization failed
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:188)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:307)
at org.apache.hadoop.mapred.YarnTezDagChild$5.run(YarnTezDagChild.java:564)
at java.se
... 6 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: : init not supported
at org.apache.hadoop.hive.ql.udf.generic.GenericUDAFStreamingEvaluator.init(GenericUDAFStreamingEvaluator.java:70)
at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:160)
... 7 more
CAUSE:
Tez containers are not allocating enough memory to run the query.
SOLUTION:
Set the following configurations values in the Hive query to increase the memory:
tez.am.resource.memory.mb=4096
tez.am.java.opts=-server -Xmx3276m -Djava.net.preferIPv4Stack=true -XX:+UseNUMA -XX:+UseParallelGC
hive.tez.container.size=4096
hive.tez.java.opts=-server -Xmx3276m -Djava.net.preferIPv4Stack=true -XX:+UseNUMA -XX:+UseParallelGC
Note:
- Ensure that the *.opts values are 80% of the *.mbs and there the *mb values can be allocated by the NodeManagers.
- If the issue happens again, please increase the above values by adding the value of the min container size and rerun the query.
Logs:
Vertex failed, vertexName=Reducer 34, vertexId=vertex_1424999265634_0222_1_23, diagnostics=[Task failed, taskId=task_1424999265634_0222_1_23_000007, diagnostics=[AttemptID:attempt_1424999265634_0222_1_23_000007_0 Info:Error: java.lang.RuntimeException: java.lang.RuntimeException: Reduce operator initialization failed
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:188)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:307)
at org.apache.hadoop.mapred.YarnTezDagChild$5.run(YarnTezDagChild.java:564)
at java.se
... 6 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: : init not supported
at org.apache.hadoop.hive.ql.udf.generic.GenericUDAFStreamingEvaluator.init(GenericUDAFStreamingEvaluator.java:70)
at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:160)
... 7 more
CAUSE:
Tez containers are not allocating enough memory to run the query.
SOLUTION:
Set the following configurations values in the Hive query to increase the memory:
tez.am.resource.memory.mb=4096
tez.am.java.opts=-server -Xmx3276m -Djava.net.preferIPv4Stack=true -XX:+UseNUMA -XX:+UseParallelGC
hive.tez.container.size=4096
hive.tez.java.opts=-server -Xmx3276m -Djava.net.preferIPv4Stack=true -XX:+UseNUMA -XX:+UseParallelGC
Note:
- Ensure that the *.opts values are 80% of the *.mbs and there the *mb values can be allocated by the NodeManagers.
- If the issue happens again, please increase the above values by adding the value of the min container size and rerun the query.
Monday, October 14, 2019
How to limit the number of mappers for a MapReduce job via Java code?
The number of mappers launched is based on the input block size.
The input block size is the size of the data block sent to different mappers while it is being read from the HDFS.
So in order to control the number of mappers we have to control the block size.
Now, there are two properties we can look into:
- mapred.min.split.size
- mapred.max.split.size (size in bytes)
For example, if we have a 20 GB file, and we want to launch 40 mappers, then we need to set it to 20480 / 40 = 512 MB each.
So for that the code would be:
conf.set("mapred.min.split.size", "536870912");
conf.set("mapred.max.split.size", "536870912");
Setting "mapred.map.tasks" is just a directive on how many mapper we want to launch. It doesn't guarantee that many mappers.
More information on this topic: http://wiki.apache.org/hadoop/HowManyMapsAndReduces
The input block size is the size of the data block sent to different mappers while it is being read from the HDFS.
So in order to control the number of mappers we have to control the block size.
Now, there are two properties we can look into:
- mapred.min.split.size
- mapred.max.split.size (size in bytes)
For example, if we have a 20 GB file, and we want to launch 40 mappers, then we need to set it to 20480 / 40 = 512 MB each.
So for that the code would be:
conf.set("mapred.min.split.size", "536870912");
conf.set("mapred.max.split.size", "536870912");
Setting "mapred.map.tasks" is just a directive on how many mapper we want to launch. It doesn't guarantee that many mappers.
More information on this topic: http://wiki.apache.org/hadoop/HowManyMapsAndReduces
Subscribe to:
Posts (Atom)
Hive Architecture
Hive Architecture in One Image
-
The error below is seen while a Hive query runs in TEZ execution mode: Logs: Vertex failed, vertexName=Reducer 34, vertexId=vertex_1424...
-
1. Too many open files This issue(Too many open files) is fixed by setting ulimit to 10000 or there can be corrupted disks on...
-
Issue: Occasionally NameNode HA fails over due to network connection interruption with JournalNodes. The error message in NameNode log:...