The number of mappers launched is based on the input block size.
The input block size is the size of the data block sent to different mappers while it is being read from the HDFS.
So in order to control the number of mappers we have to control the block size.
Now, there are two properties we can look into:
- mapred.min.split.size
- mapred.max.split.size (size in bytes)
For example, if we have a 20 GB file, and we want to launch 40 mappers, then we need to set it to 20480 / 40 = 512 MB each.
So for that the code would be:
conf.set("mapred.min.split.size", "536870912");
conf.set("mapred.max.split.size", "536870912");
Setting "mapred.map.tasks" is just a directive on how many mapper we want to launch. It doesn't guarantee that many mappers.
More information on this topic: http://wiki.apache.org/hadoop/HowManyMapsAndReduces
The input block size is the size of the data block sent to different mappers while it is being read from the HDFS.
So in order to control the number of mappers we have to control the block size.
Now, there are two properties we can look into:
- mapred.min.split.size
- mapred.max.split.size (size in bytes)
For example, if we have a 20 GB file, and we want to launch 40 mappers, then we need to set it to 20480 / 40 = 512 MB each.
So for that the code would be:
conf.set("mapred.min.split.size", "536870912");
conf.set("mapred.max.split.size", "536870912");
Setting "mapred.map.tasks" is just a directive on how many mapper we want to launch. It doesn't guarantee that many mappers.
More information on this topic: http://wiki.apache.org/hadoop/HowManyMapsAndReduces
No comments:
Post a Comment