Record-Based Block Distribution RBBD and weighted set cover scheduling WSCS in MapReduceReport as inadecuate

Record-Based Block Distribution RBBD and weighted set cover scheduling WSCS in MapReduce - Download this document for free, or read online. Document in PDF available to download.

Journal of Internet Services and Applications

, Volume 3, Issue 3, pp 319–327

First Online: 23 October 2012Received: 15 May 2012Accepted: 16 September 2012


The massive increase in data volume with the development of computation capability has outmoded compute-intensive clusters for the analysis of large-scale datasets due to the network bottleneck caused by the large amount of data transferred over the network. Chunk-based storage systems are typical data-intensive clusters introduced to do big data analysis. They split data into blocks of the same predefined size and randomly store them across nodes. These systems adopt the strategy of co-located computing and storage to reduce the network transfer by scheduling computation to the node with the most required data. It performs well when the record as the input of the analysis is mostly on the same node. However, this does not always hold to be true, because there is a gap between the records and the blocks. Blocks are scattered across the data nodes with no regard to the semantics of the record. The current solution overlooks the relationship between the computation unit as a record and the storage unit as a block. For records contained in one block, there is no data transfer to schedule by block locations. On the other hand, in practice, one record could be consisted of several blocks which widely applies to binary files, as a result, extra data transfer is incurred to prepare the input data, because these blocks are randomly stored across the nodes and need to be transferred to the selected compute node. Our contribution is to develop a Record-Based Block Distribution RBBD framework for data-intensive analytics to eliminate the gap between records and blocks, reducing the data transfer volume before the analytics are processed. Meanwhile, a weighted set cover scheduling WSCS is implemented to further improve the performance of the data-intensive analytics by choosing the best combination of data nodes to perform the computation. Our experiments show that using our RBBD framework and WSCS, the data transfer volume is reduced by an average of 36.37 % and our weighted set covering algorithm outperforms the random algorithm by 51–62 %, with the deviation from the ideal solutions of not more than 6.8 %.

KeywordsHadoop Data-intensive MapReduce HDFS Co-located compute and storage HEC  Download fulltext PDF

Author: Qiangju Xiao - Pengju Shang - Jun Wang



Related documents