Chaudhari, Shivangi. Accelerating Hadoop Map-Reduce for small/intermediate data sizes using the Comet coordination framework. Retrieved from https://doi.org/doi:10.7282/T3M045NS
DescriptionMapReduce has been emerging as a popular programming paradigm for data intensive computing in clustered environments. MapReduce as a framework for solving embarrassingly parallel problems has been extensively used on large clusters. These frameworks support ease of computation for petabytes of data mostly through the use of a distributed file system example the Google File System – used by the proprietary ‘Google Map-Reduce’.
In the "Map", the master node takes the input, divides it into smaller sub-problems, and distributes those to worker nodes. The worker node processes that smaller problem, and passes the answer back to its master node. In the "Reduce", the master node then takes the answers of the sub-problems and combines them to get the final output after reduces. The advantage of MapReduce is that, it allows for distributed processing of the map and reduction operations, assuming each operation is independent of the other, all can be executed in parallel.
We found that file writes and reads to the distributed file system, have an overhead especially for smaller data sizes of the order of few tens of GB’s. Our solution provides the MapReduce framework built over Comet framework utilizing TCP sockets for communication and coordination and uses in-memory operations for data whenever possible. The objective of this thesis is to
(1) understand the behaviors and limitations of MapReduce in the case of small-moderate datasets
(2) develop coordination and interaction framework to complement MapReduce-Hadoop to address these shortcomings
(3) demonstrate and evaluate using a real world application
In this thesis we use Comet and its services to build a MapReduce infrastructure that address the above requirements - specifically enable pull based scheduling of Map tasks as well as stream based coordination and data exchange. The framework is based on the master-worker concept. Comet is a decentralized (peer-to-peer) computational infrastructure that supports applications having high computational requirement.
Our System’s interfaces are similar to the Hadoop MapReduce framework, to make applications built on Hadoop easily portable to Comet-based framework. The details of the implementation and evaluation of an actual pharmaceutical problem, with its results have been described. We found that out solution can be used to accelerate the computations of medium sized data by delaying or avoiding the use of distributed file reads and writes.