Virga: Incremental Recomputations in MapReduce

 

In 2004, Google introduced the MapReduce paradigm for parallel data processing in large shared-nothing clusters. MapReduce was primarily built for processing large amounts of unstructured data, such as web request logs or crawled web documents stored in Google’s distributed file system (GFS). More recently, MapReduce has been reported to be used on top of Google’s Bigtable, a distributed storage system for managing structured data. A key difference between GFS and Bigtable is their update model. While GFS files are typically append-only data sets, Bigtable supports row-level updates.

 

When Bigtable is used as input source for MapReduce jobs, often only parts of the source data have been changed since the job’s previous run. As yet, MapReduce results have to be recomputed from scratch to incorporate the latest base data changes. This approach is obviously inefficient in many situations and it seems desirable to maintain MapReduce results in an incremental way similar to materialized views. From an abstract point of view, materialized views and MapReduce computations have a lot in common. A materialized view is derived from one or more base tables in a way specified by a user-supplied view definition and persistently stored in the database. Similarly, a MapReduce job reads data from one or more Bigtable datasets, transforms it in a way specified by user-supplied Map and Reduce functions and persistently stores the result in Bigtable.

 

However, applying view maintenance techniques in the MapReduce environment is challenging, because the programming models (or query languages) and data models differ heavily. View definitions are specified in SQL, a language closely tied to the relational algebra and the relational data model. The MapReduce programming model is more generic; the framework provides hooks to plug-in custom Map and Reduce functions written in standard programming languages. In this project, we explore incremental recomputation techniques in the MapReduce environment to find answers to the following questions.

 

  • Given a (sequence of) MapReduce jobs, how can “incremental counterparts” be derived that consume source deltas and compute deltas to be applied to the target view?
  • For such a derivation process, what is an appropriate level of abstraction? MapReduce by itself requires programmers to plug-in custom code. Is it feasible to identify classes of Mappers and Reducers that share interesting properties with regard to incremental processing?
  • Infrastructure has been build on top of MapReduce to provide programmers with high-level languages such as Jaql, PigLatin, or HiveQL. Is it possible to derive incremental MapReduce programs automatically for (a subset) of any of these languages?

 

 

Awards

The Virga project has been selected to receive a Google Research Award in December 2010.

Publications

2011

default
Thomas Jörg, Roya Parvizi, Hu Yong and Stefan Dessloch
Can MapReduce learn form Materialized Views?
In: Proc. LADIS 2011, pp. 1 - 5
September 2011
default
Roya Parvizi
Inkrementelle Neuberechnungen mit MapReduce
Bachelor's Thesis, Technische Universität Kaiserslautern, June 2011
Page:  
Previous | 1, 2 | Next
Export as:
BibTeX, XML

tx_sevenpack_pi1 error

No storage pid given. Select a Starting point.

Contact