Virga: Incremental Recomputations in MapReduce


In 2004, Google introduced the MapReduce paradigm for parallel data processing in large shared-nothing clusters. MapReduce was primarily built for processing large amounts of unstructured data, such as web request logs or crawled web documents stored in Google’s distributed file system (GFS). More recently, MapReduce has been reported to be used on top of Google’s Bigtable, a distributed storage system for managing structured data. A key difference between GFS and Bigtable is their update model. While GFS files are typically append-only data sets, Bigtable supports row-level updates.


When Bigtable is used as input source for MapReduce jobs, often only parts of the source data have been changed since the job’s previous run. As yet, MapReduce results have to be recomputed from scratch to incorporate the latest base data changes. This approach is obviously inefficient in many situations and it seems desirable to maintain MapReduce results in an incremental way similar to materialized views. From an abstract point of view, materialized views and MapReduce computations have a lot in common. A materialized view is derived from one or more base tables in a way specified by a user-supplied view definition and persistently stored in the database. Similarly, a MapReduce job reads data from one or more Bigtable datasets, transforms it in a way specified by user-supplied Map and Reduce functions and persistently stores the result in Bigtable.


However, applying view maintenance techniques in the MapReduce environment is challenging, because the programming models (or query languages) and data models differ heavily. View definitions are specified in SQL, a language closely tied to the relational algebra and the relational data model. The MapReduce programming model is more generic; the framework provides hooks to plug-in custom Map and Reduce functions written in standard programming languages. In this project, we explore incremental recomputation techniques in the MapReduce environment to find answers to the following questions.


  • Given a (sequence of) MapReduce jobs, how can “incremental counterparts” be derived that consume source deltas and compute deltas to be applied to the target view?
  • For such a derivation process, what is an appropriate level of abstraction? MapReduce by itself requires programmers to plug-in custom code. Is it feasible to identify classes of Mappers and Reducers that share interesting properties with regard to incremental processing?
  • Infrastructure has been build on top of MapReduce to provide programmers with high-level languages such as Jaql, PigLatin, or HiveQL. Is it possible to derive incremental MapReduce programs automatically for (a subset) of any of these languages?




The Virga project has been selected to receive a Google Research Award in December 2010.



Marc Schäfer, Johannes Schildgen and Stefan Deßloch
Sampling with Incremental MapReduce
In: Proc. Workshop on Big Data in Science (BigDS) @ BTW, LNI, Hamburg
March 2015


Yong Hu and Stefan Dessloch
Extracting deltas from column oriented NoSQL databases for different incremental applications and diverse data targets.
Special issues of ADBIS 2013
ELsevier, 2014
Johannes Schildgen, Thomas Jörg, Manuel Hoffmann and Stefan Deßloch
Marimba: A Framework for Making MapReduce Jobs Incremental
IEEE International Congress on Big Data
Marc Schäfer
Samping mit inkrementellem MapReduce
Master's Thesis, Technische Universität Kaiserslautern, 2014


Yong Hu and Stefan Dessloch
Extracting Deltas from Column Oriented NoSQL Databases for Different Incremental Applications and Diverse Data Targets
In: Proc. ADBIS 2013, LNCS 8133,
September 2013
Johannes Schildgen
Marimba: A MapReduce-based Framework for Incremental Recomputations on Big Data
In: Proc. PhD Consortium poster presentations at ADBIS, Genoa, Italy
September 2013 accepted
Manuel Hoffmann
Incremental Iterative Graph Algorithms with MapReduce
Bachelor's Thesis, Technische Universität Kaiserslautern, May 2013
Johannes Schildgen, Thomas Jörg and Stefan Deßloch
Inkrementelle Neuberechnungen in MapReduce
Datenbank-Spektrum, pp. 33-43
Springer-Verlag, March 2013
ISSN: 1618-2162
Jonathan Priebe
Collaborative Filtering with Marimba and Incremental MapReduce
Bachelor's Thesis, Technische Universität Kaiserslautern, 2013


Johannes Schildgen
Ein MapReduce-basiertes Programmiermodell für selbstwartbare Aggregatsichten
Master's Thesis, TU Kaiserslautern, August 2012
Marc Schäfer
Inkrementelle Wartung von Data Cubes mit MapReduce
Bachelor's Thesis, Technische Universität Kaiserslautern, January 2012


Thomas Jörg, Roya Parvizi, Hu Yong and Stefan Dessloch
Incremental Recomputations in MapReduce
In: Proc. CloudDB 2011
October 2011 accepted
Previous | 1, 2 | Next
Export as:

tx_sevenpack_pi1 error

No storage pid given. Select a Starting point.