Virga: Incremental Recomputations in MapReduce


In 2004, Google introduced the MapReduce paradigm for parallel data processing in large shared-nothing clusters. MapReduce was primarily built for processing large amounts of unstructured data, such as web request logs or crawled web documents stored in Google’s distributed file system (GFS). More recently, MapReduce has been reported to be used on top of Google’s Bigtable, a distributed storage system for managing structured data. A key difference between GFS and Bigtable is their update model. While GFS files are typically append-only data sets, Bigtable supports row-level updates.


When Bigtable is used as input source for MapReduce jobs, often only parts of the source data have been changed since the job’s previous run. As yet, MapReduce results have to be recomputed from scratch to incorporate the latest base data changes. This approach is obviously inefficient in many situations and it seems desirable to maintain MapReduce results in an incremental way similar to materialized views. From an abstract point of view, materialized views and MapReduce computations have a lot in common. A materialized view is derived from one or more base tables in a way specified by a user-supplied view definition and persistently stored in the database. Similarly, a MapReduce job reads data from one or more Bigtable datasets, transforms it in a way specified by user-supplied Map and Reduce functions and persistently stores the result in Bigtable.


However, applying view maintenance techniques in the MapReduce environment is challenging, because the programming models (or query languages) and data models differ heavily. View definitions are specified in SQL, a language closely tied to the relational algebra and the relational data model. The MapReduce programming model is more generic; the framework provides hooks to plug-in custom Map and Reduce functions written in standard programming languages. In this project, we explore incremental recomputation techniques in the MapReduce environment to find answers to the following questions.


  • Given a (sequence of) MapReduce jobs, how can “incremental counterparts” be derived that consume source deltas and compute deltas to be applied to the target view?
  • For such a derivation process, what is an appropriate level of abstraction? MapReduce by itself requires programmers to plug-in custom code. Is it feasible to identify classes of Mappers and Reducers that share interesting properties with regard to incremental processing?
  • Infrastructure has been build on top of MapReduce to provide programmers with high-level languages such as Jaql, PigLatin, or HiveQL. Is it possible to derive incremental MapReduce programs automatically for (a subset) of any of these languages?




The Virga project has been selected to receive a Google Research Award in December 2010.



Johannes Schildgen, Thomas Lottermann and Stefan Deßloch
Cross-System NoSQL Data Transformations with NotaQL
In: Proc. SIGMOD Workshop "Algorithms and Systems for MapReduce and Beyond (BeyondMR)", San Francisco
July 2016 accepted
Johannes Schildgen and Stefan Deßloch
Heterogenität überwinden mit der Datentransformationssprache NotaQL
Springer Verlag, 2016


Jan Adamczyk
Inkrementelle Transformationen mittels Trigger-basiertem Change-Data Capture
Bachelor's Thesis, TU Kaiserslautern, November 2015
Stefan Braun, Johannes Schildgen and Stefan Deßloch
Visualisierung von NoSQL-Transformationen unter der Verwendung von Sampling-Techniken
Herbsttreffen der Fachgruppe Datenbanken@LWA 2015
October 2015
Johannes Schildgen and Stefan Deßloch
NotaQL Is Not a Query Language! It's for Data Transformation on Wide-Column Stores
In: Proc. British International Conference on Databases - BICOD 2015
Johannes Schildgen and Stefan Deßloch
Incremental Data Transformations on Wide-Column Stores with NotaQL
In: Proc. 9th Symposium and Summer School On Service-Oriented Computing (SummerSoc), Crete, Greece
June 2015
Marc Schäfer, Johannes Schildgen and Stefan Deßloch
Sampling with Incremental MapReduce
In: Proc. Workshop on Big Data in Science (BigDS) @ BTW, LNI, Hamburg
March 2015
Sougata Bhattacharjee
Views and Virtual Table Transformations on HBase
Master's Thesis, TU Kaiserslautern, 2015
Fabian Süß
A Web-based Sandbox for NotaQL Table Transformations
Bachelor's Thesis, TU Kaiserslautern, 2015
Nico Schäfer
An SQL-based Intermediate Layer for Table Transformations on Wide-Column Stores
Bachelor's Thesis, TU Kaiserslautern, 2015


Yong Hu and Stefan Dessloch
Extracting deltas from column oriented NoSQL databases for different incremental applications and diverse data targets.
Special issues of ADBIS 2013
ELsevier, 2014
Johannes Schildgen, Thomas Jörg, Manuel Hoffmann and Stefan Deßloch
Marimba: A Framework for Making MapReduce Jobs Incremental
IEEE International Congress on Big Data
Previous | 1, 2 | Next
Export as:

tx_sevenpack_pi1 error

No storage pid given. Select a Starting point.