AG DBIS AG HIS Staff Projects NotaQL Virga INDI GEM Caro Paladin Courses Publications AG DBIS Publications AG HIS Contact Impressum News Intern

Virga: Incremental Recomputations in MapReduce

In 2004, Google introduced the MapReduce paradigm for parallel data processing in large shared-nothing clusters. MapReduce was primarily built for processing large amounts of unstructured data, such as web request logs or crawled web documents stored in Google’s distributed file system (GFS). More recently, MapReduce has been reported to be used on top of Google’s Bigtable, a distributed storage system for managing structured data. A key difference between GFS and Bigtable is their update model. While GFS files are typically append-only data sets, Bigtable supports row-level updates.

When Bigtable is used as input source for MapReduce jobs, often only parts of the source data have been changed since the job’s previous run. As yet, MapReduce results have to be recomputed from scratch to incorporate the latest base data changes. This approach is obviously inefficient in many situations and it seems desirable to maintain MapReduce results in an incremental way similar to materialized views. From an abstract point of view, materialized views and MapReduce computations have a lot in common. A materialized view is derived from one or more base tables in a way specified by a user-supplied view definition and persistently stored in the database. Similarly, a MapReduce job reads data from one or more Bigtable datasets, transforms it in a way specified by user-supplied Map and Reduce functions and persistently stores the result in Bigtable.

However, applying view maintenance techniques in the MapReduce environment is challenging, because the programming models (or query languages) and data models differ heavily. View definitions are specified in SQL, a language closely tied to the relational algebra and the relational data model. The MapReduce programming model is more generic; the framework provides hooks to plug-in custom Map and Reduce functions written in standard programming languages. In this project, we explore incremental recomputation techniques in the MapReduce environment to find answers to the following questions.

Given a (sequence of) MapReduce jobs, how can “incremental counterparts” be derived that consume source deltas and compute deltas to be applied to the target view?
For such a derivation process, what is an appropriate level of abstraction? MapReduce by itself requires programmers to plug-in custom code. Is it feasible to identify classes of Mappers and Reducers that share interesting properties with regard to incremental processing?
Infrastructure has been build on top of MapReduce to provide programmers with high-level languages such as Jaql, PigLatin, or HiveQL. Is it possible to derive incremental MapReduce programs automatically for (a subset) of any of these languages?

Awards

The Virga project has been selected to receive a Google Research Award in December 2010.

Publications

2015
	Marc Schäfer, Johannes Schildgen and Stefan Deßloch Sampling with Incremental MapReduce In: Proc. Workshop on Big Data in Science (BigDS) @ BTW, LNI, Hamburg March 2015
2014
	Yong Hu and Stefan Dessloch Extracting deltas from column oriented NoSQL databases for different incremental applications and diverse data targets. Special issues of ADBIS 2013 ELsevier, 2014
	Johannes Schildgen, Thomas Jörg, Manuel Hoffmann and Stefan Deßloch Marimba: A Framework for Making MapReduce Jobs Incremental IEEE International Congress on Big Data 2014 DOI: 10.1109/BigData.Congress.2014.27
	Marc Schäfer Samping mit inkrementellem MapReduce Master's Thesis, Technische Universität Kaiserslautern, 2014
2013
	Yong Hu and Stefan Dessloch Extracting Deltas from Column Oriented NoSQL Databases for Different Incremental Applications and Diverse Data Targets In: Proc. ADBIS 2013, LNCS 8133, September 2013
	Johannes Schildgen Marimba: A MapReduce-based Framework for Incremental Recomputations on Big Data In: Proc. PhD Consortium poster presentations at ADBIS, Genoa, Italy September 2013 accepted
	Manuel Hoffmann Incremental Iterative Graph Algorithms with MapReduce Bachelor's Thesis, Technische Universität Kaiserslautern, May 2013
	Johannes Schildgen, Thomas Jörg and Stefan Deßloch Inkrementelle Neuberechnungen in MapReduce Datenbank-Spektrum, pp. 33-43 Springer-Verlag, March 2013 ISSN: 1618-2162 DOI: 10.1007/s13222-012-0109-3
	Jonathan Priebe Collaborative Filtering with Marimba and Incremental MapReduce Bachelor's Thesis, Technische Universität Kaiserslautern, 2013 accepted
2012
	Johannes Schildgen Ein MapReduce-basiertes Programmiermodell für selbstwartbare Aggregatsichten Master's Thesis, TU Kaiserslautern, August 2012
	Marc Schäfer Inkrementelle Wartung von Data Cubes mit MapReduce Bachelor's Thesis, Technische Universität Kaiserslautern, January 2012
2011
	Thomas Jörg, Roya Parvizi, Hu Yong and Stefan Dessloch Incremental Recomputations in MapReduce In: Proc. CloudDB 2011 October 2011 accepted