Virga: Incremental Recomputations in MapReduce
In 2004, Google introduced the MapReduce paradigm for parallel data processing in large shared-nothing clusters. MapReduce was primarily built for processing large amounts of unstructured data, such as web request logs or crawled web documents stored in Google’s distributed file system (GFS). More recently, MapReduce has been reported to be used on top of Google’s Bigtable, a distributed storage system for managing structured data. A key difference between GFS and Bigtable is their update model. While GFS files are typically append-only data sets, Bigtable supports row-level updates.
When Bigtable is used as input source for MapReduce jobs, often only parts of the source data have been changed since the job’s previous run. As yet, MapReduce results have to be recomputed from scratch to incorporate the latest base data changes. This approach is obviously inefficient in many situations and it seems desirable to maintain MapReduce results in an incremental way similar to materialized views. From an abstract point of view, materialized views and MapReduce computations have a lot in common. A materialized view is derived from one or more base tables in a way specified by a user-supplied view definition and persistently stored in the database. Similarly, a MapReduce job reads data from one or more Bigtable datasets, transforms it in a way specified by user-supplied Map and Reduce functions and persistently stores the result in Bigtable.
However, applying view maintenance techniques in the MapReduce environment is challenging, because the programming models (or query languages) and data models differ heavily. View definitions are specified in SQL, a language closely tied to the relational algebra and the relational data model. The MapReduce programming model is more generic; the framework provides hooks to plug-in custom Map and Reduce functions written in standard programming languages. In this project, we explore incremental recomputation techniques in the MapReduce environment to find answers to the following questions.
|
|
Awards
The Virga project has been selected to receive a Google Research Award in December 2010.
Publications
2015 | |
Sampling with Incremental MapReduce
In: Proc.
Workshop on Big Data in Science (BigDS) @ BTW, LNI, Hamburg
March 2015
|
|
2014 | |
Extracting deltas from column oriented NoSQL databases for different incremental applications and diverse data targets.
Special issues of ADBIS 2013
ELsevier,
2014
|
|
Marimba: A Framework for Making MapReduce Jobs Incremental
IEEE International Congress on Big Data
2014
|
|
Samping mit inkrementellem MapReduce
Master's Thesis,
Technische Universität Kaiserslautern,
2014
|
|
2013 | |
Extracting Deltas from Column Oriented NoSQL Databases for Different Incremental Applications and Diverse Data Targets
In: Proc. ADBIS 2013, LNCS 8133,
September
2013
|
|
Marimba: A MapReduce-based Framework for Incremental Recomputations on Big Data
In: Proc.
PhD Consortium poster presentations at ADBIS, Genoa, Italy
September 2013
accepted
|
|
Incremental Iterative Graph Algorithms with MapReduce
Bachelor's Thesis,
Technische Universität Kaiserslautern,
May
2013
|
|
Inkrementelle Neuberechnungen in MapReduce
Datenbank-Spektrum, pp. 33-43
Springer-Verlag,
March
2013
ISSN: 1618-2162
|
|
Collaborative Filtering with Marimba and Incremental MapReduce
Bachelor's Thesis,
Technische Universität Kaiserslautern,
2013
accepted
|
|
2012 | |
Ein MapReduce-basiertes Programmiermodell für selbstwartbare Aggregatsichten
Master's Thesis,
TU Kaiserslautern,
August
2012
|
|
Inkrementelle Wartung von Data Cubes mit
MapReduce
Bachelor's Thesis,
Technische Universität Kaiserslautern,
January
2012
|
|
2011 | |
Incremental Recomputations in MapReduce
In: Proc.
CloudDB 2011
October 2011
accepted
|
tx_sevenpack_pi1 error
Contact