INDI: Incremental Recomputations in Materialized Data Integration

 

Incremental recomputations have been studied by the database research community mainly in the context of the maintenance of materialized views. Materialized views and data integration systems such as Extract-Transform-Load (ETL) tools share a key characteristic: The result data is pre-computed and materialized, so that future queries can be evaluated efficiently. Upon updates to the base data, materialized views become stale and need to be maintained. A naïve solution is to recompute views from scratch. However, an incremental recomputation approach is often more efficient. While database systems are able to maintain views incrementally, today’s ETL tools lack this capability. We believe that incremental recomputation techniques can advantageously be applied in the ETL process to improve the efficiency of data warehouse maintenance. Doing so will shrink the data warehouse update window, improve data timeliness in the warehouse and thus be a step towards near-real time data warehousing.

 

 

The INDI project is carried out in close cooperation with the IBM Research & Development Lab Böblingen. We follow an algebraic approach in the sense that we aim at deriving incremental variants from ETL jobs, which are built from standard ETL processing primitives. This is analogous to algebraic view maintenance where incremental expressions are derived from SQL/RA view definitions using again SQL/RA. This approach has a couple of advantages: Incremental ETL jobs can be executed by standard ETL tools without the need for modifications, already existing ETL jobs may be “incrementalized”, and the development of new incremental ETL solutions is eased. The ETL environment has distinct characteristics that require us to rethink and adapt traditional view maintenance techniques. We identified the following major research challenges:

  • A common language, such as SQL/RA for relational database systems, does not exist in the ETL world. Instead, commercial ETL tools provide proprietary scripting languages or graphical user interfaces for defining ETL jobs. Because of these programming model differences, standard view maintenance techniques cannot be directly applied.
  • Standardizing and improving the quality of source data is a key task in data integration. For this purpose, ETL tools provide rich sets of data cleansing operators.  This class of operators has no counterpart in the relational world and calls for new optimization strategies.
  • In a DWH environment, so called Change Data Capture (CDC) techniques are used to gather deltas at the source systems. The captured deltas may be incomplete (or partial) due to principal restrictions of the CDC technique or for improved CDC efficiency. Traditional view maintenance techniques, however, demand for deltas to be complete.
  • Database view maintenance depends on transactional guarantees. In particular, transactions allow for synchronizing view maintenance and concurrent base data updates. In a warehousing environment, the source systems are distributed; distributed transactions, however, are prohibitively expensive and warehouse maintenance thus must proceed without.

 

Publications

2011

default
Thomas Jörg and Stefan Dessloch
View Maintenance using Partial Deltas
In: Proc. BTW, LNI P - 180, pp. 287-306
March 2011

2010

default
Michael Koch
An Applied Data Matching Methodology
Master's Thesis, University of Kaiserslautern, December 2010
default
Muhammad Faisal Inam
A Comparison and Evaluation of Change Data Capture Techniques
Master's Thesis, University of Kaiserslautern, November 2010
default
Andreas Behrend and Thomas Jörg
Optimized Incremental ETL Jobs for Maintaining Data Warehouses
In: Proc. IDEAS, pp. 216-224
2010

2009

default
Thomas Jörg, Albert Maier and Oliver Suhre
Generating Extract, Transform, and Load (ETL) Jobs for Loading Data Incrementally
United States Patent
March 2009
pdf
Thomas Jörg and Stefan Dessloch
Formalizing ETL Jobs for Incremental Loading of Data Warehouses
In: Proc. BTW, pp. 327-346
2009
pdf
Thomas Jörg and Stefan Dessloch
Near Real-Time Data Warehousing Using State-of-the-Art ETL Tools
In: Proc. BIRTE, LNBIP 41, pp. 100-117
Springer, 2009
ISBN: 3-642-14558-2

2008

default
Stefan Dessloch, Mauricio A. Hernàndez, Ryan Wisnesky, Ahmed Radwan and Jindan Zhou
Orchid: Integrating Schema Mapping and ETL
In: Proc. ICDE, pp. 1307-1316
default
Thomas Jörg and Stefan Dessloch
Towards generating ETL processes for incremental loading
In: Proc. IDEAS, pp. 101-110
pdf
Stefan Dessloch, Mauricio A. Hernàndez, Ryan Wisnesky, Ahmed Radwan and Jindan Zhou
Orchid: Integrating Schema Mapping and ETL
TU Kaiserslautern, 2008
Export as:
BibTeX, XML