INDI: Incremental Recomputations in Materialized Data Integration
Incremental recomputations have been studied by the database research community mainly in the context of the maintenance of materialized views. Materialized views and data integration systems such as Extract-Transform-Load (ETL) tools share a key characteristic: The result data is pre-computed and materialized, so that future queries can be evaluated efficiently. Upon updates to the base data, materialized views become stale and need to be maintained. A naïve solution is to recompute views from scratch. However, an incremental recomputation approach is often more efficient. While database systems are able to maintain views incrementally, today’s ETL tools lack this capability. We believe that incremental recomputation techniques can advantageously be applied in the ETL process to improve the efficiency of data warehouse maintenance. Doing so will shrink the data warehouse update window, improve data timeliness in the warehouse and thus be a step towards near-real time data warehousing.
The INDI project is carried out in close cooperation with the IBM Research & Development Lab Böblingen. We follow an algebraic approach in the sense that we aim at deriving incremental variants from ETL jobs, which are built from standard ETL processing primitives. This is analogous to algebraic view maintenance where incremental expressions are derived from SQL/RA view definitions using again SQL/RA. This approach has a couple of advantages: Incremental ETL jobs can be executed by standard ETL tools without the need for modifications, already existing ETL jobs may be “incrementalized”, and the development of new incremental ETL solutions is eased. The ETL environment has distinct characteristics that require us to rethink and adapt traditional view maintenance techniques. We identified the following major research challenges:
- A common language, such as SQL/RA for relational database systems, does not exist in the ETL world. Instead, commercial ETL tools provide proprietary scripting languages or graphical user interfaces for defining ETL jobs. Because of these programming model differences, standard view maintenance techniques cannot be directly applied.
- Standardizing and improving the quality of source data is a key task in data integration. For this purpose, ETL tools provide rich sets of data cleansing operators. This class of operators has no counterpart in the relational world and calls for new optimization strategies.
- In a DWH environment, so called Change Data Capture (CDC) techniques are used to gather deltas at the source systems. The captured deltas may be incomplete (or partial) due to principal restrictions of the CDC technique or for improved CDC efficiency. Traditional view maintenance techniques, however, demand for deltas to be complete.
- Database view maintenance depends on transactional guarantees. In particular, transactions allow for synchronizing view maintenance and concurrent base data updates. In a warehousing environment, the source systems are distributed; distributed transactions, however, are prohibitively expensive and warehouse maintenance thus must proceed without.
Publications
2011 | |
View Maintenance using Partial Deltas
In: Proc.
BTW, LNI P - 180, pp. 287-306
March 2011
|
|
2010 | |
An Applied Data Matching Methodology
Master's Thesis,
University of Kaiserslautern,
December
2010
|
|
A Comparison and Evaluation of Change Data Capture Techniques
Master's Thesis,
University of Kaiserslautern,
November
2010
|
|
Optimized Incremental ETL Jobs for Maintaining Data Warehouses
In: Proc.
IDEAS, pp. 216-224
2010
|
|
2009 | |
Generating Extract, Transform, and Load (ETL) Jobs for Loading Data Incrementally
United States Patent
March
2009
|
|
Formalizing ETL Jobs for Incremental Loading of Data Warehouses
In: Proc.
BTW, pp. 327-346
2009
|
|
Near Real-Time Data Warehousing Using State-of-the-Art ETL Tools
In: Proc.
BIRTE, LNBIP 41, pp. 100-117
Springer,
2009
ISBN: 3-642-14558-2
|
|
2008 | |
Orchid: Integrating Schema Mapping and ETL
In: Proc.
ICDE, pp. 1307-1316
2008
|
|
Towards generating ETL processes for incremental loading
In: Proc.
IDEAS, pp. 101-110
2008
|
|
Orchid: Integrating Schema Mapping and ETL
TU Kaiserslautern,
2008
|