Data Warehouse Extraction Design- Concept
NOTE- While this page is part of Data Warehouse section, the concept and process can be equally applied in the areas of Master Data Management and Operational Data Store. Both MDM and ODS involve the use of Extraction, Transformation and Loading. The extent and intensity of transformation varies depending upon the purpose.
Ensuring that the data is pulled out from the correct and reliable 'master' source. For example-The customer ID and address may be available in three different systems, but it should pull it out of a source where it is most authentic and complete. The main reference for this purpose is Source Systems Mapping.
Extraction programs pick data from tens (if not scores) of different sources. Making sure that all the data across these systems represent a single business world and not the different business world in different time warps. This subject also has been covered in Data Quality topic. Not all the systems have the date and time stamps in all the transactions (for saving overhead). Therefore sometimes the Extraction/Transformation services have to do this noble task. This is possible by:
- The Extraction is done after all the interface and batch runs are complete. The consistency rules are typically same as that of production reports.
- The time stamping to be done for all the extracted data, where dimensions are interlocked with the time dimensions (most of them will be)
Ensuring that Data is made available for further processing at the right time and in the right form. For example – If the Transformation process (post Extraction) starts at 5 A.M. in the morning, the Extraction process should be able to pull out data from all the source systems before that, synchronized with their end of day batch runs etc.
Though looking simple, but this can be one of pretty cataclysmic challenges facing the monolithic systems. These large systems may belong to different technology groups and owned by different business organizations. One may find that extract routine windows keep on getting smaller with increasing production processing and enlarged production processing batches. You may think that offline databases as sources could be a solution, but the production reports generation is the first priority and challenge remains the same.
Completeness of Extraction for ETL
Ensuring that Data Extraction is well audited – The Extraction process should be able to run the quality checks to confirm that all the data has been extracted from all sources before giving a go-ahead to the Transformation activity. For example – These checks are done typically on the basis of last business transaction date OR the activity time stamp. For large Extraction routines, there are separate Extraction audit tables.
Quality of Extraction for ETL
Ensuring that Extraction process has been of quality- the process should ensure that there is a complete synch between the extracted data what was in production databases and the staging area. This can be done is many ways like running count and aggregation matches on critical fields across the staging and sources systems. You can refer interface controls for some of the checks you can place.
Preservation of Data
Ensure that Extraction Process preserve the data: The Data-Warehouse typically needs to maintain historical snapshots, which in source systems many a times are over written OR archive and also purged. There are no standard rules, but staging area typically should 'append' the extracted information, till the transformed data history is preserved either in staging OR in the final 'presentation/loaded' area.
In synch with business rules/System Logic
The common saying of 'there can be hundred answers to a question' applies perhaps most of data-warehousing project. An Extraction service can have many sources to choose from for the same set of data.
Extract programs typically take out the source data in the absolute raw form (the way it resides in the source systems) without any Transformation and also data quality checks. The key action between the source database and the extracted data is the filtration and selective Extraction. This is done mainly to reduce the Extraction time and also to avoid any unforeseen quality OR availability issues. You would not like to have an Extraction program stop because the agent code existing in the sales database is not available in the master table.
Data Warehouse Data Extraction- Process
Technical Specifications Data Extraction for ETL
You have identified the source system and which system will be the 'master' supplier of a dimension/fact. The next step is to make a list of the table and the fields you want to pick-up from each source. The dimensions and the fact table grains do not drive the Extraction routines. They are more driven by the source system. The source system mapping will give you the logical input to start writing the technical specifications of the Extraction routines. The technical specification routines (may of them are now being automated by state of the art 'Extraction tools':
The Sources systems for data extraction
The Table and field names, which have to be extracted, reference to dimensions and facts of the dimension models.
Data Extraction routing details containing:
- The Extraction routine number
- The Extraction routine details on the 'Extraction key' attributes, filters.
- Any Transformations during Extraction.
- The time of Extraction routines
- The pre-Extraction routine checks
- During Extraction routine checks
- Post-Extraction routine checks