The big question is on why we have to create a Data-Warehouse?. The reasons are as follows.
As a starter, lets say you want to know the monthly variations in 3 months running average on your customer balances over last twelve months grouped by products+ channels+ customer segments. Lets see why do you need a data-warehouse for this purpose.
Keeping Analysis/Reporting and Production Separate.
If you run the above-said query on your production systems, you will find that it will lock all your tables and will eat-up most of your resources, as it will be accessing a lot of data doing a lot of calculations. This results in the production work to come to a virtual halt. Imagine hundreds of such above-said queries running at the same time on your production systems.
Reporting and analysis work typically access data across the database tables, whereas production work typically accesses specific customer OR product OR channel record at a point of time. That’s why it is important to have the Information generation work to be done from an offline platform (aka. Data Warehouse). Purpose of Data Warehouse is to keep analysis/reporting (non-production use data) separate from production data.
Information Integration from multiple systems- Single point source for information
As an example- Lets say you have different systems for say a loan product vs. credit card product. The above-said query, if run on production will need to pick the data on real time basis from these systems.
This will make the query extremely slow, and will need to do connects in the intermediate tables OR in run-time memory. Moreover it will not be a reliable result as at a particular point of time, the databases may not be in synch as many of such synching happen in the end of day batch runs.
DW purpose for Data Consistency and Quality
Organizations are riddled with tens of important systems from which their information comes. Each of these systems may carry the information in different formats and also may be having out of synch information. (Different customer ID formats, mismatch in the supplier statuses). By bringing the data from these disparate sources at a common place, one can effectively undertake to bring the uniformity and consistency in data (Refer to cleansing and Data Transformation).
High Response Time- Production Databases are tuned to expected transaction load
Even if you run the above-said query on an offline database, it will take a lot of time on the database design, which is same as that of production. This is because the production databases are created to cater to production work. In production systems, there is some level of expected intensity for different kind of actions. Therefore, the indexing and normalization and other design considerations are for given transaction loads. However, the Data-warehouse has to be ready for fairly unexpected loads and type of queries, which demands a high degree of flexibility and quick response time.
High Response time- Normalized Data vs. Dimensional Modeling
Production/Source system database are typically normalized to enable integrity and non-redundancy of data. This kind of design is fine for transactions, which involved few records at a time. However, for large analysis and mining queries, the response time in normalized databases will be slow given the joins that have to be created.
Data Warehouse objective of providing an adaptive and flexible source of information.
Its easier for users to define the production work and functionalities they want, but difficult to define the analysis they need. The analysis needs keep on changing and Data-Warehouse has the capabilities to adapt quickly to the changing requirements. Please refer to 'Dimension Modeling'
Establish the foundation for Decision Support
Decisioning process of an organization will involve analysis, data mining, forecasting, decision modeling etc. By having a common point, which can provide consistent, quality data with high response time provides the core enabler for making fast and informed decisions.