As Thomas pointed out there seems to be a big gap on how to model a Data Warehouse. So I decided to write a little bit more about this topic and will add additionally some ETL loading pattern on top.
I think we can all agree that most people understand how to model data marts, because the Kimball Group made a really great job in pushing the star schema idea. I think we can also agree that the star schema is the preferred data model for business user analytics (also confirmed by Linstedt & Inmon). If not, please let me know.
First of all I want to explain the Data Warehouse reference architecture that I have in mind, to get a common understanding of the names and layers.
Before explaining the picture let my shortly define the abbreviations:
- CDW: Core Data Warehouse. Also know as EDW (Enterprise Data Warehouse)
- There are two types of Data Warehouse architectures that I typical see and recommend to customers:
- Single Data Warehouse: A single Data Warehouse only has one enterprise wide data mart on top of the CDW.
- Multi-Mart Data Warehouse: A multi-mart Data Warehouse has more that one Data Mart on top of the CDW. This design I typically see at really large customers where it doesn’t make any sense to push all the data only in one single Data Mart. A good example is a separated Data Mart for global financial reporting, one for production analytics and third one for supply chain optimization.
So let me explain the layers a little bit more and also how they are used.
How many layers and for which purpose?
First of all, every layer defined in a Data Warehouse architecture must have a certain reason to be there. The definition of layers is a conscious decision of the Data Warehouse architect! No layer is just there by accident.
So let me define my layers and the purpose of each layer:
- Staging: The staging layer supports 1:1 copy of source system extraction. The main goal is to bring the data as fast and as easy as possible from the sources to the SQL Server also in order to minimize the source system interaction. Data in the staging area is temporary or semi temporary and can be delete after all data is loaded into the CDW and the archive.
- Archive: The archive layer is our “life insurance”. Why? Because it keeps the full data history. I hear often the claim that the Data Warehouse has to track the complete history. That means that every single attribute change would have been tracked in the CDW layer. This is very expensive, especially if only for around 10% of the attribute history tracking is of interest for business user analysis. So why to track 100% in the CDW and nobody use it? Does that make sense? I think no, but in order to be flexible, because we don’t know which attribute history is of interest in the future, we track the full history in our archive.
- CDW: This layer is the relational Data Warehouse layer. And because it can be used for business analysis it is modeled as star schema (Thomas made some good points here and here). But an additional very import purpose of this layer is fast ETL. What does that mean? If the analysis requirements getting harder and you need to optimize the data model for reporting (aggregation, indexes, …) in order to fulfill business user requirements, you need to push it to the Data Mart. Otherwise your whole ETL process would suffer on performance and optimize a layer for both (ETL and reporting) is a really hard job. The CDW is not designed as a mix of all data marts. The goal of the CDW is to represent the whole enterprise of a customer at the lowest granularity if possible.
- Data Mart: Designed as an optional layer above the CDW the Data Mart data model is also designed as a star schema. But there needs to be a certain reason to have an additional star schema on top of the CDW. What can this be? Aggregation, attribute history usage, business logic, strong requirements for relational reporting, … As already said if the requirements for relational reporting are hard to handle in the CDW layer, push them to the Data Mart.
- Multidimensional Data Mart: The multidimensional layer, typically implemented with Analysis Services, is also an optional layer on top of the relational Data Mart or the CDW. Which technology to use in case of reporting depends always on the business user reporting requirements.
What comes next?
I will talk about the following topics in my next posts and update the links right here:
- design principles of each layer in detail
- present some modeling examples
- ETL Loading pattern for each layer
Staging & Archive: Data Warehouse reference architecture – Staging & Archive
If you have further question, please let me know.