In my last post I presented the Data Warehouse reference architecture that I use to explain design principles and ETL loading strategies for each layer. I will mainly focus on SSIS as ETL technology. So let’s first start with the Staging area and the Archive.
Staging Area – The Purpose
As I already pointed out the staging layer supports 1:1 copy of source system extraction. The main goal is to bring the data as fast and as easy as possible from the sources to the SQL Server also in order to minimize the source system interaction. Data in the staging area is temporary or semi temporary and can be deleted after all data is loaded into the CDW and the archive.
Staging Area – Design Principles
The following list of design rules apply for the Staging area:
- 1:1 copy of the source system
- Same naming conventions and data types as the source system
- No transformations
- No aggregations
- No quality checks
- 1 Staging area per source system
Let me explain some of the principles in more detail. The first five principles focus all on the idea of bringing the data as easy and as fast as possible into the staging layer of the Data Warehouse in order to achieve a smooth connectivity to our sources. We also want to keep the extraction workload as short as possible in order to minimize the load of out source systems. An ERP system like SAP ERP is typically mission critical in a customer enterprise environment so we want to minimize the data extraction influence on the daily business. Therefore we always prefer a delta extraction process, if the source system is able to deliver changes only.
The point of having one staging are per source system is because of workload and data separation. This design depends a little bit on the flavor of the architect and also on the different numbers and complexities of the source systems. Of course it is also possible to separate source systems in the staging area by defining different schemas per source system in large enterprise environments a dedicated database per source is more common.
Staging Area – SSIS Loading Pattern
Because we have no further logic to implement during data source extraction the SSIS packages look very simple.
We use a source system adapter (the one that fits best for the specific source system connectivity), in this case an OleDB source, extract the data, add some audit info like CreatedDate or LoadID within a Derived Column task and store the data in our staging table, done.
This looks very easy, isn’t it? And exactly this was the goal for staging area data loads.
Archive – The Purpose
The archive is our “life insurance”. Why? Because it keeps the full data history. I hear often the claim that the Data Warehouse has to track the complete history. So let’s be precisely about “what is data history” first. There are two types of history tracking in a Data Warehouse:
- Fact Data Changes: This is the easy one. Because every fact record has a link to a certain point in time our facts are history tracked automatically.
- Master Data Changes: This is the difficult one and is also know as slowly changing dimensions (SCD). In this case we track attribute changes in our dimension like phone number or address information.
- When I say I suggest not to track the full history within the Data Warehouse I mean not to track all attribute changes that occur just in advance. Based on my experience only 10% of history tracked dimension attributes are used by business users. So why should I then track the additional 90%, if nobody use them?
- Most of the time I hear the answer: “Because we don’t know what will be used in the future.”
- But do you really want to pay the price for that? And the price would be lot’s of between joins in your Data Warehouse and between joins are very cost extensive operations. I will dig into that topic in my post about the Core Data Warehouse design.
- To make it short, we have the full history (fact and master data) in our archive and are always able to create a new SCD2 table in our Core Data Warehouse when ever needed.
Archive – Design Principles
The archive follows the same design principles as the Staging area plus some additional principles:
- 1:1 copy of the staging area
- Same naming conventions and data types as the staging area
- No transformations
- No aggregations
- No quality checks
- 1 Archive per Staging area
- Use Compression and Partitioning
- Apply delta detection if necessary
If the source systems are not able to deliver data changes only and instead give you a full extract of their data every time I would add some logic before loading that data to the archive. In case of full extracts your archive will grow very rapidly so you need to add a delta detection logic between staging and archive and load only data changes into the archive. I also would apply table compression and partitioning. Partitioning in case you need to load a certain extract again from the archive to your Data Warehouse.
Archive – SSIS Loading Pattern
Because we have no further logic to implement during the SSIS packages look very simple.
We use an OleDB adapter to connect to our Staging database, extract the data, add some audit info like CreatedDate or LoadID within a Derived Column task and store the data in our Archive table, done.
Again a very easy and straightforward approach.
Next Steps
I hope I made my design principles clear and explained it detailed enough. If not, shoot me a comment and let me know.
Within my next post I will explain the same principles and Loading pattern for the Core Data Warehouse.
References
October 14, 2011 at 3:41 pm
>Most of the time I hear the answer: “Because we don’t know what will be used in the future.”
Agreed this answer I have heard n-times too.
It makes sense to archive and track the changes. Values tend to change in the source system the SAP extractors built on top of the LIS are very cumbersome. Assuming you tend to use the extractors and a consultant uses the exits called on every record you can run into the problem of not being in the position to do a full extract, although if it worked over a long time.
We kept all loads too and changes in archives and always implemented a way to recreate a cube’s content on BW, but the same counts for maybe Analysis Server which at least allow you to extend a reportable structure especially cubes a lot more easy. With the times and the more scenarios I faced I started to create backup copies of all the messages that arrived at an interface (PSA).
No doubt, source systems try to get rid of data very quick and don’t care a lot about false messages (or records) that are updated afterwards anyway. A data warehouse must be in the position to prove the correctness of its own data and because of this, yes archive. Source systems are always correct .. because this what the users see and expect to see.
I am happy in the meanwhile that more and more reporting can be done on the source systems anyway in the fashion of analytical queries. This will help little with having to store the whole range of characteristics for certain key figures. Typical one to many relations in source systems of flexible key value pairs, stored in a separate entity but transposed are good candidates for – ‘We need everything’. No one knows what the future will bring, we want to be prepared for the day…. For me this is a sign for the absence of a source systems capability when it comes to reporting, especially when it comes to SAP. The result of solid cleansed values from different systems is something different.
November 9, 2012 at 3:45 pm
are you going to post remaining parts of reference architecture?
January 27, 2013 at 4:37 am
That is a really good tip particularly to those fresh to the blogosphere.
Brief but very accurate information… Thank you for sharing this one.
A must read article!
August 28, 2014 at 3:47 pm
hi dwjunkie,
very nice post.
In a situation where 99% of my source data is delivered as flat files and therefore untyped. Would you recommend compressing and storing away those flat files in order to build up the staging archive or would you create loosely typed staging db tables (all columns typeof varchar(max) ) for staging and archive?
Also flat file structures may evolve over time as new columns are added gradually. This in turn leads to altered downstream etl processing. So in order to perform a full scale historic reimport i’d have to run several versions of etl process (depending on ff version) …
you’re welcome to share your thoughts regarding this scenario
August 28, 2014 at 4:11 pm
Hi,
typically I would recommend to store the data in the staging with the same data types as the source. So if your sources are mostly flat files, just store them as string, but not varchar(max). Varchar(max) is not a good data type for SQL Server as well for SSIS. In SSIS for example varchar(max) is handled as blobs and if one column is blob the SSIS directly handles disk based and not in memory anymore. There is a package property for Blob location. So keep in mind that data loading with varchar(max) will be very slow.
Secondly for your archive you can whether archive your flat files or you can create a SQL archive. This also depends on how fast you would like to make data available from the archive, if needed. And yes you can apply compression on your archive table in order so save disk space.
August 28, 2014 at 5:02 pm
Hi, dwjunkie,
what abou remaining parts of the architecture? 🙂
August 28, 2014 at 6:52 pm
you are absolutly right. I promised to deliver it, so I will. need to reserve some time for it in the next days.
September 22, 2015 at 8:14 am
Hi,
any news about the remaining parts of the architecture?
May 3, 2016 at 4:33 pm
I assume that you treat your archive tables as regular SCD Type 2 tables. But what is your choice of business key (BK) for preserving history? Is the BK automatically set to the primary key of the source table – or does each table need to analyzed to determine the BK!?
May 31, 2016 at 8:34 am
I’d not expect an archive table to be implemented with scd2 Approach. However it would be very interesting to see an example of a physical archive table with all relevant choices (audit/metadata rows, loose vs strong datatypes, partitioning, compression) explicitly outlined.
July 5, 2016 at 11:25 pm
You cannot use the Business Key as a Primary Key in the Archive or the Data Warehouse, if you are using SCD Type 2, as this would create duplicates in that column. This is why we use Surrogate Keys (usually auto-incrementing ID column).
Hope that helps
May 25, 2016 at 9:02 pm
So are you suggesting keeping a daily “snapshot” of the source system instead of the SCD? If you’re not going to track changes, I guess you’ll have to.
I’ve seen this approach in other systems and over the years this can lead to some very significant accumulation of redundant data. For example, I saw a country table which had 244 unique members contain close to 60 thousand records. Another table with 477,000 unique rows weighed in at 180 million+ records.
While it’s true that it might seem costly to pay for the SCD in the PSA, I can see that this makes the SCD to the data warehouse much easier – so what if my 477K rows becomes 157K? It’s still much faster and easier than plowing through 180 million rows ..!