This is the first of my series of blog posts about Business Intelligence and Data Warehouse best practices. The series will contain a number of best practices for data warehouse design, ETL design and SSAS (Analysis Services) configuration and design.
First, let’s look at a pattern I call ETL reentrancy, which I think is a term originating from Ralph Kimball. I think it is best explained like this:
There are many reasons that an ETL process can fail to execute. Running out of disk space can be one. Having incorrect data from source systems could be another. The question is whether a failed ETL process can cause the data warehouse into a state that requires a lot of manual work to restore. An ETL process is reentrant if you can rerun it (possibly from some later starting point than the very first) after it has failed (and you have corrected the error that made it fail).
Why this is important should be pretty clear. How do you achieve reentrancy?
- Have an ETL process that is clearly divided in logical steps. Don’t mix the E, T and L operations in the same step.
- Begin each step with any applicable clean up operations
- Prepare updates in staging tables and use set based updates
- Wrap updates within transactions if they could cause an unstable state
- Catch and handle errors in critical places