Monday, February 28, 2011

Data Warehousing - Slowly Changing DimensionsBy nature of the data warehouse, fact and dimension tables are likely to have very different transactional activity. Fact tables will be mostly inserted into as you load data. Very infrequently you might have to update the facts that were loaded incorrectly. It is even less likely that you'll ever delete rows from the fact table; the only time this might happen is if you wanted to archive the old data that is no longer relevant for the current portion of the data warehouse. Dimension tables, on the other hand are likely to see frequent updates. A classic example of this is a consumer dimension which tracks people that buy your products or services. People can change names due to marriage, divorce or another reason, they can change their title of courtesy from Ms to Mrs or to Dr. As people age they also fit a different age group and perhaps even a different education and income level. Therefore you are likely to change dimension table values often. Dimensions that are changing over time are referred to as slowly changing dimensions (SCD). There are several ways to handle SCD; most common approaches are referred to as type 1, type 2, type 3 and type 4. The easiest way to manage SCD is to simply override the existing value with the new value type 1. Indeed if a department changes the title from "Finance" to "Financial" overwriting such a value isn't likely to do much harm. However, you must weigh the consequences of overriding existing values carefully to ensure that your reports don't lose credibility. For example, suppose Mr. Smith who has near perfect credit marries Ms. Jones who has recently declared bankruptcy. Since Mr. Smith had near perfect credit before his marriage your bank has issued him a loan for $200,000. If you overwrite Mr. Smith's credit score with his new, considerably lower credit score you will make decisions from the not very distant past look very incorrect. If your data analysts happen to run reports immediately before and immediately after the credit rating was changed your data warehouse credibility will be compromised. Here is what type 1 SCD records would look like before and after change: Customer_key Customer_name Customer_city Customer_state 123 Ms. Brown Houston Texas 123 Mrs. Green Houston Texas As you can see type 1 SCD does not allow tracking of changes. Once you overwrite the value all historical reports will display Mrs. Green instead of Ms. Brown. The second and more common way of handling SCD is to create a new record when dimension attributes change - type 2. For example, when Ms. Brown becomes Mrs. Green you create a new record for that customer and mark the first record as obsolete. The second approach adds effective date and obsolete date columns to dimension tables. Doing so allows a data warehouse architect to maintain a history of SCD changes. After a change occurs the warehouse will have the following two records (one of them marked as obsolete): Customer_key Customer_name Customer_city Customer_state Effective_date Obsolete_date 123 Ms.Brown Houston Texas 1/1/2000 1/1/2005 234 Mrs.Green Houston Texas 1/1/2005 NULL The third approach is to add columns to the dimension tables in order to maintain the history of changes in a single row. In the case of consumer dimension you would create an old last name and current last name columns along with the date column that indicates when the change occurred. This type of auditing of changes is typically reserved for very few dimensions where change history tracking is critical. For example if your product prices tend to change over time and you must compare revenues based on old and new price then you might wish to use this method. Type 3 of maintaining SCD carries much overhead and is not used very frequently. After type 3 change the warehouse will still have a single record shown below: Customer_key Customer_name Customer_city Customer_state Change_date Old_name 123 Mrs.Green Houston Texas 1/1/2005 Ms.Brown Most environments that use type 3 slowly changing dimensions track 2 iterations of changes; that is you would record the current version, previous version and version before last for the changed column. Type 4 is a special way of handling type 2 changes. Type 2 usually tracks unlimited history by creating new records each time the change occurs. If you know how far back you wish to track history you can add logic to your code to delete any records that go too far back in history. For example, suppose we created an exam and offered it to our students. As time goes on we might decide to edit the exam. We could manage these edits using type 2 method, but if we accumulate 10 versions of the same exam the data might get confusing. So we might decide that we only store up-to 9 records per each exam; any time 10th version of an exam is created we'll delete the 1st version from the warehouse. SOURCE:http://sqlserverpedia.com/wiki/Data_Warehousing_-_Slowly_Changing_Dimensions

By nature of the data warehouse, fact and dimension tables are likely to have very different transactional activity. Fact tables will be mostly inserted into as you load data. Very infrequently you might have to update the facts that were loaded incorrectly. It is even less likely that you'll ever delete rows from the fact table; the only time this might happen is if you wanted to archive the old data that is no longer relevant for the current portion of the data warehouse. Dimension tables, on the other hand are likely to see frequent updates. A classic example of this is a consumer dimension which tracks people that buy your products or services. People can change names due to marriage, divorce or another reason, they can change their title of courtesy from Ms to Mrs or to Dr. As people age they also fit a different age group and perhaps even a different education and income level. Therefore you are likely to change dimension table values often.



Dimensions that are changing over time are referred to as slowly changing dimensions (SCD). There are several ways to handle SCD; most common approaches are referred to as type 1, type 2, type 3 and type 4.



The easiest way to manage SCD is to simply override the existing value with the new value type 1. Indeed if a department changes the title from "Finance" to "Financial" overwriting such a value isn't likely to do much harm. However, you must weigh the consequences of overriding existing values carefully to ensure that your reports don't lose credibility. For example, suppose Mr. Smith who has near perfect credit marries Ms. Jones who has recently declared bankruptcy. Since Mr. Smith had near perfect credit before his marriage your bank has issued him a loan for $200,000. If you overwrite Mr. Smith's credit score with his new, considerably lower credit score you will make decisions from the not very distant past look very incorrect. If your data analysts happen to run reports immediately before and immediately after the credit rating was changed your data warehouse credibility will be compromised. Here is what type 1 SCD records would look like before and after change:

Customer_key     Customer_name     Customer_city     Customer_state
123     Ms. Brown     Houston     Texas
123     Mrs. Green     Houston     Texas


As you can see type 1 SCD does not allow tracking of changes. Once you overwrite the value all historical reports will display Mrs. Green instead of Ms. Brown.



The second and more common way of handling SCD is to create a new record when dimension attributes change - type 2. For example, when Ms. Brown becomes Mrs. Green you create a new record for that customer and mark the first record as obsolete. The second approach adds effective date and obsolete date columns to dimension tables. Doing so allows a data warehouse architect to maintain a history of SCD changes. After a change occurs the warehouse will have the following two records (one of them marked as obsolete):

Customer_key     Customer_name     Customer_city     Customer_state     Effective_date     Obsolete_date
123     Ms.Brown     Houston     Texas     1/1/2000     1/1/2005
234     Mrs.Green     Houston     Texas     1/1/2005     NULL


The third approach is to add columns to the dimension tables in order to maintain the history of changes in a single row. In the case of consumer dimension you would create an old last name and current last name columns along with the date column that indicates when the change occurred. This type of auditing of changes is typically reserved for very few dimensions where change history tracking is critical. For example if your product prices tend to change over time and you must compare revenues based on old and new price then you might wish to use this method. Type 3 of maintaining SCD carries much overhead and is not used very frequently. After type 3 change the warehouse will still have a single record shown below:

Customer_key     Customer_name     Customer_city     Customer_state     Change_date     Old_name
123     Mrs.Green     Houston     Texas     1/1/2005     Ms.Brown


Most environments that use type 3 slowly changing dimensions track 2 iterations of changes; that is you would record the current version, previous version and version before last for the changed column.



Type 4 is a special way of handling type 2 changes. Type 2 usually tracks unlimited history by creating new records each time the change occurs. If you know how far back you wish to track history you can add logic to your code to delete any records that go too far back in history. For example, suppose we created an exam and offered it to our students. As time goes on we might decide to edit the exam. We could manage these edits using type 2 method, but if we accumulate 10 versions of the same exam the data might get confusing. So we might decide that we only store up-to 9 records per each exam; any time 10th version of an exam is created we'll delete the 1st version from the warehouse.


SOURCE:http://sqlserverpedia.com/wiki/Data_Warehousing_-_Slowly_Changing_Dimensions

0 comments:

Post a Comment

newer post older post Home