This CodeGen template project aims to help data warehouse developers to quickly set up new schemas in which to hold data in a normalized fashion. The template project can be easily adapted for other database engines. The deliverables of this template project are:
- DDL for staging tables in which to upload bulk data
- DDL for entity tables in which to store the normalized bulk data from the staging tables
- stored procedures for normalizing the data from the staging tables and importing it into the entity tables
This CodeGen template project assumes the following rules:
Data is always collected and never deleted
Data from the staging tables is always added to the entity tables. Data from the entity tables is never deleted. The data warehouse only collects the data. Nobody ever deletes data from the data warehouse.
Each entity has an unique key
Each entity should be uniquely identifiable, so that it can be referenced by other entities. Entities without unique keys are ignorred by this project.
The order in which entities are loaded should not matter
Data may come from different places and may not always come in sequence. If an entity references another entity by its key, and if that other entity does not exist, a new record containing the missing key is added in the dependency table, while the rest of the data will be completed later, if and when the information about the dependency is loaded.
No null keys
Every entity reference must be valid. If a dependency to an entity does not exist, then the key should point to a special dependency
The following diagram represents an use case of the normalization project. On the left there are the staging tables in which data is bulk-loaded, while on the right there are the entity tables in which data from the staging tables is loaded.
The TRAIN_STG table has a field named TYPE, which is the key of the TRAIN TYPE entity. The NAME field of the TRAIN table is the key of the TRAIN entity. The same applies to the SCHEDULE_STG table.
There are no staging tables for the Train Type and Schedule Status entities. In this example, the only information needed about these entities is an unique set of names, which can already be extracted from the existing staging tables.
The TRAIN_SCHEDULE_STG table has a composite key, made up of the TRAIN and SCHEDULE, FROM_DATE and TO_DATE fields. Since it does not have a single key, this entity will be ignorred by the template project.
Using this template project, the CodeGen Builder will generate one stored procedure for the normalization of each entity. Each generated normalization procedure follows a linear workflow which is depicted in the following diagram.
Resolve UNKNOWN records in dependency tables
For each dependency of the main entity, the procedure attempts to get the ID of the record which contains the key "UNKNOWN". If such a record does not exist, then it is added. This step ensures that no initialization is necessary prior to running the procedures for the first time.
Replace NULL dependency keys in the staging table with the "UNKNOWN" string
As the following steps will be relying on the keys in order to resolve the dependencies of the entity being processed, this step must ensure that these keys exist.
Insert new dependency keys from the staging table into dependency tables
This step ensures that there are no broken references in the destination tables and that the defined foreign keys are not violated.
Update records in the main table with information from the staging table while matching records using keys
Some of the records in the staging table might already exist in the entity table. In that case, the existing records in the entity table get updated with new information. This adds previously missing information and renews existing information about the entity.
Insert new records from the staging table into the main table where the keys are not found in the main table
Some of the keys in the staging table may not be found in the entity table. This means that these records are new and must be added to the collection schema.
The following stored procedure loads data from the SCHEDULE_STG table into the SCHEDULE table from the above example.
create procedure dbo.SCHEDULE_NORMALIZATION (@Status varchar(200) output) AS BEGIN begin TRY -- Transforming nulls into unknowns in the staging table declare @dummyValue bigint update dbo.[SCHEDULE_STG] set [FROM_STATION] = isnull([FROM_STATION], 'UNKNOWN'), [TO_STATION] = isnull([TO_STATION], 'UNKNOWN'), @dummyValue = 1 -- Adding any new values that might be found in the FROM_STATION field to the STATION entity insert into dbo.[STATION] (NAME) select distinct FROM_STATION from dbo.[SCHEDULE_STG] where FROM_STATION not in (select NAME from dbo.[STATION]); -- Adding any new values that might be found in the TO_STATION field to the STATION entity insert into dbo.[STATION] (NAME) select distinct TO_STATION from dbo.[SCHEDULE_STG] where TO_STATION not in (select NAME from dbo.[STATION]); -- Update existing data update TRG set TRG.[CODE] = isnull(STG.[CODE], TRG.[CODE]), TRG.FROM_STATION_ID = FROM_STATION_TBL.ID_STATION, TRG.TO_STATION_ID = TO_STATION_TBL.ID_STATION, TRG.[FROM_TIME] = isnull(STG.[FROM_TIME], TRG.[FROM_TIME]), TRG.[TO_TIME] = isnull(STG.[TO_TIME], TRG.[TO_TIME]) from dbo.[SCHEDULE] TRG inner join dbo.[SCHEDULE_STG] STG on TRG.[CODE] = STG.[CODE] inner join dbo.[STATION] FROM_STATION_TBL on FROM_STATION_TBL.[NAME] = STG.[FROM_STATION] inner join dbo.[STATION] TO_STATION_TBL on TO_STATION_TBL.[NAME] = STG.[TO_STATION] -- Insert new data insert into dbo.[SCHEDULE] ( [CODE], FROM_STATION_ID, TO_STATION_ID, [FROM_TIME], [TO_TIME] ) select distinct STG.[CODE], FROM_STATION_TBL.ID_STATION, TO_STATION_TBL.ID_STATION, STG.[FROM_TIME], STG.[TO_TIME] from dbo.[SCHEDULE_STG] STG inner join dbo.[STATION] FROM_STATION_TBL on FROM_STATION_TBL.[NAME] = STG.[FROM_STATION] inner join dbo.[STATION] TO_STATION_TBL on TO_STATION_TBL.[NAME] = STG.[TO_STATION] left outer join dbo.[SCHEDULE] TRG on TRG.[CODE] = STG.[CODE] where TRG.[CODE] is null -- Report success set @Status = 'SUCCESS' end TRY begin CATCH set @Status = 'FAILURE' end CATCH END