FX CodeGen Wiki

Compiles complex data models from drastically reduced syntax

Brought to you by: fx89

SQL Server normalization project architecture

This CodeGen template project aims to help data warehouse developers to quickly set up new schemas in which to hold data in a normalized fashion. The template project can be easily adapted for other database engines. The deliverables of this template project are:
- DDL for staging tables in which to upload bulk data
- DDL for entity tables in which to store the normalized bulk data from the staging tables
- stored procedures for normalizing the data from the staging tables and importing it into the entity tables

This CodeGen template project assumes the following rules:

Data is always collected and never deleted
Data from the staging tables is always added to the entity tables. Data from the entity tables is never deleted. The data warehouse only collects the data. Nobody ever deletes data from the data warehouse.

Each entity has an unique key
Each entity should be uniquely identifiable, so that it can be referenced by other entities. Entities without unique keys are ignorred by this project.

The order in which entities are loaded should not matter
Data may come from different places and may not always come in sequence. If an entity references another entity by its key, and if that other entity does not exist, a new record containing the missing key is added in the dependency table, while the rest of the data will be completed later, if and when the information about the dependency is loaded.

No null keys
Every entity reference must be valid. If a dependency to an entity does not exist, then the key should point to a special dependency

Example

The following diagram represents an use case of the normalization project. On the left there are the staging tables in which data is bulk-loaded, while on the right there are the entity tables in which data from the staging tables is loaded.

The TRAIN_STG table has a field named TYPE, which is the key of the TRAIN TYPE entity. The NAME field of the TRAIN table is the key of the TRAIN entity. The same applies to the SCHEDULE_STG table.

There are no staging tables for the Train Type and Schedule Status entities. In this example, the only information needed about these entities is an unique set of names, which can already be extracted from the existing staging tables.

The TRAIN_SCHEDULE_STG table has a composite key, made up of the TRAIN and SCHEDULE, FROM_DATE and TO_DATE fields. Since it does not have a single key, this entity will be ignorred by the template project.

Workflow

Using this template project, the CodeGen Builder will generate one stored procedure for the normalization of each entity. Each generated normalization procedure follows a linear workflow which is depicted in the following diagram.

Resolve UNKNOWN records in dependency tables
For each dependency of the main entity, the procedure attempts to get the ID of the record which contains the key "UNKNOWN". If such a record does not exist, then it is added. This step ensures that no initialization is necessary prior to running the procedures for the first time.

Replace NULL dependency keys in the staging table with the "UNKNOWN" string
As the following steps will be relying on the keys in order to resolve the dependencies of the entity being processed, this step must ensure that these keys exist.

Insert new dependency keys from the staging table into dependency tables
This step ensures that there are no broken references in the destination tables and that the defined foreign keys are not violated.

Update records in the main table with information from the staging table while matching records using keys
Some of the records in the staging table might already exist in the entity table. In that case, the existing records in the entity table get updated with new information. This adds previously missing information and renews existing information about the entity.

Insert new records from the staging table into the main table where the keys are not found in the main table
Some of the keys in the staging table may not be found in the entity table. This means that these records are new and must be added to the collection schema.

Example generated code

The following stored procedure loads data from the SCHEDULE_STG table into the SCHEDULE table from the above example.

create procedure dbo.SCHEDULE_NORMALIZATION (@Status varchar(200) output) AS
BEGIN
    begin TRY

        -- Transforming nulls into unknowns in the staging table
        declare @dummyValue bigint
        update dbo.[SCHEDULE_STG] set
            [FROM_STATION] = isnull([FROM_STATION], 'UNKNOWN'),
            [TO_STATION] = isnull([TO_STATION], 'UNKNOWN'),
            @dummyValue = 1

        -- Adding any new values that might be found in the FROM_STATION field to the STATION entity
        insert into dbo.[STATION] (NAME)
        select distinct FROM_STATION
        from dbo.[SCHEDULE_STG]
        where FROM_STATION not in (select NAME from dbo.[STATION]);

        -- Adding any new values that might be found in the TO_STATION field to the STATION entity
        insert into dbo.[STATION] (NAME)
        select distinct TO_STATION
        from dbo.[SCHEDULE_STG]
        where TO_STATION not in (select NAME from dbo.[STATION]);

        -- Update existing data
        update
            TRG
        set
            TRG.[CODE] = isnull(STG.[CODE], TRG.[CODE]),
            TRG.FROM_STATION_ID = FROM_STATION_TBL.ID_STATION,
            TRG.TO_STATION_ID = TO_STATION_TBL.ID_STATION,
            TRG.[FROM_TIME] = isnull(STG.[FROM_TIME], TRG.[FROM_TIME]),
            TRG.[TO_TIME] = isnull(STG.[TO_TIME], TRG.[TO_TIME])
        from
                        dbo.[SCHEDULE] TRG
            inner join  dbo.[SCHEDULE_STG] STG on TRG.[CODE] = STG.[CODE]
            inner join  dbo.[STATION] FROM_STATION_TBL on FROM_STATION_TBL.[NAME] = STG.[FROM_STATION]
            inner join  dbo.[STATION] TO_STATION_TBL on TO_STATION_TBL.[NAME] = STG.[TO_STATION]

        -- Insert new data
        insert into dbo.[SCHEDULE] (
            [CODE],
            FROM_STATION_ID,
            TO_STATION_ID,
            [FROM_TIME],
            [TO_TIME]
        )
        select distinct
            STG.[CODE],
            FROM_STATION_TBL.ID_STATION,
            TO_STATION_TBL.ID_STATION,
            STG.[FROM_TIME],
            STG.[TO_TIME]
        from
                        dbo.[SCHEDULE_STG] STG
            inner join  dbo.[STATION] FROM_STATION_TBL on FROM_STATION_TBL.[NAME] = STG.[FROM_STATION]
            inner join  dbo.[STATION] TO_STATION_TBL on TO_STATION_TBL.[NAME] = STG.[TO_STATION]

            left outer join dbo.[SCHEDULE] TRG on TRG.[CODE] = STG.[CODE]
        where
            TRG.[CODE] is null

        -- Report success
        set @Status = 'SUCCESS'
    end TRY
    begin CATCH
        set @Status = 'FAILURE'
    end CATCH
END