Latest changes (version 2.4, 2015-01-04)
- Added PostgreSQL support
- Split the generic transformations of links and link-satellites to 1 generic transformation for each number of hubs (1 to 10).
This prevents the creation of many database connections that are not used.
- End of changes version 2.4
Changes (version 2.3, 2014-11-19):
- adds optional mailing after error and/or success of the main jobs.
mailing properties like smtp server, port etc. is specified with Kettle properties.
- adds a 'management' job that collects database table sizes.
- fixes bug that a run did not end with status 3 (error) after it failed.
Directory upgrade_to_version_2.3 contains the way to upgrade from version 2.2 to 2.3.
- End of changes version 2.3
Changes (version 2.2, 2014-10-27):
- subruntype functionality (only process a subset of the staging and/or Data Vault tables)
- reference tables with all columns of the satellites specified separately (no change to the Excel, just for easy impact analysis)
- change detection of satellite columns by a HASH function, SHA1 is the default, MD5 also possible. The function is specified in the properties file.
- Upgraded to Pentaho Data Integration 5.2 CE, versions 5.1 and 5.0.1 also work with the provided jobs and transformations.
- Updated the documentation PDF
- End of changes version 2.2
Description of the project
Based on a designed Data Vault (hub-,link- and satellite tables are present) and an Excel sheet with the mappings, no Data Vault ETL development is needed for adding hubs, links etc.
Kasper de Graaf played a big part in the specifications for the tool set, him being a Data Vault expert, me being an ETL designer/developer.
The Virtual Machine (VMWare) is a 64 bit Ubuntu 14.04 Server with 2 RDBMS flavours:
Percona Server, a MySQL replacement with an improved InnoDB storage engine.
PostgreSQL 9.4.
The latest version is available in directory version_2.4, which includes the most recent VMWare Virtual Machine.
User: percona
Password: percona
mysql root/percona
NB: entries to add/modify in my.cnf
max_connections = 2048
table_definition_cache = 1200
PostgreSQL user/password (yes, the PostgreSQL passwords are also percona, so all passwords are :):
postgres/percona
sakila/percona
sakila_data_vault/percona
staging/percona
pdi_meta/percona
NB: changed in postgresql.conf
max_connections=250 (was 100 by default)
Starting Pentaho Data Integration (Kettle)
Run the launcher at the Desktop (run in terminal): PDI_kff_launcher.sh
Choose which RDBMS you want to use: M=MySQL, P=PostgreSQL
Based on your choice the applicable configuration files are copied to /home/percona/.kettle
and the applicable Excel sheet is copied to the right location:
/media/percona/disk2/pdi_dv_framework/dwh_datavault_mappings_1.xls
Select the file based repository (appears as default): pdi_file_repository_dv_demo_kff
The job that 'does it all' (metadata + all Data Vault objects): job_data_vault_all_incl_md
Running a complete batch including staging and the Data Vault (in directory /media/percona/disk2/pdi_dv_framework):
percona@ubuntu:~$ ./run_job_complete_batch_data_warehouse.sh
The logs of that batch run are in /media/percona/disk2/pdi_dv_framework/logs
----Attention----
After editing the metadata Excel sheet, be sure to refresh the column 'source_concat' in the sheet 'source_tables'.
For some reason this colum is sometimes seen as 'null' by Kettle, destroying the joins in the metadata queries to obtain the 'record_source_id'.
If you refresh this column by copying the value in the first row to all others, you'll be fine.
----Attention number 2----
If you discover errors/bugs in my code, please inform me at eacweber@gmail.com, so I can use your collective brains to improve it.
Greetings,
Edwin Weber, owner of the one man army Weber Solutions.