Home / version_2.4
Name Modified Size InfoDownloads / Week
Parent folder
pdi_dv_framework_vmware_virtual_machine_24.7z 2015-01-04 3.4 GB
Totals: 1 Item   3.4 GB 0
Latest changes (version 2.4, 2015-01-04)
- Added PostgreSQL support
- Split the generic transformations of links and link-satellites to 1 generic transformation for each number of hubs (1 to 10).
  This prevents the creation of many database connections that are not used.

- End of changes version 2.4

Changes (version 2.3, 2014-11-19):

- adds optional mailing after error and/or success of the main jobs.
  mailing properties like smtp server, port etc. is specified with Kettle properties.
- adds a 'management' job that collects database table sizes.
- fixes bug that a run did not end with status 3 (error) after it failed.

Directory upgrade_to_version_2.3 contains the way to upgrade from version 2.2 to 2.3.

- End of changes version 2.3

Changes (version 2.2, 2014-10-27):

- subruntype functionality (only process a subset of the staging and/or Data Vault tables)
- reference tables with all columns of the satellites specified separately (no change to the Excel, just for easy impact analysis)
- change detection of satellite columns by a HASH function, SHA1 is the default, MD5 also possible. The function is specified in the properties file.
- Upgraded to Pentaho Data Integration 5.2 CE, versions 5.1 and 5.0.1 also work with the provided jobs and transformations.
- Updated the documentation PDF 

- End of changes version 2.2

Description of the project

Based on a designed Data Vault (hub-,link- and satellite tables are present) and an Excel sheet with the mappings, no Data Vault ETL development is needed for adding hubs, links etc.
Kasper de Graaf played a big part in the specifications for the tool set, him being a Data Vault expert, me being an ETL designer/developer.

The Virtual Machine (VMWare) is a 64 bit Ubuntu 14.04 Server with 2 RDBMS flavours:
Percona Server, a MySQL replacement with an improved InnoDB storage engine.
PostgreSQL 9.4.

The latest version is available in directory version_2.4, which includes the most recent VMWare Virtual Machine.

User:     percona
Password: percona

mysql root/percona
NB: entries to add/modify in my.cnf
max_connections = 2048
table_definition_cache = 1200

PostgreSQL user/password (yes, the PostgreSQL passwords are also percona, so all passwords are :):
postgres/percona
sakila/percona
sakila_data_vault/percona
staging/percona
pdi_meta/percona

NB: changed in postgresql.conf
max_connections=250 (was 100 by default)


Starting Pentaho Data Integration (Kettle)

Run the launcher at the Desktop (run in terminal):              PDI_kff_launcher.sh
Choose which RDBMS you want to use:				M=MySQL, P=PostgreSQL
								Based on your choice the applicable configuration files are copied to /home/percona/.kettle
								and the applicable Excel sheet is copied to the right location:
								/media/percona/disk2/pdi_dv_framework/dwh_datavault_mappings_1.xls
Select the file based repository (appears as default):          pdi_file_repository_dv_demo_kff
The job that 'does it all' (metadata + all Data Vault objects): job_data_vault_all_incl_md

Running a complete batch including staging and the Data Vault (in directory /media/percona/disk2/pdi_dv_framework):
percona@ubuntu:~$ ./run_job_complete_batch_data_warehouse.sh

The logs of that batch run are in /media/percona/disk2/pdi_dv_framework/logs


----Attention----

After editing the metadata Excel sheet, be sure to refresh the column 'source_concat' in the sheet 'source_tables'.
For some reason this colum is sometimes seen as 'null' by Kettle, destroying the joins in the metadata queries to obtain the 'record_source_id'.
If you refresh this column by copying the value in the first row to all others, you'll be fine. 

----Attention number 2----
If you discover errors/bugs in my code, please inform me at eacweber@gmail.com, so I can use your collective brains to improve it.


Greetings,

Edwin Weber, owner of the one man army Weber Solutions.
Source: readme.txt, updated 2015-01-04