Hey.
I'm an architect DWH and I work in Russian big telecom company. The company is engaged in LTE network. As part of the automation of the company, we have developed a data warehouse based on HP Vertica. To load data into the warehouse, we used a commercial version of Talend.
Talend good product. But we are still faced with problems that can not be solved standard functionality of Talend. First, we solved this by writing your own code to Java. But that was a long time and require developers ETL excellent knowledge of Java. Then we connected to Talend more lightweight and flexible Groovy and began to solve such problems on it. Over time, Groovy took over a lot of work.
The main advantage of Groovy over Java to Talend were:
- Lightweight syntax
- Dynamic typing
- Strong support for working with strings, data sets and files
- Ability to extend the language with DSL
The big plus in Groovy is that it is compatible with Java, can work with all Java solutions. This allows him to use the full potential of Java, supplementing his own abilities.
In the end, I had the idea to create a Groovy open source ETL framework code-named GETL. The main features done GETL should be:
- Support for dynamic and unknown at the time of development data structures
- Templates data transformation
- Unit tests for automated testing
- Adjustable multi-threading
- Hierarchy of base classes to extend the functionality
- Store temporary data for intermediate operations on the data
- Using an RDBMS to store control data tasks
- Features high-level language Groovy (OOP, DSL, meta-programming, etc)
- The modular integration into any Java program
- Integration with other ETL via JVM (Talend, Pentaho, Informatica, etc)
- Development and testing of code over Groovy editors (Groovy Shell, Eclipse, NetBeans)
- Visual development under the Eclipse plug-in
- Support mechanism SDT (Slowly dimension time)
- Support mechanism CDC (Change data capture)
This functionality will be developed gradually, from version to version. I see a good idea in the future to start developing open source project as a servlet under Tomcat. This will be a full dedicated server for deployment in the cluster and control scheduled tasks GETL. Just for performance, well in time for GETL support Hadoop and HPFS.
Guarantee of GETL I see their own projects and projects of my friends that we need ETL with similar functionality. I would appreciate your ideas for the design and development of the product, your work as a component for GETL and your suggestions for the organization to support GETL.
P.S. Unfortunately I do not know so well spoken English, as required. Please contact me if I wrote something wrong, as it is required in English. Russian-speaking users can read my blog at http://ascrus.blogspot.ru/. There I will mirror the topics of this block in Russian.
Good luck in the project and a good mood!
Alexsey.