Also Syncsort is having another go at explaining that DMX-h product is much better than JRecord - it probably is, it would want to be at the price they charge:
Bruce back in 2014 I made some changes before you added the new ways of adding IO-Readers/Writers. I just have to say how awesome your code is and how much of a help it has been when dealing with loading legacy mainframe data into Hadoop Environments. I have used this twice over the last few years to get legacy data into Hadoop and converted the files using Mapreduce into CSV/TSV tables. Not sure why I never saw this thread but I would like to give back a bit here is the code that I wrote that does a variety of things with Mainframe Copybooks on Z. Even if folks use it just as an example:
Also here is some code that allows ftping RDW files directly from the Mainframe into Hadoop/HDFS as a mapreduce job or standalone client. https://github.com/gss2002/ftp2hdfs
All code is open no licensing etc.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks for the information on you packages, I have added them to ReadMe in a Hadoop section. The ReadMe should flow through to Git Hub clone.
I am intereseted about how JRecord is being used so please keep in touch. My whole purpose is to produce something that is useful to as many people as possible.
Suggestions
If you have suggestions that will help multiple user's please make them; I will probably do them when I have time.
First off thanks for all of your work on JRecord. Our team has been leveraging it to convert EBCDIC files with a good deal of success. We are currently looking into an approach for using this on our Hadoop cluster, but I think we have found a few sticking points for generalizing it.
When decoding an EBCDIC file, as soon as you have more than 1 record it is not trivial to calculate the split of a file. Like say for example you have 2 records of length 128 and 75. It's impossible to break up that file into a chunk where we are guaranteed to start at a new record (well, not impossible but computationally inefficient). For example say you split the file at 128MB, you would need to find the next offset for a valid record or the mapping will break.
Is this a valid line of thinking? Or is there an easier way to figure where you can split a file defined by a COBOL copybook?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I presume the files are VB on the Mainframe. One question are you transporting
the RDW (Record Descriptor Word) from the Mainframe ???
No RDW, You are correct, you will have to go through the file Record by Record.
RDW - Should be possible to search for
RDW Record Descriptor Word
In In a VB file the data is stored as:
(RDW)(Record-data)
The RDW consists of
2 byte record length length
2 hex-zero bytes.
In your example file including RDW will increase the size by 4%
while if the average record length is 20 bytes, it adds a 20% overhead.
Using RDW
In you example 128 & 75 byte records 128 meg split
You could
Use a Random File to Read 128kmb starting 128meg - 128kb.
Search for a sequence of RDW records. If you find a sequence of 500 RDW's
you can be very certain you have located the line start / end. Either that
or you have an unusual file
Even if you some how make a mistake (would be a very unusual file) you will
know the first time you run the file in Hadoop.
Having the RDW would allow generic readers.
Sequence being searched for (128 & 75 byte records)
Consider adding the copybook name to the file-name (end / start what ever).
It means you can look at the file-name and know exactly what the copybook is.
That can be handy if you use the RecordEditor / FileAid. It is also makes it
easier/quicker for new people to become productive (also they will not be
asking you what copybook to use with a specific file.
With the CodeGen sub project I have started generating Code from a Cobol Copybook ~ Cobol Data File. Basically you give CodeGen a Cobol-Copybook * and Cobol-Data file. CodeGen analysers at the Copybook/File and pasers Schema details to one or velocity templates. It may be possible to set up Hadoop templates.
If there is some thing that will be useful for general Hadoop users please let me know.
With Version 0.90 I introduced IByteRecordReader and IByteRecordWriter.
These allow you interface JRecord with data sources other than files / streams.
See Description
Last edit: Bruce Martin 2018-05-06
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
If any one is interested, there looks to be a project:
using JRecord to read copybook data under HDFS (Hadoop) Map reduce
Interesting. I have been watching the COBOLSerDe at https://rbheemana.github.io/Cobol-to-Hive/serde.html
Unfortunately both require the jar to be added to Hive. Where I work, that raises redflag. It is a neat approach.
If you look at the problems,
there still some basic issues to be resolved e.g.
The Cobol Parser is very basic, have a look https://github.com/rbheemana/Cobol-to-Hive/blob/master/src/main/java/com/savy3/hadoop/hive/serde3/cobol/CobolFieldFactory.java
Is Avro Built into HADOOP ???. For simpler files a Cobol to Avro should be possible
Looks as though datameer are using JRecord for Reading Cobol files
See:
http://www.datameer.com/community/questions/249/cobol-copybook-file-uploads
There is also
Also Syncsort is having another go at explaining that DMX-h product is much better than JRecord - it probably is, it would want to be at the price they charge:
There are some notes on using CopybookInputFormat (Which uses JRecord):
JRecord is also used in:
Bruce back in 2014 I made some changes before you added the new ways of adding IO-Readers/Writers. I just have to say how awesome your code is and how much of a help it has been when dealing with loading legacy mainframe data into Hadoop Environments. I have used this twice over the last few years to get legacy data into Hadoop and converted the files using Mapreduce into CSV/TSV tables. Not sure why I never saw this thread but I would like to give back a bit here is the code that I wrote that does a variety of things with Mainframe Copybooks on Z. Even if folks use it just as an example:
https://github.com/gss2002/copybook_formatter
Also here is some code that allows ftping RDW files directly from the Mainframe into Hadoop/HDFS as a mapreduce job or standalone client.
https://github.com/gss2002/ftp2hdfs
All code is open no licensing etc.
Thanks for the information on you packages, I have added them to ReadMe in a Hadoop section. The ReadMe should flow through to Git Hub clone.
I am intereseted about how JRecord is being used so please keep in touch. My whole purpose is to produce something that is useful to as many people as possible.
Suggestions
If you have suggestions that will help multiple user's please make them; I will probably do them when I have time.
Future & CodeGen
I have started a thread on future Changes.
CodeGen will be the major focus. My aims will be:
All the Best
Bruce
Last edit: Bruce Martin 2018-02-02
Hey Bruce,
First off thanks for all of your work on JRecord. Our team has been leveraging it to convert EBCDIC files with a good deal of success. We are currently looking into an approach for using this on our Hadoop cluster, but I think we have found a few sticking points for generalizing it.
When decoding an EBCDIC file, as soon as you have more than 1 record it is not trivial to calculate the split of a file. Like say for example you have 2 records of length 128 and 75. It's impossible to break up that file into a chunk where we are guaranteed to start at a new record (well, not impossible but computationally inefficient). For example say you split the file at 128MB, you would need to find the next offset for a valid record or the mapping will break.
Is this a valid line of thinking? Or is there an easier way to figure where you can split a file defined by a COBOL copybook?
I presume the files are VB on the Mainframe. One question are you transporting
the RDW (Record Descriptor Word) from the Mainframe ???
RDW Record Descriptor Word
In In a VB file the data is stored as:
(RDW)(Record-data)
The RDW consists of
In your example file including RDW will increase the size by 4%
while if the average record length is 20 bytes, it adds a 20% overhead.
Using RDW
In you example 128 & 75 byte records 128 meg split
You could
you can be very certain you have located the line start / end. Either that
or you have an unusual file
know the first time you run the file in Hadoop.
Sequence being searched for (128 & 75 byte records)
general comments
Consider adding the copybook name to the file-name (end / start what ever).
It means you can look at the file-name and know exactly what the copybook is.
That can be handy if you use the RecordEditor / FileAid. It is also makes it
easier/quicker for new people to become productive (also they will not be
asking you what copybook to use with a specific file.
With the CodeGen sub project I have started generating Code from a Cobol Copybook ~ Cobol Data File. Basically you give CodeGen a Cobol-Copybook * and Cobol-Data file. CodeGen analysers at the Copybook/File and pasers Schema details to one or velocity templates. It may be possible to set up Hadoop templates.
If there is some thing that will be useful for general Hadoop users please let me know.
With Version 0.90 I introduced IByteRecordReader and IByteRecordWriter.
These allow you interface JRecord with data sources other than files / streams.
See Description
Last edit: Bruce Martin 2018-05-06