JRecord / Discussion / Open Discussion: JRecord in use - mapreduce inputformat for HDFS, MAPREDUCE, PIG, HIVE, Spark

Bruce Martin - 2014-06-28

If any one is interested, there looks to be a project:

https://github.com/tmalaska/CopybookInputFormat

using JRecord to read copybook data under HDFS (Hadoop) Map reduce
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Ranga Nathan - 2018-10-17
  
  Interesting. I have been watching the COBOLSerDe at https://rbheemana.github.io/Cobol-to-Hive/serde.html
  
  Unfortunately both require the jar to be added to Hive. Where I work, that raises redflag. It is a neat approach.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Bruce Martin - 2018-10-18
    
    If you look at the problems,
    
    https://github.com/rbheemana/Cobol-to-Hive/issues
    
    there still some basic issues to be resolved e.g.
    
    Multiple 01's
    
    S99 (it handles of s9(02) but not s99)
    
    Some of these may be fixed now ???. Cobol has a lot of edge cases though
    
    The Cobol Parser is very basic, have a look https://github.com/rbheemana/Cobol-to-Hive/blob/master/src/main/java/com/savy3/hadoop/hive/serde3/cobol/CobolFieldFactory.java
    
    Is Avro Built into HADOOP ???. For simpler files a Cobol to Avro should be possible
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Bruce Martin - 2014-12-15

Looks as though datameer are using JRecord for Reading Cobol files

See:

http://www.datameer.com/community/questions/249/cobol-copybook-file-uploads

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Bruce Martin - 2015-11-29

There is also

https://github.com/ianbuss/CopybookHadoop
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Bruce Martin - 2016-03-24

Also Syncsort is having another go at explaining that DMX-h product is much better than JRecord - it probably is, it would want to be at the price they charge:

http://blog.syncsort.com/2016/03/mainframe/hadoop-mainframe-syncsort-simply-the-best-part-3/
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Bruce Martin - 2016-04-25

There are some notes on using CopybookInputFormat (Which uses JRecord):

http://index.pocketcluster.io/tmalaska-copybookinputformat.html
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Bruce Martin - 2016-08-19

JRecord is also used in:

https://wiki.cask.co/display/CE/Plugin+for+COBOL+Copybook+Reader+-+Fixed+Length
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Greg Senia - 2018-02-01

Bruce back in 2014 I made some changes before you added the new ways of adding IO-Readers/Writers. I just have to say how awesome your code is and how much of a help it has been when dealing with loading legacy mainframe data into Hadoop Environments. I have used this twice over the last few years to get legacy data into Hadoop and converted the files using Mapreduce into CSV/TSV tables. Not sure why I never saw this thread but I would like to give back a bit here is the code that I wrote that does a variety of things with Mainframe Copybooks on Z. Even if folks use it just as an example:

https://github.com/gss2002/copybook_formatter

Also here is some code that allows ftping RDW files directly from the Mainframe into Hadoop/HDFS as a mapreduce job or standalone client.
https://github.com/gss2002/ftp2hdfs

All code is open no licensing etc.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Bruce Martin - 2018-02-02

Thanks for the information on you packages, I have added them to ReadMe in a Hadoop section. The ReadMe should flow through to Git Hub clone.

I am intereseted about how JRecord is being used so please keep in touch. My whole purpose is to produce something that is useful to as many people as possible.

Suggestions

If you have suggestions that will help multiple user's please make them; I will probably do them when I have time.

Future & CodeGen

I have started a thread on future Changes.

CodeGen will be the major focus. My aims will be:

Creating a Standalone program

Inspecting both the Data file and Copybook. It should be possible to detect various date formats

Generating code (l am using Velocity Templates).

I could generate Hadoop code if useful.

All the Best
Bruce

Last edit: Bruce Martin 2018-02-02
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Collin Forsyth - 2018-05-04

Hey Bruce,

First off thanks for all of your work on JRecord. Our team has been leveraging it to convert EBCDIC files with a good deal of success. We are currently looking into an approach for using this on our Hadoop cluster, but I think we have found a few sticking points for generalizing it.

When decoding an EBCDIC file, as soon as you have more than 1 record it is not trivial to calculate the split of a file. Like say for example you have 2 records of length 128 and 75. It's impossible to break up that file into a chunk where we are guaranteed to start at a new record (well, not impossible but computationally inefficient). For example say you split the file at 128MB, you would need to find the next offset for a valid record or the mapping will break.

Is this a valid line of thinking? Or is there an easier way to figure where you can split a file defined by a COBOL copybook?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Bruce Martin - 2018-05-05

I presume the files are VB on the Mainframe. One question are you transporting
the RDW (Record Descriptor Word) from the Mainframe ???

No RDW, You are correct, you will have to go through the file Record by Record.

RDW - Should be possible to search for

RDW Record Descriptor Word

In In a VB file the data is stored as:

(RDW)(Record-data)

The RDW consists of

2 byte record length length 2 hex-zero bytes.

In your example file including RDW will increase the size by 4%
while if the average record length is 20 bytes, it adds a 20% overhead.

Using RDW

In you example 128 & 75 byte records 128 meg split
You could

Use a Random File to Read 128kmb starting 128meg - 128kb.

Search for a sequence of RDW records. If you find a sequence of 500 RDW's
you can be very certain you have located the line start / end. Either that
or you have an unusual file

Even if you some how make a mistake (would be a very unusual file) you will
know the first time you run the file in Hadoop.

Having the RDW would allow generic readers.

Sequence being searched for (128 & 75 byte records)

0,132,0,0<- 128 bytes->0,79,0,0<- 75 bytes->0,79,0,0 ....
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Bruce Martin - 2018-05-05

general comments

Consider adding the copybook name to the file-name (end / start what ever).
It means you can look at the file-name and know exactly what the copybook is.
That can be handy if you use the RecordEditor / FileAid. It is also makes it
easier/quicker for new people to become productive (also they will not be
asking you what copybook to use with a specific file.

With the CodeGen sub project I have started generating Code from a Cobol Copybook ~ Cobol Data File. Basically you give CodeGen a Cobol-Copybook * and Cobol-Data file. CodeGen analysers at the Copybook/File and pasers Schema details to one or velocity templates. It may be possible to set up Hadoop templates.
If there is some thing that will be useful for general Hadoop users please let me know.

With Version 0.90 I introduced IByteRecordReader and IByteRecordWriter.
These allow you interface JRecord with data sources other than files / streams.
See Description

Last edit: Bruce Martin 2018-05-06

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

JRecord in use - mapreduce inputformat for HDFS, MAPREDUCE, PIG, HIVE, Spark

Read Cobol data files in Java

Forums

Help

JRecord in use - mapreduce inputformat for HDFS, MAPREDUCE, PIG, HIVE, Spark

Suggestions

Future & CodeGen

RDW Record Descriptor Word

Using RDW

general comments

JRecord in use - mapreduce inputformat for HDFS, MAPREDUCE, PIG, HIVE, Spark

Read Cobol data files in Java

Forums

Help

JRecord in use - mapreduce inputformat for HDFS, MAPREDUCE, PIG, HIVE, Spark document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Suggestions

Future & CodeGen

RDW Record Descriptor Word

Using RDW

general comments

JRecord in use - mapreduce inputformat for HDFS, MAPREDUCE, PIG, HIVE, Spark