Menu

JRecord in use - mapreduce inputformat for HDFS, MAPREDUCE, PIG, HIVE, Spark

2014-06-28
2018-10-18
  • Bruce Martin

    Bruce Martin - 2014-06-28

    If any one is interested, there looks to be a project:

    using JRecord to read copybook data under HDFS (Hadoop) Map reduce

     
  • Bruce Martin

    Bruce Martin - 2015-11-29
     
  • Greg Senia

    Greg Senia - 2018-02-01

    Bruce back in 2014 I made some changes before you added the new ways of adding IO-Readers/Writers. I just have to say how awesome your code is and how much of a help it has been when dealing with loading legacy mainframe data into Hadoop Environments. I have used this twice over the last few years to get legacy data into Hadoop and converted the files using Mapreduce into CSV/TSV tables. Not sure why I never saw this thread but I would like to give back a bit here is the code that I wrote that does a variety of things with Mainframe Copybooks on Z. Even if folks use it just as an example:

    https://github.com/gss2002/copybook_formatter

    Also here is some code that allows ftping RDW files directly from the Mainframe into Hadoop/HDFS as a mapreduce job or standalone client.
    https://github.com/gss2002/ftp2hdfs

    All code is open no licensing etc.

     
  • Bruce Martin

    Bruce Martin - 2018-02-02

    Thanks for the information on you packages, I have added them to ReadMe in a Hadoop section. The ReadMe should flow through to Git Hub clone.

    I am intereseted about how JRecord is being used so please keep in touch. My whole purpose is to produce something that is useful to as many people as possible.

    Suggestions

    If you have suggestions that will help multiple user's please make them; I will probably do them when I have time.

    Future & CodeGen

    I have started a thread on future Changes.

    CodeGen will be the major focus. My aims will be:

    • Creating a Standalone program
    • Inspecting both the Data file and Copybook. It should be possible to detect various date formats
    • Generating code (l am using Velocity Templates).
    • I could generate Hadoop code if useful.

    All the Best
    Bruce

     

    Last edit: Bruce Martin 2018-02-02
  • Collin Forsyth

    Collin Forsyth - 2018-05-04

    Hey Bruce,

    First off thanks for all of your work on JRecord. Our team has been leveraging it to convert EBCDIC files with a good deal of success. We are currently looking into an approach for using this on our Hadoop cluster, but I think we have found a few sticking points for generalizing it.

    When decoding an EBCDIC file, as soon as you have more than 1 record it is not trivial to calculate the split of a file. Like say for example you have 2 records of length 128 and 75. It's impossible to break up that file into a chunk where we are guaranteed to start at a new record (well, not impossible but computationally inefficient). For example say you split the file at 128MB, you would need to find the next offset for a valid record or the mapping will break.

    Is this a valid line of thinking? Or is there an easier way to figure where you can split a file defined by a COBOL copybook?

     
  • Bruce Martin

    Bruce Martin - 2018-05-05

    I presume the files are VB on the Mainframe. One question are you transporting
    the RDW (Record Descriptor Word) from the Mainframe ???

    • No RDW, You are correct, you will have to go through the file Record by Record.
    • RDW - Should be possible to search for

    RDW Record Descriptor Word

    In In a VB file the data is stored as:

    (RDW)(Record-data)

    The RDW consists of

    2 byte record length length
    2 hex-zero bytes.
    

    In your example file including RDW will increase the size by 4%
    while if the average record length is 20 bytes, it adds a 20% overhead.

    Using RDW

    In you example 128 & 75 byte records 128 meg split
    You could

    • Use a Random File to Read 128kmb starting 128meg - 128kb.
    • Search for a sequence of RDW records. If you find a sequence of 500 RDW's
      you can be very certain you have located the line start / end. Either that
      or you have an unusual file
    • Even if you some how make a mistake (would be a very unusual file) you will
      know the first time you run the file in Hadoop.
    • Having the RDW would allow generic readers.

    Sequence being searched for (128 & 75 byte records)

      0,132,0,0<- 128 bytes->0,79,0,0<- 75 bytes->0,79,0,0 ....
    
     
  • Bruce Martin

    Bruce Martin - 2018-05-05

    general comments

    Consider adding the copybook name to the file-name (end / start what ever).
    It means you can look at the file-name and know exactly what the copybook is.
    That can be handy if you use the RecordEditor / FileAid. It is also makes it
    easier/quicker for new people to become productive (also they will not be
    asking you what copybook to use with a specific file.

    With the CodeGen sub project I have started generating Code from a Cobol Copybook ~ Cobol Data File. Basically you give CodeGen a Cobol-Copybook * and Cobol-Data file. CodeGen analysers at the Copybook/File and pasers Schema details to one or velocity templates. It may be possible to set up Hadoop templates.
    If there is some thing that will be useful for general Hadoop users please let me know.

    With Version 0.90 I introduced IByteRecordReader and IByteRecordWriter.
    These allow you interface JRecord with data sources other than files / streams.
    See Description

     

    Last edit: Bruce Martin 2018-05-06

Log in to post a comment.