Stingray - Schema-Based File Reader Blog

Simple handling of numerous data file formats, even COBOL EBCDIC.

Brought to you by: slott56

Stingray - Schema-Based File Reader / Blog: Recent posts

Moving to GitHub

It's about time to move to Git Hub. The SourceForge workflow has (finally) started to feel too complex. Details to follow.
See https://github.com/slott56 for further developments.
Also see https://slott-softwarearchitect.blogspot.com for some announcements.

Posted by 2018-10-28 Labels: github

Fix the cobol.RECFM_VB file handling

Also, updated the GIT repository and actually tagged this release as 4.4.5. That makes it official.

The unit tests for cobol.RECFM_VB are clearly inadequate, since the bug was obvious on first use.

Posted by 2015-02-19

Fix the COBOL Schema Parser Functions

The revised version of COBOL_schemata looks like this:

def COBOL_schemata( source, replacing=None, lexer_class=Lexer ):
    lexer= lexer_class( replacing )
    parser= RecordFactory()
    dde_list= list( parser.makeRecord( lexer.scan(source) ) )
    schema_list= list( make_schema( dde ) for dde in dde_list )
    return dde_list, schema_list

This gives us an API that looks like this:... read more

Posted by 2014-06-07

COBOL Schema vs. COBOL Schemata

Added a COBOL_schemata function in the GIT repository only. This is not built into the distribution kit because it's not clear if it truly helps solve problems with real-world copybooks.

The code looks like this:

def COBOL_schemata( source, replacing=None ):
    lexer= Lexer( replacing )
    parser= RecordFactory()
    dde_list= list( parser.makeRecord( lexer.scan(source) ) )
    schema_list= list( make_schema( dde ) for dde in dde_list )
    return dde_list, schema_list... [read more](/p/stingrayreader/blog/2014/05/cobol-schema-vs-cobol-schemata/)

Posted by 2014-05-20

Release 4.4.3 Bug Fixes and Small API Change

Make embedded schema loader tolerate blank sheets by producing
a warning and returning None instead of raising a StopIteration exception.
Tweak the Data validation demo to handle the None-instead-of-schema feature.
Changed cobol.COBOL_file.row_get() to leave trailing spaces
intact. This created a problem of trashing COMP items that had values
of 0x40 exactly -- an EBCDIC space.... read more

Posted by 2014-05-18

Split large files

Made some tweaks to the documentation and the cobol package that support splitting large files.

See http://stingrayreader.sourceforge.net/cobol.html#low-level-split-processing for a way to handle splits with this.

This change is only in the GIT repository, for now. It's not a formal part of the 4.4.2 release.

Posted by 2014-05-14

Version 4.4.2

Made two small performance tweaks for handling large COBOL files.
These are minor; it's difficult to make more major performance enhancements without a fundamental, wholesale change to the way attributes are handled for COBOL.

Currently, we have a lazy calculation of offset and size when there are OCCURS DEPENDING ON clauses. The calculation involves a fast (but not instantaneous) calculation of offsets.... read more

Posted by 2014-05-13

Code Update to 4.4.1.

The code in the GIT repository has been updated to version 4.4.1 to include Numbers '09 and Numbers '13 files.

Some of the unit tests for RECFM handling have been corrected, and the RECFM handling has been revised based on the unit test fixes.

Ticket 15 (wrong ODO calculations) appears to be finally resolved.

Posted by 2014-05-09

Switching to Git

The old SVN repository (which was stale code) has been dropped.

The new repository uses the Git server, and will have the live code.

This will be separate from the distribution kits, which should become less useful in the future. Simply cloning to the GIT repository should be a simpler way to get the code.

Posted by 2014-05-08

Version 4.3.4

Ticket #12 resolved. This supports what we'll call RECFM=N processing.

It will handle EBCDIC files that lack BDW/RDW words at the start of blocks and records.

It uses a very large buffer, computes the "Occurs Depending On" record size, then does an unmet of the unused bytes at the end of the buffer. It's slower than processing a file with proper RDW/BDW words from the mainframe filesystem.

Posted by 2014-05-07

Version 4.3.3

Numerous Changes...

Support iWork '09 Numbers Workbook files.
Fix the :py:class:cobol.defs.Usage hierarchy to properly handle
data which can't be converted. An ErrorCell is created in the (all too common)
case where the COBOL data is invalid.
Handled precision of comp3 correctly. Ticket #9
Added :py:class:cobol.loader.Lexer_Long_Lines to parse copybooks with
junk in positions 72:80 of each line. Ticket #11
Update the developers' guide. Ticket #7.... read more

Posted by 2014-05-06

Version 4.3.2.

Fix the stingray.cobol.dump() to properly iterate through all fields. This will iterate through all indices of indexed fields. It should produce a comprehensive record dump.

The dump() function is now an iterable over all fields in the record; each result is a tuple with (dde, attribute, indices, bytes, Cell). An error cell will also contain the raw bytes.

def raw_dump( schema, sheet ):
    for row in sheet.rows():
        stingray.cobol.dump( schema, row )... [read more](/p/stingrayreader/blog/2014/05/version-432/)

Posted by 2014-05-05

Version 4.3 Changes

This release may handle COBOL EBCDIC files with Occurs Depending On properly.
This should give an interface which is compatible with other spreadsheets even
though the record layout is rather complex.

Handle more complex VALUES clauses for more complex 88-level items.
Restructure cobol, cobol.loader to add a cobol.defs module.
Handle Occurs Depending On. Parse the syntax for ODO. Update LazyRow to tweak size and offset information for each row fetched.
Add Z/OS RECFM handling in the :py:class:cobol.EBCDIC_File class. This will allow processing "Raw" EBCDIC files with RECFM of V and RECFM of VB -- these files =include BDW and RDW headers on blocks and records.... read more

Posted by 2014-04-16

Version 4.2 Changes

More COBOL syntax is handled properly.

To an extent, there is some support of Occurs Depending On.

Also, some more complex 88-level items are properly parsed.

Small scripts like this can be used to confirm how well COBOL is supported.

import stingray.cobol.loader
import logging, sys
import pprint
logging.basicConfig(stream=sys.stderr, level=logging.DEBUG)
with open("CCCCMST3.TXT", "r") as cobol:
        schema= stingray.cobol.loader.COBOLSchemaLoader( 
            cobol, replacing=[("'XXXX'","X")] ).load()
        pprint.pprint( schema )
logging.shutdown()

Posted by 2014-04-07

Version 4.1 Changes

Version 4 dates from March, 2014. It switches to Python3.3. This is a rewrite. It doesn't use 2to3 or six.

If you're forced to use Python 2, you'll have to have a two-part application. The Stingray parts use Python3.3. All the other parts use Python2.

Added a "replacing" option to the COBOL Schema Loader.

with open("xyzzy.cob", "r") as cobol:
    schema= stingray.cobol.loader.COBOLSchemaLoader( cobol, replacing=("'WORD'", "BAR") )... [read more](/p/stingrayreader/blog/2014/03/version-41-changes/)

Posted by 2014-03-30

Audit Balance and Control

The sample applications use a simplistic counts['stage'] += 1 to accumulate balance information.

After spending some quality time reviewing Audit Balance and Control designs, it's time to update the demo programs to reflect a slightly smarter approach to this.

We don't need to go all the way down the ABC road -- yet -- but we do need to acknowledge that the balances need to reflect something we can call the "ABC Invariant".... read more

Posted by 2012-01-10 Labels: future

Bad Embedded Schema

Without naming names or pointing the finger of blame, let's talk about datasets that are created without much care. To be a little more specific, this is econometric data. Lots of data that's pretty well defined and mostly regular.

Except.

Attributes like ZIP code (or Postal code) often become part of the dataset for a number of reasons. The ZIP code is commonly used to disambiguate similarly-named business enterprises. ... read more

Posted by 2011-10-06 Labels: python csv stingray schema

The Big Picture

After spending the last three years working with complex econometric data sets, as well as COBOL files, I finally realized what The Big Picture (TBP™) is.

Data files (workbooks, COBOL files, whatever) are useless without a schema.

Further, there's rarely an obvious association between application program and schema. Indeed, one of the few desirable features of COBOL is that the schema is explicitly bound into the program. It's in the Environment Division as the "FD" definition for the file. (Sometimes this is just filler and the real definition is in the Data Division, but the principle is still being followed.)... read more

Posted by 2011-09-29 Labels: COBOL DDE ETL XLS XLRD CSV