The Cobol Lexer main classes.

Provides the API for parsing a Cobol source program and convert it to a list of lexical tokens.

Example

You want to convert a cobol source program "test.cbl" and obtains it as a list of lexical tokens:


        FileReader reader = new FileReader("test.cbl");
        CobolLexer lexer = new CobolLexer(reader);
        TokenList tokens = lexer.getTokens();
    

And that's all.

What are lexical tokens?

A lexical token is an logically indivisible element of cobol source program. TokenList contains one CobolToken for each lexical token.

CobolToken give you the following information about each token:

Each lexical token is given a type (see CobolType ).

Most tokens contains text that spans more than one line. The exception is strings: all strings, including multiline strings, are represented as one token.

Parsing only applies to proper cobol code. Compiler options and contents of pseudo-code is not parsed, only stored as found.

What results means

TokenList contains a list of lexical tokens, a combination of the following types:
Type Value Description
STRING A string Include the start/end quotation mark. For multiline strings, the content is the final value of the string (no further process required).
WORD A cobol word Variable names, reserved words, numeric values, ... (no whitespaces inside)
TEXT Unparsed text Used to store comments, compiler options and other information that is not the code itself.
SEPARATOR A comma, period or semi-colon These caracter when used as separators in code, not as part of a PICTURE format string.
COLON A colon Used for substrings, like TITLE(START:LEN).
LEFT_PAREN, RIGHT_PAREN "(", ")" Left and right parentesis.
AMPERSAND "&" The concatenation operator.
START_PSEUDO_TEXT "==" Signal start of pseudo-text.
PSEUDO_TEXT A free text The pseudo-text itself, minus initial/final "==".
END_PSEUDO_TEXT "==" Signal the end of pseudo-text
NEW_PAGE "/" A new page jump. The text includes the line's content starting from the initial "/"..
SPECIAL_LINE "$" A compiler option. The text includes the line's content starting from the initial "$".

PICTURE string format

A PICTURE string format, like X(9).99, is always expressed as a single WORD token. In any other situation, parentesis and period are expressed as separate tokens.

Continuation lines

Strings divided in two or more physical lines are joined in a single STRING token. You don't need to care about continuation characters in source code.

Same processing apply to other items splitted across multiple lines.

Debug lines

Debug lines are parsed only if you expecifically asked for it. You must indicate what debug lines you want to be parsed (see {@link jcobol.lexer.CobolLexer#getTokens(char...) getTokens(final char... debug)}).

Tabs treatement

Tabs are expanded to spaces before parsing.

For fixed format, tabs are fixed at 7, 12, 20, 28, 36, 42, 50, 58, 64, 72 columns.

For free format, tabs are fixed at 9, 17, 25, 33, 41, 49, 57, 65, 73, 81, 89 columns.