Provides the API for parsing a Cobol source program and convert it to a list of lexical tokens.
You want to convert a cobol source program "test.cbl" and obtains it as a list of lexical tokens:
FileReader reader = new FileReader("test.cbl");
CobolLexer lexer = new CobolLexer(reader);
TokenList tokens = lexer.getTokens();
And that's all.
A lexical token is an logically indivisible element of cobol source program.
TokenList
contains one
CobolToken
for each lexical token.
CobolToken
give you the following information about each token:
Each lexical token is given a type (see
CobolType
).
Most tokens contains text that spans more than one line. The exception is strings: all strings, including multiline strings, are represented as one token.
Parsing only applies to proper cobol code. Compiler options and contents of pseudo-code is not parsed, only stored as found.
TokenList contains a list of lexical tokens, a combination of the following types:
Type | Value | Description |
---|---|---|
STRING |
A string | Include the start/end quotation mark. For multiline strings, the content is the final value of the string (no further process required). |
WORD |
A cobol word | Variable names, reserved words, numeric values, ... (no whitespaces inside) |
TEXT |
Unparsed text | Used to store comments, compiler options and other information that is not the code itself. |
SEPARATOR |
A comma, period or semi-colon | These caracter when used as separators in code, not as part of a PICTURE format string. |
COLON |
A colon | Used for substrings, like TITLE(START:LEN). |
LEFT_PAREN, RIGHT_PAREN |
"(", ")" | Left and right parentesis. |
AMPERSAND |
"&" | The concatenation operator. |
START_PSEUDO_TEXT |
"==" | Signal start of pseudo-text. |
PSEUDO_TEXT |
A free text | The pseudo-text itself, minus initial/final "==". |
END_PSEUDO_TEXT |
"==" | Signal the end of pseudo-text |
NEW_PAGE |
"/" | A new page jump. The text includes the line's content starting from the initial "/".. |
SPECIAL_LINE |
"$" | A compiler option. The text includes the line's content starting from the initial "$". |
A PICTURE string format, like X(9).99
, is always expressed as a single WORD token.
In any other situation, parentesis and period are expressed as separate tokens.
Strings divided in two or more physical lines are joined in a single STRING token. You don't need to care about continuation characters in source code.
Same processing apply to other items splitted across multiple lines.
Debug lines are parsed only if you expecifically asked for it. You must indicate what debug lines you want to be parsed (see {@link jcobol.lexer.CobolLexer#getTokens(char...) getTokens(final char... debug)}).
Tabs are expanded to spaces before parsing.
For fixed format, tabs are fixed at 7, 12, 20, 28, 36, 42, 50, 58, 64, 72 columns.
For free format, tabs are fixed at 9, 17, 25, 33, 41, 49, 57, 65, 73, 81, 89 columns.