Description:
COBOL source files that contain a Ctrl+Z character (ASCII 26, SUB) as an EOF marker cause cobc to enter an infinite loop during compilation on Windows.
These files originate from the book:
"SAMS - Teach Yourself COBOL in 24 Hours"
which includes example source code on a CD-ROM.
The files use Ctrl+Z as an EOF marker, which was common in DOS and CP/M environments.
Observed behavior:
When compiling such a file, cobc does not terminate and appears to hang indefinitely.
The process must be manually interrupted (e.g., via Ctrl+C).
Tested environments:
1) MSYS2 with self-compiled GnuCOBOL 3.3-dev
2) SuperBOL AIO package (June 2024)
3) Arnold Trembley GnuCOBOL 3.2 build (August 2023)
Expected behavior:
cobc should correctly handle Ctrl+Z EOF markers (possibly issuing a warning) and terminate compilation normally without entering an infinite loop.
Additional information:
Hm, recent cobc on GNU/Linux says:
COBOL compilers on godbolt seem to just ignore the data as "in column 1-6 -> ignored" - which is identical if the file is converted to unix lf before on GNU/Linux; if converted back then there's no issue on Windows compilers either...
But compiling directly from the file as-is with native Windows builds runs into that error - both with old and new versions of cobc.
Checking further: this applies only to the final processing ->
cobc -E -o hello.i hello.cobdoes not lead to that error but a followingcobc hello.idoes.... and because of the "special" debug handling (you can't interrupt a mingw / dwarf generated binary and see something reasonable [not with GDB, LLDB seems to only support minimal dwarf...] so can't just go "up" to see where the issue is)
I think that:
I'll have a further look.
Thank you very much.
Additional finding with Hex Editor-Plugin (VSCode) after converting/reconverting:
The issue seems to occur specifically when Ctrl+Z (0x1A) follows a CR (0x0D) without a trailing LF (0x0A).
Working file ending:
CR LF SUB
Failing file ending:
CR SUB
Last edit: Michael Del Solio 2026-03-23
The difference between the preparser and the scanner is that the scanner of the preparser reads single characters from the stream (getc), builds up a buffer and handles 0x1a on its own, while the scanner of the parser reads in until a newline or the buffer is full (using fgets) (which has 32k, a limit I think may only be reached for internal directives [like reserved word specifications, source/line references as the preparser scanner buffer has a much smaller limit)
And if fgets on Windows sees 0x1a it returns 0x00; in the case of an unexpected symbol (here: a spare 0x0d) we read until the end of the word (newline or EOF) - and never reach that as 0x00 was not explicit checked (needs to be done because of fgets and 0x1a = 0x00 on Windows).
If we ever would want to read past 0x1a we'd need to change the parser's scanner to read byte-wise from the stream and create the buffer - similar to what we do with the preparser.
... the sole reason that the scanner has seen it is that the explicit handling of 0x1a was not used because it was placed after a catch-all (it needs to be up-front as both have a size of 1, same size = place in the scanner definition provides the order), so I've fixed that as well.
I'll check all those adjustments when I'm getting back to GC and need to think about how to add that best to the testsuite (most likely a normal source + printf on the command line, if we use that in other places already)...