Hello.
Recently I did many Profile-Guided Optimization (PGO) benchmarks on multiple projects (including static analysis tools and compilers like Rustc, Clang, Clangd, Clang Tidy, and many others) - the results are available here. I think it's worth trying to apply PGO to the GnuCobol tooling ecosystem like the compiler (and maybe something else like code formatters, etc.).
I can suggest the following things to do:
Here are some examples of how PGO is already integrated into other projects' build scripts:
configure
script Some PGO documentation examples in various projects:
Regarding other optimizations, I recommend enable Link-Time Optimization. Usually it's easy to implement. After PGO, I can suggest evaluating LLVM BOLT as an additional optimization step after PGO - Post Link Optimization (PLO). But I suggest starting with PGO - it's a more stable optimization than PLO in the general case. More about LLVM BOLT use cases you also can read at https://github.com/zamazan4ik/awesome-pgo .
Thank you for the suggestion and for taking the time to do this summary writeup.
For LTO: that could be useful for cobc and libcob, not for generated modules, could it?
Can you please do two countings with
make checkall
and share the comparison, along with the used flags in configure?I guess it would be most useful to check during configure is it is available and in this case auto-add to COBC_LIBS/COBC_CFLAGS/LIBCOB_LIBS/LIBCOB_CFLAGS if not explicit disabled, do you agree? This should be quite easy to check and implement so this could get into this year's release, too.
I do think it is reasonable to have PGO as a test project.
I don't expect much difference in the compiler, at least not for common workloads as in this case the C compiler called takes most of the cpu time and strcasecmp has most of the time during parsing within cobc. It likely provides some benefit for mini sources with
-fsyntax-only
, but that's so special that people can tune this themselves.It likely doesn't make much difference for a generated program as nearly all time (outside of memset/memcmp) is spent in libcob.
But applying that to libcob could indeed make a difference. A test can prove that easily by gathering the instructions during make checkall (we can easily check this by just seeing an environment variable)
Can you please test that after taking LTO? I'd be available to help by mail or Matrix, or in this ticket.
I will try to perform these tests when I get free time. Thanks a lot for a test scenario suggestion!
By the way, since GnuCobol implements Ahead-of-Time compilation model - is it possible to compile a Cobol program with PGO? Are there PGO-related compiler switches in GnuCobol? Maybe you heard about some Cobol compilers with PGO support?
If you manually specify the needed options via
-A "compile options here" -Q "link options here"
, then yes. But as noted I don't think there will be much benefit as the "real work" is nearly completely done in libcob, and the best optimizations you commonly get is with a recent GCC and passingCFLAGS="-O2 -march=native -mtune=native -g"
to GnuCOBOL's configure command (all but-O2
will be included in cobc calling the C compiler).As noted, it would be quite useful to have this comparison for lto, so I'll let it run during the day and report the reults back.
LTO check on up-to-date Debian
gcc (Debian 10.2.1-6) 10.2.1 20210110 + GNU assembler (GNU Binutils for Debian) 2.35.2
withAMD Ryzen 7 3700X 8-Core Processor
(I've used the options mentioned in the Debian lto status page)So in this setup - there isn't any improvement with native march/mtune (there may be better specific values) and the LTO options used bring around 0.05% decrease in cpu instructions and cpu time.
Using newer versions of GCC/binutils likely bring in different timing differences. using a different OS/compilers may lead to different results as well.
Note that the only parts that were traced are the actual COBOL runs and optimization was only applied to the runtime (and cobc, but this was not traced), not to the COBOL modules to not increase the time the called compiler spends (if you have huge COBOL sources after all preparsers and every preprocessing is done, which is not uncommon, then you may already have to wait for minutes).
Last edit: Simon Sobisch 2023-11-16
Rerun on Rocky 9 with
gcc (GCC) 11.3.1 20221121 (Red Hat 11.3.1-4) + GNU assembler version 2.35.2-37.el9
onIntel(R) Xeon(R) Gold 6242R CPU
that hashas (same option as above,
-O2 -g
then one time with the native and -O2 options in CFLAGS only, one time additional placing that in LDFLAGS and one time with ltoThe LTO version is the one that takes most time..., native brings 1% improvement.
Last edit: Simon Sobisch 2023-11-16
@zamazan4ik I think LTO is out given my tests from November?
PGO seems like something that should be inspected - Do you have any plans to do so?
Oh, sorry that I didn't reply in time - was busy with other stuff, my bad.
Yeah, I think your LTO tests are fine. Regarding PGO - yes, I still have such plans to perform PGO tests for GnuCobol but right now I am still busy with other stuff so cannot guarantee any ETA in this case.
Re-mapped to GC v3.X and not Prog guide.
friendly ping @zamazan4ik for possible further checks / status updates