[Rlib-users] OpenCReports 0.1 released

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi,

this was brewing for about 3 years now but I am happy
to announce the first pre-release of OpenCReports,
my take on re-implementing RLIB from scratch.

https://github.com/zboszor/OpenCReports
https://github.com/zboszor/OpenCReports/releases/tag/v0.1

I don't have any ETA for actually finishing it, though.

FYI, The name comes from the fact that it's written in C
and it's developed in the open.

THIS PRE-RELEASE DOESN'T HAVE ANY OUTPUT DRIVER.
AS SUCH, IT'S NOT USEFUL FOR END-USERS YET.

Having said that, it's quite full featured in the
data handling department.

I apologize in advance about RLIB bashing, but I know
quite a lot about its internals since I am its current
maintainer.

OpenCReports started out as an adventure in Flex and
Bison, mostly because expressions in RLIB used a home
grown parser and it had quite some bugs. For one, it was
forgiving about syntax errors in corner cases.
E.g. a missing closing parenthesis at the end of the
expression string was allowed.

On the other hand, OpenCReports is not forgiving.
It throws and error in this case, i.e. the expression
result will be an error message.

The grammar code is quite bulletproof, as in it doesn't
leak memory and doesn't have use-after-free bugs.
In general, the code is always compiled with ASAN and
UBSAN during development.

The grammar handles:
* Arithmetic operators, including the famous Facebook
   challange about implicit multiplication.
   This means that these below are not the same.
   Controversial, but correct in academic environments.

   1/(1+1)(2+2) equals to 1/8
   1/(1+1)*(2+2) equals to 2

* Binary operators
* Logic operators
* Unary operators
* Function calls

One ambiguous operator is "^". By default "x^y" is
"x XOR y" (since I like C operators) but it's selectable
to be pow(x, y) to be more compatible with RLIB.

Expressions can be (and are) optimized after parsing.
This is done to reduce the amount of work during dataset
traversal. Fully constant expressions, no matter how
complex they are, are pre-computed by the optimizer.

There are four data types in OpenCReports: string,
error, number and datetime.

Strings are UTF-8 through-and-through.

Errors are actually strings behind the scenes, they just
contain and error message. But if they are used in other
expressions, the error message and error type is propagated
upward to the parent expression.

In RLIB, numbers were handled as fixed point values stored
in a 64-bit integer with 7 decimal digits. Integers were
multiplied by 10 million and stored in the 64-bit
representation. It had its drawbacks:
* The constant multiplication and divison by 10 million
   always rounded down. In some cases, adding small
   percentages that added up to 100.0% on paper didn't add up
   to 100.0% in an RLIB report.
* Relatively small numbers may have been overflowing the
   64-bit integer if processed further, e.g. in variables.

On the other hand, numbers are handled by MPFR in
OpenCReports. The precision is selectable but by default
it's 256 bits. Since there is no constant adjustment for
the fixed precision and there is always surplus precision,
processing numbers doesn't suffer from the same bugs as RLIB.

While using MPFR may sound slower than using 64-bit storage
and fixed precision (it certainly is) but RLIB doesn't have
an expression optimizer and this already covers most of the
speed loss. The fact that it is actually numerically correct
worth the change.

Datetime is four data types in one:
* datetime (timestamp) with valid date and time
* date
* time
* interval

Expressions may be "delayed", i.e. their result will show
the value of the expression in the dataset. This is also
a features of RLIB.

RLIB separated parsing these into different functions.
In OpenCReports, all of them are aliases to stodt().

There is also a separate interval() function to parse or
create an interval value.

All values may be NULL.

Data traversal is done a little differently.
E.g. RLIB needs to go back one record in the dataset to
detect breaks. Some data sources don't allow going backwards
but allows restarting the dataset from the first row.
Because of this, RLIB needed to cache all the rows regardless
of the data source, be it PostgreSQL, MySQL or ODBC.

On the other hand, OpenCReports separated the datasource
from the row traversal in a way that the dataset pointer
doesn't need go backward. OpenCReports caches the last 2 rows
from the dataset with one row lookahead to detect the end.
This allows OpenCReports avoid extra caching of rows.

According to the original developers of RLIB, the follower
queries should work like this:
* 1:1 followers are laid out side by side (record by record)
   along with the main query. The dataset lasts while the
   main query lasts, the 1:1 followers are either cut if they
   contain more rows, or their fields are empty (NULL) if
   they contain fewer rows than the main query.
* N:1 followers should work exactly like LEFT OUTER JOIN in SQL

The RLIB implementation of N:1 follower queries is not correct
and doesn't produce the same result as a LEFT OUTER JOIN.
It's fixed in OpenCReports.

Breaks are implemented in OpenCReports.

All of the RLIB variable types (and more) are implemented
in OpenCReports.

In RLIB, variables are special entities.

In OpenCReports, they reuse expression handling with a
twist: recursive expressions were added exactly for
satisfying variables.

But recursive expressions (referencing "r.self") are an
integral part of expression handling in OpenCReports and
can be used by user expressions. In fact, it's on my TODO
list to allow creating custom variables by specifying
the base type, base expression, initial value, two
intermediate expression and the result expression.

OpenCReports supports all the basic variable types of RLIB:
count, expression, sum, average, lowest and highest.

There are some variable variants with or without ignoring
NULLs from the dataset. These are: "countall" and "averageall".
When NULLs are not ignored, rows are counted and NULLs are
replaced with 0 when averaging.

Variables may have a "resetonbreak" setting, like in RLIB.

Variables may also be "precalculated", like in RLIB.
If they have a resetonbreak setting, they will show the value
of the last row in the break. Without resetonbreak, they will
show the value of the last row in the dataset.

The dataset is processed twice if there are delayed
expressions or precalculated variables.

OpenCReports allows mixing delayed, non-delayed subexpressions
and precalculated variables in the same expression.
AFAIK, this was not possible in RLIB.

Almost all of the RLIB functions are implemented in
OpenCReports. The two missing ones are format() and dtosf().
Many other functions supported by MPFR are also implemented.

The C API of OpenCReports is extensive.
There are quite a few unit tests that utilize the API's
certain aspects.

There is an initial documentation in SGML from which
a PDF is generated during the build. It's far from
complete and it doesn't even cover the current state of
the code.

The original XML DTD was not covering everything that was
possible with RLIB's report XML. I reconstructed it from
the source code and extended it with the ones supported
by RLIB and with some new additions. E.g. "delayed" and
"precalculate" are now aliases in variables.

Currently, OpenCReports only handle any XML tags related
to report data processing described above. The output
related ones, i.e. <Output>, <Detail>, <NoData> are not
handled.

There is one extension to the RLIB DTD. If the report XML's
top node is <OpenCReport> then further XML nodes are available:
<Datasources> and <Queries>. This will allow describing
practically everything in XML with minimal programming.

An RLIB wrapper is on my TODO list.

As I described above, OpenCReport isn't and won't be
bug-for-bug compatible with RLIB.

Comments are welcome.

Best regards,
Zoltán Böszörményi