Thread: [Crm114-discuss] CRM binary file format & (future?) portability

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

L.S.,

I've been thinking a bit about this binary/native file data stuff and 
had a idea I'd like to discuss.

First off, I'm pretty sure by now that 'native binary format' for the 
storage is a Good Thing(tm) as it makes CRM act _fast_. (And that's 
precisely what I need.)
Why is it 'good'? Because crm uses a very fast mechanism to access the 
files called memory mapping (mmap(3) or the Win32 equivalent). Having 
crm translate/convert those bytes to/from 'native format' would either 
slow down the memory mapped file access (each read/write would have to 
convert int/double formats when mmap is used) or we'd have to throw 
memory mapping out the window, i.e. load every file into memory, convert 
once on load and once on save (for learn) causing an increase in both 
startup and memory usage overhead. Both these methods, aimed at 
providing a 'portable binary file format', would have serious impact on 
crm performance IMHO.
(And it just might make the code a bit less legible too, but that can be 
handled in other ways.)

Then another thought struck me: while scanning through my (rather short) 
archive of the mailing list, I've seen a few questions about the 
'portability' of these files across multiple CRM versions. (e.g. the 
64-bit discussion with regards to a new Debian build, IIRC)

I don't have a proper solution for resolving the latter, but we might 
consider ways to ensure crm binary formats are 'forwards portable' 
across platforms _and_ versions/releases. (forwards portable = anything 
before this moment may be 'broken', but from now on we can offer 
portability)

The idea is this (it has been done in a minor way before within crm):

- start each file with a versioned header (I'll come back to that later)
- the rest of the file is as it is right now: the hashes and all the 
other goodies, stored in a native format (int, double, etc.) for _fast_ 
every access/use.

The way to provide the forward portability would be through providing an 
export/import mechanism (already exists for a few formats: cssdump) 
which can be used to export binary files to a portable (preferably 
text?) format, which can then be imported into another crm: different 
version and/or platform.

A bit of a snag there: I've found that the current way of calculating 
the hashes is not portable, i.e. does not deliver the exact same values 
on different platforms. That has to be fixed before 'cross-platform' 
would be feasable. (As I don't see a 'hash converter' happening: hashes 
are essentially a one-way trip, so there's no way to convert hashes that 
were calculated based on slightly different bit patterns.)

I'm not sure fixing the strnhash() code so as to calculate the same hash 
bitpattern on every platform is quite enough to ensure cross-platform 
compatibility though: the data in the files is sometimes derived from 
the hashes: that would imply those 'derivation paths' would also have to 
be bit-compatible across platforms. (I'm not sure, have to check again 
there if we stored derived data next to config params and hash series.)

Let's assume however that we've got the hashes/import/export thing 
covered. And even if we don't we would be well aware of the limitations 
then.

That leaves that 'versioned header' I mentioned.

The versioned header should contain enough information for an 
export/import function to operate correctly:
a) import all acceptable data, or
b) report the incompatibility and hence the need to 'recreate/relearn' 
the files.

Especially (b) is important as that'd enable (automated) upgrades to 
properly interact with the users: one would then be able to select 
wether to commence with an 'incompatible' upgrade or delay it until a 
later, opportune moment.

To make it all cross compatible, the header would need to include these 
bits of info:

- actual crm version
- the classifier used
- the platform (little/big endian, 32/64/... bits, etc.)
- data type (just to make sure we're not messing with the wrong sort of 
stuff)

so the im/exporter can handle possible data content / format changes 
and/or data compatibility checks.

The header has to be in a cross-platform portable format so that every 
crm toolset of there can recognize copied/moved files and act 
accordingly: report or process the data.
Given the 'memory mapped' file access approach of crm, this would also 
mean that the header is fixed length (bytes). The software can then 
easily jump over this header to get at the data itself.

Conclusion
-----------

I suggest the introduction of a cross-platform binary format, fixed 
width header for all crm binary files (css et al) to provide (at least) 
forward compatibility.

For the remainder, all these files remain as is: data will be stored in 
a native format for fast access.

The binary format header will include these information items (at least):

- the crm version used to create the file
- the platform (integer size and format(endianess), floating point size 
and format, structure alignment, etc.)
- the classifier used to create the file
- the data content type (some classifiers use multiple files)
- space for future expansion (this is a research tool too: allow folks 
to add their own stuff which may not fit the header items above)

The approach includes the existence of an export/import tool to convert 
the data to/from a cross-platform portable text format, where 
applicable. At least, the im/exporter will then be able to accurately 
predict whether you can 'migrate' your data or need to start with a 
clean slate.
This also depends on the sophistication of the im/exporter, of course, 
but the least benefit is a quick but accurate decision on 
'migrateability' of your data. Which, I believe, doesn't exist now, 
unless you find 'better always rebuild from scratch' a good choice for 
every major/minor software upgrade. ;-)

What are your thoughts on this matter? Is this worth persuing (and hence 
augmenting the code to support such a header from now on) or is this, 
well...

Best regards,

Ger

Thread: [Crm114-discuss] CRM binary file format & (future?) portability

crm114-discuss