From: Gerrit E.G. H. <Ger...@be...> - 2007-08-06 19:59:24
|
L.S., I've been thinking a bit about this binary/native file data stuff and had a idea I'd like to discuss. First off, I'm pretty sure by now that 'native binary format' for the storage is a Good Thing(tm) as it makes CRM act _fast_. (And that's precisely what I need.) Why is it 'good'? Because crm uses a very fast mechanism to access the files called memory mapping (mmap(3) or the Win32 equivalent). Having crm translate/convert those bytes to/from 'native format' would either slow down the memory mapped file access (each read/write would have to convert int/double formats when mmap is used) or we'd have to throw memory mapping out the window, i.e. load every file into memory, convert once on load and once on save (for learn) causing an increase in both startup and memory usage overhead. Both these methods, aimed at providing a 'portable binary file format', would have serious impact on crm performance IMHO. (And it just might make the code a bit less legible too, but that can be handled in other ways.) Then another thought struck me: while scanning through my (rather short) archive of the mailing list, I've seen a few questions about the 'portability' of these files across multiple CRM versions. (e.g. the 64-bit discussion with regards to a new Debian build, IIRC) I don't have a proper solution for resolving the latter, but we might consider ways to ensure crm binary formats are 'forwards portable' across platforms _and_ versions/releases. (forwards portable = anything before this moment may be 'broken', but from now on we can offer portability) The idea is this (it has been done in a minor way before within crm): - start each file with a versioned header (I'll come back to that later) - the rest of the file is as it is right now: the hashes and all the other goodies, stored in a native format (int, double, etc.) for _fast_ every access/use. The way to provide the forward portability would be through providing an export/import mechanism (already exists for a few formats: cssdump) which can be used to export binary files to a portable (preferably text?) format, which can then be imported into another crm: different version and/or platform. A bit of a snag there: I've found that the current way of calculating the hashes is not portable, i.e. does not deliver the exact same values on different platforms. That has to be fixed before 'cross-platform' would be feasable. (As I don't see a 'hash converter' happening: hashes are essentially a one-way trip, so there's no way to convert hashes that were calculated based on slightly different bit patterns.) I'm not sure fixing the strnhash() code so as to calculate the same hash bitpattern on every platform is quite enough to ensure cross-platform compatibility though: the data in the files is sometimes derived from the hashes: that would imply those 'derivation paths' would also have to be bit-compatible across platforms. (I'm not sure, have to check again there if we stored derived data next to config params and hash series.) Let's assume however that we've got the hashes/import/export thing covered. And even if we don't we would be well aware of the limitations then. That leaves that 'versioned header' I mentioned. The versioned header should contain enough information for an export/import function to operate correctly: a) import all acceptable data, or b) report the incompatibility and hence the need to 'recreate/relearn' the files. Especially (b) is important as that'd enable (automated) upgrades to properly interact with the users: one would then be able to select wether to commence with an 'incompatible' upgrade or delay it until a later, opportune moment. To make it all cross compatible, the header would need to include these bits of info: - actual crm version - the classifier used - the platform (little/big endian, 32/64/... bits, etc.) - data type (just to make sure we're not messing with the wrong sort of stuff) so the im/exporter can handle possible data content / format changes and/or data compatibility checks. The header has to be in a cross-platform portable format so that every crm toolset of there can recognize copied/moved files and act accordingly: report or process the data. Given the 'memory mapped' file access approach of crm, this would also mean that the header is fixed length (bytes). The software can then easily jump over this header to get at the data itself. Conclusion ----------- I suggest the introduction of a cross-platform binary format, fixed width header for all crm binary files (css et al) to provide (at least) forward compatibility. For the remainder, all these files remain as is: data will be stored in a native format for fast access. The binary format header will include these information items (at least): - the crm version used to create the file - the platform (integer size and format(endianess), floating point size and format, structure alignment, etc.) - the classifier used to create the file - the data content type (some classifiers use multiple files) - space for future expansion (this is a research tool too: allow folks to add their own stuff which may not fit the header items above) The approach includes the existence of an export/import tool to convert the data to/from a cross-platform portable text format, where applicable. At least, the im/exporter will then be able to accurately predict whether you can 'migrate' your data or need to start with a clean slate. This also depends on the sophistication of the im/exporter, of course, but the least benefit is a quick but accurate decision on 'migrateability' of your data. Which, I believe, doesn't exist now, unless you find 'better always rebuild from scratch' a good choice for every major/minor software upgrade. ;-) What are your thoughts on this matter? Is this worth persuing (and hence augmenting the code to support such a header from now on) or is this, well... Best regards, Ger |
From: Paolo <oo...@us...> - 2007-08-06 23:31:45
|
On Mon, Aug 06, 2007 at 09:31:29PM +0200, Gerrit E.G. Hobbelt wrote: > > - start each file with a versioned header (I'll come back to that later) that's well established for Fidelis' OSBF > The way to provide the forward portability would be through providing an > export/import mechanism (already exists for a few formats: cssdump) ... > The versioned header should contain enough information for an > export/import function to operate correctly: > a) import all acceptable data, or there's a catch, as the original arch on which to do the export 1st might not be avail anymore ... > b) report the incompatibility and hence the need to 'recreate/relearn' > the files. ... and b) might not always be an option. > Especially (b) is important as that'd enable (automated) upgrades to > properly interact with the users: one would then be able to select yep, but I'd consider a bug (which might be just a TODO) a convertion util/function which is unable to properly convert our own stuff from arch1 to arch2, both ways, whatever arch* are. Such converters won't be exactly trivial (byte swapping, aligning, padding, etc) but feasable. > The binary format header will include these information items (at least): > > - the crm version used to create the file > - the platform (integer size and format(endianess), floating point size > and format, structure alignment, etc.) > - the classifier used to create the file > - the data content type (some classifiers use multiple files) > - space for future expansion (this is a research tool too: allow folks > to add their own stuff which may not fit the header items above) +file-format version and, since there'll be plenty of space, plain-text file-format blurb and summary file-stats, so that head -x css would be just fine to report the relevant things. > The approach includes the existence of an export/import tool to convert > the data to/from a cross-platform portable text format, where that's the current CSV inter-format, though the converter should be able to do it at once binary-2-binary. > What are your thoughts on this matter? Is this worth persuing (and hence > augmenting the code to support such a header from now on) or is this, > well... for spam filtering, it's easier (and usually better) to start from scratch, but in other applications hashes DB might be precious stuff, so as people extends crm114 use to other tasks, such tool might become highly desirable. -- paolo |
From: Gerrit E.G. H. <Ger...@be...> - 2007-08-07 18:37:13
|
Paolo wrote: > On Mon, Aug 06, 2007 at 09:31:29PM +0200, Gerrit E.G. Hobbelt wrote: > >> - start each file with a versioned header (I'll come back to that later) >> > > that's well established for Fidelis' OSBF > I saw. It's just that I'm looking for a rather more generic solution, which is copy&paste-able when anyone (probably Bill) feels like adding other classifiers to crm114. Say some sort of 'file format/coding practice' thing: rip if off the other classifiers and just add your own classifier constant (so no fancy footwork with index [0] in the data arrays itself or anything like that). >> a) import all acceptable data, or >> > > there's a catch, as the original arch on which to do the export 1st might > not be avail anymore ... > Heh :-) That's where I refer to the legalese in there: 'sorry sir, it's _forward_ compatible as of this release' ;-) The whole point is that I'm trying to get at a mechanism which clearly identifies the data, both in type and version, so that we can develop a 'sure fire' and sane conversion. This while keeping in mind that design/devel/test time is a rather limited resource, so the 'management decision' may well turn out to be to forego the availability of a complete 'conversion' for specific versions (and that may include crm file versions predating this versioning mechanism). Right now, as I see it, you can't provide hard guarantees that conversions will work (and I suspect that, given my goal with crm114, I'll need that sort of thing), as you have several classifiers and software versions, while there's no way to tell them apart in a _guaranteed_ manner: all one can go on is some version info (OSBF et al) and a bit of heuristics. And 'it may work' isn't an option for me when I'm going to employ crm114, so I like to be able to _specifically_ test (and thus support) crm software versions and classifiers. Longwinded paragraphs cut short: I want to end up with a chart which tells me: "You've got crm114 release X and are using classifier C, well, we do support a 'full data transfer' for the current crm114 release." and maybe an additional (sub-)chart which says: "And incidentally, when you have crm114 running on system S, you can also _share_ that classifier's data on system type T using our import/export-based sync system." These charts have three ticks in each cell of their matrices: (a) may work (a.k.a. there's code for this in there) + (b) tested, a.k.a. we got word it works + (c) supported, a.k.a. you may bother us / complain when it isn't working. No tick in your cell on those charts means: you're SOL. Time for a retraining and ditching of the old files, probably. This would solve the problem of the ever lasting questions: can I keep my files or should I start from scratch? For folks that cannot retrain as they go, this 'charted' approach will provide them with a clear decision chart: can/should I upgrade, or shouldn't I? >> b) report the incompatibility and hence the need to 'recreate/relearn' >> the files. >> > > ... and b) might not always be an option. > See above. I'm well aware of that. I'm driving at a mechanism which allows everyone to clearly see when and what can/has been done. That includes you (J.R. User) helping the crm114 team by adding export/import support for those situations where the chart says 'not available' while you need that sort of thing. That also includes collecting and archiving feedback on [user] test results: did their transfer/upgrade work out ok? It's added work, but the benefit is that the upgrade process (and the decision to upgrade) can be fully automated in the end: for unmanned systems: only upgrade when our locally used version + classifier has a tested (and supported?) data migration path towards this new crm114 upgrade release. > yep, but I'd consider a bug (which might be just a TODO) a convertion > util/function which is unable to properly convert our own stuff from arch1 > to arch2, both ways, whatever arch* are. > Such converters won't be exactly trivial (byte swapping, aligning, padding, > etc) but feasable. > That's where the limited design/devel resourcing comes into play: I don't mind if the 'standard' decision is NOT to support/provide a data conversion path. It's understandable that we do so as we don't have an unlimited supply of dev power. But when we do choose to provide a conversion path it's clearly identifiable. (someone may need it and help Bill, you and the others by putting in the dev effort there, just like I'm reviving the Win32 port and adding error checking and stuff along the way) And, BTW, I've been writing that sort of cross-platform stuff more often. It gets a bit wicked when you need to convert VAX/VMS Fortran floating point values to PC/x86 IEEE format, for instance. ;-)) Otherwise, it's just really careful coding and a bit of proper up-front thinking. And then keeping a lookout for register/word-size issues (e.g. 32- vs. 64-bit) throughout the crm implementation, which is the hard part. Padding, endianess, etc. can be handled rather easily: define a 'special struct' with all the basic types in there and load it with a special byte sequence: that gives you endianess and alignment for all basic types. Floating point values need a bit of a special treatment when you travel outside the IEEE realm, but that's doable too. Not trivial, though, indeed. >> The binary format header will include these information items (at least): >> >> - the crm version used to create the file >> - the platform (integer size and format(endianess), floating point size >> and format, structure alignment, etc.) >> - the classifier used to create the file >> - the data content type (some classifiers use multiple files) >> - space for future expansion (this is a research tool too: allow folks >> to add their own stuff which may not fit the header items above) >> > > +file-format version and, since there'll be plenty of space, plain-text > file-format blurb and summary file-stats, so that head -x css would be > just fine to report the relevant things. > Brilliant idea! Hadn't thought about the 'head -x', but I _like_ it. I was thinking about maybe 1 or 2 Kbytes reserved for the header anyway, so, yes, plenty of space of a little informational text up front. A few Kbytes won't hurt. +file-format: yes. In case we find the format needs to be changed again (hopefully not before 2038 ;-) ) Another very good point. >> The approach includes the existence of an export/import tool to convert >> the data to/from a cross-platform portable text format, where >> > > that's the current CSV inter-format, though the converter should be able > to do it at once binary-2-binary. > I've seen the CSV interformat and I was thinking about using that. No bin-2-bin direct stuff, as that would complicate matters beyond control: given the 'cross-platform' tack, it would mean that a developer would have to code - and maintain - software which includes a table of file layout definitions, one for each supported platform (and probably the crm release version too). Compare this to databases: right now I'm in a project where I've found that Oracle cannot copy database files as-is across patch versions (that's the ultra-minor version number), let alone moving the binary database files as-is on to different unix architectures (HPUX vs. Linux, of course with differtent CPUs too). And that makes sense! The point? When Oracle DBAs are used to export-dumping and importing databases running in the many-multi-Gigabyte range to provide an migration/upgrade path for the data stored therein, I'd like to do _exactly_ the same. That means: use the CSV format (probably augmented) as an intermediate. (Or XML when I feel like getting fancy and really 21st century ;-) ) I've done direct bin-2-bin conversions in the past, but they're a true support nightmare. It's doable, but you can have someone spend a serious chunk of his/her life on that alone. And when that person quits supporting it, you're SOL as a tool provider, really. (Imagine your customers use a platform which you didn't support just yet. Maybe a new CPU type even. Can your _design_ of the bin2bin handle that? Or do you need to spend a significant amount of devel effort just to add the generation of these new-CPU-type files to your ware?) The easy way out is to provide all your customers with a single, portable format: they've got the software built on their own machines and who better than the machine itself can convert to/from that portable format? Thus, the conversion effort is off-loaded to the compiler vendor, who has to cope with it anyway. (sscanf/printf/etc.) XML is a good example as a solution invented for solving precisely this very issue. (cross-platform, cross-version, cross-X-whatever data transfer) We might even consider using XML as a replacement for the CSV format, though XML tends to be rather, er, obese, when it comes to data file sizes. XML is hierarchical, so we can easily store our header info and crm classifier data in there, while nicely separated/organized. > for spam filtering, it's easier (and usually better) to start from scratch, > but in other applications hashes DB might be precious stuff, so as people > extends crm114 use to other tasks, such tool might become highly desirable. > Yes, indeed. Verily. <off-topic> I have looked around at software supporting Bayesian/Markovian/etc. statistics and selected crm114 because it looked like it had the right amount of 'vim' (.i.e. lively dev community) while offering a feature set which might cover my needs - or get very close indeed. I intend to use crm114 for spam filtering (when combined with xmail) and for a second purpose: I'm not going to disclose what it is exactly, but think of it as a sort of fuzzy decision-making / monitoring process, which is a bit of a cross-breed between a constraint-driven scheduler and a _learning_ 'fuzzy' discriminator, which has to wade through a slew of 'crap' to arrive at a 'proper' rule or decision. Here I'm more interested in decision _vectors_ (rather small ones) than _scalars_, but I'll tackle that hurdle when I've got crm114 to a state where I can really dive into the classifiers themselves, because I believe right now it only supports single output bits(scalar) (pR?) but I'm not entirely sure there (lacking sufficient algorithm understanding). Anyway, I guess the 'vector solution' would be to use multiple crm (file) instances in parallel: one pR for each decision item in the output vector. Of course, that's a crude way, so the 'clean' approach I originally aiming for was convert crm114 into a library which could be called/used from within my own special purpose software. Alas, that's not a Q4 2007 target anyway. ;-) The problem for me is that I need to understand/learn the algorithm internals for this advanced statistics stuff as that is new to me and I want to understand what it's actually doing, i.e. how this stuff arrives at a decision, as I need to understand the implicit restrictions on the classifiers (and learning methods). Let's just say I don't want to join the mass who can't handle the meaning and implications of 'statistical significance', such as by just grabbing a likely classifier and 'slapping it on'. I fear that would cause some serious burn in the long term. You may have seen from my work so far, that I'm a bit paranoid at times ^H^H^H^H^H^H^H acutely aware of failure conditions, and it would be utterly stupid to fall into that beartrap at a systems level by grabbing this tool and applying it to a problem without really understanding where and what the limitations of the various parts are. I've met too many design decisions _not_ too worry. I've got the idea, I have a 'feeling' that this is the right direction, but it's really still just guesswork regarding feasibility so far. I arrived at crm114 while I had been looking for a decision filter which could easily handle _huge_ inputs for tiny outputs (spam: input = whole emails, output vector size = 1), produce consistent and significant decisions (spam: > 99% filter success rate in a very short learning period) while including a good 'learning' mode: somehow I don't think Bayesian is the bee's knees when it comes to my second goal. And it has been shown it's certainly not the end of it for spam either. And besides, crm114 isn't written in Perl (or some other interpreted language). Which in my world is a big plus. ;-) I don't mind too much if crm114 doesn't work out for goal #2 - though it would be a serious setback - as there's still the spam filter feature which is useful to me. So I don't mind spending some time on this baby to push it to a level where I can sit back, have a beer and say "yeah! Looks good, feels good. Let's do it!" </off-topic> Best regards, Ger |
From: Paolo <oo...@us...> - 2007-08-08 08:41:51
|
On Tue, Aug 07, 2007 at 08:37:09PM +0200, Gerrit E.G. Hobbelt wrote: > > > Brilliant idea! Hadn't thought about the 'head -x', but I _like_ it. I > was thinking about maybe 1 or 2 Kbytes reserved for the header anyway, these are ideas that floated in ML threads long ago. Note that OSBF makes room for 4k header. > I've seen the CSV interformat and I was thinking about using that. No > bin-2-bin direct stuff, as that would complicate matters beyond control: ... > I've done direct bin-2-bin conversions in the past, but they're a true > support nightmare. It's doable, but you can have someone spend a serious ok, ok - no b2b ;) > <off-topic> ... > and a _learning_ 'fuzzy' discriminator, which has to wade through a slew > of 'crap' to arrive at a 'proper' rule or decision. Here I'm more > interested in decision _vectors_ (rather small ones) than _scalars_, but > I'll tackle that hurdle when I've got crm114 to a state where I can > really dive into the classifiers themselves, because I believe right now > it only supports single output bits(scalar) (pR?) but I'm not entirely no, if you put the classes in 2 sets like ! classify (classA_1 classA_2 ... | classB_1 classB_2 ...) you get a scalar (success|fail, but still all pR values). If you insted say ! classify (classA_1 classA_2 ... classB_1 classB_2 ...) (note no '|') ie run in 'stats-only' - you get just the pR vector. I think you can use that for building your fuzzifier, either in CRM or your favourite prog.lang. A tricky point is that pR is normalized, so that it cannot be used as class-membership function as is; an artifice could be to add a class 'AnythingElse', ie the complement to the set of your classes. > The problem for me is that I need to understand/learn the algorithm note that not all classifiers work well for N >2, nor those that are *supposed* to work have been thoroughly tested. > I've got the idea, I have a 'feeling' that this is the right direction, > but it's really still just guesswork regarding feasibility so far. well, crm114 is a jit engine + classifiers plugged-in (bolted-in, at present). The whole thing about pR is how you measure the stats for X against the N classes, which is just a bunch of lines that can be tweaked at pleasure. ... > I don't mind too much if crm114 doesn't work out for goal #2 - though it > would be a serious setback - as there's still the spam filter feature I think that, if none of the (pR output from) current classifiers fits your task, it'd be relatively easy to hack one of them into a new one, which would be named eg f-osb (Fuzzy-OSB) or even OSBG (OSB-Gerrit) ;). -- paolo |
From: Gerrit E.G. H. <Ger...@be...> - 2007-08-08 21:51:14
|
Paolo wrote: >> Brilliant idea! Hadn't thought about the 'head -x', but I _like_ it. I >> was thinking about maybe 1 or 2 Kbytes reserved for the header anyway, >> > these are ideas that floated in ML threads long ago. Note that OSBF makes > room for 4k header. > Yes, I saw there was some version checking and header code in there already. BTW, 'man head' on my box doesn't give a -x option. Is that an option to read until the EOF (or NUL?) character in an ASCII file? > ok, ok - no b2b ;) > Sorry, recalled some 'cool hacking' sessions of long past that went pear shaped as nobody could'handle' it. With 20-20 hindsight it was an exercise in complexity capability (how much nasty little details can you handle all at once). > no, if you put the classes in 2 sets like > ! classify (classA_1 classA_2 ... | classB_1 classB_2 ...) > you get a scalar (success|fail, but still all pR values). If you insted > say > ! classify (classA_1 classA_2 ... classB_1 classB_2 ...) > (note no '|') ie run in 'stats-only' - you get just the pR vector. I think > you can use that for building your fuzzifier, either in CRM or your favourite > prog.lang. A tricky point is that pR is normalized, so that it cannot be > used as class-membership function as is; an artifice could be to add a > class 'AnythingElse', ie the complement to the set of your classes. > I've copied this to my project notes. At the moment, the details of this are beyond my grasp, but that will change when I move away from the code cleanup into the actual algorithmic material of crm114. Thank you for this tip for it gives me a direction to investigate. > note that not all classifiers work well for N >2, nor those that are > *supposed* to work have been thoroughly tested. > I already suspected that much. That's why I don't mind going through all the code: I expect I'll need this exercise later on. > well, crm114 is a jit engine + classifiers plugged-in (bolted-in, at present). > [...] > I think that, if none of the (pR output from) current classifiers fits your > task, it'd be relatively easy to hack one of them into a new one, which > would be named eg f-osb (Fuzzy-OSB) or even OSBG (OSB-Gerrit) ;). > :-) heh, OSBG, now that would be something. Seriously though, I immediately recognized the plug/bolting in features when I first had a look at the crm114 code. Of course, a bit less of a copy&paste approach would have been 'nice' from a certain design point of view, but given the research nature of this type of tool (as Bill put it so eloquently somewhere: 'spam is a moving target') copy&paste is a very good approach (you can always refactor the sections that have stabilized). Besides, there are very nice tools out there to ease diff&merge-ing source files, so it's not much of a hassle to keep them in sync for now (like I did with my copy of SVM vs SKS: SKS seems to have started as an utterly stripped version of SVM, but the behaviour is _very_ similar so I merged the SVM code back in, just so I have lesser diffs to look at when cross-checking SKS vs SVM after a code change in either one of them. Ger |
From: Raul M. <mo...@ma...> - 2007-08-08 16:00:33
|
On Tue, Aug 07, 2007 at 08:37:09PM +0200, Gerrit E.G. Hobbelt wrote: > Right now, as I see it, you can't provide hard guarantees that > conversions will work (and I suspect that, given my goal with crm114, Sure you can: Reaver Cache. That works across versions, across classifiers, etc. -- Raul |