From: Christian Siefkes <siefkes@...>
Fidelis Assis wrote:
> Ok, I tried with your standard SA shuffle01 but couldn't reproduce the
> results. I got low accuracy (8 errors/last 500 x 1/500 OSBF), slow
> performance (2+ times slower than OSBF) and high disk space usage (23.3
> MB x 2.2 MB OSBF).
I also tried <hyperspace> on our standard SA test set using thick
thresholding (pR thickness 0.5) as requested by Bill (with plain
tokenization). My results were somewhat better but not great:
last 10x500: 49 errors
last 10x1000: 94 errors
all mails: 719 errors
That's acceptable but not really up to what we're getting from
Winnow or even OSBF. I also noted that the CSS files tend to grow
quite large (~10-20 MB) which makes you worry what would happen if
you use this classifier for 100,000s of messages instead of 4000.
No, that's definitely broken. I'm getting half error rate when
I run it on my old dying laptop, limping as it is, and only 300K or so
in the files.
Time to see WHERE I screwed up something, how badly, etc. Because it
very simply should not do that. I broke the code somehow, in the
process of pushing it from the old dying laptop up to Sourceforge and
then pulling it back down onto the new laptop. All of the stats runs
were on the old (966 MHz) transmeta laptop, not the new 1.6 Pentium M
laptop (as the new one isn't even fully configured yet, it doesn't
have the SA test set on it yet. Heck, it doesn't even have my MP3's
on it yet...) but I *did* build the kit on the new one as I can write
documentation with it....
... or, a worse possibility - the old laptop has some secret
sauce that makes this work, and the new one doesn't... which
might be the OS - the old one is running Core 3 and the new one
is running core 4 and I built the CRM114 release on the new one, not the
old one.
Two more observations that are unrelated to classifier accuracy:
- As usual, I called rm to delete the CSS files between test runs,
but rm complained about "Resource or device busy" and just renamed
the files to hidden NFS nodes (.nfsSOMETHING) instead of removing
them. I could later delete the renamed files from another
machine. It it possible that unmap doesn't work as expected so the
file system still considers a CSS file as open after crm
terminated? (I'm on Debian Sarge.)
I wonder if that is truly unrelated or not. If the filesystem is
dropping things on the floor, then it's possible that some "learns"
are being dropped as well, and thus the learned data (which is
appended to the end of the file in hyperspace, rather than being
embedded as in the other classifiers) aren't seen by subsequent
invocations, then it might learn and learn and learn ad infinitum.
Hmmm...
I have a hunch. We may be hoisted on our css cacheing petard.
If the caches are being held open, then there's an open file handle on
the .chp file (Crm HyPerspace), then when we write, we write off past
what the mmap has. But the cached mmap is *still* open. We close the
LEARN write. But the cache (and the cache's open() ) is still there.
And the cache's open() is *also* RW.
Maybe we're truncating where we shouldn't be? That would account for
some of the issues.
And you know what's interesting... I thought of the OTHER side of this
problem- we do an crm_munmap_all() *after* we write, to force the cached
mmaps to see the new data. Maybe it should unmap *before* the write, not
after.
Quick thing to try:
Here's the CURRENT code (lines 535ish of crm_osb_hyperspace.c)
// open the output file
hashf = fopen ( hashfilename , "ab+");
if (user_trace)
fprintf (stderr, "Writing to hash file %s\n", hashfilename);
// and write the sorted hashes out.
fwrite (hashes, 1,
sizeof (HYPERSPACE_FEATUREBUCKET_STRUCT) * hashcounts,
hashf);
fclose (hashf);
// let go of the hashes.
free (hashes);
// and force all mmaps to be re-read by later invocations; this
// is needed for demonized processes to work correctly.
crm_munmap_all ();
See that trailing crm_munmap_all() call at the very end? Move that to
right before the open() call.
Give that a try, and see if the wierdness changes. If it changes,
either for better or worse, then we have file truncation/cacheing
problems biting us.
- Strangely, classification does not seem to work reliable when files reside
on a RAID system. I get tremendously higher error figures than those above
when the CSS files __or even the mails to classify__ are on a RAID. I could
imagine that the memory mapping of CSS files is incompatible with RAID but I
was very surprised when I noticed that it doesn't even work for input files
(272 error instead of 49) -- that was totally unexpected. This is probably
unrelated to the <hyperspace> classifier, I remember having the same problem
with other classifiers, at least for CSS files.
Hmmm.... that's also quite disturbing. How the heck can that
be?
-Bill Yerazunis
|