On Wed, Dec 01, 2004 at 09:19:07PM -0200, Fidelis Assis wrote:
> Paolo wrote:
> >#0 (m.css): features: 1589664, hits: -1040300874, prob: 2.23e-307, pR:
> >-306.65 #1 (y.css): features: 1935888, hits: -1083013448, prob:
> >2.23e-307, pR: -306.65 #2 (h.css): features: 1680720, hits: -584445275,
> >prob: 2.23e-307, pR: -306.65 #3 (e.css): features: 2888992, hits:
> >2067210680, prob: 1.00e+00, pR: 306.05
> sprintf (buf,
> "#%ld (%s):"\
> - " features: %ld, hpits: %ld, prob: %3.2e, pR: %6.2f \n",
> + " features: %lu, hpits: %lu, prob: %3.2e, pR: %6.2f \n",
> This error is also present in OSBF but, because features are counted
> only once per document, the problem is rarer.
> There is a probability that the positive values for hits are wrong too,
> because of wrap-arounds - specially for SBPH.
hmm, yeah - %ld was in orig code, I've tested old binaries, both
static/shared, and got same +/- big numbers - must conclude that I've
always checked that stat printout with small sample text.
So that little bug is gone.
BUT there's still the problem with r.txt and r64.txt, can't believe crm
that it got 872468646 hits in e.css, ie 42% of hits from e.txt - the
only one learned text in e.css! I'd expect 0, these are random cruft which
share not a single word/token with learned texts.
And - but that's yet another story - I'd say crm shouldn't force a class
on input text, when actual hits are too low (below some pre-set threshold).
GPG/PGP id:0x21426690 kfp:EDFB 0103 A8D8 4180 8AB5 D59E 9771 0F28 2142 6690
"Indeed, it does come with warranty: it *will* fail, sometimes, somehow..."
- software vendor