Menu

#65 Odd glitch in index in a very small corpus

open
Indexer (22)
5
2007-01-07
2007-01-07
No

I have indexed a corpus consisting of a single file (it's M01 from FLOB, so a total of about 2000 words).

Words with multiple forms are not being indexed correctly. This affects the freq0 file and also the lists you get in the Word Query dialog.

In the freq0 file, I get, for instance, three lines for "of" instead of the one line I ought to have:

of 1 1
of 46 1
of 1 1

This should, surely, be
of 48 3

(the "forms" are distinguished by POS tags -- respectively, ???, IO, and II22).

This is reflected in the WQ dialog as each form of "of" appearing as a separate line in the top half of the word-list, rather than them being combined together in the top half and then separated out in the bottom half. See enclosed graphic.

This is carried over into the XML listing saved from this dialog as well.

Since it is a consistent problem, I'm guessing it has to do with the indexer rather than the client. I don't get this bug with bigger corpora, e.g. the full whack of FLOB (whose freq0 has one line, "of 34094 14"), and it's fine with the BNC too.

Andrew.

Discussion

  • Andrew Hardie

    Andrew Hardie - 2007-01-07

    bug graphic

     
  • Tony Dodd

    Tony Dodd - 2007-01-08

    Logged In: YES
    user_id=1036552
    Originator: NO

    I tested this with 1.22 and get the correct behaviour so whatever the problem is it appears to have gone away. Strange, I can't explain how it can have arisen, I suspected the addkeys setup but it is quite correct.

    Andrew, will you test this when 1.22 appears and report the result here; then if it really has gone away I'll close the bug.

     
  • Andrew Hardie

    Andrew Hardie - 2007-01-10

    Logged In: YES
    user_id=1460495
    Originator: YES

    No change in 1.22. There are still three separate listings for "of" in both the freq0 file and in the result I get from the Word Query dialog lookup ...

     
  • Tony Dodd

    Tony Dodd - 2007-01-15

    Logged In: YES
    user_id=1036552
    Originator: NO

    I can reproduce this now: I missed it before because it doesn't happen with the debug version. This also makes it rather hard to track, but I'll catch it one way or another!