|
From: Lachlan A. <lh...@ee...> - 2002-12-05 08:01:42
|
On Wed, 4 Dec 2002 19:12, Gabriele Bartolini wrote:
> IMHO, the ultimate goal for a search process is to get a
> set of document satisfying a semantic criteria, better a
> context criteria.
The ultimate is the singular value decomposition approach
that someone (Geoff?) was suggesting using for a "similar
documents" search. I'd really like to see this in HtDig
eventually. It moves away from indexing on words at all,
and instead indexes on an abstract notion of "how often are
the search words used in similar contexts to the words of
the target document?"
> ... italian and latin languages, but they
> are for sure more complex, having different affix rules
> and lots more of different tenses.
Good point. Am I also correct in believing that some
languages like German have a lot of changes to the stems
themselves ("schwimen, schwam, geschwomen", "trinken,
truank, getrunken")? Is there an approach that can handle
that much generality?
> As Geoff suggested, we could implement a
> different fuzzy algorithm for the 'Porter stemming' which
> builds a new index (a stemmed one).
Yes, it would be good to have a fuzzy stemming algorithm
which doesn't simply return a query with (variant1 OR
variant2 OR ...), but actually searches a stemmed index.
It would be more efficient if there are lots of different
forms.
> > word-level indexing, to give (much) smaller inverted
> > files if people don't need phrase searching.
>
> I guess customisation is our goal. In a retrieval phase,
> we'd want to store almost *anything* we can, then maybe
> with different fuzzy algorithms build alternative indexes
> (smaller or bigger, depending on users' settings).
Yes, a document-level inverted file could be generated from
the word-level one after the whole dig. I don't know much
about htdig's fuzzy mechanism yet; is it possible to delete
the main inverted file and just rely on a "fuzzy" one? If
so, the only other disadvantages would be speed and the
amount of temporary space required (RAM and disk space).
Cheers,
Lachlan
--
Lachlan Andrew Phone: +613 8344-3816 Fax: +613 8344-6678
Dept of Electrical and Electronic Engg CRICOS Provider Code
University of Melbourne, Victoria, 3010 AUSTRALIA 00116K
|