Re: [htdig-dev] stemming

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Ciao guys,

   again, sorry if I will certainly make mistakes. I love to get to know
more in this area, which is pretty new for me too. So please be patient.
:-)

Il lun, 2002-12-09 alle 02:14, Lachlan Andrew ha scritto:
> - The format you describe sounds like a "half-inverted"
>   file -- listing locations *within* a document by word, but
>   listing *document* locations by document.  Is that
>   correct?

I think that was a flat representation of the index file, just an
example. Am I right, Neal?

In a simple scenario, we'll have - (please consider it is a very very
draft!):
- a word index (word id, stemmed/unstemmed flag, maybe language?)
- a document index (document id, info regarding the document, pretty
much as now: title, modification date, etc.)
- an inverted index (word id, document id, locations)

Words
-----
ID	Word		S/U	Lang
--	----		---	----
1	traveling	0	en
3	casa		0	it
12	travel		0	en
23	travels		0	en
45	pasta		0	it
60	travel		1	en
...

Documents
---------
ID	URL			Other info
--	---			----------
1	http://www.pippo.it/	.....
2	http://www.htdig.org/	...

Index
-----
ID W	ID D	Locations and related info (position and markup)
----	----	------------------------------------------------
1	2	1 Value_location 3 Value_location

Value_Location is the value given to the location of the word

Am I right?

Of course it's just an example ... :-)

Any comments about the language?

> - With stemming in general, what is done about negating
>   affixes?  If I searched for 'mercy', I wouldn't want
>   results about 'merciless' (although I would want results
>   about 'merciful').

Good point, are there any plans to include negative words too?

Ciao ciao
-Gabriele
--=20
Gabriele Bartolini - Web Programmer
Comune di Prato - Prato - Tuscany - Italy
g.b...@co... | http://www.comune.prato.it
> find bin/laden -name osama -exec rm {} ;