Re: First attempt: RE: [Omseek-devel] Omega terms selection

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On Wed, Aug 22, 2001 at 04:33:46PM +0100, Sam Liddicott wrote:

> > The big downside to the above implementation is that the position
> > lists for the terms will be wrong, because of the way the terms from
> > the two different lists are merged back in: you'll get all the
> > capitalised ones (B.i) first, then the stemmed ones (B.ii). 
> 
> But they'll still have the correct "position" value, right?

Actually, they sort of will. Each time you use the OmTermlistAddNode,
it starts positions in the added terms at 1. So the B.ii terms will
all be fine, but the B.i terms will all have very early
positions. That may cause some issues.

> > This means that a phrase like "Mary had" won't work properly if it's 
> > converted to  a term list of (mary) [had] and then turned into a
> > query using a positional operator.
> 
> Is this because omseek doesn't support two separate terms having the same
> "position"?

I don't think there's any such restriction. That's actually a bad
example, because (mary) and [had] will have positions 1 and 2
respectively. This is misleading, though. Consider:

	Mike told me that Sally was in the Congo

->	(mike:1) (sally:2) (congo:3)
	[mike:1] [told:2] [me:3] [that:4] [sally:5] [was:6] [in:7]
	[the:8] [congo:9]

At least, I think. So a phrase search with window 3 of 'mike sally'
will match, even though it probably shouldn't do because their actual
separation is more than that.

It's a bit of a judgement call. Again, you should be able to design
your query construction to avoid that.

> ? t
> ? omindex_sam.cc
> Index: omindex.cc
> ===================================================================
> RCS file: /cvsroot/omseek/om-examples/omega/omindex.cc,v
> retrieving revision 1.28
> diff -u -r1.28 omindex.cc
> --- omindex.cc	2001/07/30 13:22:42	1.28
> +++ omindex.cc	2001/08/22 15:32:56
> @@ -150,6 +150,10 @@
>  	if (k == string::npos || !isalnum(s[k])) j = k;
>  	om_termname term = s.substr(i, j - i);
>  	lowercase_term(term);
> +	if (isupper(s[i]) || isdigit(s[i])) {
> +	  doc.add_posting(term, pos);
> +	}
> +
>          term = stemmer.stem_word(term);
>  	doc.add_posting(term, pos++);
>  	i = j + 1;

That's good, yes. :)

> Now I'm looking at indexgraphs.

They're fun. Someone should write a dia plugin for them or something,
because they give me headaches to read :)

J

-- 
/--------------------------------------------------------------------------\
  James Aylett                                            zap.tartarus.org
  ja...@ta...                                    www.footlights.org