Re: [PyIndexer] Thoughts on MySQL Implementation

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Monday 17 December 2001 07:57 am, Chris Withers allegedly wrote:
[snip load data stuff]
> Well, my reluctance on this is that it needs a temporary file. Where do we
> put this temporary file? How do we know we're going to be allowed to write
> to the filesystem?

Yup, other than using mktemp or creating some sort or var directory in the 
install, dunno. That said, I don't think it's too much to ask for a program 
to have write access to the temp dir... 8^)

I think you could potentially see a big speed improvement and even more so if 
the mySQL server is on another box...

> That said, the only other option is to build a big
> INSERT INTO tbl_name VALUES (expression,...),(...),...
>
> ...and then we have to worry about max. sql length I guess?
> Anyone know what the maximum length of SQL you can shove down a single
> c.execute() is?

I dunno. Perhaps the C API would allow you to do a LOAD DATA from an 
in-memory data structure? Just a thought, we can't be the only ones trying to 
stuff the proverbial goose here 8^)

> > As the search end, it seems to me that app side
> > processing will be best for positional matches, such
> > as for phrases.
>
> Can you elaborate?

My thought was that you would treat the query just like A and B and C on the 
SQL side of things. But you would bring in word position information with the 
queries, so that you could on the application-side determine which queries 
actually statisfy the phrase match. I think there will be relatively few 
comparisons there for any meaningful search terms.

I just think that trying to do that type of vertical comparison on the 
SQL-side will be a pain.

Also, how are you planning to deal with stop-words in phrase searches? I 
notice you have included a previous word reference in the index. I'm assuming 
stop words are thrown out of both the word index and the search words, 
correct?

> > I agree that storing a document count for each word
> > could help with optimizing since you could start with
> > the smallest dataset first and prune it from there.
>
> Does this still hold true when you're OR'ign terms together?

For simplicity, I would just treat each part of the OR as a separate query 
and combine the results on the application side. I think ORs are going to be 
expensive no matter what you do, so you might as well keep is simple.

So I guess that would be a no 8^)

> I prefer to not require the BTrees module, as it's not part of the standard
> python library, but if needs must ;-)

I hear you. Just thinking out loud.

> Thanks :-) Once I get the initial implementation and scalability testing
> package finalized, mayeb you guys could try some tweakage?

What are friends for?

[snip full-text]
> Yeah, sadly doesn't do phrase matching :-S

8^(

> We could use this and do the usual cheap hack that ZCatalog does, matching
> the phrase "x y z" simply gets turned into "x" AND "y" AND "z", but that
> won't be good enough for the specific application I need the indexer for
> :-S

Bleah. Real phrase matching is a definite requirement for me too. I wonder if 
MySQL stores any positional info in its full-text index that you could get 
your hands on... I'd imagine it is using a table internally to store the 
index...

/---------------------------------------------------\
  Casey Duncan, Sr. Web Developer
  National Legal Aid and Defender Association
  c.d...@nl...
\---------------------------------------------------/