Re: [PyIndexer] Thoughts on MySQL Implementation

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Mon, 17 Dec 2001 at 09:08:06 -0500, Casey Duncan wrote:

[ LOAD DATA and temporary files ]

> On Monday 17 December 2001 07:57 am, Chris Withers allegedly wrote:
> [snip load data stuff]
> > Well, my reluctance on this is that it needs a temporary file. Where do we
> > put this temporary file? How do we know we're going to be allowed to write
> > to the filesystem?
> 
> Yup, other than using mktemp or creating some sort or var directory in the 
> install, dunno. That said, I don't think it's too much to ask for a program 
> to have write access to the temp dir... 8^)

I think you can use a FIFO with LOAD DATA LOCAL INFILE... Anyone got
experience with this? Probably something for the mysql list ;-)

[ snip separating app and db server ]

A good idea particularly if the app is handling some of the grunt 
work :-)

TCP/IP comms between app and server may be slower than communication 
using a unix socket, though.

> > That said, the only other option is to build a big
> > INSERT INTO tbl_name VALUES (expression,...),(...),...
> >
> > ...and then we have to worry about max. sql length I guess?
> > Anyone know what the maximum length of SQL you can shove down a single
> > c.execute() is?

Whatever you set it to ;-)

The buffer starts off at net_buffer_length, and grows to a maximum of
max_allowed_packet (default 16MB, which should be enough ;-). See the
manual entries for SHOW VARIABLES.

[ app-side processing for positional matches ]

> > Can you elaborate?
> 
> My thought was that you would treat the query just like A and B and C on the 
> SQL side of things. But you would bring in word position information with the 
> queries, so that you could on the application-side determine which queries 
> actually statisfy the phrase match. I think there will be relatively few 
> comparisons there for any meaningful search terms.
> 
> I just think that trying to do that type of vertical comparison on the 
> SQL-side will be a pain.

IIRC a search for 'windows 2000' on the test data Chris has been using has
~ 10K rows for each of 'windows' and '2000'. Doing that processing
app-side would be costly (first have to match doc ids, then compare each
in O(n**2)), whereas using the previous word id in the textindex when you 
already know what the word must be, should be cheaper.

> Also, how are you planning to deal with stop-words in phrase searches? I 
> notice you have included a previous word reference in the index. I'm assuming 
> stop words are thrown out of both the word index and the search words, 
> correct?

The same splitter is applied to source docs and search string.

> > > I agree that storing a document count for each word
> > > could help with optimizing since you could start with
> > > the smallest dataset first and prune it from there.
> >
> > Does this still hold true when you're OR'ign terms together?
> 
> For simplicity, I would just treat each part of the OR as a separate query 
> and combine the results on the application side. I think ORs are going to be 
> expensive no matter what you do, so you might as well keep is simple.

Fully agree. I'm not sure what the other DBMS's do with ORs in the
WHERE, but this looks like the best way of doing it all-round.

[ MySQL's FULLTEXT index ]

> Bleah. Real phrase matching is a definite requirement for me too. I wonder if 
> MySQL stores any positional info in its full-text index that you could get 
> your hands on... I'd imagine it is using a table internally to store the 
> index...

No, MySQL doesn't store the position; it uses relative occurrences only
in determining relevance, and does so at the row level.

Cheers

-- Marcus