Re: [PyIndexer] Thoughts on MySQL Implementation
Status: Pre-Alpha
Brought to you by:
cduncan
From: Casey D. <c.d...@nl...> - 2001-12-17 14:05:13
|
On Monday 17 December 2001 07:57 am, Chris Withers allegedly wrote: [snip load data stuff] > Well, my reluctance on this is that it needs a temporary file. Where do we > put this temporary file? How do we know we're going to be allowed to write > to the filesystem? Yup, other than using mktemp or creating some sort or var directory in the install, dunno. That said, I don't think it's too much to ask for a program to have write access to the temp dir... 8^) I think you could potentially see a big speed improvement and even more so if the mySQL server is on another box... > That said, the only other option is to build a big > INSERT INTO tbl_name VALUES (expression,...),(...),... > > ...and then we have to worry about max. sql length I guess? > Anyone know what the maximum length of SQL you can shove down a single > c.execute() is? I dunno. Perhaps the C API would allow you to do a LOAD DATA from an in-memory data structure? Just a thought, we can't be the only ones trying to stuff the proverbial goose here 8^) > > As the search end, it seems to me that app side > > processing will be best for positional matches, such > > as for phrases. > > Can you elaborate? My thought was that you would treat the query just like A and B and C on the SQL side of things. But you would bring in word position information with the queries, so that you could on the application-side determine which queries actually statisfy the phrase match. I think there will be relatively few comparisons there for any meaningful search terms. I just think that trying to do that type of vertical comparison on the SQL-side will be a pain. Also, how are you planning to deal with stop-words in phrase searches? I notice you have included a previous word reference in the index. I'm assuming stop words are thrown out of both the word index and the search words, correct? > > I agree that storing a document count for each word > > could help with optimizing since you could start with > > the smallest dataset first and prune it from there. > > Does this still hold true when you're OR'ign terms together? For simplicity, I would just treat each part of the OR as a separate query and combine the results on the application side. I think ORs are going to be expensive no matter what you do, so you might as well keep is simple. So I guess that would be a no 8^) > I prefer to not require the BTrees module, as it's not part of the standard > python library, but if needs must ;-) I hear you. Just thinking out loud. > Thanks :-) Once I get the initial implementation and scalability testing > package finalized, mayeb you guys could try some tweakage? What are friends for? [snip full-text] > Yeah, sadly doesn't do phrase matching :-S 8^( > We could use this and do the usual cheap hack that ZCatalog does, matching > the phrase "x y z" simply gets turned into "x" AND "y" AND "z", but that > won't be good enough for the specific application I need the indexer for > :-S Bleah. Real phrase matching is a definite requirement for me too. I wonder if MySQL stores any positional info in its full-text index that you could get your hands on... I'd imagine it is using a table internally to store the index... /---------------------------------------------------\ Casey Duncan, Sr. Web Developer National Legal Aid and Defender Association c.d...@nl... \---------------------------------------------------/ |