Re: [Pyds-dev] Re: [PyCS-devel] big cvs commit
Status: Alpha
Brought to you by:
myelin
|
From: Phillip P. <pp...@my...> - 2003-10-14 21:28:36
|
> > Sound OK? > > Depends. :-) > > I still have the problem that PyDS creates much more than just posts. > What about that stuff - I would like to search that stuff, too. So > seeding would require to push in much more than just my 1400 posts ... 1400 posts?! It would be interesting to see some benchmark results for your blog -- search for something common, and go right to the last page of the results, then run ApacheBench on that URL. I've benchmarked the raw database access (the search code can run through 450 posts [from Second p0st] about 45 times a second) but haven't tried it since plugging it in as /system/search.py. > The problem is, your solution only searches what is in the database. And > we need to think about what to do with stuff that is upstreamed, but not > mirrored to the search database. For example what about text stuff > people just put into their upstreaming spool? This is upstreamed (both > PyDS and Radio have this feature), but isn't generated. And so it isn't > mirrored. > > Maybe files upstreamed should be automatically mirrored, too? Hmm. That > would produce duplicates on weblog stuff ... > > So, no, I am not fully satisfied ;-) Hmm, right. Part of my motivation for doing it this way is that it lets you filter out junk (page templates, blogrolls etc) that would otherwise screw up searches. So right now it only really searches blog posts and stories, if you send them specifically (I only have bzero sending posts). I don't really want to search everything (e.g. the 500K HTML file that contains the first run of the Blogging Ecosystem), just stuff I've written. If you _do_ want to search everything, it might make sense to run the query both through the new search code and through Swish, and combine the results. When you index with Swish, if the user has sent in some posts to mirror, get Swish to ignore all URLs that have been sent via mirrorPosts() but index everything else. Then combine the results later on... Does that sound better? Cheers, Phil :) |