Re: [Pyds-dev] Re: [PyCS-devel] big cvs commit

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

> > Sound OK?
> 
> Depends. :-)
> 
> I still have the problem that PyDS creates much more than just posts.
> What about that stuff - I would like to search that stuff, too. So
> seeding would require to push in much more than just my 1400 posts ...

1400 posts?!

It would be interesting to see some benchmark results for your blog --
search for something common, and go right to the last page of the
results, then run ApacheBench on that URL.  I've benchmarked the raw
database access (the search code can run through 450 posts [from
Second p0st] about 45 times a second) but haven't tried it since
plugging it in as /system/search.py.

> The problem is, your solution only searches what is in the database. And
> we need to think about what to do with stuff that is upstreamed, but not
> mirrored to the search database. For example what about text stuff
> people just put into their upstreaming spool? This is upstreamed (both
> PyDS and Radio have this feature), but isn't generated. And so it isn't
> mirrored.
> 
> Maybe files upstreamed should be automatically mirrored, too? Hmm. That
> would produce duplicates on weblog stuff ...
> 
> So, no, I am not fully satisfied ;-)

Hmm, right.  Part of my motivation for doing it this way is that it
lets you filter out junk (page templates, blogrolls etc) that would
otherwise screw up searches.  So right now it only really searches
blog posts and stories, if you send them specifically (I only have
bzero sending posts).  I don't really want to search everything
(e.g. the 500K HTML file that contains the first run of the Blogging
Ecosystem), just stuff I've written.

If you _do_ want to search everything, it might make sense to run the
query both through the new search code and through Swish, and combine
the results.  When you index with Swish, if the user has sent in some
posts to mirror, get Swish to ignore all URLs that have been sent via
mirrorPosts() but index everything else.  Then combine the results
later on...

Does that sound better?

Cheers,
Phil :)