Menu

#3 Keeping old docs in the scoop

open
nobody
5
2002-04-24
2002-04-24
No

I've scheduled sitescooper in my crontab to scoop up a
number of sites into Plucker files, and my script to do
so runs sitescooper as follows:

sitescooper -mplucker -outputtemplate -prctitle Site

This works fine for sites that basically update once a
day or less frequently, and which I have the time to
check in Plucker once a day. Those scoops that have new
articles in them have been updated and show up in
Plucker as such (when I sort the pdb's by date).

However, I don't necessarily have the time to read
everything every day, and on the other hand, I'd like
to run sitescooper more often for some news sites so
that no matter what time I sychronize my Palm, the news
scoops would be more up to date than they now are
(worst case, they're still from yesterday if the
sitescooper schedule hasn't yet run).

However, if I schedule sitescooper more often, it will
not include the articles that were picked up by an
earlier scoop in the Plucker pdb, since they were
already found in the cache. I could bypass this by
using the -refresh option, but then some sites would
create gigantic pdb's (for example, The Register has a
week worth of links on the page I scoop, and I'd like
to get two days at most into the scoop).

I could use -refresh -maxstories n, but I don't know
how many stories a site produces beforehand, and I'd
have to run sitescooper separately for each site to set
it on a site-by-site basis. I could set SizeLimit in
the site file, but that's not much better.

What I'd like to suggest is a -maxage parameter, which,
perhaps when combined with -refresh, would pull in
pages from the cache into the scoop as long as they're
not older than specified by the parameter. This would
be perfect - I could run sitescooper every hour, limit
the scoop to stories less than a day or two old, and
get fresh stuff onto the Palm any time I sync.

Discussion


Log in to post a comment.

MongoDB Logo MongoDB