|
From: J. K. <ja...@ka...> - 2004-03-26 06:40:40
|
Hello there.
> I think we must to add next fields of HTTP header:
>
> Expires:
> it may be helpful for master - not index document earle then Expires
> date.
>
> Last-Modified:
> it might be a good for indexer:
> if (date_now>Expires: and Last-Modified:<Expires:)
> {
> not index this document
> }
>
> or
>
> if (date_of_index<=Last-Modified:)
> {
> not index this document
> }
I supported the whole idea of using this information in the HTTP headers.
But after chatting with Eric on the #sprawler channel irc.freenode.net the
other day, he informed me that the "last-modifed" date isn't always
accurate. That doesn't mean we can't use it, it just means we can't rely
on it. But there is some other information we can get. And that would be
the size of the document, which you may notice is what Google stores for
pages it has cached (for some reason, non-cached pages don't have this
information in Google's results.) Now, I do understand the pages may
change before we can reindex them, but do they tend to change so much that
size data would be so far off? Having that stored on a resuilts page can
give users an idea of the size of a page, which can be very useful if they
want to know how much content is there. It's also good to know when you
have a dialup connection. :)
Here's a tweak for you:
I see that we have a line in Client.pm that says:
my @docheader=LWP::Simple::head($document);
Also added was:
$self->{CONTENT_TYPE}=undef;
Well, we can add this line after where we get the header:
$self->{SIZE_IN_BYTES}=$docheader[1];
Of course, the SIZE_IN_BYTES value would need to be declared first, but
you get the idea. If I'm not mistaken, other values returned from the
header are modification time and expiration date (in that order in an
array returned by the function.) We can take that info as well, but what
we do with it is another story.
Anyway, just thought I'd throw that in. I've mentioned any data we can get
from headers may be valuable, and size can also be good for gathering
statisitics, getting a greater idea of average web page size for our
purposes. It'll be good to know, so we'll know how much hardware we'll
need. And then there's the issue of caching web pages, where size of
pages, of course, is definitely a factor.
On a completely unrelated note, I plan on commenting on the Sprawler map
within the next few days (this time I'll make sure of it, Eric.) So much
to do, so little time.
Thanks,
J.K.
|