lately, we've observed enormous traffic spikes when our DSpace instance
(about 55,000 items) is being indexed by Google. This month, this caused
over 600GB of traffic, which we'd of course like to avoid.
The main culprits are the various browse pages (browse-title, browse-date,
browse-author) because of the way they're constructed - with the
"top=somehandle" argument. This means that, in our case, tens of thousands
of browse pages can exist, and all are created and indexed by Google.
It seems to be me that the way these pages are constructed could be
altered to prevent this kind of behaviour - by, for example, changing the
way the pages skip etc. (Adding a robots.txt would be no use since 1. we
don't want this not to be indexed and 2. the file itself would be
Tom De Mulder <tdm27@...> - Cambridge University Computing Service
New Museums Site, Pembroke Street, Cambridge CB2 3QH
-> 23/03/2005 : The Moon is Waxing Gibbous (82% of Full)