-
ia_igor committed revision 5465 to the Heritrix: Internet Archive Web Crawler SVN repository, changing 1 files.
2007-09-05 09:27:28 UTC in Heritrix: Internet Archive Web Crawler
-
ia_igor committed revision 4943 to the Heritrix: Internet Archive Web Crawler SVN repository, changing 1 files.
2007-02-27 02:54:54 UTC in Heritrix: Internet Archive Web Crawler
-
ia_igor committed revision 4940 to the Heritrix: Internet Archive Web Crawler SVN repository, changing 1 files.
2007-02-27 01:15:21 UTC in Heritrix: Internet Archive Web Crawler
-
First regex has three instead of two groups :(
Therefore,
^(?i)([^\?]+/)(\((?:S\(|)[0-9a-z]{24}\)(?:\)|)/)([^\?]+\.aspx.*)$
should be
^(?i)([^\?]+/)(?:\((?:S\(|)[0-9a-z]{24}\)(?:\)|)/)([^\?]+\.aspx.*)$
^^^.
2007-01-11 02:18:01 UTC in Heritrix: Internet Archive Web Crawler
-
Resources:
NET 1.0/1.1
http://MySite.com/MyWebApplication/(XXXXXXXXXXX)/home.aspx
NET 2.0
http://MySite.com/MyWebApplication/(A(XXXX)S(XXXX)F(XXXX))/home.aspx
Breaking it down:
1. A(XXXX): This is the Anonymous-ID. It is used to identify the (anonymous) user accessing your application. The string may or may-not be encrypted, depending on your configuration settings in the...
2007-01-11 01:43:25 UTC in Heritrix: Internet Archive Web Crawler
-
I noticed that the following Archival URL syntax is not
supported or it is broken:
http://hostname/collection-name/url as in
http://web.archive.org/web/appels.com
It seems that the classic WM redirects to the latest
known date. I am not sure if this is a feature but I
think that we should implement/fix it.
2006-10-10 00:20:24 UTC in Web Archive Access Utilities
-
Logged In: YES
user_id=715474
1a) It is not working for me in 1.8
2a) OK
3a) I was not clear on the masquerade setting. I thought
that this setting would override the user-agent; having the
user-agent in override seemed redundant. I see that UI
comments on these settings have been improved. The last part
of the original comment should be ignored.
2006-08-07 18:53:24 UTC in Heritrix: Internet Archive Web Crawler
-
It seems that we cannot override (per host) user-agents
and masquerade fileds. Any change on per host level is
reflected in the global settings.
Also, we should not not have user-agent (not
user-agents) and from fileds in the overrides.
2006-08-03 19:50:33 UTC in Heritrix: Internet Archive Web Crawler
-
Our Mike start seeing this error in one of our weekly
crawls:
Time: Jul. 26, 2006 06:34:18 GMT
Level: SEVERE
Message:
Fatal exception in ToeThread #9:
Exception:
com.sleepycat.util.RuntimeExceptionWrapper
Cause: java.io.EOFException
at
java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2502)
at...
2006-07-26 22:43:45 UTC in Heritrix: Internet Archive Web Crawler
-
ia_igor committed patchset 3883 of module ArchiveOpenCrawler to the Heritrix: Internet Archive Web Crawler CVS repository, changing 1 files.
2006-04-11 22:43:20 UTC in Heritrix: Internet Archive Web Crawler