See
http://groups.yahoo.com/group/archive-crawler/message/1961
Crawl 'archive.org' and see how things work across
redirect to 'www.archive.org'. This works fine. We
have code to handle this case.
What we don't have is code to handle the inverse, going
from 'www.netarkivet.dk' to 'netarkivet.dk'.
TO REPRODUCE:
Crawl using the seed 'www.netarkivet.dk'. See how the
crawl profile looks like:
2005-06-17T04:49:32.368Z 1 77
dns:www.netarkivet.dk P http://www.netarkivet.dk/
text/dns #001 20050617044931942+18 - -
2005-06-17T04:49:33.375Z 302 301
http://www.netarkivet.dk/robots.txt P
http://www.netarkivet.dk/ text/html #001
20050617044932942+402 ESVJLA2FRUB6UDKGUQWRMONKMVJQR325 -
2005-06-17T04:49:35.804Z 302 291
http://www.netarkivet.dk/ - - text/html #001
20050617044935387+402 FJZLU5E256LIJ5NFHPXYMTWVX7E5CS3L 3t
... and then we crawl no more.
The site wants to redirect us to the domain minus the
'www' but we're ruliing pages as already seen.
Karl Thiessen
None
1.6.0
Public
|
Date: 2007-03-14 00:55
|
|
Date: 2005-07-18 19:45 Logged In: YES |
|
Date: 2005-06-24 01:33 Logged In: YES |
|
Date: 2005-06-17 16:25 Logged In: YES |
|
Date: 2005-06-17 15:57 Logged In: YES |
|
Date: 2005-06-17 05:11 Logged In: YES |
|
Date: 2005-06-17 05:08 Logged In: YES |
| Field | Old Value | Date | By |
|---|---|---|---|
| artifact_group_id | None | 2005-09-23 18:02 | gojomo |
| status_id | Open | 2005-07-18 19:45 | karl-ia |
| close_date | 2005-06-17 05:08 | 2005-07-18 19:45 | karl-ia |
| assigned_to | nobody | 2005-06-17 05:11 | stack-sf |
| status_id | Closed | 2005-06-17 05:11 | stack-sf |
| status_id | Open | 2005-06-17 05:08 | stack-sf |
| close_date | - | 2005-06-17 05:08 | stack-sf |
| resolution_id | None | 2005-06-17 05:08 | stack-sf |
Copyright © 2010 Geeknet, Inc. All rights reserved. Terms of Use