From: Umanda D. <abe...@gm...> - 2014-01-07 18:11:30
|
*Re: [Archive-access-discuss] [archive-crawler] Heritrix 3.1.0 and wayback questions <http://sourceforge.net/mailarchive/message.php?msg_id=31804713>* From: Coram, Roger <Roger.Coram@bl...> - 2014-01-03 10:28 *Attachments:* Message as HTML<http://sourceforge.net/mailarchive/attachment.php?list_name=archive-access-discuss&message_id=74C97E7DF5A7784D997217FF75D1216612EC81E7%40w2k3-bspex1&counter=1> Sometimes, specifically with links added client-side, Wayback doesn't have the opportunity to rewrite correctly. The images being 'pushed' below are absolute paths and it's possible that your browser is trying to load them relative to your own domain (you can check this via your browser's developer's tools - you should be able to see what it's actually requesting and any 404s). Rewriting links like this is an ongoing problem but one being actively pursued. However, potentially running Wayback in proxy mode should fix this (provided the content is there). Hi Coram Roger, I saw your reply and thank you for that. Actually I tested this with browser tools. All the requests are successful. No 404s. I have not run wayback in proxy mode before. Can you please provide me a link which I can get help about this? Regards On Thu, Jan 2, 2014 at 11:06 PM, Umanda Dikwatta <abe...@gm...>wrote: > Hello, > > I have crawled http://www.cse.mrt.ac.lk/ and I.m trying to recreate this > from wayback 1.8. But following javascript isin the html. > > <script type="text/javascript">RokStoriesImage.push('/images/stories/demo/rokstories/rs4.jpg'); > RokStoriesImage.push('/images/stories/demo/rokstories/rs3.jpg');RokStoriesImage.push('/images/stories/demo/rokstories/rs4.jpg'); > window.addEvent('domready', function() { new RokStories('.feature-block', > { 'startElement': 0, 'thumbsOpacity': 0.5, 'mousetype': 'click','autorun': 0,'delay': 5000, 'startWidth': > 615 });});</script> <div class="feature-block"> <div class=" > image-container"> <div class="image-full"></div> <div class="image-small"> <img > src="/images/stories/demo/rokstories/rs4_thumb.jpg<http://www.cse.mrt.ac.lk/images/stories/demo/rokstories/rs4_thumb.jpg>" > class="feature-sub" alt="image" /> <img src=" > /images/stories/demo/rokstories/rs3_thumb.jpg<http://www.cse.mrt.ac.lk/images/stories/demo/rokstories/rs3_thumb.jpg>" > class="feature-sub" alt="image" /> <img src=" > /images/stories/demo/rokstories/rs4_thumb.jpg<http://www.cse.mrt.ac.lk/images/stories/demo/rokstories/rs4_thumb.jpg>" > class="feature-sub" alt="image" /> </div> </div> <div class=" > desc-container"> > In the start up of the web site,rs4.jpg is loaded into the image-full div > block.But this is not working in the wayback. Is there a special reason for > that? Please help me to find this. > > Regards > > > On Wed, Dec 18, 2013 at 5:31 AM, Noah Levitt <nl...@ar...> wrote: > >> >> >> Hello, >> >> Basically yeah that's what hops means, except the seed is hop=0, and >> the links from seed are hop=1, I think. >> >> By "max-depth" do you mean the property maxPathDepth of >> org.archive.modules.deciderules.TooManyPathSegmentsDecideRule? If so, >> you have the right idea. "TooManyPathSegmentsDecideRule... Rule >> REJECTs any CrawlURIs whose total number of path-segments (as >> indicated by the count of '/' characters not including the first '//') >> is over a given threshold." >> >> http://builds.archive.org/javadoc/heritrix-3.x-snapshot/org/archive/modules/deciderules/TooManyPathSegmentsDecideRule.html >> >> On "Problem2", the wayback issue, the wayback mailing list might be a >> better place to ask. >> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss >> You can cc this list if you want. Please include relevant information >> your wayback setup and the behavior you are seeing as precisely as you >> can. >> >> Noah >> >> >> On Sat, Dec 14, 2013 at 10:38 PM, Umanda Dikwatta <abe...@gm...> >> wrote: >> > >> > >> > Hi Noah, >> > >> > Thank you so much for your reply. To get more clear idea, I have >> explained, >> > what I understood here. Please tell is it correct? >> > >> > Problem1 >> > >> > If we consider http://www.mrt.ac.lk/web/ as a seed and then if we >> specify >> > max-hops = 3 and max-depth=7. >> > >> > Is it mean, http://www.mrt.ac.lk/web/ is hop=1. Then all the links in >> the >> > http://www.mrt.ac.lk/web/ has hop=2. >> > All the links inside those links has hop=3. Since max-hops=3, links >> inside >> > these will not crawled. Then what >> > is the max-depth? Is this the correct definition for hops? >> > >> > According to this hops definition >> > >> http://www.mrt.ac.lk/web/sites/default/files/styles/slideshow/public/field/slideshow/ERU%202013%204.jpg >> > is in http://www.mrt.ac.lk/web/ and therefore it is in hop=2. But if we >> > consider number of slashes, it has more than 7 >> > (max-depth) slashes. >> > So is this slashes indicates the max-depth. As I could see in my crawl >> log, >> > number of slashes >=7 has not crawled. >> > Only other links have been crawled. >> > >> > Is this what do mean Noah? >> > >> > Problem2 >> > >> > I tried this with wayback 1.6 and wayback 1.8. But still the issue is >> there >> > with the duplicate content. Is there any solution for this? >> > >> > Thank you and Regards >> > >> > >> > >> > >> >> __._,_.___ >> Reply via web post<http://groups.yahoo.com/group/archive-crawler/post;_ylc=X3oDMTJwNWRsaWphBF9TAzk3MzU5NzE0BGdycElkAzg3NTk4NjcEZ3Jwc3BJZAMxNzA1MDA0OTI0BG1zZ0lkAzg0MzYEc2VjA2Z0cgRzbGsDcnBseQRzdGltZQMxMzg3MzI0ODc2?act=reply&messageNum=8436> Reply >> to sender >> <nl...@ar...?subject=Re%3A%20%5Barchive-crawler%5D%20Heritrix%203%2E1%2E0%20and%20wayback%20questions> Reply >> to group >> <arc...@ya...?subject=Re%3A%20%5Barchive-crawler%5D%20Heritrix%203%2E1%2E0%20and%20wayback%20questions> Start >> a New Topic<http://groups.yahoo.com/group/archive-crawler/post;_ylc=X3oDMTJlbHFtc3Y1BF9TAzk3MzU5NzE0BGdycElkAzg3NTk4NjcEZ3Jwc3BJZAMxNzA1MDA0OTI0BHNlYwNmdHIEc2xrA250cGMEc3RpbWUDMTM4NzMyNDg3Ng--> Messages >> in this topic<http://groups.yahoo.com/group/archive-crawler/message/8423;_ylc=X3oDMTM0ZGRxcmVkBF9TAzk3MzU5NzE0BGdycElkAzg3NTk4NjcEZ3Jwc3BJZAMxNzA1MDA0OTI0BG1zZ0lkAzg0MzYEc2VjA2Z0cgRzbGsDdnRwYwRzdGltZQMxMzg3MzI0ODc2BHRwY0lkAzg0MjM->(4) >> Recent Activity: >> >> - New Members<http://groups.yahoo.com/group/archive-crawler/members;_ylc=X3oDMTJmb29jYXV1BF9TAzk3MzU5NzE0BGdycElkAzg3NTk4NjcEZ3Jwc3BJZAMxNzA1MDA0OTI0BHNlYwN2dGwEc2xrA3ZtYnJzBHN0aW1lAzEzODczMjQ4NzY-?o=6> >> 1 >> >> Visit Your Group<http://groups.yahoo.com/group/archive-crawler;_ylc=X3oDMTJlZmljZ3RzBF9TAzk3MzU5NzE0BGdycElkAzg3NTk4NjcEZ3Jwc3BJZAMxNzA1MDA0OTI0BHNlYwN2dGwEc2xrA3ZnaHAEc3RpbWUDMTM4NzMyNDg3Ng--> >> [image: Yahoo! Groups]<http://groups.yahoo.com/;_ylc=X3oDMTJkOWhsZW1hBF9TAzk3MzU5NzE0BGdycElkAzg3NTk4NjcEZ3Jwc3BJZAMxNzA1MDA0OTI0BHNlYwNmdHIEc2xrA2dmcARzdGltZQMxMzg3MzI0ODc2> >> Switch to: Text-Only<arc...@ya...?subject=Change+Delivery+Format:+Traditional>, >> Daily Digest<arc...@ya...?subject=Email+Delivery:+Digest>• >> Unsubscribe<arc...@ya...?subject=Unsubscribe>• Terms >> of Use <http://info.yahoo.com/legal/us/yahoo/utos/terms/> • Send us >> Feedback >> <ygr...@ya...?subject=Feedback+on+the+redesigned+individual+mail+v1> >> . >> >> __,_._,___ >> > > |