From: Coram, R. <Rog...@bl...> - 2014-01-03 10:28:44
|
Sometimes, specifically with links added client-side, Wayback doesn't have the opportunity to rewrite correctly. The images being 'pushed' below are absolute paths and it's possible that your browser is trying to load them relative to your own domain (you can check this via your browser's developer's tools - you should be able to see what it's actually requesting and any 404s). Rewriting links like this is an ongoing problem but one being actively pursued. However, potentially running Wayback in proxy mode should fix this (provided the content is there). From: arc...@ya... [mailto:arc...@ya...] On Behalf Of Umanda Dikwatta Sent: 02 January 2014 17:36 To: arc...@ya... Cc: arc...@li... Subject: Re: [archive-crawler] Heritrix 3.1.0 and wayback questions Hello, I have crawled http://www.cse.mrt.ac.lk/ and I.m trying to recreate this from wayback 1.8. But following javascript isin the html. <script type="text/javascript"> RokStoriesImage.push('/images/stories/demo/rokstories/rs4.jpg'); RokStoriesImage.push('/images/stories/demo/rokstories/rs3.jpg'); RokStoriesImage.push('/images/stories/demo/rokstories/rs4.jpg'); window.addEvent('domready', function() { new RokStories('.feature-block', { 'startElement': 0, 'thumbsOpacity': 0.5, 'mousetype': 'click', 'autorun': 0, 'delay': 5000, 'startWidth': 615 }); }); </script> <div class="feature-block"> <div class="image-container"> <div class="image-full"></div> <div class="image-small"> <img src="/images/stories/demo/rokstories/rs4_thumb.jpg <http://www.cse.mrt.ac.lk/images/stories/demo/rokstories/rs4_thumb.jpg> " class="feature-sub" alt="image" /> <img src="/images/stories/demo/rokstories/rs3_thumb.jpg <http://www.cse.mrt.ac.lk/images/stories/demo/rokstories/rs3_thumb.jpg> " class="feature-sub" alt="image" /> <img src="/images/stories/demo/rokstories/rs4_thumb.jpg <http://www.cse.mrt.ac.lk/images/stories/demo/rokstories/rs4_thumb.jpg> " class="feature-sub" alt="image" /> </div> </div> <div class="desc-container"> In the start up of the web site,rs4.jpg is loaded into the image-full div block.But this is not working in the wayback. Is there a special reason for that? Please help me to find this. Regards On Wed, Dec 18, 2013 at 5:31 AM, Noah Levitt <nl...@ar...> wrote: Hello, Basically yeah that's what hops means, except the seed is hop=0, and the links from seed are hop=1, I think. By "max-depth" do you mean the property maxPathDepth of org.archive.modules.deciderules.TooManyPathSegmentsDecideRule? If so, you have the right idea. "TooManyPathSegmentsDecideRule... Rule REJECTs any CrawlURIs whose total number of path-segments (as indicated by the count of '/' characters not including the first '//') is over a given threshold." http://builds.archive.org/javadoc/heritrix-3.x-snapshot/org/archive/modu les/deciderules/TooManyPathSegmentsDecideRule.html On "Problem2", the wayback issue, the wayback mailing list might be a better place to ask. https://lists.sourceforge.net/lists/listinfo/archive-access-discuss You can cc this list if you want. Please include relevant information your wayback setup and the behavior you are seeing as precisely as you can. Noah On Sat, Dec 14, 2013 at 10:38 PM, Umanda Dikwatta <abe...@gm...> wrote: > > > Hi Noah, > > Thank you so much for your reply. To get more clear idea, I have explained, > what I understood here. Please tell is it correct? > > Problem1 > > If we consider http://www.mrt.ac.lk/web/ as a seed and then if we specify > max-hops = 3 and max-depth=7. > > Is it mean, http://www.mrt.ac.lk/web/ is hop=1. Then all the links in the > http://www.mrt.ac.lk/web/ has hop=2. > All the links inside those links has hop=3. Since max-hops=3, links inside > these will not crawled. Then what > is the max-depth? Is this the correct definition for hops? > > According to this hops definition > http://www.mrt.ac.lk/web/sites/default/files/styles/slideshow/public/fie ld/slideshow/ERU%202013%204.jpg > is in http://www.mrt.ac.lk/web/ and therefore it is in hop=2. But if we > consider number of slashes, it has more than 7 > (max-depth) slashes. > So is this slashes indicates the max-depth. As I could see in my crawl log, > number of slashes >=7 has not crawled. > Only other links have been crawled. > > Is this what do mean Noah? > > Problem2 > > I tried this with wayback 1.6 and wayback 1.8. But still the issue is there > with the duplicate content. Is there any solution for this? > > Thank you and Regards > > > > __._,_.___ Reply via web post <http://groups.yahoo.com/group/archive-crawler/post;_ylc=X3oDMTJwMDY3N2V lBF9TAzk3MzU5NzE0BGdycElkAzg3NTk4NjcEZ3Jwc3BJZAMxNzA1MDA0OTI0BG1zZ0lkAzg 0NDMEc2VjA2Z0cgRzbGsDcnBseQRzdGltZQMxMzg4Njg0MTY4?act=reply&messageNum=8 443> Reply to sender <mailto:abe...@gm...?subject=Re%3A%20%5Barchive-crawler%5D%20Heri trix%203%2E1%2E0%20and%20wayback%20questions> Reply to group <mailto:arc...@ya...?subject=Re%3A%20%5Barchive-crawl er%5D%20Heritrix%203%2E1%2E0%20and%20wayback%20questions> Start a New Topic <http://groups.yahoo.com/group/archive-crawler/post;_ylc=X3oDMTJlNTBzc3R wBF9TAzk3MzU5NzE0BGdycElkAzg3NTk4NjcEZ3Jwc3BJZAMxNzA1MDA0OTI0BHNlYwNmdHI Ec2xrA250cGMEc3RpbWUDMTM4ODY4NDE2OA--> Messages in this topic <http://groups.yahoo.com/group/archive-crawler/message/8423;_ylc=X3oDMTM 0MjlhY3Y1BF9TAzk3MzU5NzE0BGdycElkAzg3NTk4NjcEZ3Jwc3BJZAMxNzA1MDA0OTI0BG1 zZ0lkAzg0NDMEc2VjA2Z0cgRzbGsDdnRwYwRzdGltZQMxMzg4Njg0MTY4BHRwY0lkAzg0MjM -> (5) Recent Activity: * New Members <http://groups.yahoo.com/group/archive-crawler/members;_ylc=X3oDMTJmcWYx NWRuBF9TAzk3MzU5NzE0BGdycElkAzg3NTk4NjcEZ3Jwc3BJZAMxNzA1MDA0OTI0BHNlYwN2 dGwEc2xrA3ZtYnJzBHN0aW1lAzEzODg2ODQxNjg-?o=6> 1 Visit Your Group <http://groups.yahoo.com/group/archive-crawler;_ylc=X3oDMTJlMGY1bG1wBF9T Azk3MzU5NzE0BGdycElkAzg3NTk4NjcEZ3Jwc3BJZAMxNzA1MDA0OTI0BHNlYwN2dGwEc2xr A3ZnaHAEc3RpbWUDMTM4ODY4NDE2OA--> Yahoo! Groups <http://groups.yahoo.com/;_ylc=X3oDMTJkN3EyMXRxBF9TAzk3NDc2NTkwBGdycElkA zg3NTk4NjcEZ3Jwc3BJZAMxNzA1MDA0OTI0BHNlYwNmdHIEc2xrA2dmcARzdGltZQMxMzg4N jg0MTY4> Switch to: Text-Only <mailto:arc...@ya...?subject=Change%20Del ivery%20Format:%20Traditional> , Daily Digest <mailto:arc...@ya...?subject=Email%20Delivery: %20Digest> * Unsubscribe <mailto:arc...@ya...?subject=Unsubscribe> * Terms of Use <http://info.yahoo.com/legal/us/yahoo/utos/terms/> * Send us Feedback <mailto:ygr...@ya...?subject=Feedback%20on%20the %20redesigned%20individual%20mail%20v1> . <http://geo.yahoo.com/serv?s=97359714/grpId=8759867/grpspId=1705004924/m sgId=8443/stime=1388684168> __,_._,___ |