From: Ko, L. <Lau...@un...> - 2014-01-07 19:28:18
|
The administrator manual at http://archive-access.sourceforge.net/projects/wayback/administrator_manual.html has a section called "Proxy Replay Mode" that should help somewhat. As mentioned in it, you will need to give Tomcat a connector on a port to be used by proxy mode, so in your Tomcat's server.xml file, where you see other Connectors defined, assuming you wanted to use port 8090, you would add something similar to: <Connector port="8090" protocol="HTTP/1.1" connectionTimeout="20000" redirectPort="8543" /> In your wayback.xml file, in addition to a default archival URL Replay AccessPoint, you define an AccessPoint for proxy replay. It might be something similar to (this assumes you set a connector on port 8090 and have your archival AccessPoint defined with name "8080:wayback"): <import resource="ProxyReplay.xml"/> <bean name="8090" parent="8080:wayback"> <property name="serveStatic" value="true" /> <property name="bounceToReplayPrefix" value="false" /> <property name="bounceToQueryPrefix" value="false" /> <property name="refererAuth" value="" /> <property name="staticPrefix" value="http://localhost:8090/" /> <property name="replayPrefix" value="http://localhost:8090/" /> <property name="queryPrefix" value="http://localhost:8090/" /> <property name="replay" ref="proxyreplay" /> <property name="uriConverter"> <bean class="org.archive.wayback.proxy.RedirectResultURIConverter"> <property name="redirectURI" value="http://localhost:8090/jsp/QueryUI/Redirect.jsp" /> </bean> </property> <property name="parser"> <bean class="org.archive.wayback.proxy.ProxyRequestParser"> <property name="localhostNames"> <list> <value>localhost</value> </list> </property> <property name="maxRecords" value="1000" /> <property name="addDefaults" value="false" /> </bean> </property> </bean> </beans> Once proxy mode is set up, if you want to access it via a browser, you need to set your browser's proxy server setting to use the Wayback proxy mode URL you defined. After this is set, go to http://www.cse.mrt.ac.lk/ via your browser's address bar, and if things are correct, it should pull from your archived site, not the live web. Hope this helps, Lauren Ko Programmer/Analyst UNT Libraries ________________________________________ From: Umanda Dikwatta [abe...@gm...] Sent: Tuesday, January 07, 2014 12:11 PM To: arc...@li... Subject: Re: [Archive-access-discuss] [archive-crawler] Heritrix 3.1.0 and wayback questions Re: [Archive-access-discuss] [archive-crawler] Heritrix 3.1.0 and wayback questions<http://sourceforge.net/mailarchive/message.php?msg_id=31804713> From: Coram, Roger <Roger.Coram@bl...> - 2014-01-03 10:28 Attachments: Message as HTML<http://sourceforge.net/mailarchive/attachment.php?list_name=archive-access-discuss&message_id=74C97E7DF5A7784D997217FF75D1216612EC81E7%40w2k3-bspex1&counter=1> Sometimes, specifically with links added client-side, Wayback doesn't have the opportunity to rewrite correctly. The images being 'pushed' below are absolute paths and it's possible that your browser is trying to load them relative to your own domain (you can check this via your browser's developer's tools - you should be able to see what it's actually requesting and any 404s). Rewriting links like this is an ongoing problem but one being actively pursued. However, potentially running Wayback in proxy mode should fix this (provided the content is there). Hi Coram Roger, I saw your reply and thank you for that. Actually I tested this with browser tools. All the requests are successful. No 404s. I have not run wayback in proxy mode before. Can you please provide me a link which I can get help about this? Regards On Thu, Jan 2, 2014 at 11:06 PM, Umanda Dikwatta <abe...@gm...<mailto:abe...@gm...>> wrote: Hello, I have crawled http://www.cse.mrt.ac.lk/ and I.m trying to recreate this from wayback 1.8. But following javascript isin the html. <script type="text/javascript"> RokStoriesImage.push('/images/stories/demo/rokstories/rs4.jpg'); RokStoriesImage.push('/images/stories/demo/rokstories/rs3.jpg'); RokStoriesImage.push('/images/stories/demo/rokstories/rs4.jpg'); window.addEvent('domready', function() { new RokStories('.feature-block', { 'startElement': 0, 'thumbsOpacity': 0.5, 'mousetype': 'click', 'autorun': 0, 'delay': 5000, 'startWidth': 615 }); }); </script> <div class="feature-block"> <div class="image-container"> <div class="image-full"></div> <div class="image-small"> <img src="/images/stories/demo/rokstories/rs4_thumb.jpg<http://www.cse.mrt.ac.lk/images/stories/demo/rokstories/rs4_thumb.jpg>" class="feature-sub" alt="image" /> <img src="/images/stories/demo/rokstories/rs3_thumb.jpg<http://www.cse.mrt.ac.lk/images/stories/demo/rokstories/rs3_thumb.jpg>" class="feature-sub" alt="image" /> <img src="/images/stories/demo/rokstories/rs4_thumb.jpg<http://www.cse.mrt.ac.lk/images/stories/demo/rokstories/rs4_thumb.jpg>" class="feature-sub" alt="image" /> </div> </div> <div class="desc-container"> In the start up of the web site,rs4.jpg is loaded into the image-full div block.But this is not working in the wayback. Is there a special reason for that? Please help me to find this. Regards On Wed, Dec 18, 2013 at 5:31 AM, Noah Levitt <nl...@ar...<mailto:nl...@ar...>> wrote: Hello, Basically yeah that's what hops means, except the seed is hop=0, and the links from seed are hop=1, I think. By "max-depth" do you mean the property maxPathDepth of org.archive.modules.deciderules.TooManyPathSegmentsDecideRule? If so, you have the right idea. "TooManyPathSegmentsDecideRule... Rule REJECTs any CrawlURIs whose total number of path-segments (as indicated by the count of '/' characters not including the first '//') is over a given threshold." http://builds.archive.org/javadoc/heritrix-3.x-snapshot/org/archive/modules/deciderules/TooManyPathSegmentsDecideRule.html On "Problem2", the wayback issue, the wayback mailing list might be a better place to ask. https://lists.sourceforge.net/lists/listinfo/archive-access-discuss You can cc this list if you want. Please include relevant information your wayback setup and the behavior you are seeing as precisely as you can. Noah On Sat, Dec 14, 2013 at 10:38 PM, Umanda Dikwatta <abe...@gm...<mailto:abe...@gm...>> wrote: > > > Hi Noah, > > Thank you so much for your reply. To get more clear idea, I have explained, > what I understood here. Please tell is it correct? > > Problem1 > > If we consider http://www.mrt.ac.lk/web/ as a seed and then if we specify > max-hops = 3 and max-depth=7. > > Is it mean, http://www.mrt.ac.lk/web/ is hop=1. Then all the links in the > http://www.mrt.ac.lk/web/ has hop=2. > All the links inside those links has hop=3. Since max-hops=3, links inside > these will not crawled. Then what > is the max-depth? Is this the correct definition for hops? > > According to this hops definition > http://www.mrt.ac.lk/web/sites/default/files/styles/slideshow/public/field/slideshow/ERU%202013%204.jpg > is in http://www.mrt.ac.lk/web/ and therefore it is in hop=2. But if we > consider number of slashes, it has more than 7 > (max-depth) slashes. > So is this slashes indicates the max-depth. As I could see in my crawl log, > number of slashes >=7 has not crawled. > Only other links have been crawled. > > Is this what do mean Noah? > > Problem2 > > I tried this with wayback 1.6 and wayback 1.8. But still the issue is there > with the duplicate content. Is there any solution for this? > > Thank you and Regards > > > > __._,_.___ Reply via web post<http://groups.yahoo.com/group/archive-crawler/post;_ylc=X3oDMTJwNWRsaWphBF9TAzk3MzU5NzE0BGdycElkAzg3NTk4NjcEZ3Jwc3BJZAMxNzA1MDA0OTI0BG1zZ0lkAzg0MzYEc2VjA2Z0cgRzbGsDcnBseQRzdGltZQMxMzg3MzI0ODc2?act=reply&messageNum=8436> Reply to sender <mailto:nl...@ar...?subject=Re%3A%20%5Barchive-crawler%5D%20Heritrix%203%2E1%2E0%20and%20wayback%20questions> Reply to group <mailto:arc...@ya...?subject=Re%3A%20%5Barchive-crawler%5D%20Heritrix%203%2E1%2E0%20and%20wayback%20questions> Start a New Topic<http://groups.yahoo.com/group/archive-crawler/post;_ylc=X3oDMTJlbHFtc3Y1BF9TAzk3MzU5NzE0BGdycElkAzg3NTk4NjcEZ3Jwc3BJZAMxNzA1MDA0OTI0BHNlYwNmdHIEc2xrA250cGMEc3RpbWUDMTM4NzMyNDg3Ng--> Messages in this topic<http://groups.yahoo.com/group/archive-crawler/message/8423;_ylc=X3oDMTM0ZGRxcmVkBF9TAzk3MzU5NzE0BGdycElkAzg3NTk4NjcEZ3Jwc3BJZAMxNzA1MDA0OTI0BG1zZ0lkAzg0MzYEc2VjA2Z0cgRzbGsDdnRwYwRzdGltZQMxMzg3MzI0ODc2BHRwY0lkAzg0MjM-> (4) Recent Activity: * New Members<http://groups.yahoo.com/group/archive-crawler/members;_ylc=X3oDMTJmb29jYXV1BF9TAzk3MzU5NzE0BGdycElkAzg3NTk4NjcEZ3Jwc3BJZAMxNzA1MDA0OTI0BHNlYwN2dGwEc2xrA3ZtYnJzBHN0aW1lAzEzODczMjQ4NzY-?o=6> 1 Visit Your Group<http://groups.yahoo.com/group/archive-crawler;_ylc=X3oDMTJlZmljZ3RzBF9TAzk3MzU5NzE0BGdycElkAzg3NTk4NjcEZ3Jwc3BJZAMxNzA1MDA0OTI0BHNlYwN2dGwEc2xrA3ZnaHAEc3RpbWUDMTM4NzMyNDg3Ng--> [Yahoo! Groups]<http://groups.yahoo.com/;_ylc=X3oDMTJkOWhsZW1hBF9TAzk3MzU5NzE0BGdycElkAzg3NTk4NjcEZ3Jwc3BJZAMxNzA1MDA0OTI0BHNlYwNmdHIEc2xrA2dmcARzdGltZQMxMzg3MzI0ODc2> Switch to: Text-Only<mailto:arc...@ya...?subject=Change+Delivery+Format:+Traditional>, Daily Digest<mailto:arc...@ya...?subject=Email+Delivery:+Digest> • Unsubscribe<mailto:arc...@ya...?subject=Unsubscribe> • Terms of Use<http://info.yahoo.com/legal/us/yahoo/utos/terms/> • Send us Feedback <mailto:ygr...@ya...?subject=Feedback+on+the+redesigned+individual+mail+v1> . __,_._,___ |