archive-access-discuss Mailing List for Web Archive Access Utilities (Page 2)

Brought to you by: binzino, bradtofel, gojomo, ia_igor, and 5 others

archive-access-discuss — General discussion about archive-access projects

You can subscribe to this list here.

2005	Jan	Feb	Mar	Apr	May	Jun	Jul (1)	Aug (4)	Sep (5)	Oct (17)	Nov (30)	Dec (3)
2006	Jan (4)	Feb (14)	Mar (8)	Apr (11)	May (2)	Jun (13)	Jul (9)	Aug (2)	Sep (2)	Oct (9)	Nov (20)	Dec (9)
2007	Jan (6)	Feb (4)	Mar (6)	Apr (7)	May (6)	Jun (6)	Jul (4)	Aug (3)	Sep (9)	Oct (26)	Nov (23)	Dec (2)
2008	Jan (17)	Feb (19)	Mar (16)	Apr (27)	May (3)	Jun (21)	Jul (21)	Aug (8)	Sep (13)	Oct (7)	Nov (8)	Dec (8)
2009	Jan (18)	Feb (14)	Mar (27)	Apr (14)	May (10)	Jun (14)	Jul (18)	Aug (30)	Sep (18)	Oct (12)	Nov (5)	Dec (26)
2010	Jan (27)	Feb (3)	Mar (8)	Apr (4)	May (6)	Jun (13)	Jul (25)	Aug (11)	Sep (2)	Oct (4)	Nov (7)	Dec (6)
2011	Jan (25)	Feb (17)	Mar (25)	Apr (23)	May (15)	Jun (12)	Jul (8)	Aug (13)	Sep (4)	Oct (17)	Nov (7)	Dec (6)
2012	Jan (4)	Feb (7)	Mar (1)	Apr (10)	May (11)	Jun (5)	Jul (7)	Aug (1)	Sep (1)	Oct (5)	Nov (6)	Dec (13)
2013	Jan (9)	Feb (7)	Mar (3)	Apr (1)	May (3)	Jun (19)	Jul (3)	Aug (3)	Sep	Oct (1)	Nov (1)	Dec (1)
2014	Jan (11)	Feb (1)	Mar	Apr (2)	May (6)	Jun	Jul	Aug (1)	Sep	Oct (1)	Nov (1)	Dec (1)
2015	Jan	Feb	Mar	Apr	May	Jun (1)	Jul (4)	Aug	Sep	Oct	Nov	Dec (1)
2016	Jan (4)	Feb (3)	Mar	Apr	May	Jun	Jul (1)	Aug	Sep	Oct (1)	Nov	Dec
2018	Jan	Feb	Mar	Apr (1)	May (1)	Jun	Jul (2)	Aug	Sep (1)	Oct	Nov (1)	Dec
2019	Jan (2)	Feb (1)	Mar	Apr	May	Jun (2)	Jul	Aug	Sep (1)	Oct (1)	Nov	Dec

Flat | Threaded

<< < 1 2 3 4 .. 43 > >> (Page 2 of 43)

[Archive-access-discuss] wayback : http 404-resource not found error

From: Umanda D. <abe...@gm...> - 2014-01-11 17:41:03

Hi,

I'm running wayback 1.8 in ubuntu VM. But I have an issue. Whenever I do a
change in BDBCollection.xml or wayback.xml, I cannot access the wayback
again. I normally shutdown the apache, do my changes and restart the
apache. But then it gives http 404 resource not found error. Therefore I
have to restart my remote VM and start the apache again.

Then afterwards it takes lot of time to view the snapshots for specific
urls in wayback.

Does anyone know the reason for this?

Regards

Re: [Archive-access-discuss] Some Javascripts not working in wayback

From: Umanda D. <abe...@gm...> - 2014-01-08 05:58:51

Hi Edward,

Thank you so much for the reply. I'm using the newest version of wayback
which I downloaded from
http://builds.archive.org/maven2/org/archive/wayback/dist/1.8.0-SNAPSHOT/.(
dist-1.8.0-SNAPSHOT-1.8.0-SNAPSHOT.tar.gz<http://builds.archive.org/maven2/org/archive/wayback/dist/1.8.0-SNAPSHOT/dist-1.8.0-SNAPSHOT-1.8.0-SNAPSHOT.tar.gz>).
 So as you have mentioned in 1) and 2) all the necessary files are there in
the wayback directory.

The 3rd point you have mentioned is yet to be implemented in the wayback.
Isn't it?

Thank you and Best Regards


On Tue, Jan 7, 2014 at 9:26 PM, Edward Summers <eh...@po...> wrote:

> Hi Umanda,
>
> It looks like the slideshow in 1) has the same problem at Internet Archive:
>
>     https://web.archive.org/web/20131230013555/http://www.cmb.ac.lk/
>
> So the good news is you are not alone. If you pull up the JavaScript
> console you should see an error like this:
>
>     Uncaught TypeError: Object function (selector,context){return new
> jQuery.fn.init(selector,context)} has no method 'isPlainObject’
>
> You can see on line 36 of the HTML that jQuery v1.4.2 is being loaded
> correctly, as it is on the original page:
>
>     <script type='text/javascript' src='/web/20131230013555js_/
> http://www.cmb.ac.lk/wp-includes/js/jquery/jquery.js?ver=1.4.2'></script>
>
> But on line 170 you can see Wayback’s boilerplate is loading jQuery again,
> but an older version (v1.3.2), which seems not to have the isPlainObject
> method, which causes the error above. Also, this reload probably stomps on
> any jQuery plugins that have been installed.
>
> I don’t know what the best solution here is, but it seems to me there are
> (at least) three options:
>
> 1) From a brief search It looks like [1] jQuery takes backwards
> compatibility pretty seriously. So if Wayback must have a jQuery dependency
> perhaps it could simply be upgraded to use the latest version in git [2].
>
> 2) It looks like it’s possible for multiple versions of jQuery to co-exist
> on the same page [3]. So perhaps Wayback could be updated to use jQuery in
> this way, so that it doesn’t interfere with archived pages that also use
> jQuery?
>
> 3) Perhaps Wayback should test to see if jQuery could be loaded before
> re-loading it? This is what was recommended in a previous bug report [4].
>
> //Ed
>
> [1]
> http://stackoverflow.com/questions/281438/how-good-is-jquerys-backward-compatibility
> [2]
> https://github.com/internetarchive/wayback/tree/master/wayback-webapp/src/main/webapp/js
> [3]
> http://stackoverflow.com/questions/1566595/can-i-use-multiple-versions-of-jquery-on-the-same-page
> [4]
> https://webarchive.jira.com/browse/ACC-118?jql=project%20%3D%20ACC%20AND%20component%20%3D%20Wayback%20AND%20text%20~%20%22jquery%22
>
> On Jan 7, 2014, at 10:04 AM, Umanda Dikwatta <abe...@gm...> wrote:
>
> > Hello,
> >
> >
> >
> > I'm using Heritrix 3.1.0 and wayback 1.8 in order to crawl and re-create
> the web sites.I have following seed urls.
> >
> >
> >
> > 1) http://www.cmb.ac.lk
> >
> > 2) http://www.pdn.ac.lk
> >
> > 3) http://www.kln.ac.lk
> >
> >
> >
> > When I'm trying to re-create these web sites, slide show which is in the
> main page of first 2 sites are not working. Also,
> >
> > A drop down menu which is activated in the mouse over of main menu in
> the 3rd site also not working. In all these cases the relevant
> >
> > files are in the crawl log and successfully crawled. As I saw in the code
> >
> >
> >
> > http://www.cmb.ac.lk - slide show works using
> http://www.cmb.ac.lk/wp-content/plugins/widgetkit/widgets/slideshow/js/slideshow.js
> >
> >
> > http://www.pdn.ac.lk - slide show works using skitter.js(
> http://www.pdn.ac.lk/assist/js1/jquery.skitter.min.js) This js file is
> >
> > in the archive.
> >
> > http://www.kln.ac.lk - This main menu drop down is working using the
> javascipt below.
> > <script type="text/javascript">
> >
> >                       var megamenu = new jaMegaMenuMoo ('ja-megamenu', {
> >
> >                               'bgopacity': 0,
> >
> >                               'delayHide': 300,
> >
> >                               'slide': 0,
> >
> >                               'fading': 1,
> >
> >                               'direction':'down',
> >
> >                               'action':'mouseover',
> >
> >                               'tips': false,
> >
> >                               'duration': 300,
> >
> >                               'hidestyle': 'fastwhenshow'
> >
> >                       });
> >
> >                       </script>
> >
> > Therefore, is this a limitation with wayback machine. Please help me.
> >
> >
> >
> > Regards
> >
> >
> >
> ------------------------------------------------------------------------------
> > Rapidly troubleshoot problems before they affect your business. Most IT
> > organizations don't have a clear picture of how application performance
> > affects their revenue. With AppDynamics, you get 100% visibility into
> your
> > Java,.NET, & PHP application. Start your 15-day FREE TRIAL of
> AppDynamics Pro!
> >
> http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk_______________________________________________
> > Archive-access-discuss mailing list
> > Arc...@li...
> > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>
>

Re: [Archive-access-discuss] [archive-crawler] Heritrix 3.1.0 and wayback questions

From: Ko, L. <Lau...@un...> - 2014-01-07 19:28:18

The administrator manual at http://archive-access.sourceforge.net/projects/wayback/administrator_manual.html has a section called "Proxy Replay Mode" that should help somewhat. As mentioned in it, you will need to give Tomcat a connector on a port to be used by proxy mode, so in your Tomcat's server.xml file, where you see other Connectors defined, assuming you wanted to use port 8090, you would add something similar to:

    <Connector port="8090" protocol="HTTP/1.1" 
               connectionTimeout="20000" 
               redirectPort="8543" />

In your wayback.xml file, in addition to a default archival URL Replay AccessPoint, you define an AccessPoint for proxy replay. It might be something similar to (this assumes you set a connector on port 8090 and have your archival AccessPoint defined with name "8080:wayback"):

  <import resource="ProxyReplay.xml"/>
  <bean name="8090" parent="8080:wayback">
    <property name="serveStatic" value="true" />
    <property name="bounceToReplayPrefix" value="false" />
    <property name="bounceToQueryPrefix" value="false" />
    <property name="refererAuth" value="" />

    <property name="staticPrefix" value="http://localhost:8090/" />
    <property name="replayPrefix" value="http://localhost:8090/" />
    <property name="queryPrefix" value="http://localhost:8090/" />
    <property name="replay" ref="proxyreplay" />
    <property name="uriConverter">
      <bean class="org.archive.wayback.proxy.RedirectResultURIConverter">
        <property name="redirectURI" value="http://localhost:8090/jsp/QueryUI/Redirect.jsp" />
      </bean>
    </property>
    <property name="parser">
      <bean class="org.archive.wayback.proxy.ProxyRequestParser">
        <property name="localhostNames">
          <list>
            <value>localhost</value>
          </list>
        </property>
        <property name="maxRecords" value="1000" />
        <property name="addDefaults" value="false" />
      </bean>
    </property>
  </bean>
</beans>

Once proxy mode is set up, if you want to access it via a browser, you need to set your browser's proxy server setting to use the Wayback proxy mode URL you defined. After this is set, go to http://www.cse.mrt.ac.lk/ via your browser's address bar, and if things are correct, it should pull from your archived site, not the live web.

Hope this helps,
Lauren Ko
Programmer/Analyst
UNT Libraries
________________________________________
From: Umanda Dikwatta [abe...@gm...]
Sent: Tuesday, January 07, 2014 12:11 PM
To: arc...@li...
Subject: Re: [Archive-access-discuss] [archive-crawler] Heritrix 3.1.0 and      wayback questions

Re: [Archive-access-discuss] [archive-crawler] Heritrix 3.1.0 and wayback questions<http://sourceforge.net/mailarchive/message.php?msg_id=31804713>
From: Coram, Roger <Roger.Coram@bl...> - 2014-01-03 10:28

Attachments: Message as HTML<http://sourceforge.net/mailarchive/attachment.php?list_name=archive-access-discuss&message_id=74C97E7DF5A7784D997217FF75D1216612EC81E7%40w2k3-bspex1&counter=1>

Sometimes, specifically with links added client-side, Wayback doesn't
have the opportunity to rewrite correctly. The images being 'pushed'
below are absolute paths and it's possible that your browser is trying
to load them relative to your own domain (you can check this via your
browser's developer's tools - you should be able to see what it's
actually requesting and any 404s).

Rewriting links like this is an ongoing problem but one being actively
pursued. However, potentially running Wayback in proxy mode should fix
this (provided the content is there).

Hi Coram Roger,

I saw your reply and thank you for that. Actually I tested this with browser tools. All the requests are successful. No 404s. I have not run wayback in proxy mode before. Can you please provide me a link which I can get help about this?

Regards

On Thu, Jan 2, 2014 at 11:06 PM, Umanda Dikwatta <abe...@gm...<mailto:abe...@gm...>> wrote:
Hello,

I have crawled http://www.cse.mrt.ac.lk/ and I.m trying to recreate this from wayback 1.8. But following javascript isin the html.

<script type="text/javascript">
        RokStoriesImage.push('/images/stories/demo/rokstories/rs4.jpg');
        RokStoriesImage.push('/images/stories/demo/rokstories/rs3.jpg');
        RokStoriesImage.push('/images/stories/demo/rokstories/rs4.jpg');

        window.addEvent('domready', function() {
        new RokStories('.feature-block', {
        'startElement': 0,
        'thumbsOpacity': 0.5,
        'mousetype': 'click',
        'autorun': 0,
        'delay': 5000,
        'startWidth': 615 });
        });
        </script>
        <div class="feature-block">
        <div class="image-container">
        <div class="image-full"></div>
        <div class="image-small">
        <img src="/images/stories/demo/rokstories/rs4_thumb.jpg<http://www.cse.mrt.ac.lk/images/stories/demo/rokstories/rs4_thumb.jpg>" class="feature-sub" alt="image" />
        <img src="/images/stories/demo/rokstories/rs3_thumb.jpg<http://www.cse.mrt.ac.lk/images/stories/demo/rokstories/rs3_thumb.jpg>" class="feature-sub" alt="image" />
        <img src="/images/stories/demo/rokstories/rs4_thumb.jpg<http://www.cse.mrt.ac.lk/images/stories/demo/rokstories/rs4_thumb.jpg>" class="feature-sub" alt="image" />
        </div>
        </div>
        <div class="desc-container">
In the start up of the web site,rs4.jpg is loaded into the image-full div block.But this is not working in the wayback. Is there a special reason for that? Please help me to find this.

Regards

On Wed, Dec 18, 2013 at 5:31 AM, Noah Levitt <nl...@ar...<mailto:nl...@ar...>> wrote:

Hello,

Basically yeah that's what hops means, except the seed is hop=0, and
the links from seed are hop=1, I think.

By "max-depth" do you mean the property maxPathDepth of
org.archive.modules.deciderules.TooManyPathSegmentsDecideRule? If so,
you have the right idea. "TooManyPathSegmentsDecideRule... Rule
REJECTs any CrawlURIs whose total number of path-segments (as
indicated by the count of '/' characters not including the first '//')
is over a given threshold."
http://builds.archive.org/javadoc/heritrix-3.x-snapshot/org/archive/modules/deciderules/TooManyPathSegmentsDecideRule.html

On "Problem2", the wayback issue, the wayback mailing list might be a
better place to ask.
https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
You can cc this list if you want. Please include relevant information
your wayback setup and the behavior you are seeing as precisely as you
can.

Noah

On Sat, Dec 14, 2013 at 10:38 PM, Umanda Dikwatta <abe...@gm...<mailto:abe...@gm...>> wrote:
>
>
> Hi Noah,
>
> Thank you so much for your reply. To get more clear idea, I have explained,
> what I understood here. Please tell is it correct?
>
> Problem1
>
> If we consider http://www.mrt.ac.lk/web/ as a seed and then if we specify
> max-hops = 3 and max-depth=7.
>
> Is it mean, http://www.mrt.ac.lk/web/ is hop=1. Then all the links in the
> http://www.mrt.ac.lk/web/ has hop=2.
> All the links inside those links has hop=3. Since max-hops=3, links inside
> these will not crawled. Then what
> is the max-depth? Is this the correct definition for hops?
>
> According to this hops definition
> http://www.mrt.ac.lk/web/sites/default/files/styles/slideshow/public/field/slideshow/ERU%202013%204.jpg
> is in http://www.mrt.ac.lk/web/ and therefore it is in hop=2. But if we
> consider number of slashes, it has more than 7
> (max-depth) slashes.
> So is this slashes indicates the max-depth. As I could see in my crawl log,
> number of slashes >=7 has not crawled.
> Only other links have been crawled.
>
> Is this what do mean Noah?
>
> Problem2
>
> I tried this with wayback 1.6 and wayback 1.8. But still the issue is there
> with the duplicate content. Is there any solution for this?
>
> Thank you and Regards
>
>
>
>

__._,_.___
Reply via web post<http://groups.yahoo.com/group/archive-crawler/post;_ylc=X3oDMTJwNWRsaWphBF9TAzk3MzU5NzE0BGdycElkAzg3NTk4NjcEZ3Jwc3BJZAMxNzA1MDA0OTI0BG1zZ0lkAzg0MzYEc2VjA2Z0cgRzbGsDcnBseQRzdGltZQMxMzg3MzI0ODc2?act=reply&messageNum=8436>  Reply to sender <mailto:nl...@ar...?subject=Re%3A%20%5Barchive-crawler%5D%20Heritrix%203%2E1%2E0%20and%20wayback%20questions>   Reply to group <mailto:arc...@ya...?subject=Re%3A%20%5Barchive-crawler%5D%20Heritrix%203%2E1%2E0%20and%20wayback%20questions>        Start a New Topic<http://groups.yahoo.com/group/archive-crawler/post;_ylc=X3oDMTJlbHFtc3Y1BF9TAzk3MzU5NzE0BGdycElkAzg3NTk4NjcEZ3Jwc3BJZAMxNzA1MDA0OTI0BHNlYwNmdHIEc2xrA250cGMEc3RpbWUDMTM4NzMyNDg3Ng-->         Messages in this topic<http://groups.yahoo.com/group/archive-crawler/message/8423;_ylc=X3oDMTM0ZGRxcmVkBF9TAzk3MzU5NzE0BGdycElkAzg3NTk4NjcEZ3Jwc3BJZAMxNzA1MDA0OTI0BG1zZ0lkAzg0MzYEc2VjA2Z0cgRzbGsDdnRwYwRzdGltZQMxMzg3MzI0ODc2BHRwY0lkAzg0MjM-> (4)
Recent Activity:

  *   New Members<http://groups.yahoo.com/group/archive-crawler/members;_ylc=X3oDMTJmb29jYXV1BF9TAzk3MzU5NzE0BGdycElkAzg3NTk4NjcEZ3Jwc3BJZAMxNzA1MDA0OTI0BHNlYwN2dGwEc2xrA3ZtYnJzBHN0aW1lAzEzODczMjQ4NzY-?o=6> 1

Visit Your Group<http://groups.yahoo.com/group/archive-crawler;_ylc=X3oDMTJlZmljZ3RzBF9TAzk3MzU5NzE0BGdycElkAzg3NTk4NjcEZ3Jwc3BJZAMxNzA1MDA0OTI0BHNlYwN2dGwEc2xrA3ZnaHAEc3RpbWUDMTM4NzMyNDg3Ng-->
[Yahoo! Groups]<http://groups.yahoo.com/;_ylc=X3oDMTJkOWhsZW1hBF9TAzk3MzU5NzE0BGdycElkAzg3NTk4NjcEZ3Jwc3BJZAMxNzA1MDA0OTI0BHNlYwNmdHIEc2xrA2dmcARzdGltZQMxMzg3MzI0ODc2>
Switch to: Text-Only<mailto:arc...@ya...?subject=Change+Delivery+Format:+Traditional>, Daily Digest<mailto:arc...@ya...?subject=Email+Delivery:+Digest> • Unsubscribe<mailto:arc...@ya...?subject=Unsubscribe> • Terms of Use<http://info.yahoo.com/legal/us/yahoo/utos/terms/> • Send us Feedback <mailto:ygr...@ya...?subject=Feedback+on+the+redesigned+individual+mail+v1>
.

__,_._,___

Re: [Archive-access-discuss] [archive-crawler] Heritrix 3.1.0 and wayback questions

From: Umanda D. <abe...@gm...> - 2014-01-07 18:11:30

*Re: [Archive-access-discuss] [archive-crawler] Heritrix 3.1.0 and wayback
questions <http://sourceforge.net/mailarchive/message.php?msg_id=31804713>*
From: Coram, Roger <Roger.Coram@bl...> - 2014-01-03 10:28

*Attachments:* Message as
HTML<http://sourceforge.net/mailarchive/attachment.php?list_name=archive-access-discuss&message_id=74C97E7DF5A7784D997217FF75D1216612EC81E7%40w2k3-bspex1&counter=1>


Sometimes, specifically with links added client-side, Wayback doesn't
have the opportunity to rewrite correctly. The images being 'pushed'
below are absolute paths and it's possible that your browser is trying
to load them relative to your own domain (you can check this via your
browser's developer's tools - you should be able to see what it's
actually requesting and any 404s).



Rewriting links like this is an ongoing problem but one being actively
pursued. However, potentially running Wayback in proxy mode should fix
this (provided the content is there).



Hi Coram Roger,

I saw your reply and thank you for that. Actually I tested this with
browser tools. All the requests are successful. No 404s. I have not run
wayback in proxy mode before. Can you please provide me a link which I can
get help about this?

Regards

On Thu, Jan 2, 2014 at 11:06 PM, Umanda Dikwatta <abe...@gm...>wrote:

> Hello,
>
> I have crawled http://www.cse.mrt.ac.lk/ and I.m trying to recreate this
> from wayback 1.8. But following javascript isin the html.
>
> <script type="text/javascript">RokStoriesImage.push('/images/stories/demo/rokstories/rs4.jpg');
> RokStoriesImage.push('/images/stories/demo/rokstories/rs3.jpg');RokStoriesImage.push('/images/stories/demo/rokstories/rs4.jpg');
> window.addEvent('domready', function() { new RokStories('.feature-block',
> { 'startElement': 0, 'thumbsOpacity': 0.5, 'mousetype': 'click','autorun': 0,'delay': 5000, 'startWidth':
> 615 });});</script> <div class="feature-block"> <div class="
> image-container"> <div class="image-full"></div> <div class="image-small"> <img
> src="/images/stories/demo/rokstories/rs4_thumb.jpg<http://www.cse.mrt.ac.lk/images/stories/demo/rokstories/rs4_thumb.jpg>"
> class="feature-sub" alt="image" /> <img src="
> /images/stories/demo/rokstories/rs3_thumb.jpg<http://www.cse.mrt.ac.lk/images/stories/demo/rokstories/rs3_thumb.jpg>"
> class="feature-sub" alt="image" /> <img src="
> /images/stories/demo/rokstories/rs4_thumb.jpg<http://www.cse.mrt.ac.lk/images/stories/demo/rokstories/rs4_thumb.jpg>"
> class="feature-sub" alt="image" /> </div> </div> <div class="
> desc-container">
> In the start up of the web site,rs4.jpg is loaded into the image-full div
> block.But this is not working in the wayback. Is there a special reason for
> that? Please help me to find this.
>
> Regards
>
>
> On Wed, Dec 18, 2013 at 5:31 AM, Noah Levitt <nl...@ar...> wrote:
>
>>
>>
>> Hello,
>>
>> Basically yeah that's what hops means, except the seed is hop=0, and
>> the links from seed are hop=1, I think.
>>
>> By "max-depth" do you mean the property maxPathDepth of
>> org.archive.modules.deciderules.TooManyPathSegmentsDecideRule? If so,
>> you have the right idea. "TooManyPathSegmentsDecideRule... Rule
>> REJECTs any CrawlURIs whose total number of path-segments (as
>> indicated by the count of '/' characters not including the first '//')
>> is over a given threshold."
>>
>> http://builds.archive.org/javadoc/heritrix-3.x-snapshot/org/archive/modules/deciderules/TooManyPathSegmentsDecideRule.html
>>
>> On "Problem2", the wayback issue, the wayback mailing list might be a
>> better place to ask.
>> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>> You can cc this list if you want. Please include relevant information
>> your wayback setup and the behavior you are seeing as precisely as you
>> can.
>>
>> Noah
>>
>>
>> On Sat, Dec 14, 2013 at 10:38 PM, Umanda Dikwatta <abe...@gm...>
>> wrote:
>> >
>> >
>> > Hi Noah,
>> >
>> > Thank you so much for your reply. To get more clear idea, I have
>> explained,
>> > what I understood here. Please tell is it correct?
>> >
>> > Problem1
>> >
>> > If we consider http://www.mrt.ac.lk/web/ as a seed and then if we
>> specify
>> > max-hops = 3 and max-depth=7.
>> >
>> > Is it mean, http://www.mrt.ac.lk/web/ is hop=1. Then all the links in
>> the
>> > http://www.mrt.ac.lk/web/ has hop=2.
>> > All the links inside those links has hop=3. Since max-hops=3, links
>> inside
>> > these will not crawled. Then what
>> > is the max-depth? Is this the correct definition for hops?
>> >
>> > According to this hops definition
>> >
>> http://www.mrt.ac.lk/web/sites/default/files/styles/slideshow/public/field/slideshow/ERU%202013%204.jpg
>> > is in http://www.mrt.ac.lk/web/ and therefore it is in hop=2. But if we
>> > consider number of slashes, it has more than 7
>> > (max-depth) slashes.
>> > So is this slashes indicates the max-depth. As I could see in my crawl
>> log,
>> > number of slashes >=7 has not crawled.
>> > Only other links have been crawled.
>> >
>> > Is this what do mean Noah?
>> >
>> > Problem2
>> >
>> > I tried this with wayback 1.6 and wayback 1.8. But still the issue is
>> there
>> > with the duplicate content. Is there any solution for this?
>> >
>> > Thank you and Regards
>> >
>> >
>> >
>> >
>>
>>  __._,_.___
>>   Reply via web post<http://groups.yahoo.com/group/archive-crawler/post;_ylc=X3oDMTJwNWRsaWphBF9TAzk3MzU5NzE0BGdycElkAzg3NTk4NjcEZ3Jwc3BJZAMxNzA1MDA0OTI0BG1zZ0lkAzg0MzYEc2VjA2Z0cgRzbGsDcnBseQRzdGltZQMxMzg3MzI0ODc2?act=reply&messageNum=8436>  Reply
>> to sender
>> <nl...@ar...?subject=Re%3A%20%5Barchive-crawler%5D%20Heritrix%203%2E1%2E0%20and%20wayback%20questions>  Reply
>> to group
>> <arc...@ya...?subject=Re%3A%20%5Barchive-crawler%5D%20Heritrix%203%2E1%2E0%20and%20wayback%20questions>  Start
>> a New Topic<http://groups.yahoo.com/group/archive-crawler/post;_ylc=X3oDMTJlbHFtc3Y1BF9TAzk3MzU5NzE0BGdycElkAzg3NTk4NjcEZ3Jwc3BJZAMxNzA1MDA0OTI0BHNlYwNmdHIEc2xrA250cGMEc3RpbWUDMTM4NzMyNDg3Ng-->  Messages
>> in this topic<http://groups.yahoo.com/group/archive-crawler/message/8423;_ylc=X3oDMTM0ZGRxcmVkBF9TAzk3MzU5NzE0BGdycElkAzg3NTk4NjcEZ3Jwc3BJZAMxNzA1MDA0OTI0BG1zZ0lkAzg0MzYEc2VjA2Z0cgRzbGsDdnRwYwRzdGltZQMxMzg3MzI0ODc2BHRwY0lkAzg0MjM->(4)
>>  Recent Activity:
>>
>>    - New Members<http://groups.yahoo.com/group/archive-crawler/members;_ylc=X3oDMTJmb29jYXV1BF9TAzk3MzU5NzE0BGdycElkAzg3NTk4NjcEZ3Jwc3BJZAMxNzA1MDA0OTI0BHNlYwN2dGwEc2xrA3ZtYnJzBHN0aW1lAzEzODczMjQ4NzY-?o=6>
>>    1
>>
>>  Visit Your Group<http://groups.yahoo.com/group/archive-crawler;_ylc=X3oDMTJlZmljZ3RzBF9TAzk3MzU5NzE0BGdycElkAzg3NTk4NjcEZ3Jwc3BJZAMxNzA1MDA0OTI0BHNlYwN2dGwEc2xrA3ZnaHAEc3RpbWUDMTM4NzMyNDg3Ng-->
>>  [image: Yahoo! Groups]<http://groups.yahoo.com/;_ylc=X3oDMTJkOWhsZW1hBF9TAzk3MzU5NzE0BGdycElkAzg3NTk4NjcEZ3Jwc3BJZAMxNzA1MDA0OTI0BHNlYwNmdHIEc2xrA2dmcARzdGltZQMxMzg3MzI0ODc2>
>> Switch to: Text-Only<arc...@ya...?subject=Change+Delivery+Format:+Traditional>,
>> Daily Digest<arc...@ya...?subject=Email+Delivery:+Digest>•
>> Unsubscribe<arc...@ya...?subject=Unsubscribe>• Terms
>> of Use <http://info.yahoo.com/legal/us/yahoo/utos/terms/> • Send us
>> Feedback
>> <ygr...@ya...?subject=Feedback+on+the+redesigned+individual+mail+v1>
>>    .
>>
>> __,_._,___
>>
>
>

Re: [Archive-access-discuss] Some Javascripts not working in wayback

From: Edward S. <eh...@po...> - 2014-01-07 15:56:25

Hi Umanda,

It looks like the slideshow in 1) has the same problem at Internet Archive:

    https://web.archive.org/web/20131230013555/http://www.cmb.ac.lk/

So the good news is you are not alone. If you pull up the JavaScript console you should see an error like this:

    Uncaught TypeError: Object function (selector,context){return new jQuery.fn.init(selector,context)} has no method 'isPlainObject’ 

You can see on line 36 of the HTML that jQuery v1.4.2 is being loaded correctly, as it is on the original page:

    <script type='text/javascript' src='/web/20131230013555js_/http://www.cmb.ac.lk/wp-includes/js/jquery/jquery.js?ver=1.4.2'></script>

But on line 170 you can see Wayback’s boilerplate is loading jQuery again, but an older version (v1.3.2), which seems not to have the isPlainObject method, which causes the error above. Also, this reload probably stomps on any jQuery plugins that have been installed.

I don’t know what the best solution here is, but it seems to me there are (at least) three options:

1) From a brief search It looks like [1] jQuery takes backwards compatibility pretty seriously. So if Wayback must have a jQuery dependency perhaps it could simply be upgraded to use the latest version in git [2].

2) It looks like it’s possible for multiple versions of jQuery to co-exist on the same page [3]. So perhaps Wayback could be updated to use jQuery in this way, so that it doesn’t interfere with archived pages that also use jQuery?

3) Perhaps Wayback should test to see if jQuery could be loaded before re-loading it? This is what was recommended in a previous bug report [4].

//Ed

[1] http://stackoverflow.com/questions/281438/how-good-is-jquerys-backward-compatibility
[2] https://github.com/internetarchive/wayback/tree/master/wayback-webapp/src/main/webapp/js
[3] http://stackoverflow.com/questions/1566595/can-i-use-multiple-versions-of-jquery-on-the-same-page
[4] https://webarchive.jira.com/browse/ACC-118?jql=project%20%3D%20ACC%20AND%20component%20%3D%20Wayback%20AND%20text%20~%20%22jquery%22

On Jan 7, 2014, at 10:04 AM, Umanda Dikwatta <abe...@gm...> wrote:

> Hello,
> 
> 
> 
> I'm using Heritrix 3.1.0 and wayback 1.8 in order to crawl and re-create the web sites.I have following seed urls.
> 
> 
> 
> 1) http://www.cmb.ac.lk
> 
> 2) http://www.pdn.ac.lk
> 
> 3) http://www.kln.ac.lk
> 
> 
> 
> When I'm trying to re-create these web sites, slide show which is in the main page of first 2 sites are not working. Also,
> 
> A drop down menu which is activated in the mouse over of main menu in the 3rd site also not working. In all these cases the relevant 
> 
> files are in the crawl log and successfully crawled. As I saw in the code
> 
> 
> 
> http://www.cmb.ac.lk - slide show works using http://www.cmb.ac.lk/wp-content/plugins/widgetkit/widgets/slideshow/js/slideshow.js
> 
> 
> http://www.pdn.ac.lk - slide show works using skitter.js(http://www.pdn.ac.lk/assist/js1/jquery.skitter.min.js) This js file is
> 
> in the archive.
> 
> http://www.kln.ac.lk - This main menu drop down is working using the javascipt below.
> <script type="text/javascript">
> 
> 			var megamenu = new jaMegaMenuMoo ('ja-megamenu', {
> 
> 				'bgopacity': 0, 
> 
> 				'delayHide': 300, 
> 
> 				'slide': 0, 
> 
> 				'fading': 1,
> 
> 				'direction':'down',
> 
> 				'action':'mouseover',
> 
> 				'tips': false,
> 
> 				'duration': 300,
> 
> 				'hidestyle': 'fastwhenshow'
> 
> 			});			
> 
> 			</script>
> 
> Therefore, is this a limitation with wayback machine. Please help me.
> 
> 
> 
> Regards
> 
> 
> ------------------------------------------------------------------------------
> Rapidly troubleshoot problems before they affect your business. Most IT 
> organizations don't have a clear picture of how application performance 
> affects their revenue. With AppDynamics, you get 100% visibility into your 
> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
> http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk_______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss

[Archive-access-discuss] Some Javascripts not working in wayback

From: Umanda D. <abe...@gm...> - 2014-01-07 15:04:16

Hello,


I'm using Heritrix 3.1.0 and wayback 1.8 in order to crawl and re-create
the web sites.I have following seed urls.


1) http://www.cmb.ac.lk

2) http://www.pdn.ac.lk

3) http://www.kln.ac.lk


When I'm trying to re-create these web sites, slide show which is in the
main page of first 2 sites are not working. Also,

A drop down menu which is activated in the mouse over of main menu in the
3rd site also not working. In all these cases the relevant

files are in the crawl log and successfully crawled. As I saw in the code


http://www.cmb.ac.lk - slide show works using
http://www.cmb.ac.lk/wp-content/plugins/widgetkit/widgets/slideshow/js/slideshow.js


http://www.pdn.ac.lk - slide show works using skitter.js(
http://www.pdn.ac.lk/assist/js1/jquery.skitter.min.js) This js file is

in the archive.

http://www.kln.ac.lk - This main menu drop down is working using the
javascipt below.

<script type="text/javascript">

var megamenu = new jaMegaMenuMoo ('ja-megamenu', {

'bgopacity': 0,

'delayHide': 300,

'slide': 0,

'fading': 1,

'direction':'down',

'action':'mouseover',

'tips': false,

'duration': 300,

'hidestyle': 'fastwhenshow'

});

</script>

Therefore, is this a limitation with wayback machine. Please help me.


Regards

Re: [Archive-access-discuss] [archive-crawler] Heritrix 3.1.0 and wayback questions

From: Coram, R. <Rog...@bl...> - 2014-01-03 10:28:44

Sometimes, specifically with links added client-side, Wayback doesn't
have the opportunity to rewrite correctly. The images being 'pushed'
below are absolute paths and it's possible that your browser is trying
to load them relative to your own domain (you can check this via your
browser's developer's tools - you should be able to see what it's
actually requesting and any 404s).

 

Rewriting links like this is an ongoing problem but one being actively
pursued. However, potentially running Wayback in proxy mode should fix
this (provided the content is there).

 

From: arc...@ya...
[mailto:arc...@ya...] On Behalf Of Umanda Dikwatta
Sent: 02 January 2014 17:36
To: arc...@ya...
Cc: arc...@li...
Subject: Re: [archive-crawler] Heritrix 3.1.0 and wayback questions

 

  

Hello,

 

I have crawled http://www.cse.mrt.ac.lk/ and I.m trying to recreate this
from wayback 1.8. But following javascript isin the html.

 

<script type="text/javascript">

	RokStoriesImage.push('/images/stories/demo/rokstories/rs4.jpg');

	RokStoriesImage.push('/images/stories/demo/rokstories/rs3.jpg');

	RokStoriesImage.push('/images/stories/demo/rokstories/rs4.jpg');

		
	window.addEvent('domready', function() {

	new RokStories('.feature-block', {

	'startElement': 0,

	'thumbsOpacity': 0.5,

	'mousetype': 'click',

	'autorun': 0,

	'delay': 5000,

	'startWidth': 615 });

	});

	</script>

	<div class="feature-block">

	<div class="image-container">

	<div class="image-full"></div>

	<div class="image-small">

	<img src="/images/stories/demo/rokstories/rs4_thumb.jpg
<http://www.cse.mrt.ac.lk/images/stories/demo/rokstories/rs4_thumb.jpg>
" class="feature-sub" alt="image" />

	<img src="/images/stories/demo/rokstories/rs3_thumb.jpg
<http://www.cse.mrt.ac.lk/images/stories/demo/rokstories/rs3_thumb.jpg>
" class="feature-sub" alt="image" />

	<img src="/images/stories/demo/rokstories/rs4_thumb.jpg
<http://www.cse.mrt.ac.lk/images/stories/demo/rokstories/rs4_thumb.jpg>
" class="feature-sub" alt="image" />

	</div>

	</div>

	<div class="desc-container">
In the start up of the web site,rs4.jpg is loaded into the image-full
div block.But this is not working in the wayback. Is there a special
reason for that? Please help me to find this.

Regards

 

On Wed, Dec 18, 2013 at 5:31 AM, Noah Levitt <nl...@ar...>
wrote:

  

Hello,

Basically yeah that's what hops means, except the seed is hop=0, and
the links from seed are hop=1, I think.

By "max-depth" do you mean the property maxPathDepth of
org.archive.modules.deciderules.TooManyPathSegmentsDecideRule? If so,
you have the right idea. "TooManyPathSegmentsDecideRule... Rule
REJECTs any CrawlURIs whose total number of path-segments (as
indicated by the count of '/' characters not including the first '//')
is over a given threshold."
http://builds.archive.org/javadoc/heritrix-3.x-snapshot/org/archive/modu
les/deciderules/TooManyPathSegmentsDecideRule.html

On "Problem2", the wayback issue, the wayback mailing list might be a
better place to ask.
https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
You can cc this list if you want. Please include relevant information
your wayback setup and the behavior you are seeing as precisely as you
can.

Noah



On Sat, Dec 14, 2013 at 10:38 PM, Umanda Dikwatta <abe...@gm...>
wrote:
>
>
> Hi Noah,
>
> Thank you so much for your reply. To get more clear idea, I have
explained,
> what I understood here. Please tell is it correct?
>
> Problem1
>
> If we consider http://www.mrt.ac.lk/web/ as a seed and then if we
specify
> max-hops = 3 and max-depth=7.
>
> Is it mean, http://www.mrt.ac.lk/web/ is hop=1. Then all the links in
the
> http://www.mrt.ac.lk/web/ has hop=2.
> All the links inside those links has hop=3. Since max-hops=3, links
inside
> these will not crawled. Then what
> is the max-depth? Is this the correct definition for hops?
>
> According to this hops definition
>
http://www.mrt.ac.lk/web/sites/default/files/styles/slideshow/public/fie
ld/slideshow/ERU%202013%204.jpg
> is in http://www.mrt.ac.lk/web/ and therefore it is in hop=2. But if
we
> consider number of slashes, it has more than 7
> (max-depth) slashes.
> So is this slashes indicates the max-depth. As I could see in my crawl
log,
> number of slashes >=7 has not crawled.
> Only other links have been crawled.
>
> Is this what do mean Noah?
>
> Problem2
>
> I tried this with wayback 1.6 and wayback 1.8. But still the issue is
there
> with the duplicate content. Is there any solution for this?
>
> Thank you and Regards
>
>
>
> 

 

__._,_.___

Reply via web post
<http://groups.yahoo.com/group/archive-crawler/post;_ylc=X3oDMTJwMDY3N2V
lBF9TAzk3MzU5NzE0BGdycElkAzg3NTk4NjcEZ3Jwc3BJZAMxNzA1MDA0OTI0BG1zZ0lkAzg
0NDMEc2VjA2Z0cgRzbGsDcnBseQRzdGltZQMxMzg4Njg0MTY4?act=reply&messageNum=8
443>  

Reply to sender
<mailto:abe...@gm...?subject=Re%3A%20%5Barchive-crawler%5D%20Heri
trix%203%2E1%2E0%20and%20wayback%20questions> 

Reply to group
<mailto:arc...@ya...?subject=Re%3A%20%5Barchive-crawl
er%5D%20Heritrix%203%2E1%2E0%20and%20wayback%20questions> 

Start a New Topic
<http://groups.yahoo.com/group/archive-crawler/post;_ylc=X3oDMTJlNTBzc3R
wBF9TAzk3MzU5NzE0BGdycElkAzg3NTk4NjcEZ3Jwc3BJZAMxNzA1MDA0OTI0BHNlYwNmdHI
Ec2xrA250cGMEc3RpbWUDMTM4ODY4NDE2OA-->  

Messages in this topic
<http://groups.yahoo.com/group/archive-crawler/message/8423;_ylc=X3oDMTM
0MjlhY3Y1BF9TAzk3MzU5NzE0BGdycElkAzg3NTk4NjcEZ3Jwc3BJZAMxNzA1MDA0OTI0BG1
zZ0lkAzg0NDMEc2VjA2Z0cgRzbGsDdnRwYwRzdGltZQMxMzg4Njg0MTY4BHRwY0lkAzg0MjM
->  (5) 

Recent Activity: 

*         New Members
<http://groups.yahoo.com/group/archive-crawler/members;_ylc=X3oDMTJmcWYx
NWRuBF9TAzk3MzU5NzE0BGdycElkAzg3NTk4NjcEZ3Jwc3BJZAMxNzA1MDA0OTI0BHNlYwN2
dGwEc2xrA3ZtYnJzBHN0aW1lAzEzODg2ODQxNjg-?o=6>  1 

Visit Your Group
<http://groups.yahoo.com/group/archive-crawler;_ylc=X3oDMTJlMGY1bG1wBF9T
Azk3MzU5NzE0BGdycElkAzg3NTk4NjcEZ3Jwc3BJZAMxNzA1MDA0OTI0BHNlYwN2dGwEc2xr
A3ZnaHAEc3RpbWUDMTM4ODY4NDE2OA-->  

Yahoo! Groups
<http://groups.yahoo.com/;_ylc=X3oDMTJkN3EyMXRxBF9TAzk3NDc2NTkwBGdycElkA
zg3NTk4NjcEZ3Jwc3BJZAMxNzA1MDA0OTI0BHNlYwNmdHIEc2xrA2dmcARzdGltZQMxMzg4N
jg0MTY4> 

Switch to: Text-Only
<mailto:arc...@ya...?subject=Change%20Del
ivery%20Format:%20Traditional> , Daily Digest
<mailto:arc...@ya...?subject=Email%20Delivery:
%20Digest>  * Unsubscribe
<mailto:arc...@ya...?subject=Unsubscribe>
* Terms of Use <http://info.yahoo.com/legal/us/yahoo/utos/terms/>  *
Send us Feedback
<mailto:ygr...@ya...?subject=Feedback%20on%20the
%20redesigned%20individual%20mail%20v1> 

.

 
<http://geo.yahoo.com/serv?s=97359714/grpId=8759867/grpspId=1705004924/m
sgId=8443/stime=1388684168> 

__,_._,___

Re: [Archive-access-discuss] [archive-crawler] Heritrix 3.1.0 and wayback questions

From: Umanda D. <abe...@gm...> - 2014-01-02 17:36:13

Hello,

I have crawled http://www.cse.mrt.ac.lk/ and I.m trying to recreate this
from wayback 1.8. But following javascript isin the html.

<script type="text/javascript">RokStoriesImage.push('/images/stories/demo/rokstories/rs4.jpg');RokStoriesImage.push('/images/stories/demo/rokstories/rs3.jpg');RokStoriesImage.push('/images/stories/demo/rokstories/rs4.jpg');
window.addEvent('domready', function() { new
RokStories('.feature-block', {'startElement': 0,'thumbsOpacity':
0.5,'mousetype': 'click','autorun': 0,'delay': 5000,'startWidth': 615
});
});</script><div class="feature-block"> <div class="image-container"> <div
class="image-full"></div> <div class="image-small"> <img src="
/images/stories/demo/rokstories/rs4_thumb.jpg<http://www.cse.mrt.ac.lk/images/stories/demo/rokstories/rs4_thumb.jpg>"
class="feature-sub" alt="image" /> <img src="
/images/stories/demo/rokstories/rs3_thumb.jpg<http://www.cse.mrt.ac.lk/images/stories/demo/rokstories/rs3_thumb.jpg>"
class="feature-sub" alt="image" /> <img src="
/images/stories/demo/rokstories/rs4_thumb.jpg<http://www.cse.mrt.ac.lk/images/stories/demo/rokstories/rs4_thumb.jpg>"
class="feature-sub" alt="image" /> </div> </div> <div class="desc-container
">
In the start up of the web site,rs4.jpg is loaded into the image-full div
block.But this is not working in the wayback. Is there a special reason for
that? Please help me to find this.

Regards


On Wed, Dec 18, 2013 at 5:31 AM, Noah Levitt <nl...@ar...> wrote:

>
>
> Hello,
>
> Basically yeah that's what hops means, except the seed is hop=0, and
> the links from seed are hop=1, I think.
>
> By "max-depth" do you mean the property maxPathDepth of
> org.archive.modules.deciderules.TooManyPathSegmentsDecideRule? If so,
> you have the right idea. "TooManyPathSegmentsDecideRule... Rule
> REJECTs any CrawlURIs whose total number of path-segments (as
> indicated by the count of '/' characters not including the first '//')
> is over a given threshold."
>
> http://builds.archive.org/javadoc/heritrix-3.x-snapshot/org/archive/modules/deciderules/TooManyPathSegmentsDecideRule.html
>
> On "Problem2", the wayback issue, the wayback mailing list might be a
> better place to ask.
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
> You can cc this list if you want. Please include relevant information
> your wayback setup and the behavior you are seeing as precisely as you
> can.
>
> Noah
>
>
> On Sat, Dec 14, 2013 at 10:38 PM, Umanda Dikwatta <abe...@gm...>
> wrote:
> >
> >
> > Hi Noah,
> >
> > Thank you so much for your reply. To get more clear idea, I have
> explained,
> > what I understood here. Please tell is it correct?
> >
> > Problem1
> >
> > If we consider http://www.mrt.ac.lk/web/ as a seed and then if we
> specify
> > max-hops = 3 and max-depth=7.
> >
> > Is it mean, http://www.mrt.ac.lk/web/ is hop=1. Then all the links in
> the
> > http://www.mrt.ac.lk/web/ has hop=2.
> > All the links inside those links has hop=3. Since max-hops=3, links
> inside
> > these will not crawled. Then what
> > is the max-depth? Is this the correct definition for hops?
> >
> > According to this hops definition
> >
> http://www.mrt.ac.lk/web/sites/default/files/styles/slideshow/public/field/slideshow/ERU%202013%204.jpg
> > is in http://www.mrt.ac.lk/web/ and therefore it is in hop=2. But if we
> > consider number of slashes, it has more than 7
> > (max-depth) slashes.
> > So is this slashes indicates the max-depth. As I could see in my crawl
> log,
> > number of slashes >=7 has not crawled.
> > Only other links have been crawled.
> >
> > Is this what do mean Noah?
> >
> > Problem2
> >
> > I tried this with wayback 1.6 and wayback 1.8. But still the issue is
> there
> > with the duplicate content. Is there any solution for this?
> >
> > Thank you and Regards
> >
> >
> >
> >
>
>  __._,_.___
>   Reply via web post<http://groups.yahoo.com/group/archive-crawler/post;_ylc=X3oDMTJwNWRsaWphBF9TAzk3MzU5NzE0BGdycElkAzg3NTk4NjcEZ3Jwc3BJZAMxNzA1MDA0OTI0BG1zZ0lkAzg0MzYEc2VjA2Z0cgRzbGsDcnBseQRzdGltZQMxMzg3MzI0ODc2?act=reply&messageNum=8436>  Reply
> to sender
> <nl...@ar...?subject=Re%3A%20%5Barchive-crawler%5D%20Heritrix%203%2E1%2E0%20and%20wayback%20questions>  Reply
> to group
> <arc...@ya...?subject=Re%3A%20%5Barchive-crawler%5D%20Heritrix%203%2E1%2E0%20and%20wayback%20questions>  Start
> a New Topic<http://groups.yahoo.com/group/archive-crawler/post;_ylc=X3oDMTJlbHFtc3Y1BF9TAzk3MzU5NzE0BGdycElkAzg3NTk4NjcEZ3Jwc3BJZAMxNzA1MDA0OTI0BHNlYwNmdHIEc2xrA250cGMEc3RpbWUDMTM4NzMyNDg3Ng-->  Messages
> in this topic<http://groups.yahoo.com/group/archive-crawler/message/8423;_ylc=X3oDMTM0ZGRxcmVkBF9TAzk3MzU5NzE0BGdycElkAzg3NTk4NjcEZ3Jwc3BJZAMxNzA1MDA0OTI0BG1zZ0lkAzg0MzYEc2VjA2Z0cgRzbGsDdnRwYwRzdGltZQMxMzg3MzI0ODc2BHRwY0lkAzg0MjM->(4)
>  Recent Activity:
>
>    - New Members<http://groups.yahoo.com/group/archive-crawler/members;_ylc=X3oDMTJmb29jYXV1BF9TAzk3MzU5NzE0BGdycElkAzg3NTk4NjcEZ3Jwc3BJZAMxNzA1MDA0OTI0BHNlYwN2dGwEc2xrA3ZtYnJzBHN0aW1lAzEzODczMjQ4NzY-?o=6>
>    1
>
>  Visit Your Group<http://groups.yahoo.com/group/archive-crawler;_ylc=X3oDMTJlZmljZ3RzBF9TAzk3MzU5NzE0BGdycElkAzg3NTk4NjcEZ3Jwc3BJZAMxNzA1MDA0OTI0BHNlYwN2dGwEc2xrA3ZnaHAEc3RpbWUDMTM4NzMyNDg3Ng-->
>  [image: Yahoo! Groups]<http://groups.yahoo.com/;_ylc=X3oDMTJkOWhsZW1hBF9TAzk3MzU5NzE0BGdycElkAzg3NTk4NjcEZ3Jwc3BJZAMxNzA1MDA0OTI0BHNlYwNmdHIEc2xrA2dmcARzdGltZQMxMzg3MzI0ODc2>
> Switch to: Text-Only<arc...@ya...?subject=Change+Delivery+Format:+Traditional>,
> Daily Digest<arc...@ya...?subject=Email+Delivery:+Digest>•
> Unsubscribe<arc...@ya...?subject=Unsubscribe>• Terms
> of Use <http://info.yahoo.com/legal/us/yahoo/utos/terms/> • Send us
> Feedback
> <ygr...@ya...?subject=Feedback+on+the+redesigned+individual+mail+v1>
>    .
>
> __,_._,___
>

[Archive-access-discuss] Cannot re-create some web sites with slide shows

From: Umanda D. <abe...@gm...> - 2014-01-01 04:28:20

Hi,

I have configured wayback 1.8(dist-1.8.0-SNAPSHOT-1.8.0-SNAPSHOT.tar) in
Apache Tomcat (apache-tomcat-6.0.37). I'm using heritrix 3.1.0 to crawl web
sites. I'm running all these in ubuntu.

I have problems with the slide shows(slide show of the main page) of
following sites.

http://www.cse.mrt.ac.lk/
http://www.pdn.ac.lk/

But the slide show of following site is working

https://www.ucsc.cmb.ac.lk/

I checked in the heritrix crawl log and all the links are archived. So I
want to clarify whether this is a limitation of wayback machine or some
other issue.

Regards

[Archive-access-discuss] Some web sites are not recreating properly in wayback

From: Umanda D. <abe...@gm...> - 2013-12-29 16:42:06

I have configured wayback 1.8(dist-1.8.0-SNAPSHOT-1.8.0-SNAPSHOT.tar) in
Apache Tomcat (apache-tomcat-6.0.37). I'm using heritrix 3.1.0 to crawl web
sites. I'm running all these in ubuntu.

I have problems with the slide shows(slide show of the main page) of
following sites.

http://www.cse.mrt.ac.lk/
http://www.pdn.ac.lk/

But the slide show of following site is working

https://www.ucsc.cmb.ac.lk/

I checked in the heritrix crawl log and all the links are archived. So I
want to clarify whether this is a limitation of wayback machine or some
other issue.

Regards

[Archive-access-discuss] Added files not showing up in archive

From: Armin S. <sch...@gm...> - 2013-11-05 15:54:52

Hi everyone,

I set up a local Wayback instance a while ago and everything worked 
fine, until i tried to add newly harvested .warc files to my filestore 
directory. When i browse the index-data dir, there is noting in the 
queue and all my files were added to the merged directory, however when 
i try to access them via the browser, only my old harvests are showing 
up in Wayback. Does anyone have an idea what the problem could be? Any 
hints are highly appreciated - i will of course try to provide all the 
info that you guys may need...Thanks

[Archive-access-discuss] how to create path-index.txt?

From: Nicholas T. <ta...@gm...> - 2013-10-18 17:07:15

Hi folks, I have a fresh Wayback 1.6.1 installation configured to use CDX
files. Everything's working except...how do I create a path index? I mean,
I can create one manually with a bash script, but I figure that Wayback
must include a utility to do this? Reading between the lines of the
outdated Administrators Manual, I'm guessing it's the location-client (
http://archive-access.sourceforge.net/projects/wayback/administrator_manual.html#location-client),
but it expects the locationDB to be an URL rather than a local file path.

Thanks for any assistance you can provide.

~Nicholas

[Archive-access-discuss] Fwd: [archive-crawler] replay of gezipped css

From: Noah L. <nl...@ar...> - 2013-08-21 17:52:02

A wayback question...

---------- Forwarded message ----------
From: Anna Kugler <ku...@bs...>
Date: Wed, Aug 21, 2013 at 5:38 AM
Subject: [archive-crawler] replay of gezipped css
To: arc...@ya...


Hi,

we came across the issue: https://webarchive.jira.com/browse/ACC-81 when we
archuved an gzipped css.

Is there a workaround/fix in heritrix or wayback to replay gzipped content?

thanx
Anna


_______________________
Bayerische Staatsbibliothek
Münchener Digitalisierungs-Zentrum (MDZ)/ Digitale Bibliothek
Munich Digitization Centre/ Digital Library
Bayerische Staatsbibliothek
Ludwigstr.16
D-80539 München

E-Mail: Ann...@bs...
Tel. +49/89/28638-2998
URL: www.bsb-muenchen.de




------------------------------------

Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/archive-crawler/

<*> Your email settings:
    Individual Email | Traditional

<*> To change settings online go to:
    http://groups.yahoo.com/group/archive-crawler/join
    (Yahoo! ID required)

<*> To change settings via email:
    arc...@ya...
    arc...@ya...

<*> To unsubscribe from this group, send an email to:
    arc...@ya...

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/

Re: [Archive-access-discuss] Replaying redirects

From: Kristinn S. <kri...@la...> - 2013-08-09 10:12:55

See my replies below...

> It is possible to disable the interstitial by removing the bean that
> handles that in ArchivalUrlReplay.xml which probably looks like this:
>

I do not want to remove the interstitial. It is useful information in most cases and hiding it causes all sorts of weird.

I just want to eliminate it in these kinds of pointless instances.


>
>>      Also, as a side note, the XML search results used to contain
>> the destination for redirect captures, but no longer. This limits my
>> options of dealing with this in the frontend search results.
>
>
> Hmm.. no changes have been made to the xml search results in a long
> time. (I have been working on an alternative tool for viewing search
> results:
> https://github.com/internetarchive/wayback/tree/master/wayback-cdx-
> server)
>
> For some old cdxs, we've discovered the cdx field was often
> improperly encoded (or encoding was ambiguous) and simply started
> writing a "-" for new cdxs.
>
> It is not directly useful to wayback replay except to determine if a
> url is a self-redirect.
>
> That said, it should still appear in the search results if it is in
> the cdx.

Yes, the issue is that it is not (anymore) included in the CDX (we were actually not using CDXs in our old installation).

This info may not be directly useful to wayback replay, but it can be useful for rendering a results page. I think this merits a closer look (although I'm not exactly eager to rebuild all our CDX files). At minimum, it would be useful if a self-redirect was annotated so we could suppress it in the results page.

I do wonder if it is even useful at all to have self-redirects in the CDX at all? We treat www.example.com and example.com as the same URL anyway. Would there be any harm in eliminating all redirects like this?


>       2. When navigating search results using the "last" arrow on the
> injected Toolbar, if the previous capture was such a redirect, you
> will not get anywhere as the redirect is resolved by sending you back
> to the same capture you are on.
>
>       I found an instance of this in Internet Archive's Wayback:
> http://web.archive.org/web/20130602062836/http://timarit.is/
>
>       Try clicking the back arrow. You are not going anywhere. It
> doesn't even give you the URL redirect notice.
>
>
> Thanks for pointing this example! This happens due to a slightly
> complicated combination of different things.
>
> The short answer is that this should be fixable by checking the
> referrer, or better yet, I'd like to propose having a dedicated "prev
> from X" or "next from X" timestamp modifier.
> such as:
> "20130602062836-" -  redirect to prev available capture before
> 20130602062836 or back to 20130602062836 if it is the first.
> "20130602062836+" - redirect to next available capture after
> 20130602062836 or back to 20130602062836 if it is the last.
>
> The long answer is [snipped]
>
> By adding an explicit timestamp modifier of + or -, one could request
> /web/16-/ and guarantee to get a previous capture to 16, if at all
> possible.
>
> Such a modifier (or a different one) for specifying prev/next capture
> could be a useful for other reasons as well.
>
> Any thoughts on this idea?

I like it. Would definitely address the replay part of this issue. You would, though, need to address how to handle cases where you specify a TIME- and there are no older instances but there are new ones. Do you then ignore the - or do you return 'nothing found'.

- Kris

-------------------------------------------------------------------------
Landsbókasafn Íslands - Háskólabókasafn | Arngrímsgötu 3 - 107 Reykjavík
Sími/Tel: +354 5255600 | www.landsbokasafn.is
-------------------------------------------------------------------------
fyrirvari/disclaimer - http://fyrirvari.landsbokasafn.is

[Archive-access-discuss] Replaying redirects

From: Kristinn S. <kri...@la...> - 2013-08-07 13:08:20

Hi all,

I have an issue with redirects.

It is semi common for us to have crawled an URL like http://www.example.com only to find a 301 or 302 to http://example.com (or vice versa). The second URL is then crawled a few seconds later.

This creates two issues.

1. The first capture (the redirect) is rendered by default when a user selects that particular date. This starts the user off on a URL redirect notice page. Less than ideal.

It is possible to just resolve the redirect immediately, without alerting the user, but that will cause them to see an URL from a slightly different time than they expected.

Also, as a side note, the XML search results used to contain the destination for redirect captures, but no longer. This limits my options of dealing with this in the frontend search results.

2. When navigating search results using the "last" arrow on the injected Toolbar, if the previous capture was such a redirect, you will not get anywhere as the redirect is resolved by sending you back to the same capture you are on.

I found an instance of this in Internet Archive's Wayback: http://web.archive.org/web/20130602062836/http://timarit.is/

Try clicking the back arrow. You are not going anywhere. It doesn't even give you the URL redirect notice.

You can, of course, use the sparkline or jump by years/months, but moving one capture at a time is a desired ability.

I currently have it set so that redirects are never resolved behind the scenes.

Another side note. Why does the toolbar disappear when displaying the URL redirect notice? Many URLs have gone from serving content to being a redirect and back (sometimes repeatedly). Just because you hit a redirect doesn't mean that you are either going to want to follow it or go back. You may want to choose another capture for that URL from the toolbar. This is actually easy to fix (as I have done), but I feel displaying the toolbar should be standard.

Anyway, that is my rant on redirects. Any thoughts and suggestions?

Best,
Kris

-------------------------------------------------------------------------
Landsbókasafn Íslands - Háskólabókasafn | Arngrímsgötu 3 - 107 Reykjavík
Sími/Tel: +354 5255600 | www.landsbokasafn.is
-------------------------------------------------------------------------
fyrirvari/disclaimer - http://fyrirvari.landsbokasafn.is

Re: [Archive-access-discuss] RE Can Wayback handle an URL with 10.000+ snapshots?

From: Ilya <il...@ar...> - 2013-07-31 01:59:54

Hi,

It should be possible to configure wayback with more than 10000 records..

Note that both the LocalResourceIndex and the ArchivalUrlRequestParser 
have maxRecords properties..
both should be updated to the same value.

For our main index, we use the "ZipNum" CDX format which creates a 
secondary index for the CDX and allows compression. There's also an 
option to "collapse" results based on a portion of the timestamp (for 
example, don't show more than 1 snapshot per hour).

This is the configuration seen here:
http://web.archive.org/web/*/google.com

We are working on releasing addition documentation on how to create the 
"ZipNum" index from plain CDX files.


In addition, we've been working on a separate, new CDX server API for 
wayback, which allows for more control over querying.

For example, the following query returns a first page of "uncollapsed" 
results (page is configured at 150000 max cdx lines on our end at the 
moment)
http://web.archive.org/cdx/search/cdx?url=google.com

The following returns much less results, collapsed to no more than 1 
hour (by ignoring duplicates of the first 10 digits of the timestamp field)
http://web.archive.org/cdx/search/cdx?url=google.com&collapse=timestamp:10


The full documentation for this new API is available here:
https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server





On 7/30/13 11:55 AM, nic...@bn... wrote:
>
> Hi Kristinn,
>
> The index lookup algorithm , as you probably know, boils down to:
>
> 1) Perform a bin search on the set of CDX files defined in the 
> WaybackCollection
> 2) Sequentially iterate over the records starting at the first found 
> occurence, applying filters along the way
> 3) Stop the process after examining maxRecords (10 000 by default)
>
> At BnF we have recently changed the way we merge CDX to have a single 
> CDX in a collection, whenever possible, or at least as few CDX as 
> possible. This allowed us to raise maxRecords to 100,000 with a 
> not-stellar-but-acceptable search times.
>
> However we do have also this problem with sites that are captured on a 
> daily basis, and that's one of the motivations behind trying to use a 
> search engine framework like SOLR or ElasticSearch to index individual 
> CDX.
>
> Best regards,
>
> Nicolas Giraud
> ------------------------------------------------------------------------
>
> Exposition */Zellidja, carnets de voyage 
> <http://www.bnf.fr/fr/evenements_et_culture/anx_expositions/f.zellidja.html>/* 
> - prolongation jusqu'au 3 août 2013 - BnF - François-Mitterrand
>
> *Avant d'imprimer, pensez à l'environnement.*
>
>
>
> ------------------------------------------------------------------------------
> Get your SQL database under version control now!
> Version control is standard for application code, but databases havent
> caught up. So what steps can you take to put your SQL databases under
> version control? Why should you start doing it? Read more to find out.
> http://pubads.g.doubleclick.net/gampad/clk?id=49501711&iu=/4140/ostg.clktrk
>
>
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss

[Archive-access-discuss] RE Can Wayback handle an URL with 10.000+ snapshots?

From: <nic...@bn...> - 2013-07-30 16:20:34

Hi Kristinn,

The index lookup algorithm , as you probably know, boils down to:

1) Perform a bin search on the set of CDX files defined in the 
WaybackCollection
2) Sequentially iterate over the records starting at the first found 
occurence, applying filters along the way
3) Stop the process after examining maxRecords (10 000 by default)

At BnF we have recently changed the way we merge CDX to have a single CDX 
in a collection, whenever possible, or at least as few CDX as possible. 
This allowed us to raise maxRecords to 100,000 with a 
not-stellar-but-acceptable search times.

However we do have also this problem with sites that are captured on a 
daily basis, and that's one of the motivations behind trying to use a 
search engine framework like SOLR or ElasticSearch to index individual 
CDX.

Best regards,

Nicolas Giraud

Exposition  Zellidja, carnets de voyage  - prolongation jusqu'au 3 août 2013 - BnF - François-Mitterrand Avant d'imprimer, pensez à l'environnement.

[Archive-access-discuss] Can Wayback handle an URL with 10.000+ snapshots?

From: Kristinn S. <kri...@la...> - 2013-07-30 15:49:53

We've been experimenting with crawling using RSS feeds. This has generally gone well, but has led to a concern over how well Wayback can handle a URL that has a LOT of snapshots.

In our RSS experiment we've seen that the front pages (which are crawled each time an item is added to the feed) are crawled as often as 2000 times a month (and, yes, those are all unique captures!). Wayback has a default "maxRecords" of 10 thousand, a value we'll hit in just a few months crawling. Interestingly, while I can lower that value in the wayback.xml config file, raising it causes all searches to return a "Bad Query Exception", the 10.000 limit seems pretty hard wired in.

Has anyone looked into how Wayback handles scaling along this axis?

- Kris

-------------------------------------------------------------------------
Landsbókasafn Íslands - Háskólabókasafn | Arngrímsgötu 3 - 107 Reykjavík
Sími/Tel: +354 5255600 | www.landsbokasafn.is
-------------------------------------------------------------------------
fyrirvari/disclaimer - http://fyrirvari.landsbokasafn.is

Re: [Archive-access-discuss] Migration & WBM

From: <Ism...@te...> - 2013-06-07 14:09:36

Hi

I'm am performing file format conversions on WARC record payloads (for 
example turning .doc files to .pdf). Based on the WARC specification I 
believe the method of doing this is to create a new WARC record resembling 
this format:

WARC-Type: conversion. 
WARC-Target-URI: Same as the original record
WARC-Date:  The date-time when this conversion record is created
WARC-Record-ID: new unique ID
WARC-Refers-To: old record id
Content-Type: New content type
...

(for an example conversion record see appendix C.7 of ISO 28500:2009)

I would like to use WARC files containing conversion records with WBM. I 
would like to be able to set a configuration option to tell WBM to always 
use the most recent conversion record payload when I request the target 
URI. In other words, the most recent conversion record 
payload/content-type is used in place of the original record. 

Will this feature be implemented in WBM and is there a time scale for it? 
I feel this is an important feature to have in WBM since the alternative 
method of carrying out file format conversions requires modifying the 
contents of the WARC which would compromise its integrity so it is 
something I'd like to avoid.

- Ismail
Programmer, Tessella Ltd

Re: [Archive-access-discuss] Wayback Indexer

From: Jackson, A. <And...@bl...> - 2013-06-07 13:38:35

Hi Ilya,

 

Thanks for the info. Just to be clear, I am not trying to argue that you or IA should have done anything differently. You are doing what you need to do to get your work done. It’s not reasonable for the rest of us to expect you to implement the tests that we need! Rather, I am trying to argue that if the IIPC members want portable, stable, predictable and heavily tested releases of these tools, then we all need to step up and make it happen (rather than passively rely on IA, or patching and fixing in private).

 

This is not so hard, but does require some communication and technical effort. My preferred approach is that, starting with Wayback, we make the IIPC github fork into the ‘canonical’ version, and IIPC members work together to take over managing the roadmap, reviewing issues and pull request, testing and releases. Any member would be free to do their own thing, of course, but the IIPC roadmap would pin particular features (e.g. deduplication variants) to upcoming mayor/minor releases, and pool bug-fixes into point releases. IA would no longer be responsible for making official releases.

 

Ideally, an IIPC funded role would oversee the development, communications, and the release process, so that everyone knows where we are. For example, in the case of the ACC-126 bug, that person could check if the issue was being dealt with, and could have marked the bug as ‘assigned’ and pinned the issue against for 1.8.0 release on the JIRA ACC roadmap [1]. This stuff take time and effort we don’t always have, but a little feedback makes all the difference to the success of an open source project like this one. It’s very disheartening to feel like your tickets and pull-requests are disappearing into a black hole.

 

I imagine a pool of ‘core committers’ (drawn from  the IIPC membership) would have to be set up to support this individual, agree the roadmap, help review difficult pull requests (e.g. one that mean changing tests), and so on. For example, the project lead might be responsible for ensuring that there is an up to date roadmap, but would *not* actually be responsible for creating it – the core committers would have to do that. That group could also define policies to make the development coordinator’s job easier, e.g. ‘if you pull request contains a new feature without a new test, it will be rejected’ [2].

 

I think we all want something like this to work, and that we all want to pool our resources as efficiently as possible (especially when we are all working around or patching the same bugs in private). We’re all just pressed for time, so I think things will work really well if IIPC can invest in making sure the information moves around, a roadmap can be agreed, the issues and pull-requests are reviewed, the test pass, and the releases happen.

 

This is just my proposal for discussion, now and at the autumn meeting. I’ll happily go along with whatever structure or process makes this work.

 

As for my integration tests, I’m planning to set them up in a separate project for now, to see how they work (https://github.com/ukwa/warc-explorer, specifically in the warc-explorer-wayback sub-project which takes the IA Wayback release and overlays it with a suitable config for local testing). Once this seems to be working, I’d be interested in patching it back into the main Wayback codebase.

 

 

Thanks,

Andy

 

1.       https://webarchive.jira.com/browse/ACC#selectedTab=com.atlassian.jira.plugin.system.project%3Aroadmap-panel 

2.       See for example https://github.com/diaspora/diaspora/wiki/Pull-Request-Guidelines, https://django-admin2.readthedocs.org/en/latest/contributing.html, https://github.com/adobe/brackets/wiki/Pull-Request-Review-Checklist etc.

 

From: Ilya Kreymer [mailto:il...@ar...] 
Sent: 06 June 2013 20:13
To: arc...@li...
Subject: Re: [Archive-access-discuss] Wayback Indexer

 

Hi Andy,

I totally agree with you regarding the need for additional integration tests.
We have unfortunately not had the resourcesto devote to ensuring full stability of the snapshot distributions, but we are now
focusing on creating a stable 1.8.0 release in the upcoming month(s).

If you have any integration tests you would like to contribute or suggest, please let me know.

I am aware of this bug that was filed regarding url-agnostic dedup: https://webarchive.jira.com/browse/ACC-126
This is planned to be addressed before the 1.8.0 release.

If there are other bug reports, feel free to file them under this JIRA.

I believe the meeting in the fall is planned to better figure out how to ensure the stability of wayback in the long term for the IIPC.

Thanks,

Ilya
Engineer
IA


On 06/06/2013 09:13 AM, Jackson, Andrew wrote:

	It's not just the indexer. The front-end logic and the coupling to H3
	have all been problematic recently.
	 
	We have suffered a range of problems deploying recent Wayback versions,
	due to unintended consequences of recent changes that break
	functionality that we require. As well as the de-duplication problems I
	mentioned in a separate email, we've also had issues with Memento access
	points (which don't return link-format timemaps as they should/used to)
	and the XML query endpoint failing under certain conditions (due to
	changes in URL handling/'cleaning').
	 
	In my opinion, one of the critical jobs for the future Wayback OS
	project is to set up proper, automated integration tests that exercise
	all the functionality the IIPC partners need, and will therefore detect
	if changes to the source code have unintentionally altered critical
	behaviour. It is technically fairly straightforward to make an
	integration test that, say, indexes a few WARCs, fires up a Wayback
	instance, and checks the responses to some queries. It does, of course,
	require some investment of time and effort. However, that investment
	would enable future modifications to the code base to be carried out
	with far more confidence.
	 
	I've started doing some work in this area, but would appreciate knowing
	if anyone else is willing to put some effort into building up the
	testing framework.
	 
	Thanks,
	Andy
	 
	 

		-----Original Message-----
		From: Jones, Gina [mailto:gj...@lo...]
		Sent: 06 June 2013 13:13
		To: arc...@li...
		Subject: [Archive-access-discuss] Wayback Indexer
		 
		I believe that the wayback indexer is the weakest link to longterm

	access to

		our collections.  And it isn't obvious sometimes what is going on when

	you

		index content until you actually access that content.
		 
		One of the projects I want to do this year (or next) is to take the

	available

		indexers and index a set of content that we have (2000-now) and review

	the

		output.
		 
		gina
		 
		 

	------------------------------------------------------------------------
	------

		How ServiceNow helps IT people transform IT departments:
		1. A cloud service to automate IT design, transition and operations 2.
		Dashboards that offer high-level views of enterprise services 3. A

	single

		system of record for all IT processes

	http://p.sf.net/sfu/servicenow-d2d-j

		_______________________________________________
		Archive-access-discuss mailing list
		Arc...@li...
		https://lists.sourceforge.net/lists/listinfo/archive-access-discuss

	 
	**************************************************************************
	Experience the British Library online at http://www.bl.uk/
	 
	The British Library’s latest Annual Report and Accounts : http://www.bl.uk/aboutus/annrep/index.html
	 
	Help the British Library conserve the world's knowledge. Adopt a Book. http://www.bl.uk/adoptabook
	 
	The Library's St Pancras site is WiFi - enabled
	 
	*************************************************************************
	 
	The information contained in this e-mail is confidential and may be legally privileged. It is intended for the addressee(s) only. If you are not the intended recipient, please delete this e-mail and notify the mailto:pos...@bl... : The contents of this e-mail must not be disclosed or copied without the sender's consent.
	 
	The statements and opinions expressed in this message are those of the author and do not necessarily reflect those of the British Library. The British Library does not take any responsibility for the views of the author.
	 
	*************************************************************************
	 Think before you print
	 
	------------------------------------------------------------------------------
	How ServiceNow helps IT people transform IT departments:
	1. A cloud service to automate IT design, transition and operations
	2. Dashboards that offer high-level views of enterprise services
	3. A single system of record for all IT processes
	http://p.sf.net/sfu/servicenow-d2d-j
	_______________________________________________
	Archive-access-discuss mailing list
	Arc...@li...
	https://lists.sourceforge.net/lists/listinfo/archive-access-discuss

 


**************************************************************************
Experience the British Library online at http://www.bl.uk/
 
The British Library’s latest Annual Report and Accounts : http://www.bl.uk/aboutus/annrep/index.html
 
Help the British Library conserve the world's knowledge. Adopt a Book. http://www.bl.uk/adoptabook
 
The Library's St Pancras site is WiFi - enabled
 
*************************************************************************
 
The information contained in this e-mail is confidential and may be legally privileged. It is intended for the addressee(s) only. If you are not the intended recipient, please delete this e-mail and notify the mailto:pos...@bl... : The contents of this e-mail must not be disclosed or copied without the sender's consent.
 
The statements and opinions expressed in this message are those of the author and do not necessarily reflect those of the British Library. The British Library does not take any responsibility for the views of the author.
 
*************************************************************************
 Think before you print

Re: [Archive-access-discuss] indexing best practicesWayback 1.x.xIndexers

From: Jackson, A. <And...@bl...> - 2013-06-07 12:07:17

Actually, I was wrong about the 1.7.1-SNAPSHOT version of Wayback. In my experiments, it doesn’t ignore the revisit record, but rather indexes it as if it was NOT deduplicated. This means deduplicated records actually break playback, because you just get sent back an empty payload instead being re-directed to the older copy! I'll try to start setting up some test cases for these.

The current (2nd generation?) deduplicated record style (keeping the HTTP response headers in the revisit record because only the payload is checksummed), implies that reconstruction of the original response can only be guaranteed if you combine both the original and the revisit records. i.e. the CDX would have to contain references to both records in order to ensure fidelity of playback - it's only optional if you are sure the response headers won’t cause significant problems.

Andy

> -----Original Message-----
> From: Kristinn Sigurðsson [mailto:kri...@la...]
> Sent: 06 June 2013 17:24
> To: arc...@li...
> Subject: Re: [Archive-access-discuss] indexing best practicesWayback
> 1.x.xIndexers
> 
> A question on the indexing of de-duplicated records ... are they of any use as
> Wayback is currently implemented?
> 
> The warc/revisit record in the CDX file will point at the WARC that contains
> that revisit record. That record does not give any indication as to where the
> actual payload is found. That can only be inferred as same URL, earliest date
> prior to this. An inference that may or may not be accurate.
> 
> The crawl logs I have, contain a bit more detail and I was planning on mining
> them to generate 'deduplication' cdx files that would augment the ones
> generated from WARCs and ARCs (especially necessary for the ARCs as they
> have no record of the duplicates).
> 
> It seems to me, that for deduplicated content CDX files really need to contain
> two file+offset values. One for the payload and another (optional one!) for
> the warc/revisit record.
> 
> Or maybe I've completely missed something.
> 
> - Kris
> 
> 
> 
> 
> -------------------------------------------------------------------------
> Landsbókasafn Íslands - Háskólabókasafn | Arngrímsgötu 3 - 107 Reykjavík
> Sími/Tel: +354 5255600 | www.landsbokasafn.is
> -------------------------------------------------------------------------
> fyrirvari/disclaimer - http://fyrirvari.landsbokasafn.is
> > -----Original Message-----
> > From: Jackson, Andrew [mailto:And...@bl...]
> > Sent: 6. júní 2013 15:17
> > To: Jones, Gina; arc...@li...
> > Subject: Re: [Archive-access-discuss] indexing best practices Wayback
> > 1.x.xIndexers
> >
> > The latest versions of Wayback still seem to have major problems. The
> > 1.7.1-SNAPSHOT line appears to ignore de-duplication records, although
> > this is confused by the fact that H3/Wayback has recently been changed
> > so that de-duplication records are not empty, but rather they contain
> > the headers of the response (in case only the payload of the resource
> > itself was unchanged). However, recent Wayback versions *require* this
> > header, which breaks playback in older (but WARC-spec compliant) WARC
> > files with empty de-duplication records.
> >
> > This appears to be the same in the 1.8.0-SNAPSHOT line, but other
> > regressions mean I can't use that version (it has started refusing to
> > accept as valid some particular compressed WARC files that the
> > 1.7.1-SNAPSHOT line copes with just fine).
> >
> > Best wishes,
> > Andy Jackson
> >
> > > -----Original Message-----
> > > From: Jones, Gina [mailto:gj...@lo...]
> > > Sent: 04 June 2013 19:27
> > > To: arc...@li...
> > > Subject: [Archive-access-discuss] indexing best practices Wayback
> > > 1.x.xIndexers
> > >
> > > We have not found issues here at the Library as our collection has
> > gotten
> > > bigger.  In the past, we have had separate access points to the
> > each
> > > "collection" but are in the process of combining our content into
> > one
> > access
> > > point for a more cohesive collection.
> > >
> > > However, we have found challenges in indexing and combining those
> > > indexes, specifically due to deduplicated content.  We have content
> > > beginning in 2009 that has been deduplicated using the WARC/revisit
> > field.
> > >
> > > This is what we have think we have figured out.  If anyone has any
> > other
> > > information on these indexers, we would love to know about it.  We
> > posted
> > > a question to the listserv about 2 years ago and didn't get any
> > comments
> > > back:
> > >
> > > Wayback 1.4.x Indexers
> > > -The Wayback 1.4.2 indexer produces "warc/revisit" fields in the
> > file
> > content
> > > index that Wayback 1.4.2 cannot process and display.
> > >
> > > -When we re-indexed the same content with Wayback 1.4.0 indexer,
> > > Wayback was able to handle the revisit entries. Since the
> > "warc/revisit" field
> > > didn't exist at the time that Wayback 1.4.0 was released, we
> > suppose
> > that
> > > Wayback 1.4.0 responds to those entries as it would to any date
> > instance link
> > > where content was missing - by redirecting to the next most
> > temporally-
> > > proximate capture.
> > >
> > > -Wayback 1.6.0 can handle file content indexes with "warc/revisit"
> > fields, as
> > > well as the older 1.4.0 file content indexes
> > >
> > > -We have been unable to get Wayback 1.6.0 indexer to run on an AIX
> > server.
> > >
> > > -Wayback 1.6.0 indexer writes an alpha key code to the top line of
> > the
> > file
> > > content index. If you are merging indexes and resorting manually,
> > be
> > sure to
> > > remove that line after the index is generated.
> > >
> > > Combining cdx's from multiple indexers
> > >
> > > -As for the issue on combining the indexes, it has to do with the
> > number of
> > > fields that 1.4.0 / 1.4.2 and 1.6.X generate. The older version
> > generates a
> > > different version of the index, with a different subset of fields.
> > >
> > > -Wayback 1.6.0 can handle both indexes, so it doesn't matter if you
> > have
> > > your content indexed with either of the two. However, if you plan
> > to
> > > combine the indexes into one big index, they need to match.
> > >
> > > -The specific problem we had was with sections of an ongoing crawl.
> > 2009
> > > content was indexed with 1.4.X, but 2009+2010 content was indexed
> > with
> > > 1.6.X, so if we merge and sort, we would get the 2009 entries
> > twice,
> > because
> > > they do not match exactly (different number of fields).
> > >
> > > -The field configurations for the two versions (as we have them
> > are)
> > >
> > > 1.4.2: CDX N b h m s k r V g
> > > 1.6.1: CDX N b a m s k r M V g
> > >
> > > For definitions of the fields here is an old reference:
> > > http://archive.org/web/researcher/cdx_legend.php
> > >
> > >
> > > Gina Jones
> > > Ignacio Garcia del Campo
> > > Laura Graham
> > >
> > >
> > > -----Original Message-----
> > > From: arc...@li...
> > [mailto:archive-
> > > acc...@li...]
> > > Sent: Tuesday, June 04, 2013 8:03 AM
> > > To: arc...@li...
> > > Subject: Archive-access-discuss Digest, Vol 78, Issue 2
> > >
> > > Send Archive-access-discuss mailing list submissions to
> > >     arc...@li...
> > >
> > > To subscribe or unsubscribe via the World Wide Web, visit
> > >
> > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
> > > or, via email, send a message with subject or body 'help' to
> > >     arc...@li...
> > >
> > > You can reach the person managing the list at
> > >     arc...@li...
> > >
> > > When replying, please edit your Subject line so it is more specific
> > than "Re:
> > > Contents of Archive-access-discuss digest..."
> > >
> > >
> > > Today's Topics:
> > >
> > >    1. Best practices for indexing a growing 2+      billion document
> > >       collection (Kristinn Sigur?sson)
> > >    2. Re: Best practices for indexing a growing     2+      billion
> > document
> > >       collection (Erik Hetzner)
> > >    3. Re: Best practices for indexing a growing 2+  billion document
> > >       collection (Colin Rosenthal)
> > >
> > >
> > > -------------------------------------------------------------------
> > ---
> > >
> > > Message: 1
> > > Date: Mon, 3 Jun 2013 11:39:40 +0000
> > > From: Kristinn Sigur?sson <kri...@la...>
> > > Subject: [Archive-access-discuss] Best practices for indexing a
> > >     growing 2+      billion document collection
> > > To: "arc...@li..."
> > >     <arc...@li...>
> > > Message-ID:
> > >     <E48...@bl...>
> > > Content-Type: text/plain; charset="utf-8"
> > >
> > > Dear all,
> > >
> > > We are planning on updating our Wayback installation and I would
> > like
> > to poll
> > > your collective wisdom on the best approach for managing the
> > Wayback
> > > index.
> > >
> > > Currently, our collection is about 2.2 billion items. It is also
> > growing at a rate of
> > > approximately 350-400 million records per year.
> > >
> > > The obvious approach would be to use a sorted CDX file (or files)
> > as
> > the
> > > index. I'm, however, concerned about its performance at this scale.
> > > Additionally, updating a CDX based index can be troublesome.
> > Especially as
> > > we would like to update it continuously as new material is
> > ingested.
> > >
> > > Any relevant experience and advice you could share on this topic
> > would
> > be
> > > greatly appreciated.
> > >
> > >
> > > Best regards,
> > > Mr. Kristinn Sigur?sson
> > > Head of IT
> > > National and University Library of Iceland
> > >
> > >
> > >
> > >
> > >
> > ---------------------------------------------------------------------
> > ---
> > -
> > > Landsb?kasafn ?slands - H?sk?lab?kasafn | Arngr?msg?tu 3 - 107
> > Reykjav?k
> > > S?mi/Tel: +354 5255600 | www.landsbokasafn.is
> > >
> > ---------------------------------------------------------------------
> > ---
> > -
> > > fyrirvari/disclaimer - http://fyrirvari.landsbokasafn.is
> > >
> > > ------------------------------
> > >
> > > Message: 2
> > > Date: Mon, 03 Jun 2013 11:49:04 -0700
> > > From: Erik Hetzner <eri...@uc...>
> > > Subject: Re: [Archive-access-discuss] Best practices for indexing a
> > >     growing 2+      billion document collection
> > > To: Kristinn Sigur?sson <kri...@la...>
> > > Cc: "arc...@li..."
> > >     <arc...@li...>
> > > Message-ID: <201...@ma...>
> > > Content-Type: text/plain; charset="utf-8"
> > >
> > > At Mon, 3 Jun 2013 11:39:40 +0000,
> > > Kristinn Sigur?sson wrote:
> > > >
> > > > Dear all,
> > > >
> > > > We are planning on updating our Wayback installation and I would
> > like
> > > > to poll your collective wisdom on the best approach for managing
> > the
> > > > Wayback index.
> > > >
> > > > Currently, our collection is about 2.2 billion items. It is also
> > > > growing at a rate of approximately 350-400 million records per
> > year.
> > > >
> > > > The obvious approach would be to use a sorted CDX file (or files)
> > as
> > > > the index. I'm, however, concerned about its performance at this
> > > > scale. Additionally, updating a CDX based index can be
> > troublesome.
> > > > Especially as we would like to update it continuously as new
> > material
> > > > is ingested.
> > > >
> > > > Any relevant experience and advice you could share on this topic
> > would
> > > > be greatly appreciated.
> > >
> > > Hi Kristinn,
> > >
> > > We use 4 different CDX files. One is updated every ten minutes, one
> > hourly,
> > > one daily, and one monthly. We use the unix sort command to sort.
> > This
> > has
> > > worked pretty well for us. We aren?t doing it in the most efficient
> > manner,
> > > and we will probably switch to sorting with hadoop at some point,
> > but
> > it
> > > works pretty well.
> > >
> > > best, Erik
> > > -------------- next part -------------- Sent from my free software
> > > system <http://fsf.org/>.
> > >
> > > ------------------------------
> > >
> > > Message: 3
> > > Date: Tue, 4 Jun 2013 12:17:18 +0200
> > > From: Colin Rosenthal <cs...@st...>
> > > Subject: Re: [Archive-access-discuss] Best practices for indexing a
> > >     growing 2+      billion document collection
> > > To: arc...@li...
> > > Message-ID: <51A...@st...>
> > > Content-Type: text/plain; charset="UTF-8"; format=flowed
> > >
> > > On 06/03/2013 08:49 PM, Erik Hetzner wrote:
> > > > At Mon, 3 Jun 2013 11:39:40 +0000, Kristinn Sigur?sson wrote:
> > > >> Dear all,
> > > >>
> > > >> We are planning on updating our Wayback installation and I would
> > like
> > > >> to poll your collective wisdom on the best approach for managing
> > the
> > > >> Wayback index.
> > > >>
> > > >> Currently, our collection is about 2.2 billion items. It is also
> > > >> growing at a rate of approximately 350-400 million records per
> > year.
> > > >>
> > > >> The obvious approach would be to use a sorted CDX file (or
> > files)
> > as
> > > >> the index. I'm, however, concerned about its performance at this
> > > >> scale. Additionally, updating a CDX based index can be
> > troublesome.
> > > >> Especially as we would like to update it continuously as new
> > material
> > > >> is ingested.
> > > >>
> > > >> Any relevant experience and advice you could share on this topic
> > > >> would be greatly appreciated.
> > > > Hi Kristinn,
> > > >
> > > > We use 4 different CDX files. One is updated every ten minutes,
> > one
> > > > hourly, one daily, and one monthly. We use the unix sort command
> > to
> > > > sort. This has worked pretty well for us. We aren?t doing it in
> > the
> > > > most efficient manner, and we will probably switch to sorting
> > with
> > > > hadoop at some point, but it works pretty well.
> > > >
> > > > best, Erik
> > > Hi Kristinn,
> > >
> > > Our strategy for building cdx indexes is described at
> > >
> >
> https://sbforge.org/display/NASDOC321/Wayback+Configuration#WaybackC
> > > onfiguration-AggregatorApplication
> > > .
> > >
> > > Essentially we have multiple threads creating unsorted cdx files
> > for
> > all new
> > > arc/warc files in the archive. These are then sorted and merged
> > into
> > an
> > > intermediate index file. When the intermediate file grows larger
> > than
> > 100MB,
> > > it is merged with the current main index file, and when that grows
> > larger than
> > > 50GB we rollover to a new main index file. We currently have about
> > 5TB
> > total
> > > cdx index. This includes 16 older cdx files of size 150GB-300GB,
> > built
> > by
> > > handrolled scripts before we had a functional automatic indexing
> > workflow.
> > >
> > > We would be fascinated to hear if anyone is using an entirely
> > different
> > > strategy (e.g. bdb) for a large archive.
> > >
> > > One of our big issues at the moment is QA of our cdx files. How can
> > we
> > be
> > > sure that our indexes actually cover all the files and records in
> > the
> > archive?
> > >
> > > Colin Rosenthal
> > > IT-Developer
> > > Netarkivet, Denmark
> > >
> > >
> > >
> > >
> > > ------------------------------
> > >
> > >
> > ---------------------------------------------------------------------
> > ---
> > ------
> > > How ServiceNow helps IT people transform IT departments:
> > > 1. A cloud service to automate IT design, transition and operations
> > 2.
> > > Dashboards that offer high-level views of enterprise services 3. A
> > single
> > > system of record for all IT processes
> > http://p.sf.net/sfu/servicenow-d2d-j
> > >
> > > ------------------------------
> > >
> > > _______________________________________________
> > > Archive-access-discuss mailing list
> > > Arc...@li...
> > > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
> > >
> > >
> > > End of Archive-access-discuss Digest, Vol 78, Issue 2
> > > *****************************************************
> > >
> > >
> > ---------------------------------------------------------------------
> > ---
> > ------
> > > How ServiceNow helps IT people transform IT departments:
> > > 1. A cloud service to automate IT design, transition and operations
> > 2.
> > > Dashboards that offer high-level views of enterprise services 3. A
> > single
> > > system of record for all IT processes
> > http://p.sf.net/sfu/servicenow-d2d-j
> > > _______________________________________________
> > > Archive-access-discuss mailing list
> > > Arc...@li...
> > > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
> >
> >
> **********************************************************
> ***********
> > *****
> > Experience the British Library online at http://www.bl.uk/
> >
> > The British Library’s latest Annual Report and Accounts :
> > http://www.bl.uk/aboutus/annrep/index.html
> >
> > Help the British Library conserve the world's knowledge. Adopt a Book.
> > http://www.bl.uk/adoptabook
> >
> > The Library's St Pancras site is WiFi - enabled
> >
> >
> **********************************************************
> ***********
> > ****
> >
> > The information contained in this e-mail is confidential and may be
> > legally privileged. It is intended for the addressee(s) only. If you
> > are not the intended recipient, please delete this e-mail and notify
> > the mailto:pos...@bl... : The contents of this e-mail must not be
> > disclosed or copied without the sender's consent.
> >
> > The statements and opinions expressed in this message are those of the
> > author and do not necessarily reflect those of the British Library.
> > The British Library does not take any responsibility for the views of
> > the author.
> >
> >
> **********************************************************
> ***********
> > ****
> >  Think before you print
> >
> > ---------------------------------------------------------------------
> > ---------
> > How ServiceNow helps IT people transform IT departments:
> > 1. A cloud service to automate IT design, transition and operations 2.
> > Dashboards that offer high-level views of enterprise services 3. A
> > single system of record for all IT processes
> > http://p.sf.net/sfu/servicenow-d2d-j
> > _______________________________________________
> > Archive-access-discuss mailing list
> > Arc...@li...
> > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
> ------------------------------------------------------------------------------
> How ServiceNow helps IT people transform IT departments:
> 1. A cloud service to automate IT design, transition and operations 2.
> Dashboards that offer high-level views of enterprise services 3. A single
> system of record for all IT processes http://p.sf.net/sfu/servicenow-d2d-j
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss

**************************************************************************
Experience the British Library online at http://www.bl.uk/
 
The British Library’s latest Annual Report and Accounts : http://www.bl.uk/aboutus/annrep/index.html
 
Help the British Library conserve the world's knowledge. Adopt a Book. http://www.bl.uk/adoptabook
 
The Library's St Pancras site is WiFi - enabled
 
*************************************************************************
 
The information contained in this e-mail is confidential and may be legally privileged. It is intended for the addressee(s) only. If you are not the intended recipient, please delete this e-mail and notify the mailto:pos...@bl... : The contents of this e-mail must not be disclosed or copied without the sender's consent.
 
The statements and opinions expressed in this message are those of the author and do not necessarily reflect those of the British Library. The British Library does not take any responsibility for the views of the author.
 
*************************************************************************
 Think before you print

Re: [Archive-access-discuss] indexing best practices Wayback 1.x.xIndexers

From: Kristinn S. <kri...@la...> - 2013-06-07 11:03:39

I believe a fundamental mistake is being made in how duplicates are treated by these tools.

We should not treat uri and uri-agnostic duplicates any differently. Both should have the URI and Date of the original payload recorded in the warc/revisit record.

This simplifies the WARC reading as (ignoring legacy WARCs) you'll always have the information necessary to replay the content available in the revisit record. The WARC writer also doesn't need to handle more complex cases, either it is a revisit record or not.

Additionally, this makes building deduplication indexes (to be consulted during crawl-time) easier as you have all the necessary info in the WARCs, you don't need to dereference the revisit records you encounter there.

As for the WARC-Refers-To-Filename and WARC-Refers-To-File-Offset. These should not be used. It was very clear that many organizations do not view their WARC records as immutable objects. Any support for them should be implemented as optional and defaulting to off.

- Kris


-------------------------------------------------------------------------
Landsbókasafn Íslands - Háskólabókasafn | Arngrímsgötu 3 - 107 Reykjavík
Sími/Tel: +354 5255600 | www.landsbokasafn.is
-------------------------------------------------------------------------
fyrirvari/disclaimer - http://fyrirvari.landsbokasafn.is
> -----Original Message-----
> From: Noah Levitt [mailto:nl...@ar...]
> Sent: 6. júní 2013 19:23
> To: Ilya Kreymer
> Cc: arc...@li...
> Subject: Re: [Archive-access-discuss] indexing best practices Wayback
> 1.x.xIndexers
>
> Re WARC-Refers-To-Filename and WARC-Refers-To-File-Offset, as some of
> you know there is a proposed spec
> https://docs.google.com/document/d/1QyQBA7Ykgxie75V8Jziz_O7hbhwf7PF6_
> u9O6w6zgp0 that discourages these two fields, and instead "strongly
> recommends" WARC-Refers-To-Target-URI and WARC-Refers-To-Date .
>
> Url-agnostic revisit records written by heritrix currently contain
> all four of those headers.
>
> The wayback implementation does support replay using WARC-Refers-To-
> Target-URI
> +
>  WARC-Refers-To-Date
> , but that code path hasn't been exercised much at IA yet.
>
> Of course we plan to update heritrix and wayback to be more inline
> with the new spec soon, perhaps dropping support for WARC-Refers-To-
> Filename
> +
>  WARC-Refers-To-File-Offset
> . If someone else out there want to work on that, that would also be
> welcome.
>
> (There seems to be a feeling that IA has too much control over the
> code. Should more people have commit privs maybe? And/or maybe the
> repos under https://github.com/iipc should be canonical?)
>
> Noah
>
>
> On Thu, Jun 6, 2013 at 11:53 AM, Ilya Kreymer <il...@ar...>
> wrote:
>
>
>       Hi,
>
>       I wanted to clear up some confusion about how the revisit
> system is working.
>
>       When wayback reads cdx records for a given url it stores them
> by their digest hash in a cache (a map) for that request.
>
>       If/when a record of "warc/revisit" type has been encountered,
> wayback will look up the digest in this map, and add resolve the
> lookup to the original.
>       If the original can not be found for that revisit digest,
> wayback will display an error.
>
>       The traditional implementation going back several version was
> to play back the original warc headers and content from the original.
>
>       We realized that this was incorrect due to the fact that the
> digest only accounts for the response body and not the headers.
>
>       Since the warc that produces the revisit record still has the
> latest captured headers, wayback will replay the headers from the
> latest capture with the content from the original, again, since the
> digest guarantees only that the body is the same not the headers.
>
>       Thus to handle the revisit record, wayback will be reading from
> two warcs, the one with the revisit record and the original.
>
>       Finally, we've recently added support for the url-agnostic
> features that were added to Heritrix, which support looking up the
> original based on annotations found in the
>       warc, such as WARC-Refers-To-Filename and WARC-Refers-To-File-
> Offset. ( https://webarchive.jira.com/browse/HER-2022) This allows
> wayback to resolve the revisit against a cdx record from a different
> url by pointing
>       to the warc name and offset directly. This feature is still
> somewhat experimental and is not yet in wide use.
>
>       I hope this clears things up a bit, if not, feel free to
> respond and we'll try to elaborate further as this is potentially
> confusing area.
>
>       Thanks,
>
>       Ilya
>       Internet Archive
>       Engineer
>
>
>
>
>       On 06/06/2013 09:24 AM, Kristinn Sigurðsson wrote:
>
>
>               A question on the indexing of de-duplicated records ...
> are they of any use as Wayback is currently implemented?
>
>               The warc/revisit record in the CDX file will point at
> the WARC that contains that revisit record. That record does not give
> any indication as to where the actual payload is found. That can only
> be inferred as same URL, earliest date prior to this. An inference
> that may or may not be accurate.
>
>               The crawl logs I have, contain a bit more detail and I
> was planning on mining them to generate 'deduplication' cdx files
> that would augment the ones generated from WARCs and ARCs (especially
> necessary for the ARCs as they have no record of the duplicates).
>
>               It seems to me, that for deduplicated content CDX files
> really need to contain two file+offset values. One for the payload
> and another (optional one!) for the warc/revisit record.
>
>               Or maybe I've completely missed something.
>
>               - Kris
>
>
>
>
>               --------------------------------------------------------
> -----------------
>               Landsbókasafn Íslands - Háskólabókasafn | Arngrímsgötu 3
> - 107 Reykjavík
>               Sími/Tel: +354 5255600 <tel:%2B354%205255600>  |
> www.landsbokasafn.is
>               --------------------------------------------------------
> -----------------
>               fyrirvari/disclaimer - http://fyrirvari.landsbokasafn.is
>
>                       -----Original Message-----
>                       From: Jackson, Andrew
> [mailto:And...@bl...]
>                       Sent: 6. júní 2013 15:17
>                       To: Jones, Gina; archive-access-
> di...@li...
>                       Subject: Re: [Archive-access-discuss] indexing
> best practices Wayback
>                       1.x.xIndexers
>
>                       The latest versions of Wayback still seem to have
> major problems. The
>                       1.7.1-SNAPSHOT line appears to ignore de-
> duplication records,
>                       although
>                       this is confused by the fact that H3/Wayback has
> recently been
>                       changed
>                       so that de-duplication records are not empty, but
> rather they contain
>                       the headers of the response (in case only the
> payload of the resource
>                       itself was unchanged). However, recent Wayback
> versions *require*
>                       this
>                       header, which breaks playback in older (but WARC-
> spec compliant) WARC
>                       files with empty de-duplication records.
>
>                       This appears to be the same in the 1.8.0-SNAPSHOT
> line, but other
>                       regressions mean I can't use that version (it has
> started refusing to
>                       accept as valid some particular compressed WARC
> files that the
>                       1.7.1-SNAPSHOT line copes with just fine).
>
>                       Best wishes,
>                       Andy Jackson
>
>
>                               -----Original Message-----
>                               From: Jones, Gina [mailto:gj...@lo...]
>                               Sent: 04 June 2013 19:27
>                               To: archive-access-
> di...@li...
>                               Subject: [Archive-access-discuss] indexing
> best practices Wayback
>                               1.x.xIndexers
>
>                               We have not found issues here at the
> Library as our collection has
>
>                       gotten
>
>                               bigger.  In the past, we have had separate
> access points to the
>
>                       each
>
>                               "collection" but are in the process of
> combining our content into
>
>                       one
>                       access
>
>                               point for a more cohesive collection.
>
>                               However, we have found challenges in
> indexing and combining those
>                               indexes, specifically due to deduplicated
> content.  We have content
>                               beginning in 2009 that has been
> deduplicated using the WARC/revisit
>
>                       field.
>
>                               This is what we have think we have figured
> out.  If anyone has any
>
>                       other
>
>                               information on these indexers, we would
> love to know about it.  We
>
>                       posted
>
>                               a question to the listserv about 2 years
> ago and didn't get any
>
>                       comments
>
>                               back:
>
>                               Wayback 1.4.x Indexers
>                               -The Wayback 1.4.2 indexer produces
> "warc/revisit" fields in the
>
>                       file
>                       content
>
>                               index that Wayback 1.4.2 cannot process
> and display.
>
>                               -When we re-indexed the same content with
> Wayback 1.4.0 indexer,
>                               Wayback was able to handle the revisit
> entries. Since the
>
>                       "warc/revisit" field
>
>                               didn't exist at the time that Wayback
> 1.4.0 was released, we
>
>                       suppose
>                       that
>
>                               Wayback 1.4.0 responds to those entries as
> it would to any date
>
>                       instance link
>
>                               where content was missing - by redirecting
> to the next most
>
>                       temporally-
>
>                               proximate capture.
>
>                               -Wayback 1.6.0 can handle file content
> indexes with "warc/revisit"
>
>                       fields, as
>
>                               well as the older 1.4.0 file content
> indexes
>
>                               -We have been unable to get Wayback 1.6.0
> indexer to run on an AIX
>
>                       server.
>
>                               -Wayback 1.6.0 indexer writes an alpha key
> code to the top line of
>
>                       the
>                       file
>
>                               content index. If you are merging indexes
> and resorting manually,
>
>                       be
>                       sure to
>
>                               remove that line after the index is
> generated.
>
>                               Combining cdx's from multiple indexers
>
>                               -As for the issue on combining the
> indexes, it has to do with the
>
>                       number of
>
>                               fields that 1.4.0 / 1.4.2 and 1.6.X
> generate. The older version
>
>                       generates a
>
>                               different version of the index, with a
> different subset of fields.
>
>                               -Wayback 1.6.0 can handle both indexes, so
> it doesn't matter if you
>
>                       have
>
>                               your content indexed with either of the
> two. However, if you plan
>
>                       to
>
>                               combine the indexes into one big index,
> they need to match.
>
>                               -The specific problem we had was with
> sections of an ongoing crawl.
>
>                       2009
>
>                               content was indexed with 1.4.X, but
> 2009+2010 content was indexed
>
>                       with
>
>                               1.6.X, so if we merge and sort, we would
> get the 2009 entries
>
>                       twice,
>                       because
>
>                               they do not match exactly (different
> number of fields).
>
>                               -The field configurations for the two
> versions (as we have them
>
>                       are)
>
>                               1.4.2: CDX N b h m s k r V g
>                               1.6.1: CDX N b a m s k r M V g
>
>                               For definitions of the fields here is an
> old reference:
>
>       http://archive.org/web/researcher/cdx_legend.php
>
>
>                               Gina Jones
>                               Ignacio Garcia del Campo
>                               Laura Graham
>
>
>                               -----Original Message-----
>                               From: archive-access-discuss-
> re...@li...
>
>                       [mailto:archive-
>
>                               access-discuss-
> re...@li...]
>                               Sent: Tuesday, June 04, 2013 8:03 AM
>                               To: archive-access-
> di...@li...
>                               Subject: Archive-access-discuss Digest,
> Vol 78, Issue 2
>
>                               Send Archive-access-discuss mailing list
> submissions to
>                                   archive-access-
> di...@li...
>
>                               To subscribe or unsubscribe via the World
> Wide Web, visit
>
>
>
>       https://lists.sourceforge.net/lists/listinfo/archive-access-
> discuss
>
>                               or, via email, send a message with subject
> or body 'help' to
>                                   archive-access-discuss-
> re...@li...
>
>                               You can reach the person managing the list
> at
>                                   archive-access-discuss-
> ow...@li...
>
>                               When replying, please edit your Subject
> line so it is more specific
>
>                       than "Re:
>
>                               Contents of Archive-access-discuss
> digest..."
>
>
>                               Today's Topics:
>
>                                  1. Best practices for indexing a
> growing 2+      billion document
>                                     collection (Kristinn Sigur?sson)
>                                  2. Re: Best practices for indexing a
> growing     2+      billion
>
>                       document
>
>                                     collection (Erik Hetzner)
>                                  3. Re: Best practices for indexing a
> growing 2+  billion document
>                                     collection (Colin Rosenthal)
>
>
>                               ------------------------------------------
> -------------------------
>
>                       ---
>
>                               Message: 1
>                               Date: Mon, 3 Jun 2013 11:39:40 +0000
>                               From: Kristinn Sigur?sson
> <kri...@la...> <mailto:kri...@la...>
>                               Subject: [Archive-access-discuss] Best
> practices for indexing a
>                                   growing 2+      billion document
> collection
>                               To: "archive-access-
> di...@li..." <mailto:archive-access-
> di...@li...>
>                                   <archive-access-
> di...@li...> <mailto:archive-access-
> di...@li...>
>                               Message-ID:
>
> <E48...@bl...>
> <mailto:E48...@bl...>
>                               Content-Type: text/plain; charset="utf-8"
>
>                               Dear all,
>
>                               We are planning on updating our Wayback
> installation and I would
>
>                       like
>                       to poll
>
>                               your collective wisdom on the best
> approach for managing the
>
>                       Wayback
>
>                               index.
>
>                               Currently, our collection is about 2.2
> billion items. It is also
>
>                       growing at a rate of
>
>                               approximately 350-400 million records per
> year.
>
>                               The obvious approach would be to use a
> sorted CDX file (or files)
>
>                       as
>                       the
>
>                               index. I'm, however, concerned about its
> performance at this scale.
>                               Additionally, updating a CDX based index
> can be troublesome.
>
>                       Especially as
>
>                               we would like to update it continuously as
> new material is
>
>                       ingested.
>
>                               Any relevant experience and advice you
> could share on this topic
>
>                       would
>                       be
>
>                               greatly appreciated.
>
>
>                               Best regards,
>                               Mr. Kristinn Sigur?sson
>                               Head of IT
>                               National and University Library of Iceland
>
>
>
>
>
>
>                       -------------------------------------------------
> --------------------
>                       ---
>                       -
>
>                               Landsb?kasafn ?slands - H?sk?lab?kasafn |
> Arngr?msg?tu 3 - 107
>
>                       Reykjav?k
>
>                               S?mi/Tel: +354 5255600
> <tel:%2B354%205255600>  | www.landsbokasafn.is
>
>
>                       -------------------------------------------------
> --------------------
>                       ---
>                       -
>
>                               fyrirvari/disclaimer -
> http://fyrirvari.landsbokasafn.is
>
>                               ------------------------------
>
>                               Message: 2
>                               Date: Mon, 03 Jun 2013 11:49:04 -0700
>                               From: Erik Hetzner <eri...@uc...>
> <mailto:eri...@uc...>
>                               Subject: Re: [Archive-access-discuss] Best
> practices for indexing a
>                                   growing 2+      billion document
> collection
>                               To: Kristinn Sigur?sson
> <kri...@la...> <mailto:kri...@la...>
>                               Cc: "archive-access-
> di...@li..." <mailto:archive-access-
> di...@li...>
>                                   <archive-access-
> di...@li...> <mailto:archive-access-
> di...@li...>
>                               Message-ID:
> <201...@ma...>
> <mailto:201...@ma...>
>                               Content-Type: text/plain; charset="utf-8"
>
>                               At Mon, 3 Jun 2013 11:39:40 +0000,
>                               Kristinn Sigur?sson wrote:
>
>                                       Dear all,
>
>                                       We are planning on updating our
> Wayback installation and I would
>
>                       like
>
>                                       to poll your collective wisdom on
> the best approach for managing
>
>                       the
>
>                                       Wayback index.
>
>                                       Currently, our collection is about
> 2.2 billion items. It is also
>                                       growing at a rate of approximately
> 350-400 million records per
>
>                       year.
>
>                                       The obvious approach would be to use
> a sorted CDX file (or files)
>
>                       as
>
>                                       the index. I'm, however, concerned
> about its performance at this
>                                       scale. Additionally, updating a CDX
> based index can be
>
>                       troublesome.
>
>                                       Especially as we would like to
> update it continuously as new
>
>                       material
>
>                                       is ingested.
>
>                                       Any relevant experience and advice
> you could share on this topic
>
>                       would
>
>                                       be greatly appreciated.
>
>                               Hi Kristinn,
>
>                               We use 4 different CDX files. One is
> updated every ten minutes, one
>
>                       hourly,
>
>                               one daily, and one monthly. We use the
> unix sort command to sort.
>
>                       This
>                       has
>
>                               worked pretty well for us. We aren?t doing
> it in the most efficient
>
>                       manner,
>
>                               and we will probably switch to sorting
> with hadoop at some point,
>
>                       but
>                       it
>
>                               works pretty well.
>
>                               best, Erik
>                               -------------- next part --------------
>                               Sent from my free software system
> <http://fsf.org/> <http://fsf.org/> .
>
>                               ------------------------------
>
>                               Message: 3
>                               Date: Tue, 4 Jun 2013 12:17:18 +0200
>                               From: Colin Rosenthal
> <cs...@st...> <mailto:cs...@st...>
>                               Subject: Re: [Archive-access-discuss] Best
> practices for indexing a
>                                   growing 2+      billion document
> collection
>                               To: archive-access-
> di...@li...
>                               Message-ID:
> <51A...@st...>
> <mailto:51A...@st...>
>                               Content-Type: text/plain; charset="UTF-8";
> format=flowed
>
>                               On 06/03/2013 08:49 PM, Erik Hetzner
> wrote:
>
>                                       At Mon, 3 Jun 2013 11:39:40 +0000,
>                                       Kristinn Sigur?sson wrote:
>
>                                               Dear all,
>
>                                               We are planning on updating
> our Wayback installation and I would
>
>                       like
>
>                                               to poll your collective
> wisdom on the best approach for managing
>
>                       the
>
>                                               Wayback index.
>
>                                               Currently, our collection is
> about 2.2 billion items. It is also
>                                               growing at a rate of
> approximately 350-400 million records per
>
>                       year.
>
>                                               The obvious approach would be
> to use a sorted CDX file (or
>
>                       files)
>                       as
>
>                                               the index. I'm, however,
> concerned about its performance at this
>                                               scale. Additionally, updating
> a CDX based index can be
>
>                       troublesome.
>
>                                               Especially as we would like
> to update it continuously as new
>
>                       material
>
>                                               is ingested.
>
>                                               Any relevant experience and
> advice you could share on this topic
>                                               would be greatly appreciated.
>
>                                       Hi Kristinn,
>
>                                       We use 4 different CDX files. One is
> updated every ten minutes,
>
>                       one
>
>                                       hourly, one daily, and one monthly.
> We use the unix sort command
>
>                       to
>
>                                       sort. This has worked pretty well
> for us. We aren?t doing it in
>
>                       the
>
>                                       most efficient manner, and we will
> probably switch to sorting
>
>                       with
>
>                                       hadoop at some point, but it works
> pretty well.
>
>                                       best, Erik
>
>                               Hi Kristinn,
>
>                               Our strategy for building cdx indexes is
> described at
>
>
>
>       https://sbforge.org/display/NASDOC321/Wayback+Configuration#Way
> backC
>
>                               onfiguration-AggregatorApplication
>                               .
>
>                               Essentially we have multiple threads
> creating unsorted cdx files
>
>                       for
>                       all new
>
>                               arc/warc files in the archive. These are
> then sorted and merged
>
>                       into
>                       an
>
>                               intermediate index file. When the
> intermediate file grows larger
>
>                       than
>                       100MB,
>
>                               it is merged with the current main index
> file, and when that grows
>
>                       larger than
>
>                               50GB we rollover to a new main index file.
> We currently have about
>
>                       5TB
>                       total
>
>                               cdx index. This includes 16 older cdx
> files of size 150GB-300GB,
>
>                       built
>                       by
>
>                               handrolled scripts before we had a
> functional automatic indexing
>
>                       workflow.
>
>                               We would be fascinated to hear if anyone
> is using an entirely
>
>                       different
>
>                               strategy (e.g. bdb) for a large archive.
>
>                               One of our big issues at the moment is QA
> of our cdx files. How can
>
>                       we
>                       be
>
>                               sure that our indexes actually cover all
> the files and records in
>
>                       the
>                       archive?
>
>                               Colin Rosenthal
>                               IT-Developer
>                               Netarkivet, Denmark
>
>
>
>
>                               ------------------------------
>
>
>
>                       -------------------------------------------------
> --------------------
>                       ---
>                       ------
>
>                               How ServiceNow helps IT people transform
> IT departments:
>                               1. A cloud service to automate IT design,
> transition and operations
>
>                       2.
>
>                               Dashboards that offer high-level views of
> enterprise services 3. A
>
>                       single
>
>                               system of record for all IT processes
>
>                       http://p.sf.net/sfu/servicenow-d2d-j
>
>                               ------------------------------
>
>
>       _______________________________________________
>                               Archive-access-discuss mailing list
>                               Archive-access-
> di...@li...
>
>       https://lists.sourceforge.net/lists/listinfo/archive-access-
> discuss
>
>
>                               End of Archive-access-discuss Digest, Vol
> 78, Issue 2
>
>       *****************************************************
>
>
>
>                       -------------------------------------------------
> --------------------
>                       ---
>                       ------
>
>                               How ServiceNow helps IT people transform
> IT departments:
>                               1. A cloud service to automate IT design,
> transition and operations
>
>                       2.
>
>                               Dashboards that offer high-level views of
> enterprise services 3. A
>
>                       single
>
>                               system of record for all IT processes
>
>                       http://p.sf.net/sfu/servicenow-d2d-j
>
>
>       _______________________________________________
>                               Archive-access-discuss mailing list
>                               Archive-access-
> di...@li...
>
>       https://lists.sourceforge.net/lists/listinfo/archive-access-
> discuss
>
>
>       ***************************************************************
> ******
>                       *****
>                       Experience the British Library online at
> http://www.bl.uk/
>
>                       The British Library's latest Annual Report and
> Accounts :
>                       http://www.bl.uk/aboutus/annrep/index.html
>
>                       Help the British Library conserve the world's
> knowledge. Adopt a
>                       Book. http://www.bl.uk/adoptabook
>
>                       The Library's St Pancras site is WiFi - enabled
>
>
>       ***************************************************************
> ******
>                       ****
>
>                       The information contained in this e-mail is
> confidential and may be
>                       legally privileged. It is intended for the
> addressee(s) only. If you
>                       are not the intended recipient, please delete
> this e-mail and notify
>                       the mailto:pos...@bl... : The contents of
> this e-mail must not be
>                       disclosed or copied without the sender's consent.
>
>                       The statements and opinions expressed in this
> message are those of
>                       the author and do not necessarily reflect those
> of the British
>                       Library. The British Library does not take any
> responsibility for the
>                       views of the author.
>
>
>       ***************************************************************
> ******
>                       ****
>                        Think before you print
>
>                       -------------------------------------------------
> --------------------
>                       ---------
>                       How ServiceNow helps IT people transform IT
> departments:
>                       1. A cloud service to automate IT design,
> transition and operations
>                       2. Dashboards that offer high-level views of
> enterprise services
>                       3. A single system of record for all IT processes
>                       http://p.sf.net/sfu/servicenow-d2d-j
>                       _______________________________________________
>                       Archive-access-discuss mailing list
>                       Arc...@li...
>
>       https://lists.sourceforge.net/lists/listinfo/archive-access-
> discuss
>
>               --------------------------------------------------------
> ----------------------
>               How ServiceNow helps IT people transform IT departments:
>               1. A cloud service to automate IT design, transition and
> operations
>               2. Dashboards that offer high-level views of enterprise
> services
>               3. A single system of record for all IT processes
>               http://p.sf.net/sfu/servicenow-d2d-j
>               _______________________________________________
>               Archive-access-discuss mailing list
>               Arc...@li...
>               https://lists.sourceforge.net/lists/listinfo/archive-
> access-discuss
>
>
>
>       ---------------------------------------------------------------
> ---------------
>       How ServiceNow helps IT people transform IT departments:
>       1. A cloud service to automate IT design, transition and
> operations
>       2. Dashboards that offer high-level views of enterprise
> services
>       3. A single system of record for all IT processes
>       http://p.sf.net/sfu/servicenow-d2d-j
>       _______________________________________________
>       Archive-access-discuss mailing list
>       Arc...@li...
>       https://lists.sourceforge.net/lists/listinfo/archive-access-
> discuss
>
>
>

Re: [Archive-access-discuss] indexing best practices Wayback 1.x.x Indexers

From: Nicholas C. <ni...@kb...> - 2013-06-06 21:23:44

I experimented with an alternative flatfile lookup implementation which caches the first 16 levels of binary search decisions.
Somehow that could be fun to mix with compressed blocks.

So unless a prefix spans more that 3000 lines you only decompress one block of gzip'ed data per lookup?

Is this format faster or predominantly used to save disk space?

-Nicholas

> -----Oprindelig meddelelse-----
> Fra: Ilya Kreymer [mailto:il...@ar...]
> Sendt: 6. juni 2013 21:38
> Til: arc...@li...
> Emne: Re: [Archive-access-discuss] indexing best practices Wayback
> 1.x.x Indexers
> 
> I also wanted to provide a brief overview about the new indexing format
> we are using at IA for large indexes.
> 
> We refer to this as "zipnum" or "ziplines" and it is basically
> concatenated gzip'd blocks, each with 3000 lines of cdx.
> 
> The concatened .gz file has a corresponding sorted text index summary
> The text index summary has the first url of each 3000 line block and a
> filename and offset to the full concatenated .gz file.
> 
> This allows the full .gz index to be spread over multiple shards, and
> lends itself well to be being built in Hadoop.
> 
> We have tools to generate the zipnum sharded index in hadoop as well as
> standalone Java and Python tools.
> 
> We are working on providing more documentation of this format, but I
> just wanted to give a brief overview for now.
> 
> 
> Using this format, we have been using a similar approach of having a
> full zipnum cluster (updated less frequently).
> and smaller zipnum clusters that are updated daily or hourly, and then
> re-merged into the full zipnum cluster.
> 
> Wayback has stable support for reading this data format (via
> ZipNumClusterSearchResultSource) which we have been using for over a
> year,
> and the tools to generate the format are in the ia-hadoop-tools
> repository, however we definitely need to provide more documentation on
> using this
> system.
> 
> Please feel free to let us know if you have further questions in the
> mean time.
> 
> Ilya,
> Engineer
> IA
> 
> 
> On 06/06/2013 12:17 AM, Colin Rosenthal wrote:
> > On 06/04/2013 08:27 PM, Jones, Gina wrote:
> >> -Wayback 1.6.0 can handle both indexes, so it doesn't matter if you
> have your content indexed with either of the two. However, if you plan
> to combine the indexes into one big index, they need to match.
> >>
> >> -The specific problem we had was with sections of an ongoing crawl.
> 2009 content was indexed with 1.4.X, but 2009+2010 content was indexed
> with 1.6.X, so if we merge and sort, we would get the 2009 entries
> twice, because they do not match exactly (different number of fields).
> >>
> >> -The field configurations for the two versions (as we have them are)
> >>
> >> 1.4.2: CDX N b h m s k r V g
> >> 1.6.1: CDX N b a m s k r M V g
> >>
> >> For definitions of the fields here is an old reference:
> http://archive.org/web/researcher/cdx_legend.php
> >>
> > Thank you, Gina, that is extremely interesting!
> >
> > Colin Rosenthal
> > Netarkivet
> >
> > ---------------------------------------------------------------------
> ---------
> > How ServiceNow helps IT people transform IT departments:
> > 1. A cloud service to automate IT design, transition and operations
> > 2. Dashboards that offer high-level views of enterprise services
> > 3. A single system of record for all IT processes
> > http://p.sf.net/sfu/servicenow-d2d-j
> > _______________________________________________
> > Archive-access-discuss mailing list
> > Arc...@li...
> > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
> 
> 
> -----------------------------------------------------------------------
> -------
> How ServiceNow helps IT people transform IT departments:
> 1. A cloud service to automate IT design, transition and operations
> 2. Dashboards that offer high-level views of enterprise services
> 3. A single system of record for all IT processes
> http://p.sf.net/sfu/servicenow-d2d-j
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss

Re: [Archive-access-discuss] indexing best practices Wayback 1.x.x Indexers

From: Ilya K. <il...@ar...> - 2013-06-06 19:38:01

I also wanted to provide a brief overview about the new indexing format 
we are using at IA for large indexes.

We refer to this as "zipnum" or "ziplines" and it is basically 
concatenated gzip'd blocks, each with 3000 lines of cdx.

The concatened .gz file has a corresponding sorted text index summary
The text index summary has the first url of each 3000 line block and a 
filename and offset to the full concatenated .gz file.

This allows the full .gz index to be spread over multiple shards, and 
lends itself well to be being built in Hadoop.

We have tools to generate the zipnum sharded index in hadoop as well as 
standalone Java and Python tools.

We are working on providing more documentation of this format, but I 
just wanted to give a brief overview for now.

Using this format, we have been using a similar approach of having a 
full zipnum cluster (updated less frequently).
and smaller zipnum clusters that are updated daily or hourly, and then 
re-merged into the full zipnum cluster.

Wayback has stable support for reading this data format (via 
ZipNumClusterSearchResultSource) which we have been using for over a year,
and the tools to generate the format are in the ia-hadoop-tools 
repository, however we definitely need to provide more documentation on 
using this
system.

Please feel free to let us know if you have further questions in the 
mean time.

Ilya,
Engineer
IA

On 06/06/2013 12:17 AM, Colin Rosenthal wrote:
> On 06/04/2013 08:27 PM, Jones, Gina wrote:
>> -Wayback 1.6.0 can handle both indexes, so it doesn't matter if you have your content indexed with either of the two. However, if you plan to combine the indexes into one big index, they need to match.
>>
>> -The specific problem we had was with sections of an ongoing crawl. 2009 content was indexed with 1.4.X, but 2009+2010 content was indexed with 1.6.X, so if we merge and sort, we would get the 2009 entries twice, because they do not match exactly (different number of fields).
>>
>> -The field configurations for the two versions (as we have them are)
>>
>> 1.4.2: CDX N b h m s k r V g
>> 1.6.1: CDX N b a m s k r M V g
>>
>> For definitions of the fields here is an old reference: http://archive.org/web/researcher/cdx_legend.php
>>
> Thank you, Gina, that is extremely interesting!
>
> Colin Rosenthal
> Netarkivet
>
> ------------------------------------------------------------------------------
> How ServiceNow helps IT people transform IT departments:
> 1. A cloud service to automate IT design, transition and operations
> 2. Dashboards that offer high-level views of enterprise services
> 3. A single system of record for all IT processes
> http://p.sf.net/sfu/servicenow-d2d-j
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss

Re: [Archive-access-discuss] indexing best practices Wayback 1.x.xIndexers

From: Noah L. <nl...@ar...> - 2013-06-06 19:23:29

Re WARC-Refers-To-Filename and WARC-Refers-To-File-Offset, as some of you
know there is a proposed spec
https://docs.google.com/document/d/1QyQBA7Ykgxie75V8Jziz_O7hbhwf7PF6_u9O6w6zgp0that
discourages these two fields, and instead "strongly recommends"
WARC-Refers-To-Target-URI
 and
WARC-Refers-To-Date
.

Url-agnostic revisit records written by heritrix currently contain all
four of those headers.

The wayback implementation does support replay using
WARC-Refers-To-Target-URI
 +
 WARC-Refers-To-Date
, but that code path hasn't been exercised much at IA yet.

Of course we plan to update heritrix and wayback to be more inline with the
new spec soon, perhaps dropping support for
WARC-Refers-To-Filename
+
 WARC-Refers-To-File-Offset
. If someone else out there want to work on that, that would also be
welcome.

(There seems to be a feeling that IA has too much control over the code.
Should more people have commit privs maybe? And/or maybe the repos under
https://github.com/iipc should be canonical?)

Noah


On Thu, Jun 6, 2013 at 11:53 AM, Ilya Kreymer <il...@ar...> wrote:

>  Hi,
>
> I wanted to clear up some confusion about how the revisit system is
> working.
>
> When wayback reads cdx records for a given url it stores them by their
> digest hash in a cache (a map) for that request.
>
> If/when a record of "warc/revisit" type has been encountered, wayback will
> look up the digest in this map, and add resolve the lookup to the original.
> If the original can not be found for that revisit digest, wayback will
> display an error.
>
> The traditional implementation going back several version was to play back
> the original warc headers and content from the original.
>
> We realized that this was incorrect due to the fact that the digest only
> accounts for the response body and not the headers.
>
> Since the warc that produces the revisit record still has the latest
> captured headers, wayback will replay the headers from the latest capture
> with the content from the original, again, since the digest guarantees only
> that the body is the same not the headers.
>
> Thus to handle the revisit record, wayback will be reading from two warcs,
> the one with the revisit record and the original.
>
> Finally, we've recently added support for the url-agnostic features that
> were added to Heritrix, which support looking up the original based on
> annotations found in the
> warc, such as WARC-Refers-To-Filename and WARC-Refers-To-File-Offset. (
> https://webarchive.jira.com/browse/HER-2022) This allows wayback to
> resolve the revisit against a cdx record from a different url by pointing
> to the warc name and offset directly. This feature is still somewhat
> experimental and is not yet in wide use.
>
> I hope this clears things up a bit, if not, feel free to respond and we'll
> try to elaborate further as this is potentially confusing area.
>
> Thanks,
>
> Ilya
> Internet Archive
> Engineer
>
>
>
>
> On 06/06/2013 09:24 AM, Kristinn Sigurðsson wrote:
>
> A question on the indexing of de-duplicated records ... are they of any use as Wayback is currently implemented?
>
> The warc/revisit record in the CDX file will point at the WARC that contains that revisit record. That record does not give any indication as to where the actual payload is found. That can only be inferred as same URL, earliest date prior to this. An inference that may or may not be accurate.
>
> The crawl logs I have, contain a bit more detail and I was planning on mining them to generate 'deduplication' cdx files that would augment the ones generated from WARCs and ARCs (especially necessary for the ARCs as they have no record of the duplicates).
>
> It seems to me, that for deduplicated content CDX files really need to contain two file+offset values. One for the payload and another (optional one!) for the warc/revisit record.
>
> Or maybe I've completely missed something.
>
> - Kris
>
>
>
>
> -------------------------------------------------------------------------
> Landsbókasafn Íslands - Háskólabókasafn | Arngrímsgötu 3 - 107 Reykjavík
> Sími/Tel: +354 5255600 | www.landsbokasafn.is
> -------------------------------------------------------------------------
> fyrirvari/disclaimer - http://fyrirvari.landsbokasafn.is
>
>  -----Original Message-----
> From: Jackson, Andrew [mailto:And...@bl... <And...@bl...>]
> Sent: 6. júní 2013 15:17
> To: Jones, Gina; arc...@li...
> Subject: Re: [Archive-access-discuss] indexing best practices Wayback
> 1.x.xIndexers
>
> The latest versions of Wayback still seem to have major problems. The
> 1.7.1-SNAPSHOT line appears to ignore de-duplication records,
> although
> this is confused by the fact that H3/Wayback has recently been
> changed
> so that de-duplication records are not empty, but rather they contain
> the headers of the response (in case only the payload of the resource
> itself was unchanged). However, recent Wayback versions *require*
> this
> header, which breaks playback in older (but WARC-spec compliant) WARC
> files with empty de-duplication records.
>
> This appears to be the same in the 1.8.0-SNAPSHOT line, but other
> regressions mean I can't use that version (it has started refusing to
> accept as valid some particular compressed WARC files that the
> 1.7.1-SNAPSHOT line copes with just fine).
>
> Best wishes,
> Andy Jackson
>
>
>  -----Original Message-----
> From: Jones, Gina [mailto:gj...@lo... <gj...@lo...>]
> Sent: 04 June 2013 19:27
> To: arc...@li...
> Subject: [Archive-access-discuss] indexing best practices Wayback
> 1.x.xIndexers
>
> We have not found issues here at the Library as our collection has
>
>  gotten
>
>  bigger.  In the past, we have had separate access points to the
>
>  each
>
>  "collection" but are in the process of combining our content into
>
>  one
> access
>
>  point for a more cohesive collection.
>
> However, we have found challenges in indexing and combining those
> indexes, specifically due to deduplicated content.  We have content
> beginning in 2009 that has been deduplicated using the WARC/revisit
>
>  field.
>
>  This is what we have think we have figured out.  If anyone has any
>
>  other
>
>  information on these indexers, we would love to know about it.  We
>
>  posted
>
>  a question to the listserv about 2 years ago and didn't get any
>
>  comments
>
>  back:
>
> Wayback 1.4.x Indexers
> -The Wayback 1.4.2 indexer produces "warc/revisit" fields in the
>
>  file
> content
>
>  index that Wayback 1.4.2 cannot process and display.
>
> -When we re-indexed the same content with Wayback 1.4.0 indexer,
> Wayback was able to handle the revisit entries. Since the
>
>  "warc/revisit" field
>
>  didn't exist at the time that Wayback 1.4.0 was released, we
>
>  suppose
> that
>
>  Wayback 1.4.0 responds to those entries as it would to any date
>
>  instance link
>
>  where content was missing - by redirecting to the next most
>
>  temporally-
>
>  proximate capture.
>
> -Wayback 1.6.0 can handle file content indexes with "warc/revisit"
>
>  fields, as
>
>  well as the older 1.4.0 file content indexes
>
> -We have been unable to get Wayback 1.6.0 indexer to run on an AIX
>
>  server.
>
>  -Wayback 1.6.0 indexer writes an alpha key code to the top line of
>
>  the
> file
>
>  content index. If you are merging indexes and resorting manually,
>
>  be
> sure to
>
>  remove that line after the index is generated.
>
> Combining cdx's from multiple indexers
>
> -As for the issue on combining the indexes, it has to do with the
>
>  number of
>
>  fields that 1.4.0 / 1.4.2 and 1.6.X generate. The older version
>
>  generates a
>
>  different version of the index, with a different subset of fields.
>
> -Wayback 1.6.0 can handle both indexes, so it doesn't matter if you
>
>  have
>
>  your content indexed with either of the two. However, if you plan
>
>  to
>
>  combine the indexes into one big index, they need to match.
>
> -The specific problem we had was with sections of an ongoing crawl.
>
>  2009
>
>  content was indexed with 1.4.X, but 2009+2010 content was indexed
>
>  with
>
>  1.6.X, so if we merge and sort, we would get the 2009 entries
>
>  twice,
> because
>
>  they do not match exactly (different number of fields).
>
> -The field configurations for the two versions (as we have them
>
>  are)
>
>  1.4.2: CDX N b h m s k r V g
> 1.6.1: CDX N b a m s k r M V g
>
> For definitions of the fields here is an old reference:http://archive.org/web/researcher/cdx_legend.php
>
>
> Gina Jones
> Ignacio Garcia del Campo
> Laura Graham
>
>
> -----Original Message-----
> From: arc...@li...
>
>  [mailto:archive <archive>-
>
>  acc...@li...]
> Sent: Tuesday, June 04, 2013 8:03 AM
> To: arc...@li...
> Subject: Archive-access-discuss Digest, Vol 78, Issue 2
>
> Send Archive-access-discuss mailing list submissions to
>     arc...@li...
>
> To subscribe or unsubscribe via the World Wide Web, visit
>
>
>  https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>
>  or, via email, send a message with subject or body 'help' to
>     arc...@li...
>
> You can reach the person managing the list at
>     arc...@li...
>
> When replying, please edit your Subject line so it is more specific
>
>  than "Re:
>
>  Contents of Archive-access-discuss digest..."
>
>
> Today's Topics:
>
>    1. Best practices for indexing a growing 2+      billion document
>       collection (Kristinn Sigur?sson)
>    2. Re: Best practices for indexing a growing     2+      billion
>
>  document
>
>        collection (Erik Hetzner)
>    3. Re: Best practices for indexing a growing 2+  billion document
>       collection (Colin Rosenthal)
>
>
> -------------------------------------------------------------------
>
>  ---
>
>  Message: 1
> Date: Mon, 3 Jun 2013 11:39:40 +0000
> From: Kristinn Sigur?sson <kri...@la...> <kri...@la...>
> Subject: [Archive-access-discuss] Best practices for indexing a
>     growing 2+      billion document collection
> To: "arc...@li..." <arc...@li...>
>     <arc...@li...> <arc...@li...>
> Message-ID:
>     <E48...@bl...> <E48...@bl...>
> Content-Type: text/plain; charset="utf-8"
>
> Dear all,
>
> We are planning on updating our Wayback installation and I would
>
>  like
> to poll
>
>  your collective wisdom on the best approach for managing the
>
>  Wayback
>
>  index.
>
> Currently, our collection is about 2.2 billion items. It is also
>
>  growing at a rate of
>
>  approximately 350-400 million records per year.
>
> The obvious approach would be to use a sorted CDX file (or files)
>
>  as
> the
>
>  index. I'm, however, concerned about its performance at this scale.
> Additionally, updating a CDX based index can be troublesome.
>
>  Especially as
>
>  we would like to update it continuously as new material is
>
>  ingested.
>
>  Any relevant experience and advice you could share on this topic
>
>  would
> be
>
>  greatly appreciated.
>
>
> Best regards,
> Mr. Kristinn Sigur?sson
> Head of IT
> National and University Library of Iceland
>
>
>
>
>
>
>  ---------------------------------------------------------------------
> ---
> -
>
>  Landsb?kasafn ?slands - H?sk?lab?kasafn | Arngr?msg?tu 3 - 107
>
>  Reykjav?k
>
>  S?mi/Tel: +354 5255600 | www.landsbokasafn.is
>
>  ---------------------------------------------------------------------
> ---
> -
>
>  fyrirvari/disclaimer - http://fyrirvari.landsbokasafn.is
>
> ------------------------------
>
> Message: 2
> Date: Mon, 03 Jun 2013 11:49:04 -0700
> From: Erik Hetzner <eri...@uc...> <eri...@uc...>
> Subject: Re: [Archive-access-discuss] Best practices for indexing a
>     growing 2+      billion document collection
> To: Kristinn Sigur?sson <kri...@la...> <kri...@la...>
> Cc: "arc...@li..." <arc...@li...>
>     <arc...@li...> <arc...@li...>
> Message-ID: <201...@ma...> <201...@ma...>
> Content-Type: text/plain; charset="utf-8"
>
> At Mon, 3 Jun 2013 11:39:40 +0000,
> Kristinn Sigur?sson wrote:
>
>  Dear all,
>
> We are planning on updating our Wayback installation and I would
>
>  like
>
>  to poll your collective wisdom on the best approach for managing
>
>  the
>
>  Wayback index.
>
> Currently, our collection is about 2.2 billion items. It is also
> growing at a rate of approximately 350-400 million records per
>
>  year.
>
>  The obvious approach would be to use a sorted CDX file (or files)
>
>  as
>
>  the index. I'm, however, concerned about its performance at this
> scale. Additionally, updating a CDX based index can be
>
>  troublesome.
>
>  Especially as we would like to update it continuously as new
>
>  material
>
>  is ingested.
>
> Any relevant experience and advice you could share on this topic
>
>  would
>
>  be greatly appreciated.
>
>  Hi Kristinn,
>
> We use 4 different CDX files. One is updated every ten minutes, one
>
>  hourly,
>
>  one daily, and one monthly. We use the unix sort command to sort.
>
>  This
> has
>
>  worked pretty well for us. We aren?t doing it in the most efficient
>
>  manner,
>
>  and we will probably switch to sorting with hadoop at some point,
>
>  but
> it
>
>  works pretty well.
>
> best, Erik
> -------------- next part --------------
> Sent from my free software system <http://fsf.org/> <http://fsf.org/>.
>
> ------------------------------
>
> Message: 3
> Date: Tue, 4 Jun 2013 12:17:18 +0200
> From: Colin Rosenthal <cs...@st...> <cs...@st...>
> Subject: Re: [Archive-access-discuss] Best practices for indexing a
>     growing 2+      billion document collection
> To: arc...@li...
> Message-ID: <51A...@st...> <51A...@st...>
> Content-Type: text/plain; charset="UTF-8"; format=flowed
>
> On 06/03/2013 08:49 PM, Erik Hetzner wrote:
>
>  At Mon, 3 Jun 2013 11:39:40 +0000,
> Kristinn Sigur?sson wrote:
>
>  Dear all,
>
> We are planning on updating our Wayback installation and I would
>
>   like
>
>   to poll your collective wisdom on the best approach for managing
>
>   the
>
>   Wayback index.
>
> Currently, our collection is about 2.2 billion items. It is also
> growing at a rate of approximately 350-400 million records per
>
>   year.
>
>   The obvious approach would be to use a sorted CDX file (or
>
>   files)
> as
>
>   the index. I'm, however, concerned about its performance at this
> scale. Additionally, updating a CDX based index can be
>
>   troublesome.
>
>   Especially as we would like to update it continuously as new
>
>   material
>
>   is ingested.
>
> Any relevant experience and advice you could share on this topic
> would be greatly appreciated.
>
>  Hi Kristinn,
>
> We use 4 different CDX files. One is updated every ten minutes,
>
>  one
>
>  hourly, one daily, and one monthly. We use the unix sort command
>
>  to
>
>  sort. This has worked pretty well for us. We aren?t doing it in
>
>  the
>
>  most efficient manner, and we will probably switch to sorting
>
>  with
>
>  hadoop at some point, but it works pretty well.
>
> best, Erik
>
>  Hi Kristinn,
>
> Our strategy for building cdx indexes is described at
>
>
>  https://sbforge.org/display/NASDOC321/Wayback+Configuration#WaybackC
>
>  onfiguration-AggregatorApplication
> .
>
> Essentially we have multiple threads creating unsorted cdx files
>
>  for
> all new
>
>  arc/warc files in the archive. These are then sorted and merged
>
>  into
> an
>
>  intermediate index file. When the intermediate file grows larger
>
>  than
> 100MB,
>
>  it is merged with the current main index file, and when that grows
>
>  larger than
>
>  50GB we rollover to a new main index file. We currently have about
>
>  5TB
> total
>
>  cdx index. This includes 16 older cdx files of size 150GB-300GB,
>
>  built
> by
>
>  handrolled scripts before we had a functional automatic indexing
>
>  workflow.
>
>  We would be fascinated to hear if anyone is using an entirely
>
>  different
>
>  strategy (e.g. bdb) for a large archive.
>
> One of our big issues at the moment is QA of our cdx files. How can
>
>  we
> be
>
>  sure that our indexes actually cover all the files and records in
>
>  the
> archive?
>
>  Colin Rosenthal
> IT-Developer
> Netarkivet, Denmark
>
>
>
>
> ------------------------------
>
>
>
>  ---------------------------------------------------------------------
> ---
> ------
>
>  How ServiceNow helps IT people transform IT departments:
> 1. A cloud service to automate IT design, transition and operations
>
>  2.
>
>  Dashboards that offer high-level views of enterprise services 3. A
>
>  single
>
>  system of record for all IT processes
>
>  http://p.sf.net/sfu/servicenow-d2d-j
>
>  ------------------------------
>
> _______________________________________________
> Archive-access-discuss mailing lis...@li...://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>
>
> End of Archive-access-discuss Digest, Vol 78, Issue 2
> *****************************************************
>
>
>
>  ---------------------------------------------------------------------
> ---
> ------
>
>  How ServiceNow helps IT people transform IT departments:
> 1. A cloud service to automate IT design, transition and operations
>
>  2.
>
>  Dashboards that offer high-level views of enterprise services 3. A
>
>  single
>
>  system of record for all IT processes
>
>  http://p.sf.net/sfu/servicenow-d2d-j
>
>  _______________________________________________
> Archive-access-discuss mailing lis...@li...://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>
>  *********************************************************************
> *****
> Experience the British Library online at http://www.bl.uk/
>
> The British Library’s latest Annual Report and Accounts :http://www.bl.uk/aboutus/annrep/index.html
>
> Help the British Library conserve the world's knowledge. Adopt a
> Book. http://www.bl.uk/adoptabook
>
> The Library's St Pancras site is WiFi - enabled
>
> *********************************************************************
> ****
>
> The information contained in this e-mail is confidential and may be
> legally privileged. It is intended for the addressee(s) only. If you
> are not the intended recipient, please delete this e-mail and notify
> the mailto:pos...@bl... <pos...@bl...> : The contents of this e-mail must not be
> disclosed or copied without the sender's consent.
>
> The statements and opinions expressed in this message are those of
> the author and do not necessarily reflect those of the British
> Library. The British Library does not take any responsibility for the
> views of the author.
>
> *********************************************************************
> ****
>  Think before you print
>
> ---------------------------------------------------------------------
> ---------
> How ServiceNow helps IT people transform IT departments:
> 1. A cloud service to automate IT design, transition and operations
> 2. Dashboards that offer high-level views of enterprise services
> 3. A single system of record for all IT processeshttp://p.sf.net/sfu/servicenow-d2d-j
> _______________________________________________
> Archive-access-discuss mailing lis...@li...://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>
>  ------------------------------------------------------------------------------
> How ServiceNow helps IT people transform IT departments:
> 1. A cloud service to automate IT design, transition and operations
> 2. Dashboards that offer high-level views of enterprise services
> 3. A single system of record for all IT processeshttp://p.sf.net/sfu/servicenow-d2d-j
> _______________________________________________
> Archive-access-discuss mailing lis...@li...://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>
>
>
>
> ------------------------------------------------------------------------------
> How ServiceNow helps IT people transform IT departments:
> 1. A cloud service to automate IT design, transition and operations
> 2. Dashboards that offer high-level views of enterprise services
> 3. A single system of record for all IT processes
> http://p.sf.net/sfu/servicenow-d2d-j
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>
>

37 messages has been excluded from this view by a project administrator.

Flat | Threaded

<< < 1 2 3 4 .. 43 > >> (Page 2 of 43)