From: Colin R. <cs...@st...> - 2011-03-01 14:29:56
|
On 2011-03-01 13:17, Bradley Tofel wrote: > I'm not aware of any way to record fetched-but-not-stored metadata in > ARC files, only in WARC files. > > Is the information about the second(,third,etc) > downloaded-but-not-stored being recorded only in the crawl logs? Forging > CDX records from information in crawl logs may be possible, but as far > as I know has never been attempted. Actually NetarchiveSuite has a utility to do just that: http://netarchive.dk/suite/Additional%20Tools%20Manual%203.14#Additional_Tools_Manual_3.14.2BAC8-Tools_in_Wayback_Module.dk.netarkivet.wayback.DeduplicateToCDXApplication cheers, Colin Rosenthal IT-Developer State and University Library, Aarhus > We use WARC files with content-digest duplicate reduction (as opposed to > sending if-modified/if-none-match headers, which has only been used and > replayed via Wayback experimentally.) > > Brad > > On 3/1/11 4:43 PM, Natalia Torres wrote: >> Hi Brad, >> >> thanks a lot for your advice. I added the "dedupeRecords" property to >> the LocalResourceIndex Bean in CDXCollection.xml and restart tomcat, but >> I can't view correctly the crawls as before: viewing the first crawl >> everything is correct and viewing the second version the images/css/pdf >> (only crawled at the first time) aren't displayed... >> >> We are using arc files, the behavior is the same that using warc or we >> need to change to warc? >> >> Here is the CDXCollection.xml file: >> >> <?xml version="1.0" encoding="UTF-8"?> >> <beans xmlns="http://www.springframework.org/schema/beans" >> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" >> xsi:schemaLocation="http://www.springframework.org/schema/beans >> http://www.springframework.org/schema/beans/spring-beans-2.5.xsd" >> default-init-method="init"> >> >> <bean id="localcdxcollection" >> class="org.archive.wayback.webapp.WaybackCollection"> >> <property name="resourceStore"> >> <bean class="org.archive.wayback.resourcestore.LocationDBResourceStore"> >> <property name="db"> >> <bean >> class="org.archive.wayback.resourcestore.locationdb.FlatFileResourceFileLocationDB"> >> <property name="path" value="${wayback.basedir}/path-ind >> ex.txt" /> >> </bean> >> </property> >> </bean> >> </property> >> >> <property name="resourceIndex"> >> <bean class="org.archive.wayback.resourceindex.LocalResourceIndex"> >> <property name="canonicalizer" ref="waybackCanonicalizer" /> >> <property name="source"> >> >> <!-- >> A single CDX SearchResultSource example. >> --> >> <bean class="org.archive.wayback.resourceindex.cdx.CDXIndex"> >> <property name="path" value="${wayback.basedir}/dedup2011.cdx" /> >> </bean> >> >> </property> >> <property name="maxRecords" value="10000" /> >> <property name="dedupeRecords" value="true" /> >> </bean> >> </property> >> </bean> >> >> </beans> >> >> thanks, >> >> natalia >> >> >> ------------------------------------------------------------------------------ >> Free Software Download: Index, Search& Analyze Logs and other IT data in >> Real-Time with Splunk. Collect, index and harness all the fast moving IT data >> generated by your applications, servers and devices whether physical, virtual >> or in the cloud. Deliver compliance at lower cost and gain new business >> insights. http://p.sf.net/sfu/splunk-dev2dev >> _______________________________________________ >> Archive-access-discuss mailing list >> Arc...@li... >> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > ------------------------------------------------------------------------------ > Free Software Download: Index, Search& Analyze Logs and other IT data in > Real-Time with Splunk. Collect, index and harness all the fast moving IT data > generated by your applications, servers and devices whether physical, virtual > or in the cloud. Deliver compliance at lower cost and gain new business > insights. http://p.sf.net/sfu/splunk-dev2dev > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |