From: Bradley T. <br...@ar...> - 2011-03-01 12:11:17
|
I'm not aware of any way to record fetched-but-not-stored metadata in ARC files, only in WARC files. Is the information about the second(,third,etc) downloaded-but-not-stored being recorded only in the crawl logs? Forging CDX records from information in crawl logs may be possible, but as far as I know has never been attempted. We use WARC files with content-digest duplicate reduction (as opposed to sending if-modified/if-none-match headers, which has only been used and replayed via Wayback experimentally.) Brad On 3/1/11 4:43 PM, Natalia Torres wrote: > Hi Brad, > > thanks a lot for your advice. I added the "dedupeRecords" property to > the LocalResourceIndex Bean in CDXCollection.xml and restart tomcat, but > I can't view correctly the crawls as before: viewing the first crawl > everything is correct and viewing the second version the images/css/pdf > (only crawled at the first time) aren't displayed... > > We are using arc files, the behavior is the same that using warc or we > need to change to warc? > > Here is the CDXCollection.xml file: > > <?xml version="1.0" encoding="UTF-8"?> > <beans xmlns="http://www.springframework.org/schema/beans" > xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" > xsi:schemaLocation="http://www.springframework.org/schema/beans > http://www.springframework.org/schema/beans/spring-beans-2.5.xsd" > default-init-method="init"> > > <bean id="localcdxcollection" > class="org.archive.wayback.webapp.WaybackCollection"> > <property name="resourceStore"> > <bean class="org.archive.wayback.resourcestore.LocationDBResourceStore"> > <property name="db"> > <bean > class="org.archive.wayback.resourcestore.locationdb.FlatFileResourceFileLocationDB"> > <property name="path" value="${wayback.basedir}/path-ind > ex.txt" /> > </bean> > </property> > </bean> > </property> > > <property name="resourceIndex"> > <bean class="org.archive.wayback.resourceindex.LocalResourceIndex"> > <property name="canonicalizer" ref="waybackCanonicalizer" /> > <property name="source"> > > <!-- > A single CDX SearchResultSource example. > --> > <bean class="org.archive.wayback.resourceindex.cdx.CDXIndex"> > <property name="path" value="${wayback.basedir}/dedup2011.cdx" /> > </bean> > > </property> > <property name="maxRecords" value="10000" /> > <property name="dedupeRecords" value="true" /> > </bean> > </property> > </bean> > > </beans> > > thanks, > > natalia > > > ------------------------------------------------------------------------------ > Free Software Download: Index, Search& Analyze Logs and other IT data in > Real-Time with Splunk. Collect, index and harness all the fast moving IT data > generated by your applications, servers and devices whether physical, virtual > or in the cloud. Deliver compliance at lower cost and gain new business > insights. http://p.sf.net/sfu/splunk-dev2dev > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |