From: Erik H. <eri...@uc...> - 2011-04-05 17:05:06
|
At Tue, 05 Apr 2011 08:19:59 -0700, Gary Wesley wrote: > > I have the new 1.6.0 and want to use only CDX > and no BDB for > my indexing, since I have a lot of files. > I put a small number of files in my > wayback.basedir=/lfs/1/tmp/wayback > and started Tomcat. > I commented out the indexqueueupdater, > to prevent BDB from indexing the files. > I see the files in file-db/incoming > and file-db/state/filesk. > > 1) How do I get them to appear where CDX > can use them? > > You sent me a script: > find /lfs/1/tmp/wayback/index-data/{incoming,merged} -type f -name > "*.arc.gz" | xargs cat | /lfs/1/tmp/wayback/bin/url-client | sort -u -S > 50% -T /lfs/1/tmp/wayback/sort-tmp > /lfs/1/tmp/wayback/cdx/Katrina.cdx > but I don't see any files in those directories. > (because it was for when I had already partially > indexed with BDB, in my previous attempt?). > > 2) How do I update my CDX when I add files? Hi Gary, FYI, attached is an almost complete wayback config for a CDX based system. Re. your 2nd question, we work in the following manner. For every ARC file we have a corresponding CDX file on hand. We maintain 4 sorted CDX files for wayback’s use. One is regenerated every month from all the ARC files (this takes a long time, though the sort command is pretty efficient). One is generated once a day from every ARC file that is *not* in the monthly CDX file. One is generated every hour from every ARC file that is neither in the monthly nor the daily CDX file. And one generated every 10 minutes from everything that is not in the monthly, daily, or hourly CDX file. I hope that helps. best, Erik |