You can subscribe to this list here.
2005 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
(4) |
Sep
(5) |
Oct
(17) |
Nov
(30) |
Dec
(3) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2006 |
Jan
(4) |
Feb
(14) |
Mar
(8) |
Apr
(11) |
May
(2) |
Jun
(13) |
Jul
(9) |
Aug
(2) |
Sep
(2) |
Oct
(9) |
Nov
(20) |
Dec
(9) |
2007 |
Jan
(6) |
Feb
(4) |
Mar
(6) |
Apr
(7) |
May
(6) |
Jun
(6) |
Jul
(4) |
Aug
(3) |
Sep
(9) |
Oct
(26) |
Nov
(23) |
Dec
(2) |
2008 |
Jan
(17) |
Feb
(19) |
Mar
(16) |
Apr
(27) |
May
(3) |
Jun
(21) |
Jul
(21) |
Aug
(8) |
Sep
(13) |
Oct
(7) |
Nov
(8) |
Dec
(8) |
2009 |
Jan
(18) |
Feb
(14) |
Mar
(27) |
Apr
(14) |
May
(10) |
Jun
(14) |
Jul
(18) |
Aug
(30) |
Sep
(18) |
Oct
(12) |
Nov
(5) |
Dec
(26) |
2010 |
Jan
(27) |
Feb
(3) |
Mar
(8) |
Apr
(4) |
May
(6) |
Jun
(13) |
Jul
(25) |
Aug
(11) |
Sep
(2) |
Oct
(4) |
Nov
(7) |
Dec
(6) |
2011 |
Jan
(25) |
Feb
(17) |
Mar
(25) |
Apr
(23) |
May
(15) |
Jun
(12) |
Jul
(8) |
Aug
(13) |
Sep
(4) |
Oct
(17) |
Nov
(7) |
Dec
(6) |
2012 |
Jan
(4) |
Feb
(7) |
Mar
(1) |
Apr
(10) |
May
(11) |
Jun
(5) |
Jul
(7) |
Aug
(1) |
Sep
(1) |
Oct
(5) |
Nov
(6) |
Dec
(13) |
2013 |
Jan
(9) |
Feb
(7) |
Mar
(3) |
Apr
(1) |
May
(3) |
Jun
(19) |
Jul
(3) |
Aug
(3) |
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
2014 |
Jan
(11) |
Feb
(1) |
Mar
|
Apr
(2) |
May
(6) |
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
2015 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
(4) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(1) |
2016 |
Jan
(4) |
Feb
(3) |
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
|
2018 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
|
Jul
(2) |
Aug
|
Sep
(1) |
Oct
|
Nov
(1) |
Dec
|
2019 |
Jan
(2) |
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
(2) |
Jul
|
Aug
|
Sep
(1) |
Oct
(1) |
Nov
|
Dec
|
From: Neil G. <n.g...@bu...> - 2011-04-12 15:36:49
|
Hey, I can see ".ver1.arc" files created in "C:\tmp\wayback\arcs" now I've enabled Wayback as the quality-review tool, but when I actually click "Submit to archive", these files are deleted and nothing takes its place? Shouldn't there be an ".arc" file without the "ver1" - representing the submitted archive? wct-das.properties: # Location of the folder Wayback is watching for auto indexing waybackIndexer.waybackInputFolder=/tmp/wayback/arcs # Location of the folder where Wayback places merged indexes waybackIndexer.waybackMergedFolder=/tmp/wayback/index-data/merged # Location of the folder where Wayback places failed indexes waybackIndexer.waybackFailedFolder=/tmp/wayback/index-data/failed Also, I'm a just starting out with this - should I be using the CDX file WCT creates in Wayback or not? Neil Gibbons Technical Consultant T: +44 (0)161 441 0570 M: +44 (0)7540047787 E: n.g...@bu... W: www.building-blocks.com Building Blocks (UK) Ltd is a company registered in England and Wales with company number 06175049 Registered Office: 7 Christie Way, Christie Fields, Manchester, M21 7QY This communication contains information which is confidential and may also be privileged. It is for the exclusive use of the addressee(s). If you are not a named addressee please note that any distribution, reproduction, copying, publication or use of this communication or the information in it is prohibited. If you have received this communication in error, please contact the sender immediately, and also delete the communication from your computer. Building Blocks (UK) Ltd cannot be held responsible for any failure by the recipient to test for viruses before opening any attachments. -- This message has been checked by Building Blocks (UK) Ltd and is believed to be clean. |
From: Neil G. <n.g...@bu...> - 2011-04-08 15:13:06
|
Hi, I'm having the exact same problem as Lewis describes: http://sourceforge.net/mailarchive/message.php?msg_id=27071510 <http://sourceforge.net/mailarchive/message.php?msg_id=27071510>My understanding is that from a default installation the access point name is set to "8080:wayback" so should be accessible via http://localhost.archive.org:8080/wayback/wayback - hence why the WCT Core has its quality review tool property set to "qualityReviewToolController.archiveUrl= http://localhost:8080/wayback/wayback/*/" by default. However as Lewis describes, using the default setting the "staticPrefix" is always "/". I've followed the advice in the lists and set the access point name to simply "8080" to get a working wayback, but I'd love to know why the defaults don't work - especially when they seem to fit with the documentation - is this a bug? Kind regards Neil Gibbons Technical Consultant T: +44 (0)161 441 0570 M: +44 (0)7540047787 E: n.g...@bu... W: www.building-blocks.com Building Blocks (UK) Ltd is a company registered in England and Wales with company number 06175049 Registered Office: 7 Christie Way, Christie Fields, Manchester, M21 7QY This communication contains information which is confidential and may also be privileged. It is for the exclusive use of the addressee(s). If you are not a named addressee please note that any distribution, reproduction, copying, publication or use of this communication or the information in it is prohibited. If you have received this communication in error, please contact the sender immediately, and also delete the communication from your computer. Building Blocks (UK) Ltd cannot be held responsible for any failure by the recipient to test for viruses before opening any attachments. -- This message has been checked by Building Blocks (UK) Ltd and is believed to be clean. |
From: Erik H. <eri...@uc...> - 2011-04-05 17:05:06
|
At Tue, 05 Apr 2011 08:19:59 -0700, Gary Wesley wrote: > > I have the new 1.6.0 and want to use only CDX > and no BDB for > my indexing, since I have a lot of files. > I put a small number of files in my > wayback.basedir=/lfs/1/tmp/wayback > and started Tomcat. > I commented out the indexqueueupdater, > to prevent BDB from indexing the files. > I see the files in file-db/incoming > and file-db/state/filesk. > > 1) How do I get them to appear where CDX > can use them? > > You sent me a script: > find /lfs/1/tmp/wayback/index-data/{incoming,merged} -type f -name > "*.arc.gz" | xargs cat | /lfs/1/tmp/wayback/bin/url-client | sort -u -S > 50% -T /lfs/1/tmp/wayback/sort-tmp > /lfs/1/tmp/wayback/cdx/Katrina.cdx > but I don't see any files in those directories. > (because it was for when I had already partially > indexed with BDB, in my previous attempt?). > > 2) How do I update my CDX when I add files? Hi Gary, FYI, attached is an almost complete wayback config for a CDX based system. Re. your 2nd question, we work in the following manner. For every ARC file we have a corresponding CDX file on hand. We maintain 4 sorted CDX files for wayback’s use. One is regenerated every month from all the ARC files (this takes a long time, though the sort command is pretty efficient). One is generated once a day from every ARC file that is *not* in the monthly CDX file. One is generated every hour from every ARC file that is neither in the monthly nor the daily CDX file. And one generated every 10 minutes from everything that is not in the monthly, daily, or hourly CDX file. I hope that helps. best, Erik |
From: Gary W. <je...@cs...> - 2011-04-05 15:35:30
|
I have the new 1.6.0 and want to use only CDX and no BDB for my indexing, since I have a lot of files. I put a small number of files in my wayback.basedir=/lfs/1/tmp/wayback and started Tomcat. I commented out the indexqueueupdater, to prevent BDB from indexing the files. I see the files in file-db/incoming and file-db/state/filesk. 1) How do I get them to appear where CDX can use them? You sent me a script: find /lfs/1/tmp/wayback/index-data/{incoming,merged} -type f -name "*.arc.gz" | xargs cat | /lfs/1/tmp/wayback/bin/url-client | sort -u -S 50% -T /lfs/1/tmp/wayback/sort-tmp > /lfs/1/tmp/wayback/cdx/Katrina.cdx but I don't see any files in those directories. (because it was for when I had already partially indexed with BDB, in my previous attempt?). 2) How do I update my CDX when I add files? Gary Wesley -- A witty saying proves nothing. -- Voltaire |
From: Aaron B. <aa...@ar...> - 2011-03-17 19:37:38
|
Gerard Suades i Méndez <gs...@ce...> writes: > NutchWAX 0.13 official release. Good :) In that case, I can say that you do *not* need the crawl_parse crawl_data content sub-directories of the segments. You can safely delete them to save on disk space. For example: $ rm -rfv segments/*/c* Also, the NutchWAX 0.13 official release *does* have the content of the documents stored in the index (in compressed form). This means that the indexes are 100% self-contained and you do not need the segments for the live search service. However, NutchWAX 0.13 official release does *not* perform de-duplication during indexing. That feature was added to a branch I created from NW 0.13 but has not been officially released yet. The branch is http://archive-access.svn.sourceforge.net/svnroot/archive-access/tags/nutchwax-0_13-JIRA-WAX-75/archive Aaron |
From: Gerard S. i M. <gs...@ce...> - 2011-03-17 10:45:46
|
NutchWAX 0.13 official release. Aaron Binns escribió: > Gerard, > > I looked through the SVN logs for NutchWAX and it looks like most of the > interesting features I described were done on a branch of NutchWAX 0.13 > > http://archive-access.svn.sourceforge.net/svnroot/archive-access/tags/nutchwax-0_13-JIRA-WAX-75/archive > > Which version of NutchWAX have you been using? > > > > Aaron > -- Gerard ...................................................................... __ / / Gerard Suades Méndez C E / S / C A Departament d'Aplicacions i Projectes /_/ Centre de Supercomputació de Catalunya Gran Capità, 2-4 (Edifici Nexus) · 08034 Barcelona T. 93 551 62 20 · F. 93 205 6979 · gs...@ce... ...................................................................... |
From: Aaron B. <aa...@ar...> - 2011-03-15 23:34:17
|
Gerard, I looked through the SVN logs for NutchWAX and it looks like most of the interesting features I described were done on a branch of NutchWAX 0.13 http://archive-access.svn.sourceforge.net/svnroot/archive-access/tags/nutchwax-0_13-JIRA-WAX-75/archive Which version of NutchWAX have you been using? Aaron |
From: Aaron B. <aa...@ar...> - 2011-03-15 21:58:15
|
Gerard Suades i Méndez <gs...@ce...> writes: > If segments needs to be kept in order to update the indexes with new > crawls then we need to bear in mind that indexes+segments size > represents somewhere around 50% of the all ARCs, specially in terms of > scalability. Are these numbers usual? That seems large to me. In your segments, which of the following sub-directories appear? <segment>/crawl_data crawl_? parse_text parse_data content I'll have to check the SVN revision in NutchWAX where I removed the dependencies on the "crawl_*" and "content" sub-dirs. Those sub-dirs are left-overs from Nutch and we don't use them at all for NutchWAX. If you see those sub-dirs, re-calculate the sizes w/o them for a more realistic measurement. Once I can confirm the SVN revision where I removed their use in NutchWAX (never used by JBs), then you can delete them as they aren't used at all. In earlier versions of NutchWAX, there was code to check that the sub-dirs were there, even though they weren't used. For the JBs and more recent versions of NutchWAX, only parse_text parse_data are used. In my indexing projects, after the (w)arcs are all imported, I just do a rm -rf segments/*/c* to remove the sub-dirs starting with 'c'. > We tried both approaches for the entire ARC collection: > > a) IndexMerger Lucene API (inside TNH). > index size: 813GB > > b) Re-built the entire index giving as input both old and new NutchWAX > segments of the ARC files. > index size: 563GB > > is it normal that there is this difference of sizes in the indexes? It's quite possible. If there was a lot of duplicate captures, then you could see such a large reduction in size. Method A would preserve the duplicates whereas method B de-duplicates. In later versions of NutchWAX and the JBs, de-duplication happens automatically, but only *within a single indexing job*. If you index all the segments in one job, then they will be de-duplicated. If you index subsets of segments, creating multiple indexes, then there could be duplicates across the indexes. NutchWAX segments and their sub-directories are rather simple data structures actually. They are in a compressed binary (Hadoop) format, so you can't simply 'cat' them, but they are in essence: [<unique-id>, <set of key/value properties], ... Each record has a unique key, for which we use "<url> <hash>". Then the record is simply key/value pairs of properties. In the parse_data sub-dir, we have records following the form ['http://example.com/ 123456...', ["title" => "My webpage", "date" => "20101202092343", "type" => "text/hmtl", ....] ] ['http://example.com/contact.html 3452...', ["title" => "Contact us", "date" => "20101202092355", "type" => "text/hmtl", ....] ] And in the pase_text, we have ['http://example.com/ 123456...', ["body" => "Here is the body of the webpage."] ] ... When indexing, with either NutchWAX or the JBs, the sub-dirs are opened up and Hadoop combines the records together from the sub-dirs, matching according to unique-id. In NutchWAX and JBs, we also detect multiple merged records with the same unique-id and then perform our own merging by retaining only 1 key/value pair for properties such as "title" and combine values for the "date" property so that we have *all* the capture dates for the unique version of a URL. For example, imagine that we had two records ['http://example.com/ 123456...', ["title" => "My webpage", "date" => "20090304101509", "type" => "text/hmtl", ....] ] ['http://example.com/ 123456...', ["title" => "My webpage", "date" => "20101202092343", "type" => "text/hmtl", ....] ] during the indexing process, these would be combined into a single record with two capture dates ['http://example.com/ 123456...', ["title" => "My webpage", "date" => ["20090304101509", "20101202092343"], "type" => "text/hmtl", ....] ] but for all the other properties, we only have one value. It doesn't make sense to have the title or mime-type twice. This is the core of the de-duplication process during indexing. But this de-duplication process is done by the Java code in the NutchWAX Indexer and the JBs Indexer. Lucene doesn't know anything about it. > 3.- We have only one collection for all the ARC files. We have our > collection on open access and the service is load balanced through > several nodes. That's the scenario in where several tomcats are > accessing the same indexes. Does that mean that each node has a local copy of the index? Or perhaps the index is on an NFS share or SAN mounted on each node? Lastly, the indexing process for the JBs is pretty much the same as for NutchWAX. The command-lines are similar, but for the JBs, you have to use the Hadoop command-line driver, whereas NutchWAX comes with its own. E.g. $ nutchwax index indexes segments/* vs. $ hadoop jar jbs-*.jar Indexer indexes segments/* The version of Hadoop that we use is the Cloudera distribution, which is based on Hadoop 0.20.2 with some Cloudera patches to fix bugs. I believe you can use Hadoop 0.20.1 or 0.20.2 w/o any problems. The JBs also does a better job of filtering out obvious crap, such as "words" which are do not contain any letters, such as "34983545$%23432" is filtered out when indexing with JBs. It also canonicalizes the mime-types so that all the dozens of different known variaties of MS Office mime-types are all mapped to one standard set. It also omits 'robots.txt' files and ignores mime-types that probably don't have text in them, such as "application/octet-stream". I'd recommend giving JBs a try, at least to test and compare to the index built with NutchWAX; especially since JBs does the accented character collapsing. Aaron -- Aaron Binns Senior Software Engineer, Web Group, Internet Archive Program Officer, IIPC aa...@ar... |
From: Bradley T. <br...@ar...> - 2011-03-14 10:12:56
|
Hi Hamid, This is indeed an area that seems to require more research, and my understanding agrees with the document you referred: there are two camps, which I refer to as the "forward-convert" camp, and the "emulation-will-save-us" camp. The paper looks to describe some limited scale research into the forward-convert approach. Wayback does not currently ship with any code to support forward converted formats, but my feeling is that adding this sort of functionality would be pretty straightforward, if not trivial. This all requires further discussion and analysis, but a technical Straw-Man to implementing this functionality within Wayback might look like: 1) alter standard CDX format to include 2 extra fields: WARC-ID, and WARC-Refers-To The former would be present for all records. The WARC-Refers-To record would be present for all transformed records. As a side note, it seems that some indication of the type of conversion performed, as well as the specific version and configuration of the software used might be useful in the conversion records. If this information was included in the conversion WARC records, it could be included in a 3rd (and possibly 4th) field, and then consulted at query time by Wayback, to choose the "best" conversion available if newer techniques or software surfaced. Lastly, subsequent index steps could be simplified (could save an extra "sort" operation) if the capture date of the original record were to be included in the conversion WARC record. 2) modify Wayback indexing code to include WARC-ID, and WARC-Refers-To data into the CaptureSearchResults (as well as the aforementioned conversion data, and original capture date, if available) 3) create a Wayback CaptureSearchResult filter, which reads both original and conversion records from the index, and produces a new set of results which prefer the converted records, if available. Logic for the preference of converted records seems likely to change over time, so making this somewhat flexible might be a desirable design goal. I mentioned in #1, that including the original records capture date might simplify later steps, specifically step 3. If the converted records included the original capture date, they would sort along side the original records, and the filter could simply omit the original records, if a converted record was present. Likely the filter would also annotate the converted record with information for the user about the original format, the conversion, etc. If the original records capture date is not included, then this filter would have to: * buffer all the matching records into a data structure * match up converted records to their originals by WARC-ID and WARC-Refers-To * annotate the converted records with info about the original, and the conversion process used * discard original records * re-sort the resulting search results for the rest of the Wayback system, which (currently) expects search results will be returned in data ascending order. Unquestionable there will need to be substantial QA effort to determine the viability of the solution, but the experience gained now will certainly be valuable. Another tactic for solving the antiquated format issue within Wayback, would be to implement specialized ReplayRenderers for various formats, and experiment with converting those formats on-the-fly, at Replay Time. Possibly compute time will get(remain?) cheap enough that this solution could be tractable in the long-term. Looking forward to comments or questions on the topic. Are other institutions interested in this in the near term? Is there other ongoing IIPC research in this arena that would be a better venue for the discussion? Brad On 3/14/11 3:31 PM, Hamid Rofoogaran wrote: > Hi Brad, > This is the link to the document i mentioned > http://publik.tuwien.ac.at/files/PubDat_181115.pdf > By "migrated content" i mean for example that within your web > archive (WARC files) there are a number of MS Word and TIFF > objects. Your organisation decides that all the MS Word objects shall > be converted to PDF/A and all the TIFF images will be converted to png > format. The "new" WARC has now a migrated content. > Talking about this document, there are two issues in the "summary an > outlook" which i wonder if there has been any progress since 2009 namely: > 1- "..... but further experiments with larger data sets are required > to evaluate the scalability of this approach." > > 2- "The support of access engines ((WayBack) , my comment) for > migrated records and extracted > > metadata needs to be further analysed > > Best > Hamid > ----------------------------------------------------- > Hamid Rofoogaran > LDP Centre > Tel: +46 921 57308 > Mobile: +46 76 81 57308 > ham...@ld... > ham...@lt... > www.ldb-centrum.se > ----------------------------------------------------- > ------------------------------------------------------------------------ > *Från:* Bradley Tofel [br...@ar...] > *Skickat:* den 11 mars 2011 kl 4:47 > *Till:* Hamid Rofoogaran > *Kopia:* arc...@li... > *Ämne:* Re: [Archive-access-discuss] Migration & WBM > > Hi Hamid, > > Can you elaborate on what you mean by "migrated"? > > Do you have any links to the report you mentioned? > > One of the design goals of the WARC format is to allow content which > was recorded in other formats, for example, as millions of files on a > "standard filesystem" to be encapsulated in more manageable WARC > files. Is this the kind of "migration" to which you're referring? > > If so, Wayback has not currently be used in this application, but it's > design has considered this as a future goal. > > Wayback attempts to be a framework for: > 1) creating indexes of large amounts of semi-structured data > 2) providing search of those indexes, both to query what content is > available, and for retrieving pointers to specific resources captured > 3) returning specific captured resources, in many cases altering the > resources to provide contextual metadata, or to enhance viewing of > those resources by clients. > > Currently, the modules that have been developed within this framework > primarily index HTTP content within W/ARC files, provide search of > those indexes by URL, and alter returned resources, namely HTML, CSS, > and Javascript, to assist replay within a web browser. > > So, depending on what you mean by "migrated" Wayback may be a good > starting point to provide access to large bodies of content stored in > W/ARC format. I'd be happy to provide suggestions, assistance, and as > time permits, code to help with your Wayback extensions. > > Looking forward to hearing back about your specific needs! > > Brad > > On 3/10/11 8:16 PM, Hamid Rofoogaran wrote: >> Hi everybody, >> Is waybackmachine able to access (and present) WARC files where the >> content have been migrated ? Is there any developement ongoing >> regarding this matter ? Any documents, papers, reports to read about it ? >> >> I will be very gratefull for any kind of information about "migrating >> of WARC content AND Waybackmachine" . >> The only report i have found is from Vienna University of Technology >> written by Andreas Rauber , ...(2009) >> >> Regards >> Hamid >> >> >> ------------------------------------------------------------------------------ >> Colocation vs. Managed Hosting >> A question and answer guide to determining the best fit >> for your organization - today and in the future. >> http://p.sf.net/sfu/internap-sfd2d >> >> >> _______________________________________________ >> Archive-access-discuss mailing list >> Arc...@li... >> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > > ------------------------------------------------------------------------------ > Colocation vs. Managed Hosting > A question and answer guide to determining the best fit > for your organization - today and in the future. > http://p.sf.net/sfu/internap-sfd2d > > > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
From: Hamid R. <ham...@lt...> - 2011-03-14 08:32:01
|
Hi Brad, This is the link to the document i mentioned http://publik.tuwien.ac.at/files/PubDat_181115.pdf By "migrated content" i mean for example that within your web archive (WARC files) there are a number of MS Word and TIFF objects. Your organisation decides that all the MS Word objects shall be converted to PDF/A and all the TIFF images will be converted to png format. The "new" WARC has now a migrated content. Talking about this document, there are two issues in the "summary an outlook" which i wonder if there has been any progress since 2009 namely: 1- "..... but further experiments with larger data sets are required to evaluate the scalability of this approach." 2- "The support of access engines ((WayBack) , my comment) for migrated records and extracted metadata needs to be further analysed Best Hamid ----------------------------------------------------- Hamid Rofoogaran LDP Centre Tel: +46 921 57308 Mobile: +46 76 81 57308 ham...@ld... ham...@lt... www.ldb-centrum.se ----------------------------------------------------- ________________________________ Från: Bradley Tofel [br...@ar...] Skickat: den 11 mars 2011 kl 4:47 Till: Hamid Rofoogaran Kopia: arc...@li... Ämne: Re: [Archive-access-discuss] Migration & WBM Hi Hamid, Can you elaborate on what you mean by "migrated"? Do you have any links to the report you mentioned? One of the design goals of the WARC format is to allow content which was recorded in other formats, for example, as millions of files on a "standard filesystem" to be encapsulated in more manageable WARC files. Is this the kind of "migration" to which you're referring? If so, Wayback has not currently be used in this application, but it's design has considered this as a future goal. Wayback attempts to be a framework for: 1) creating indexes of large amounts of semi-structured data 2) providing search of those indexes, both to query what content is available, and for retrieving pointers to specific resources captured 3) returning specific captured resources, in many cases altering the resources to provide contextual metadata, or to enhance viewing of those resources by clients. Currently, the modules that have been developed within this framework primarily index HTTP content within W/ARC files, provide search of those indexes by URL, and alter returned resources, namely HTML, CSS, and Javascript, to assist replay within a web browser. So, depending on what you mean by "migrated" Wayback may be a good starting point to provide access to large bodies of content stored in W/ARC format. I'd be happy to provide suggestions, assistance, and as time permits, code to help with your Wayback extensions. Looking forward to hearing back about your specific needs! Brad On 3/10/11 8:16 PM, Hamid Rofoogaran wrote: Hi everybody, Is waybackmachine able to access (and present) WARC files where the content have been migrated ? Is there any developement ongoing regarding this matter ? Any documents, papers, reports to read about it ? I will be very gratefull for any kind of information about "migrating of WARC content AND Waybackmachine" . The only report i have found is from Vienna University of Technology written by Andreas Rauber , ...(2009) Regards Hamid ------------------------------------------------------------------------------ Colocation vs. Managed Hosting A question and answer guide to determining the best fit for your organization - today and in the future. http://p.sf.net/sfu/internap-sfd2d _______________________________________________ Archive-access-discuss mailing list Arc...@li...<mailto:Arc...@li...> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
From: Bradley T. <br...@ar...> - 2011-03-11 03:40:12
|
Hi Hamid, Can you elaborate on what you mean by "migrated"? Do you have any links to the report you mentioned? One of the design goals of the WARC format is to allow content which was recorded in other formats, for example, as millions of files on a "standard filesystem" to be encapsulated in more manageable WARC files. Is this the kind of "migration" to which you're referring? If so, Wayback has not currently be used in this application, but it's design has considered this as a future goal. Wayback attempts to be a framework for: 1) creating indexes of large amounts of semi-structured data 2) providing search of those indexes, both to query what content is available, and for retrieving pointers to specific resources captured 3) returning specific captured resources, in many cases altering the resources to provide contextual metadata, or to enhance viewing of those resources by clients. Currently, the modules that have been developed within this framework primarily index HTTP content within W/ARC files, provide search of those indexes by URL, and alter returned resources, namely HTML, CSS, and Javascript, to assist replay within a web browser. So, depending on what you mean by "migrated" Wayback may be a good starting point to provide access to large bodies of content stored in W/ARC format. I'd be happy to provide suggestions, assistance, and as time permits, code to help with your Wayback extensions. Looking forward to hearing back about your specific needs! Brad On 3/10/11 8:16 PM, Hamid Rofoogaran wrote: > Hi everybody, > Is waybackmachine able to access (and present) WARC files where the > content have been migrated ? Is there any developement ongoing > regarding this matter ? Any documents, papers, reports to read about it ? > > I will be very gratefull for any kind of information about "migrating > of WARC content AND Waybackmachine" . > The only report i have found is from Vienna University of Technology > written by Andreas Rauber , ...(2009) > > Regards > Hamid > > > ------------------------------------------------------------------------------ > Colocation vs. Managed Hosting > A question and answer guide to determining the best fit > for your organization - today and in the future. > http://p.sf.net/sfu/internap-sfd2d > > > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
From: Bradley T. <br...@ar...> - 2011-03-11 03:26:15
|
Hi Laura, Wayback 1.6.0 contains code to run a special AccessPoint which acts as a "modified" proxy server. When proxy requests are received by this AccessPoint, a request to the live web, for the URL requested by the client, is recorded into an ARC file on the spot. The single compressed ARC record is then returned as the HTTP entity to the requesting client. Note this means you cannot point a web browser directly at this service, since the browser doesn't know how to unpack the enclosed ARC record (there is another "unwrapping" proxy AccessPoint which does this, allowing experimenting with recording a web browser session.) However, a client which expects to be returned an ARC record, can then unpack the returned ARC record and use it, to access the entire HTTP response to a robots.txt request, for example. This service is used in Wayback 1.6.0 to request content from the live web for both checking robots.txt files, and for "backfilling" content requested via replay sessions, but which is not in the archive. Some of the driving factors behind returning a compressed ARC record instead of proxy returning the actual response is to simplify inserting an HTTP cache between the Wayback service and the live web proxy AccessPoint. We use varnish, which handles caching of the returned ARC record, and coalescing of multiple concurrent requests into a single request to the live web proxy AccessPoint. We intend to make this service record WARC files in the near term - porting the old Wayback ARC recording code was more expediant for 1.6.0. Currently, there's some complexity in implementing this, which will probably require some additional documentation. If you're interested, please let me know, and we'll try to prioritize this documentation. Lastly, note that we've discovered some significant bugs in the 1.6.0 codebase specifically related to this live web proxy AccessPoint, mostly in bad handling of connection errors and timeouts. These fixes are all in SVN currently, but we have not scheduled a 1.6.1 release at the moment. Brad On 3/10/11 8:02 PM, Graham, Laura wrote: > We were wondering here at the Library of Congress about the LiveWeb.xml in Wayback 1.6. The wayback.xml explains: > > " LiveWeb.xml contains the 'proxylivewebcache' bean that enable fetching > content from the live web, recording that content in ARC files. > To use the "excluder-factory-robot" bean as an exclusionFactory property of > AccessPoints, which will cause live robots.txt files to be consulted > retroactively before showing archived content, you'll need to import > LiveWeb.xml as well." > > We understand about consulting the robots.txt for display, of course, but can the Wayback actually write data to ARC (WARC?) files? What does "recording" mean? > > Thanks! > Laura Graham > > > > ------------------------------------------------------------------------------ > Colocation vs. Managed Hosting > A question and answer guide to determining the best fit > for your organization - today and in the future. > http://p.sf.net/sfu/internap-sfd2d > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
From: Gerard S. i M. <gs...@ce...> - 2011-03-10 16:33:51
|
Hi Aaron, If segments needs to be kept in order to update the indexes with new crawls then we need to bear in mind that indexes+segments size represents somewhere around 50% of the all ARCs, specially in terms of scalability. Are these numbers usual? 1.- Regarding to the merge index process we don't have any de-duplication strategy right now due to OOM errors we found when we were building the indexes in the first steps with NutchWAX. We were unable to build the indexes from scratch in a single job, we had to split in different processes with a small set of segments (we discuss that in the beginning of this thread). You pointed out that a NutchWAX minor revision on the 0.13 version has some new feature related to duplicate records during index-building. That might be helpful to try de-duplication index building. We tried both approaches for the entire ARC collection: a) IndexMerger Lucene API (inside TNH). index size: 813GB b) Re-built the entire index giving as input both old and new NutchWAX segments of the ARC files. index size: 563GB is it normal that there is this difference of sizes in the indexes? JBs sounds good. "Accented letter collapsing" is an interesting feature for a web pages in Catalan, that clearly meets our needs for a Catalonia web archive. We think its worth trying it. We would be glad if you could give us further information on JBs process and de-duplication index. Is de-duplication enabled by default? 2.- OK, our archive files are mainly text (~82%), so its usual that kind of percentage. 3.- We have only one collection for all the ARC files. We have our collection on open access and the service is load balanced through several nodes. That's the scenario in where several tomcats are accessing the same indexes. 4.- Having read your "SOLR-Nutch Report" I understand the situation we are now. Some of the "key problems" were pointed out also in IWAW2010 in Viena. P.S.: Don't worry if your answers are very long. P.S.2: This thread has been evolving through several topics, if you think it's better to answer in a different thread (JBs tool) with a new title feel free to switch it. Thank you very match for your answers. Best regards, Gerard Aaron Binns escribió: > Gerard Suades i Méndez <gs...@ce...> writes >> 1.- We have a new set of ARC that we would like to include in full >> text search. We were wondering if there is any special procedure to >> update the already existing NutchWAX indexes with the new crawls. Any >> idea for the merge process? Do we need to keep segments of old crawls >> in order to generate the indexes of the new crawls before merging all >> together? >> > > Yes, for *building* the indexes you need to keep the segments, only for > the TNH search service you don't need the segments as the index has all > the information in it needed for search services. > > There are basically two ways to merge indexes, which one you choose > depends on your de-duplication strategy. > > If you have two Lucene indexes A and B, you can just use the IndexMerger > command in TNH to merge them together. TNH provides a simple > command-line wrapper around the Lucene index merging API call. Since > TNH is a webapp, you have to un-jar it to be able to use the Java > command-line wrappers, for example > > $ mkdir tnh > $ cd tnh > $ jar xf ~/tnh.war > $ export CLASSPATH=WEB-INF/classes:WEB-INF/lib/lucene-core-*.jar > $ java IndexMerger <merged> <index-A> <index-B> > > This simply calls the Lucene library index-merge function, so it does > *not* know anything about de-duplication. If you have the same record > in both index A and index B, then you will have them both in the merged > index. > > So, if you already have an index for your existing collection, then get > some new (W)ARC files, you and index those separately and then merge the > two indexes together. > > > Another approach is to re-build the entire index, giving as inputs the > initial NutchWAX segments and the new NutchWAX segment for the new > (W)ARCs. Then, you will have one single index with everything in it. > > In this case, any duplicate records can be detected and merged when the > combined index is being built. The merging of duplicate records during > index-building was a feature put into a minor revision of NutchWAX 0.13. > I'll have to look up the specific SVN revision. > > > With regards to indexing, there is a side-project of mine similar to TNH > which does a better job of index-building than NutchWAX. This project > is called "The JBs", which was the name of the band for the famous > musician James Brown. > > One of the many improvements in The JBs does is "accented letter > collapsing" so that words with accented characters are indexed so that > they can be found with or without the accent mark. For example, > > Méndez > > with NutchWAX it is put into the index exactly as "Méndez". If someone > searches for "Mendez", it will not be found. But if the index is built > with then both "Méndez" and "Mendez" can be found. > > The JBs also performs merging of duplicates when building a single index > from multiple NutchWAX segments. > > But, this email is getting rather long already, with more below, so I > will conclude this section on The JBs. We can discuss further if you > are interested. > > >> 2.- The size of the index which self-contained the segments >> information is a linear growth size related to the ARC? at this moment >> index represents pretty much 7.5% of the whole collection ARCs size. >> > > It depends on the mix of file types in the original ARC files. Only > text types are put into the full-text search, so things like JPG, MP3, > AVI, ZIP, etc. are omitted. You're 7.5% number does not seem unusual to > me. In our full-text search for Archive-It.org, there are just over 1 > billion documents in the index and the on-disk index size is ~3.5TB, and > the size of all the (W)ARC files is somewhere around 100TB. But I know > there are lots of large binary files, including lots of YouTube video in > the Archive-It collection. > > >> 3.- Is it possible to install TNH in several tomcats sharing the same >> index? in other words, does TNH block index while searching as Wayback >> used to? >> > > I don't remember if that specific use-case was tested. It should work. > > TNH is built on Lucene and when TNH opens the index, it uses the Lucene > API call to open the index in read-only mode; so there should be no > exclusive locking and multiple TNH web application instances should be > able to open the same index. > > However, TNH and the Lucene library do cache parts of the index in > memory, so if you have multiple instances of the TNH web appliction, you > will have multiple instances of the caches as well. > > An alternative approach might be to use a multi-index setup in a single > TNH instance and use the "i=<indexname>" URL parameter to select which > index to search. > > Maybe you can describe what you are trying to do with multiple TNH > webapp instances reading the same index and I can provide some > suggestions on how to implement it. > > >> 4.- Based on the results of our tests we are thinking of using TNH for >> full text search instead of WERA. Is there any roadmap or a major >> release planned for the future? >> > > No, there isn't any roadmap. Well, the roadmap is to migrate everything > to Apache SOLR, which merged projects with Lucene last year and is now > considered *the* open-source full-text search platform. > > Unfortunately, there are some features missing from SOLR which are > required for full-text search on web archives. Also, we don't know yet > how SOLR will scale, especially in a multi-server configuration. > > I produced a report for the IIPC covering the issues with migrating from > NutchWAX to SOLR. > > http://archive.org/~aaron/iipc/ > > > So, that leaves us in an intermediate state where NutchWAX's search > service performance is not sufficient, but SOLR is not quite ready for > full-scale migration. The Internet Archive needs to decide if we commit > to supporting TNH (with an official release) as an intermediate step in > the migration path to SOLR. > > And if people are finding TNH useful and an adequate replacement for the > NutchWAX search service, then we would have a stronger case to commit > the resources to support an official TNH release. > > > -- ...................................................................... __ / / Gerard Suades Méndez C E / S / C A Departament d'Aplicacions i Projectes /_/ Centre de Supercomputació de Catalunya Gran Capità, 2-4 (Edifici Nexus) · 08034 Barcelona T. 93 551 62 20 · F. 93 205 6979 · gs...@ce... ...................................................................... |
From: Hamid R. <ham...@lt...> - 2011-03-10 13:19:20
|
Hi everybody, Is waybackmachine able to access (and present) WARC files where the content have been migrated ? Is there any developement ongoing regarding this matter ? Any documents, papers, reports to read about it ? I will be very gratefull for any kind of information about "migrating of WARC content AND Waybackmachine" . The only report i have found is from Vienna University of Technology written by Andreas Rauber , ...(2009) Regards Hamid |
From: Graham, L. <lg...@lo...> - 2011-03-10 13:02:36
|
We were wondering here at the Library of Congress about the LiveWeb.xml in Wayback 1.6. The wayback.xml explains: " LiveWeb.xml contains the 'proxylivewebcache' bean that enable fetching content from the live web, recording that content in ARC files. To use the "excluder-factory-robot" bean as an exclusionFactory property of AccessPoints, which will cause live robots.txt files to be consulted retroactively before showing archived content, you'll need to import LiveWeb.xml as well." We understand about consulting the robots.txt for display, of course, but can the Wayback actually write data to ARC (WARC?) files? What does "recording" mean? Thanks! Laura Graham |
From: Bradley T. <br...@ar...> - 2011-03-10 02:40:50
|
We omitted the CDATA around some javascript that used the '<' comparator, and also needed to close 2 INPUT tags. Brad On 3/10/11 8:42 AM, Ed Summers wrote: > On Wed, Mar 9, 2011 at 8:16 PM, Bradley Tofel<br...@ar...> wrote: >> The problem appears to have been simple to address and should be fixed now - >> please let us know if you believe there's still a problem. > That was fast--Thanks! What did you end up having to do to fix it? > I'll be sure to send questions like this via the feedback page in the > future. > > //Ed > > ------------------------------------------------------------------------------ > Colocation vs. Managed Hosting > A question and answer guide to determining the best fit > for your organization - today and in the future. > http://p.sf.net/sfu/internap-sfd2d > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
From: Ed S. <eh...@po...> - 2011-03-10 01:42:46
|
On Wed, Mar 9, 2011 at 8:16 PM, Bradley Tofel <br...@ar...> wrote: > The problem appears to have been simple to address and should be fixed now - > please let us know if you believe there's still a problem. That was fast--Thanks! What did you end up having to do to fix it? I'll be sure to send questions like this via the feedback page in the future. //Ed |
From: Bradley T. <br...@ar...> - 2011-03-10 01:10:16
|
Ed, Thanks for bringing this to our attention! The problem appears to have been simple to address and should be fixed now - please let us know if you believe there's still a problem. In the future, if you notice issues or have questions about the beta service at waybackmachine.org, please let us know via the feedback page at: http://faq.waybackmachine.org/contact/ Thanks again for reporting the issue, and for using the service! Brad On 3/10/11 7:39 AM, Kris Carpenter Negulescu wrote: > > forwarding to Wayback discussion list... > > Begin forwarded message: > >> *From: *Ed Summers <eh...@PO... <mailto:eh...@PO...>> >> *Date: *March 9, 2011 3:56:19 PM PST >> *To: *CUR...@LI... >> <mailto:CUR...@LI...> >> *Subject: **[IIPC-Web-Curators] xhtml & wayback* >> *Reply-To: *IIPC Web Curators <CUR...@LI... >> <mailto:CUR...@LI...>> >> >> I noticed some problems getting XHTML out of the new Wayback Machine, >> which I wrote about [1]. I'd be interested in your thoughts about >> whether there might be a workable solution. >> >> //Ed >> >> [1] http://inkdroid.org/journal/2011/03/09/xhtml-wayback/ >> >> ############################ >> >> To unsubscribe from the CURATORS list: >> write to: mailto:CUR...@LI... >> or click the following link: >> http://list.netpreserve.org/SCRIPTS/WA-NETPRESERVE.EXE?SUBED1=CURATORS&A=1 >> <http://list.netpreserve.org/SCRIPTS/WA-NETPRESERVE.EXE?SUBED1=CURATORS&A=1> > > > ------------------------------------------------------------------------------ > Colocation vs. Managed Hosting > A question and answer guide to determining the best fit > for your organization - today and in the future. > http://p.sf.net/sfu/internap-sfd2d > > > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
From: Kris C. N. <kca...@ar...> - 2011-03-10 00:40:05
|
forwarding to Wayback discussion list... Begin forwarded message: > From: Ed Summers <eh...@PO...> > Date: March 9, 2011 3:56:19 PM PST > To: CUR...@LI... > Subject: [IIPC-Web-Curators] xhtml & wayback > Reply-To: IIPC Web Curators <CUR...@LI...> > > I noticed some problems getting XHTML out of the new Wayback Machine, > which I wrote about [1]. I'd be interested in your thoughts about > whether there might be a workable solution. > > //Ed > > [1] http://inkdroid.org/journal/2011/03/09/xhtml-wayback/ > > ############################ > > To unsubscribe from the CURATORS list: > write to: mailto:CUR...@LI... > or click the following link: > http://list.netpreserve.org/SCRIPTS/WA-NETPRESERVE.EXE?SUBED1=CURATORS&A=1 |
From: <aa...@ar...> - 2011-03-07 17:36:19
|
A few links I omitted from my previous response. Some documentation on TNH and the JBs: https://webarchive.jira.com/wiki/display/search/The+New+Hotness https://webarchive.jira.com/wiki/display/search/The+JBs And a direct link to the Nutch(WAX)-Solr report: http://www.archive.org/~aaron/iipc/solr-nutch-report.html Aaron |
From: Aaron B. <aa...@ar...> - 2011-03-05 21:49:48
|
Gerard Suades i Méndez <gs...@ce...> writes: > 1.- We have a new set of ARC that we would like to include in full > text search. We were wondering if there is any special procedure to > update the already existing NutchWAX indexes with the new crawls. Any > idea for the merge process? Do we need to keep segments of old crawls > in order to generate the indexes of the new crawls before merging all > together? Yes, for *building* the indexes you need to keep the segments, only for the TNH search service you don't need the segments as the index has all the information in it needed for search services. There are basically two ways to merge indexes, which one you choose depends on your de-duplication strategy. If you have two Lucene indexes A and B, you can just use the IndexMerger command in TNH to merge them together. TNH provides a simple command-line wrapper around the Lucene index merging API call. Since TNH is a webapp, you have to un-jar it to be able to use the Java command-line wrappers, for example $ mkdir tnh $ cd tnh $ jar xf ~/tnh.war $ export CLASSPATH=WEB-INF/classes:WEB-INF/lib/lucene-core-*.jar $ java IndexMerger <merged> <index-A> <index-B> This simply calls the Lucene library index-merge function, so it does *not* know anything about de-duplication. If you have the same record in both index A and index B, then you will have them both in the merged index. So, if you already have an index for your existing collection, then get some new (W)ARC files, you and index those separately and then merge the two indexes together. Another approach is to re-build the entire index, giving as inputs the initial NutchWAX segments and the new NutchWAX segment for the new (W)ARCs. Then, you will have one single index with everything in it. In this case, any duplicate records can be detected and merged when the combined index is being built. The merging of duplicate records during index-building was a feature put into a minor revision of NutchWAX 0.13. I'll have to look up the specific SVN revision. With regards to indexing, there is a side-project of mine similar to TNH which does a better job of index-building than NutchWAX. This project is called "The JBs", which was the name of the band for the famous musician James Brown. One of the many improvements in The JBs does is "accented letter collapsing" so that words with accented characters are indexed so that they can be found with or without the accent mark. For example, Méndez with NutchWAX it is put into the index exactly as "Méndez". If someone searches for "Mendez", it will not be found. But if the index is built with then both "Méndez" and "Mendez" can be found. The JBs also performs merging of duplicates when building a single index from multiple NutchWAX segments. But, this email is getting rather long already, with more below, so I will conclude this section on The JBs. We can discuss further if you are interested. > 2.- The size of the index which self-contained the segments > information is a linear growth size related to the ARC? at this moment > index represents pretty much 7.5% of the whole collection ARCs size. It depends on the mix of file types in the original ARC files. Only text types are put into the full-text search, so things like JPG, MP3, AVI, ZIP, etc. are omitted. You're 7.5% number does not seem unusual to me. In our full-text search for Archive-It.org, there are just over 1 billion documents in the index and the on-disk index size is ~3.5TB, and the size of all the (W)ARC files is somewhere around 100TB. But I know there are lots of large binary files, including lots of YouTube video in the Archive-It collection. > 3.- Is it possible to install TNH in several tomcats sharing the same > index? in other words, does TNH block index while searching as Wayback > used to? I don't remember if that specific use-case was tested. It should work. TNH is built on Lucene and when TNH opens the index, it uses the Lucene API call to open the index in read-only mode; so there should be no exclusive locking and multiple TNH web application instances should be able to open the same index. However, TNH and the Lucene library do cache parts of the index in memory, so if you have multiple instances of the TNH web appliction, you will have multiple instances of the caches as well. An alternative approach might be to use a multi-index setup in a single TNH instance and use the "i=<indexname>" URL parameter to select which index to search. Maybe you can describe what you are trying to do with multiple TNH webapp instances reading the same index and I can provide some suggestions on how to implement it. > 4.- Based on the results of our tests we are thinking of using TNH for > full text search instead of WERA. Is there any roadmap or a major > release planned for the future? No, there isn't any roadmap. Well, the roadmap is to migrate everything to Apache SOLR, which merged projects with Lucene last year and is now considered *the* open-source full-text search platform. Unfortunately, there are some features missing from SOLR which are required for full-text search on web archives. Also, we don't know yet how SOLR will scale, especially in a multi-server configuration. I produced a report for the IIPC covering the issues with migrating from NutchWAX to SOLR. http://archive.org/~aaron/iipc/ So, that leaves us in an intermediate state where NutchWAX's search service performance is not sufficient, but SOLR is not quite ready for full-scale migration. The Internet Archive needs to decide if we commit to supporting TNH (with an official release) as an intermediate step in the migration path to SOLR. And if people are finding TNH useful and an adequate replacement for the NutchWAX search service, then we would have a stronger case to commit the resources to support an official TNH release. -- Aaron Binns Senior Software Engineer, Web Group, Internet Archive Program Officer, IIPC aa...@ar... |
From: Awakash B. <abo...@ac...> - 2011-03-03 17:04:00
|
Hello Brad and/or Lori, I haven't received a reply to the below e-mail. Would you know why we are still witnessing issues? Any help would be great! Thanks, Awakash ________________________________ From: Awakash Bodiwala Sent: Thursday, February 24, 2011 10:15 AM To: 'Bradley Tofel' Cc: 'arc...@li...'; Jennie Corman Subject: RE: [Archive-access-discuss] Instructions on running wayback and to unpack files Hello Brad, I've made this update to read from another location. After the restart, the settings seems good. But I'm still seeing a 404 error after I click on the 'Take Me Back' submit button. Here is the URL: http://ipaddress:8080/query?type=urlquery&url=http%3A%2F%2Fwww.mysite.co m&date=2009&Submit=Take+Me+Back Any suggestions on the issue? I think Tomcat still isn't reading the .warc and .warc.gz files (or the .manifest, .log files). Best, Awakash ________________________________ From: Bradley Tofel [mailto:br...@ar...] Sent: Friday, February 04, 2011 2:44 AM To: Awakash Bodiwala Cc: arc...@li...; Jennie Corman Subject: Re: [Archive-access-discuss] Instructions on running wayback and to unpack files Hi Awakash, You can change the basedir to whatever is simpler for your installation, in this case, likely /home/site/archivefiles/ or wherever they will show up - no need to move them to /tmp/wayback - that's just the default directory wayback uses in the default configuration. Let me know how this works for you, Brad |
From: Gerard S. i M. <gs...@ce...> - 2011-03-02 16:09:34
|
Aaron Binns escribió: > Gerard Suades i Méndez <gs...@ce...> writes >> using dumper tool -c option: 146.235.591 documents >> > > Hmm, a 12GB machine should be able to serve a 146 million document > index. > > In one of our deployments, we have a ~380 million document index spread > (unevenly) across three nodes, each with 8GB RAM. The sizes of each > are: > > 114.555.371 > 152.748.262 > 114.567.931 > > So if a 8GB RAM node can handle between 115-150 million documents, I > would expect your 12GB machine could as well. > > Now, our deployment is using the "tnh" code I mentioned before; so > that could be a differentiating factor. > > Also, since you are using 64-bit JVM, I strongly recommend using the JVM > option: > > -XX:+UseCompressedOops > > With this feature enabled, the JVM will use 32-bit object references > rather than 64-bit. As long as the number of *objects* in your system > are below 2^32 (~4billion) then 32-bit references are sufficient. > > This can save a lot of memory since there are going to be hundreds of > millions of references in the JVM's heap. > > For example, on the 8GB nodes in our ~380 million document deployment, > the JVM options we use are: > > JAVA_OPTS="-Djava.awt.headless=true -Xmx5000m -XX:+UseCompressedOops" > We are using java 1.6.0_12 version and unfortunately it has a bug with UseCompressedOops which seems to be solved in update 14. We will update java version and try it again. >> yes, fields stored in the index are: collection, content, date, >> digest, length, segment, site, title, type and url. >> > > Since the 'content' field is stored in the index, if you use the "tnh" > code, you don't need the Nutch(WAX) segments. Everything is > self-contained in the index. > > I just wanted to point this out in case you want to use "tnh" rather > than NutchWAX for search serving. I recommend "tnh" over NutchWAX, it's > what we are using at the Archive now for all our deployments. > We gave it a spin with the whole collection of ARC and TNH shows a dramatic improvement on performance compared with NutchWAX. Really impressive. Congratz. good job ;) As you said before, NutchWAX segments are no longer needed with TNH. We would like to ask a few questions: 1.- We have a new set of ARC that we would like to include in full text search. We were wondering if there is any special procedure to update the already existing NutchWAX indexes with the new crawls. Any idea for the merge process? Do we need to keep segments of old crawls in order to generate the indexes of the new crawls before merging all together? 2.- The size of the index which self-contained the segments information is a linear growth size related to the ARC? at this moment index represents pretty much 7.5% of the whole collection ARCs size. 3.- Is it possible to install TNH in several tomcats sharing the same index? in other words, does TNH block index while searching as Wayback used to? 4.- Based on the results of our tests we are thinking of using TNH for full text search instead of WERA. Is there any roadmap or a major release planned for the future? -- Gerard ...................................................................... __ / / Gerard Suades Méndez C E / S / C A Departament d'Aplicacions i Projectes /_/ Centre de Supercomputació de Catalunya Gran Capità, 2-4 (Edifici Nexus) · 08034 Barcelona T. 93 551 62 20 · F. 93 205 6979 · gs...@ce... ...................................................................... |
From: Natalia T. <nt...@ce...> - 2011-03-01 15:21:43
|
Hi Brad, thanks for the information about it. We'll try to recrawl with Heritrix using warc files and we'll try again with wayback. There's any problem if we have a mix with arc and warc files in the same wayback? It's a good practice? Natalia |
From: Colin R. <cs...@st...> - 2011-03-01 14:29:56
|
On 2011-03-01 13:17, Bradley Tofel wrote: > I'm not aware of any way to record fetched-but-not-stored metadata in > ARC files, only in WARC files. > > Is the information about the second(,third,etc) > downloaded-but-not-stored being recorded only in the crawl logs? Forging > CDX records from information in crawl logs may be possible, but as far > as I know has never been attempted. Actually NetarchiveSuite has a utility to do just that: http://netarchive.dk/suite/Additional%20Tools%20Manual%203.14#Additional_Tools_Manual_3.14.2BAC8-Tools_in_Wayback_Module.dk.netarkivet.wayback.DeduplicateToCDXApplication cheers, Colin Rosenthal IT-Developer State and University Library, Aarhus > We use WARC files with content-digest duplicate reduction (as opposed to > sending if-modified/if-none-match headers, which has only been used and > replayed via Wayback experimentally.) > > Brad > > On 3/1/11 4:43 PM, Natalia Torres wrote: >> Hi Brad, >> >> thanks a lot for your advice. I added the "dedupeRecords" property to >> the LocalResourceIndex Bean in CDXCollection.xml and restart tomcat, but >> I can't view correctly the crawls as before: viewing the first crawl >> everything is correct and viewing the second version the images/css/pdf >> (only crawled at the first time) aren't displayed... >> >> We are using arc files, the behavior is the same that using warc or we >> need to change to warc? >> >> Here is the CDXCollection.xml file: >> >> <?xml version="1.0" encoding="UTF-8"?> >> <beans xmlns="http://www.springframework.org/schema/beans" >> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" >> xsi:schemaLocation="http://www.springframework.org/schema/beans >> http://www.springframework.org/schema/beans/spring-beans-2.5.xsd" >> default-init-method="init"> >> >> <bean id="localcdxcollection" >> class="org.archive.wayback.webapp.WaybackCollection"> >> <property name="resourceStore"> >> <bean class="org.archive.wayback.resourcestore.LocationDBResourceStore"> >> <property name="db"> >> <bean >> class="org.archive.wayback.resourcestore.locationdb.FlatFileResourceFileLocationDB"> >> <property name="path" value="${wayback.basedir}/path-ind >> ex.txt" /> >> </bean> >> </property> >> </bean> >> </property> >> >> <property name="resourceIndex"> >> <bean class="org.archive.wayback.resourceindex.LocalResourceIndex"> >> <property name="canonicalizer" ref="waybackCanonicalizer" /> >> <property name="source"> >> >> <!-- >> A single CDX SearchResultSource example. >> --> >> <bean class="org.archive.wayback.resourceindex.cdx.CDXIndex"> >> <property name="path" value="${wayback.basedir}/dedup2011.cdx" /> >> </bean> >> >> </property> >> <property name="maxRecords" value="10000" /> >> <property name="dedupeRecords" value="true" /> >> </bean> >> </property> >> </bean> >> >> </beans> >> >> thanks, >> >> natalia >> >> >> ------------------------------------------------------------------------------ >> Free Software Download: Index, Search& Analyze Logs and other IT data in >> Real-Time with Splunk. Collect, index and harness all the fast moving IT data >> generated by your applications, servers and devices whether physical, virtual >> or in the cloud. Deliver compliance at lower cost and gain new business >> insights. http://p.sf.net/sfu/splunk-dev2dev >> _______________________________________________ >> Archive-access-discuss mailing list >> Arc...@li... >> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > ------------------------------------------------------------------------------ > Free Software Download: Index, Search& Analyze Logs and other IT data in > Real-Time with Splunk. Collect, index and harness all the fast moving IT data > generated by your applications, servers and devices whether physical, virtual > or in the cloud. Deliver compliance at lower cost and gain new business > insights. http://p.sf.net/sfu/splunk-dev2dev > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |