From: Bradley T. <br...@ar...> - 2011-03-14 10:12:56
|
Hi Hamid, This is indeed an area that seems to require more research, and my understanding agrees with the document you referred: there are two camps, which I refer to as the "forward-convert" camp, and the "emulation-will-save-us" camp. The paper looks to describe some limited scale research into the forward-convert approach. Wayback does not currently ship with any code to support forward converted formats, but my feeling is that adding this sort of functionality would be pretty straightforward, if not trivial. This all requires further discussion and analysis, but a technical Straw-Man to implementing this functionality within Wayback might look like: 1) alter standard CDX format to include 2 extra fields: WARC-ID, and WARC-Refers-To The former would be present for all records. The WARC-Refers-To record would be present for all transformed records. As a side note, it seems that some indication of the type of conversion performed, as well as the specific version and configuration of the software used might be useful in the conversion records. If this information was included in the conversion WARC records, it could be included in a 3rd (and possibly 4th) field, and then consulted at query time by Wayback, to choose the "best" conversion available if newer techniques or software surfaced. Lastly, subsequent index steps could be simplified (could save an extra "sort" operation) if the capture date of the original record were to be included in the conversion WARC record. 2) modify Wayback indexing code to include WARC-ID, and WARC-Refers-To data into the CaptureSearchResults (as well as the aforementioned conversion data, and original capture date, if available) 3) create a Wayback CaptureSearchResult filter, which reads both original and conversion records from the index, and produces a new set of results which prefer the converted records, if available. Logic for the preference of converted records seems likely to change over time, so making this somewhat flexible might be a desirable design goal. I mentioned in #1, that including the original records capture date might simplify later steps, specifically step 3. If the converted records included the original capture date, they would sort along side the original records, and the filter could simply omit the original records, if a converted record was present. Likely the filter would also annotate the converted record with information for the user about the original format, the conversion, etc. If the original records capture date is not included, then this filter would have to: * buffer all the matching records into a data structure * match up converted records to their originals by WARC-ID and WARC-Refers-To * annotate the converted records with info about the original, and the conversion process used * discard original records * re-sort the resulting search results for the rest of the Wayback system, which (currently) expects search results will be returned in data ascending order. Unquestionable there will need to be substantial QA effort to determine the viability of the solution, but the experience gained now will certainly be valuable. Another tactic for solving the antiquated format issue within Wayback, would be to implement specialized ReplayRenderers for various formats, and experiment with converting those formats on-the-fly, at Replay Time. Possibly compute time will get(remain?) cheap enough that this solution could be tractable in the long-term. Looking forward to comments or questions on the topic. Are other institutions interested in this in the near term? Is there other ongoing IIPC research in this arena that would be a better venue for the discussion? Brad On 3/14/11 3:31 PM, Hamid Rofoogaran wrote: > Hi Brad, > This is the link to the document i mentioned > http://publik.tuwien.ac.at/files/PubDat_181115.pdf > By "migrated content" i mean for example that within your web > archive (WARC files) there are a number of MS Word and TIFF > objects. Your organisation decides that all the MS Word objects shall > be converted to PDF/A and all the TIFF images will be converted to png > format. The "new" WARC has now a migrated content. > Talking about this document, there are two issues in the "summary an > outlook" which i wonder if there has been any progress since 2009 namely: > 1- "..... but further experiments with larger data sets are required > to evaluate the scalability of this approach." > > 2- "The support of access engines ((WayBack) , my comment) for > migrated records and extracted > > metadata needs to be further analysed > > Best > Hamid > ----------------------------------------------------- > Hamid Rofoogaran > LDP Centre > Tel: +46 921 57308 > Mobile: +46 76 81 57308 > ham...@ld... > ham...@lt... > www.ldb-centrum.se > ----------------------------------------------------- > ------------------------------------------------------------------------ > *Från:* Bradley Tofel [br...@ar...] > *Skickat:* den 11 mars 2011 kl 4:47 > *Till:* Hamid Rofoogaran > *Kopia:* arc...@li... > *Ämne:* Re: [Archive-access-discuss] Migration & WBM > > Hi Hamid, > > Can you elaborate on what you mean by "migrated"? > > Do you have any links to the report you mentioned? > > One of the design goals of the WARC format is to allow content which > was recorded in other formats, for example, as millions of files on a > "standard filesystem" to be encapsulated in more manageable WARC > files. Is this the kind of "migration" to which you're referring? > > If so, Wayback has not currently be used in this application, but it's > design has considered this as a future goal. > > Wayback attempts to be a framework for: > 1) creating indexes of large amounts of semi-structured data > 2) providing search of those indexes, both to query what content is > available, and for retrieving pointers to specific resources captured > 3) returning specific captured resources, in many cases altering the > resources to provide contextual metadata, or to enhance viewing of > those resources by clients. > > Currently, the modules that have been developed within this framework > primarily index HTTP content within W/ARC files, provide search of > those indexes by URL, and alter returned resources, namely HTML, CSS, > and Javascript, to assist replay within a web browser. > > So, depending on what you mean by "migrated" Wayback may be a good > starting point to provide access to large bodies of content stored in > W/ARC format. I'd be happy to provide suggestions, assistance, and as > time permits, code to help with your Wayback extensions. > > Looking forward to hearing back about your specific needs! > > Brad > > On 3/10/11 8:16 PM, Hamid Rofoogaran wrote: >> Hi everybody, >> Is waybackmachine able to access (and present) WARC files where the >> content have been migrated ? Is there any developement ongoing >> regarding this matter ? Any documents, papers, reports to read about it ? >> >> I will be very gratefull for any kind of information about "migrating >> of WARC content AND Waybackmachine" . >> The only report i have found is from Vienna University of Technology >> written by Andreas Rauber , ...(2009) >> >> Regards >> Hamid >> >> >> ------------------------------------------------------------------------------ >> Colocation vs. Managed Hosting >> A question and answer guide to determining the best fit >> for your organization - today and in the future. >> http://p.sf.net/sfu/internap-sfd2d >> >> >> _______________________________________________ >> Archive-access-discuss mailing list >> Arc...@li... >> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > > ------------------------------------------------------------------------------ > Colocation vs. Managed Hosting > A question and answer guide to determining the best fit > for your organization - today and in the future. > http://p.sf.net/sfu/internap-sfd2d > > > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |