You can subscribe to this list here.
2005 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
(10) |
Sep
(36) |
Oct
(339) |
Nov
(103) |
Dec
(152) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2006 |
Jan
(141) |
Feb
(102) |
Mar
(125) |
Apr
(203) |
May
(57) |
Jun
(30) |
Jul
(139) |
Aug
(46) |
Sep
(64) |
Oct
(105) |
Nov
(34) |
Dec
(162) |
2007 |
Jan
(81) |
Feb
(57) |
Mar
(141) |
Apr
(72) |
May
(9) |
Jun
(1) |
Jul
(144) |
Aug
(88) |
Sep
(40) |
Oct
(43) |
Nov
(34) |
Dec
(20) |
2008 |
Jan
(44) |
Feb
(45) |
Mar
(16) |
Apr
(36) |
May
(8) |
Jun
(77) |
Jul
(177) |
Aug
(66) |
Sep
(8) |
Oct
(33) |
Nov
(13) |
Dec
(37) |
2009 |
Jan
(2) |
Feb
(5) |
Mar
(8) |
Apr
|
May
(36) |
Jun
(19) |
Jul
(46) |
Aug
(8) |
Sep
(1) |
Oct
(66) |
Nov
(61) |
Dec
(10) |
2010 |
Jan
(13) |
Feb
(16) |
Mar
(38) |
Apr
(76) |
May
(47) |
Jun
(32) |
Jul
(35) |
Aug
(45) |
Sep
(20) |
Oct
(61) |
Nov
(24) |
Dec
(16) |
2011 |
Jan
(22) |
Feb
(34) |
Mar
(11) |
Apr
(8) |
May
(24) |
Jun
(23) |
Jul
(11) |
Aug
(42) |
Sep
(81) |
Oct
(48) |
Nov
(21) |
Dec
(20) |
2012 |
Jan
(30) |
Feb
(25) |
Mar
(4) |
Apr
(6) |
May
(1) |
Jun
(5) |
Jul
(5) |
Aug
(8) |
Sep
(6) |
Oct
(6) |
Nov
|
Dec
|
From: Sverre B. <sv...@us...> - 2005-11-03 13:25:54
|
Update of /cvsroot/archive-access/archive-access/projects/wera/src/webapps/wera/lib/seal In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv25466/lib/seal Modified Files: nutch.inc Log Message: RFE1346889 Google-like result presentation Index: nutch.inc =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/wera/src/webapps/wera/lib/seal/nutch.inc,v retrieving revision 1.8 retrieving revision 1.9 diff -C2 -d -r1.8 -r1.9 *** nutch.inc 20 Oct 2005 10:40:48 -0000 1.8 --- nutch.inc 3 Nov 2005 13:25:29 -0000 1.9 *************** *** 61,66 **** var $sort; var $debug; - var $supressduplicates; var $morepages; /** --- 61,67 ---- var $sort; var $debug; var $morepages; + var $dedupfield; + var $hitsperdup; /** *************** *** 77,82 **** $this->offset = 0; $this->timespent = 0; ! $this->unsetSupressDuplicates(); ! $this->morepages = false; } --- 78,83 ---- $this->offset = 0; $this->timespent = 0; ! $this->morepages = false; ! $this->setDedup(); } *************** *** 116,120 **** # e.g &dedupField=date&hitsPerDup=100&sort=date if ($sortorder == "ascending" or $sortorder == "descending") { ! $this->sort = "&dedupField=date&sort=date"; if ($sortorder == "descending") { $this->sort .= "&reverse=true"; --- 117,122 ---- # e.g &dedupField=date&hitsPerDup=100&sort=date if ($sortorder == "ascending" or $sortorder == "descending") { ! $this->setDedup(100, "date"); ! $this->sort = "&sort=date"; if ($sortorder == "descending") { $this->sort .= "&reverse=true"; *************** *** 123,140 **** } - - /** - * Set suppress duplicate urls - */ - function setSupressDuplicates() { - $this->supressduplicates = "&hitsPerDup=1&dedupField=exacturl"; - } ! /** ! * Unset suppress duplicate urls ! */ ! function unsetSupressDuplicates() { ! $this->supressduplicates = "&hitsPerDup=0"; ! } /** --- 125,142 ---- } ! /** ! * Set deduplication ! * ! * If dedupfield is emty, NutchWax defaults to 'site' ! * To turn off dedup, set hitsperdup to 0 ! * ! * @param integer Hits per duplicate ! * @param string Field to deduplicate on ! */ ! function setDedup($hitsperdup = 0, $dedupfield = "") { ! $this->hitsperdup = $hitsperdup; ! $this->dedupfield = $dedupfield; ! } /** *************** *** 171,175 **** $time_start = microtime_float(); ! $this->queryurl = $this->searchengineurl . "?query=" . $this->adaptQuery($this->query) . "&start=" . $this->offset . "&hitsPerPage=" . $this->hitsperset . $this->supressduplicates; if ($this->sort != "") { --- 173,177 ---- $time_start = microtime_float(); ! $this->queryurl = $this->searchengineurl . "?query=" . $this->adaptQuery($this->query) . "&start=" . $this->offset . "&hitsPerPage=" . $this->hitsperset . "&hitsPerDup=" . $this->hitsperdup . "&dedupField=" . $this->dedupfield; if ($this->sort != "") { *************** *** 287,291 **** $this->resultset[$this->hitno]['encoding'] .= $data; } ! break; } } --- 289,298 ---- $this->resultset[$this->hitno]['encoding'] .= $data; } ! break; ! case "NUTCH:SITE": ! if (in_array("site", $this->resultfields)) { ! $this->resultset[$this->hitno]['site'] .= $data; ! } ! break; } } |
From: Michael S. <sta...@us...> - 2005-11-01 19:17:17
|
Update of /cvsroot/archive-access/archive-access/projects/nutch/xdocs In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv3465/xdocs Modified Files: faq.fml Log Message: * xdocs/faq.fml More edit of scoring section. Index: faq.fml =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/nutch/xdocs/faq.fml,v retrieving revision 1.14 retrieving revision 1.15 diff -C2 -d -r1.14 -r1.15 *** faq.fml 1 Nov 2005 19:13:47 -0000 1.14 --- faq.fml 1 Nov 2005 19:17:09 -0000 1.15 *************** *** 272,276 **** query.host.boost, 2.0f query.phrase.boost, 1.0f</pre></p> ! <p>You can change the above boosts by editing your nutch-site.xml</p> <p>Anchor text makes a large contribution to a document ranking score. You can see the anchor text for a page by browsing to the 'explain' then --- 272,278 ---- query.host.boost, 2.0f query.phrase.boost, 1.0f</pre></p> ! <p>From the list above, you can see that terms found in a document URL get ! the highest boost with anchor text next, etc. ! You can change the above boosts by editing your nutch-site.xml</p> <p>Anchor text makes a large contribution to a document ranking score. You can see the anchor text for a page by browsing to the 'explain' then |
From: Michael S. <sta...@us...> - 2005-11-01 19:13:56
|
Update of /cvsroot/archive-access/archive-access/projects/nutch/xdocs In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv2303/xdocs Modified Files: faq.fml Log Message: * xdocs/faq.fml Edit on ranking on how you can change query time boost. Index: faq.fml =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/nutch/xdocs/faq.fml,v retrieving revision 1.13 retrieving revision 1.14 diff -C2 -d -r1.13 -r1.14 *** faq.fml 1 Nov 2005 19:12:10 -0000 1.13 --- faq.fml 1 Nov 2005 19:13:47 -0000 1.14 *************** *** 272,275 **** --- 272,276 ---- query.host.boost, 2.0f query.phrase.boost, 1.0f</pre></p> + <p>You can change the above boosts by editing your nutch-site.xml</p> <p>Anchor text makes a large contribution to a document ranking score. You can see the anchor text for a page by browsing to the 'explain' then |
From: Michael S. <sta...@us...> - 2005-11-01 19:12:18
|
Update of /cvsroot/archive-access/archive-access/projects/nutch/xdocs In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv1807/xdocs Modified Files: faq.fml Log Message: * xdocs/faq.fml Add question on nutch ranking. Index: faq.fml =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/nutch/xdocs/faq.fml,v retrieving revision 1.12 retrieving revision 1.13 diff -C2 -d -r1.12 -r1.13 *** faq.fml 20 Oct 2005 23:51:35 -0000 1.12 --- faq.fml 1 Nov 2005 19:12:10 -0000 1.13 *************** *** 237,241 **** <question>How to sort results by date? </question> - <answer><p> <code>http://localhost:8080/archive-access-nutch/search.jsp?query=traditional+irish+music+paddy&hitsPerPage=100&dedupField=date&hitsPerDup=100&sort=date</code> --- 237,240 ---- *************** *** 251,256 **** </p></answer> </faq> ! <faq> ! <question id="mimetype">How to query for mimetypes? </question> <answer> --- 250,255 ---- </p></answer> </faq> ! <faq id="mimetype"> ! <question>How to query for mimetypes? </question> <answer> *************** *** 263,266 **** --- 262,281 ---- </answer> </faq> + <faq id="scoring"> + <question>Tell me more about how scoring is done in + nutch/nutchwax.</question> + <answer> + <p>By default, at query time, the following fields are boosted as follows: + <pre>query.url.boost, 4.0f + query.anchor.boost, 2.0f + query.title.boost, 1.5f + query.host.boost, 2.0f + query.phrase.boost, 1.0f</pre></p> + <p>Anchor text makes a large contribution to a document ranking score. + You can see the anchor text for a page by browsing to the 'explain' then + editing the URL to put in place 'anchors.jsp' instead of 'explain.jsp'. + </p> + </answer> + </faq> </part> </faqs> |
From: Michael S. <sta...@us...> - 2005-10-31 21:08:28
|
Update of /cvsroot/archive-access/archive-access/projects/wayback In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv27518 Modified Files: .classpath Log Message: * .classpath Had a full path for the codec jar. Fix. Index: .classpath =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/wayback/.classpath,v retrieving revision 1.5 retrieving revision 1.6 diff -C2 -d -r1.5 -r1.6 *** .classpath 25 Oct 2005 20:09:31 -0000 1.5 --- .classpath 31 Oct 2005 21:08:20 -0000 1.6 *************** *** 18,22 **** path="src/webapp/WEB-INF/lib/libidn-0.5.9.jar"/> <classpathentry kind="lib" ! path="/src/webapp/WEB-INF/lib/commons-codec-1.3.jar"/> <classpathentry kind="lib" path="src/webapp/WEB-INF/lib/dsi-unimi-it-1.0.0.kb.jar"/> --- 18,22 ---- path="src/webapp/WEB-INF/lib/libidn-0.5.9.jar"/> <classpathentry kind="lib" ! path="src/webapp/WEB-INF/lib/commons-codec-1.3.jar"/> <classpathentry kind="lib" path="src/webapp/WEB-INF/lib/dsi-unimi-it-1.0.0.kb.jar"/> |
From: Michael S. <sta...@us...> - 2005-10-31 18:00:26
|
Update of /cvsroot/archive-access/archive-access/projects/nutch/src/java/org/archive/access/nutch In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv13604/src/java/org/archive/access/nutch Modified Files: NutchwaxSegmentMergeTool.java Log Message: * src/java/org/archive/access/nutch/NutchwaxSegmentMergeTool.java Added deduping that counts the collection name. Index: NutchwaxSegmentMergeTool.java =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/nutch/src/java/org/archive/access/nutch/NutchwaxSegmentMergeTool.java,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** NutchwaxSegmentMergeTool.java 27 Oct 2005 16:09:52 -0000 1.3 --- NutchwaxSegmentMergeTool.java 31 Oct 2005 18:00:17 -0000 1.4 *************** *** 238,256 **** String name = sr.segmentDir.getName(); FetcherOutput fo = new FetcherOutput(); for (long i = 0; i < sr.size; i++) { try { ! if (!sr.get(i, fo, null, null, null)) break; Document doc = new Document(); // compute boost ! float boost = IndexSegment.calculateBoost(fo.getFetchListEntry().getPage().getScore(), scorePower, boostByLinkCount, fo.getAnchors().length); doc.add(new Field("sd", name + "|" + i, true, false, false)); ! doc.add(new Field("uh", MD5Hash.digest(fo.getUrl().toString()).toString(), true, true, false)); ! doc.add(new Field("ch", fo.getMD5Hash().toString(), true, true, false)); ! doc.add(new Field("time", DateField.timeToString(fo.getFetchDate()), true, false, false)); ! doc.add(new Field("score", boost + "", true, false, false)); ! doc.add(new Field("ul", fo.getUrl().toString().length() + "", true, false, false)); iw.addDocument(doc); processedRecords++; --- 238,269 ---- String name = sr.segmentDir.getName(); FetcherOutput fo = new FetcherOutput(); + ParseData pd = new ParseData(); for (long i = 0; i < sr.size; i++) { try { ! if (!sr.get(i, fo, null, null, pd)) ! break; Document doc = new Document(); // compute boost ! float boost = IndexSegment.calculateBoost( ! fo.getFetchListEntry().getPage().getScore(), scorePower, boostByLinkCount, fo.getAnchors().length); doc.add(new Field("sd", name + "|" + i, true, false, false)); ! // doc.add(new Field("uh", ! // MD5Hash.digest(fo.getUrl().toString()).toString(), true, true, false)); ! // doc.add(new Field("ch", fo.getMD5Hash().toString(), ! // true, true, false)); ! doc.add(new Field("time", ! DateField.timeToString(fo.getFetchDate()), true, false, false)); ! // doc.add(new Field("score", boost + "", true, false, false)); ! // doc.add(new Field("ul", fo.getUrl().toString().length() + "", true, ! // false, false)); ! ! // Hash up the content hash, the url itself and the collection name. ! String hashStr = fo.getMD5Hash().toString() + fo.getUrl().toString() + ! pd.getMetadata().getProperty("collection"); ! doc.add(new Field("ucc", MD5Hash.digest(hashStr).toString(), true, true, ! false)); iw.addDocument(doc); processedRecords++; *************** *** 298,411 **** } iw.close(); ! LOG.info("* Optimizing index took " + (System.currentTimeMillis() - s1) + " ms"); ! LOG.info("* Skipping deduplicate step..."); ! // LOG.info("* Removing duplicate entries..."); ! // stage = SegmentMergeStatus.STAGE_DEDUP; ! IndexReader ir = IndexReader.open(masterDir); ! // int i = 0; ! // long cnt = 0L; ! // processedRecords = 0L; ! // s1 = System.currentTimeMillis(); ! // delta = s1; ! // TermEnum te = ir.terms(); ! // while(te.next()) { ! // Term t = te.term(); ! // if (t == null) continue; ! // if (!(t.field().equals("ch") || t.field().equals("uh"))) continue; ! // cnt++; ! // processedRecords = cnt / 2; ! // if (cnt > 0 && (cnt % (LOG_STEP * 2) == 0)) { ! // LOG.info(" Processed " + processedRecords + " records (" + ! // (float)(LOG_STEP * 1000)/(float)(System.currentTimeMillis() - delta) + " rec/s)"); ! // delta = System.currentTimeMillis(); ! // } ! // // Enumerate all docs with the same URL hash or content hash ! // TermDocs td = ir.termDocs(t); ! // if (td == null) continue; ! // if (t.field().equals("uh")) { ! // // Keep only the latest version of the document with ! // // the same url hash. Note: even if the content ! // // hash is identical, other metadata may be different, so even ! // // in this case it makes sense to keep the latest version. ! // int id = -1; ! // String time = null; ! // Document doc = null; ! // while (td.next()) { ! // int docid = td.doc(); ! // if (!ir.isDeleted(docid)) { ! // doc = ir.document(docid); ! // if (time == null) { ! // time = doc.get("time"); ! // id = docid; ! // continue; ! // } ! // String dtime = doc.get("time"); ! // // "time" is a DateField, and can be compared lexicographically ! // if (dtime.compareTo(time) > 0) { ! // if (id != -1) { ! // ir.delete(id); ! // } ! // time = dtime; ! // id = docid; ! // } else { ! // ir.delete(docid); ! // } ! // } ! // } ! // } else if (t.field().equals("ch")) { ! // // Keep only the version of the document with ! // // the highest score, and then with the shortest url. ! // int id = -1; ! // int ul = 0; ! // float score = 0.0f; ! // Document doc = null; ! // while (td.next()) { ! // int docid = td.doc(); ! // if (!ir.isDeleted(docid)) { ! // doc = ir.document(docid); ! // if (ul == 0) { ! // try { ! // ul = Integer.parseInt(doc.get("ul")); ! // score = Float.parseFloat(doc.get("score")); ! // } catch (Exception e) {}; ! // id = docid; ! // continue; ! // } ! // int dul = 0; ! // float dscore = 0.0f; ! // try { ! // dul = Integer.parseInt(doc.get("ul")); ! // dscore = Float.parseFloat(doc.get("score")); ! // } catch (Exception e) {}; ! // int cmp = Float.compare(dscore, score); ! // if (cmp == 0) { ! // // equal scores, select the one with shortest url ! // if (dul < ul) { ! // if (id != -1) { ! // ir.delete(id); ! // } ! // ul = dul; ! // id = docid; ! // } else { ! // ir.delete(docid); ! // } ! // } else if (cmp < 0) { ! // ir.delete(docid); ! // } else { ! // if (id != -1) { ! // ir.delete(id); ! // } ! // ul = dul; ! // id = docid; ! // } ! // } ! // } ! // } ! // } ! // // ! // // keep the IndexReader open... ! // // ! // ! // LOG.info("* Deduplicating took " + (System.currentTimeMillis() - s1) + " ms"); stage = SegmentMergeStatus.STAGE_WRITING; processedRecords = 0L; --- 311,375 ---- } iw.close(); ! LOG.info("* Optimizing index took " + (System.currentTimeMillis() - s1) + ! " ms"); ! LOG.info("* Dedupling based off hash of content-md5 + url + collection..."); ! stage = SegmentMergeStatus.STAGE_DEDUP; ! IndexReader ir = IndexReader.open(masterDir); ! int i = 0; ! long cnt = 0L; ! processedRecords = 0L; ! s1 = System.currentTimeMillis(); ! delta = s1; ! TermEnum te = ir.terms(); ! while(te.next()) { ! Term t = te.term(); ! if (t == null) continue; ! if (!(t.field().equals("ucc"))) continue; ! cnt++; ! processedRecords = cnt / 2; ! if (cnt > 0 && (cnt % (LOG_STEP * 2) == 0)) { ! LOG.info(" Processed " + processedRecords + " records (" + ! (float)(LOG_STEP * 1000)/(float)(System.currentTimeMillis() - delta) + ! " rec/s)"); ! delta = System.currentTimeMillis(); ! } ! // Enumerate all docs with the same URL + content + collection hash. ! TermDocs td = ir.termDocs(t); ! if (td == null) continue; ! if (t.field().equals("ucc")) { ! // Keep only the latest version of the document with ! // the same url + content + collection hash. ! int id = -1; ! String time = null; ! Document doc = null; ! while (td.next()) { ! int docid = td.doc(); ! if (!ir.isDeleted(docid)) { ! doc = ir.document(docid); ! if (time == null) { ! time = doc.get("time"); ! id = docid; ! continue; ! } ! String dtime = doc.get("time"); ! // "time" is a DateField, and can be compared lexicographically ! if (dtime.compareTo(time) > 0) { ! if (id != -1) { ! ir.delete(id); ! } ! time = dtime; ! id = docid; ! } else { ! ir.delete(docid); ! } ! } ! } ! } ! } ! // ! // keep the IndexReader open... ! // ! ! LOG.info("* Deduplicating took " + (System.currentTimeMillis() - s1) + " ms"); stage = SegmentMergeStatus.STAGE_WRITING; processedRecords = 0L; |
From: Michael S. <sta...@us...> - 2005-10-27 16:10:00
|
Update of /cvsroot/archive-access/archive-access/projects/nutch/src/java/org/archive/access/nutch In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv30856/src/java/org/archive/access/nutch Modified Files: NutchwaxSegmentMergeTool.java Log Message: * src/java/org/archive/access/nutch/NutchwaxSegmentMergeTool.java Change information message from severe to info. Index: NutchwaxSegmentMergeTool.java =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/nutch/src/java/org/archive/access/nutch/NutchwaxSegmentMergeTool.java,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** NutchwaxSegmentMergeTool.java 6 Oct 2005 01:45:35 -0000 1.2 --- NutchwaxSegmentMergeTool.java 27 Oct 2005 16:09:52 -0000 1.3 *************** *** 225,229 **** } masters.add(masterDir); ! LOG.severe("MasterDir is " + masterDir.toString()); IndexWriter iw = new IndexWriter(masterDir, new WhitespaceAnalyzer(), true); iw.setUseCompoundFile(false); --- 225,229 ---- } masters.add(masterDir); ! LOG.info("MasterDir is " + masterDir.toString()); IndexWriter iw = new IndexWriter(masterDir, new WhitespaceAnalyzer(), true); iw.setUseCompoundFile(false); |
From: Sverre B. <sv...@us...> - 2005-10-26 09:21:13
|
Update of /cvsroot/archive-access/archive-access/projects/wera/src/articles In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv31313/src/articles Modified Files: what-is-wera.xml Log Message: Added section on WERA future Index: what-is-wera.xml =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/wera/src/articles/what-is-wera.xml,v retrieving revision 1.4 retrieving revision 1.5 diff -C2 -d -r1.4 -r1.5 *** what-is-wera.xml 21 Oct 2005 07:33:39 -0000 1.4 --- what-is-wera.xml 26 Oct 2005 09:21:02 -0000 1.5 *************** *** 138,200 **** </listitem> </itemizedlist> ! <section> ! <title>Practical use</title> ! <para>The original vision for the <ulink ! url="http://nwa.nb.no">NwaToolset</ulink> (the predecessor of Wera) was ! to enable search across the different Nordic Web Archives and provide ! seamless navigation within the different archives. The ability to search ! across the different indexes was solved by the using <ulink ! url="http://fastsearch.com/">Fast Search & Transfer</ulink>'s multi ! node architecture. To enable Wera to retrieve a particular document with ! a given <literal>aid</literal> (Archive ID) from the right archive the ! collection field was introduced in the index (also present in the ! NutchWax index). The Wera config file holds the mapping from collection ! to archive (or rather Wera installation).</para> ! <para>Another reason to include the collection field was to ensure that ! the actual link rewriting was done by the owner of the document. Each ! archive holder would have to set up their own Wera installation. When ! one Wera was requesting a document from a remote archive, the remote ! Wera should make the necessary changes to the document before delivering ! it to the calling Wera. The reason for this was to make sure that the ! owner had full control over what was delivered to the calling site, thus ! being able to threat the document in accordance with local policies ! rather than the policies of the caller site. The figure below ! illustrates the currently supported use of mapping between collection ! and archive nodes.</para> ! <figure> ! <title>Wera interfacing several archive nodes</title> ! <mediaobject> ! <imageobject> ! <imagedata fileref="images/wera3.png" /> ! </imageobject> ! </mediaobject> ! </figure> ! <para>In the Wera installation of <emphasis>W1</emphasis> the different ! collections indexed in NutchWax are mapped to corresponding Wera ! installations of <emphasis>W2- Wn</emphasis>. When the timeline view on ! W1 encounters a resource located on a different node (e.g. the ! collection mapping points to the Wera installation of ! <emphasis>W2</emphasis>) it requests that resource from the Wera ! installation at <literal>W2</literal>. Wera at <literal>W2</literal> ! fetches the resource from its Retriever and does the necessary changes ! to the file before delivering it to Wera at <literal>W1</literal> (e.g. ! inserts javascript link rewriter or rewrites it server side). When Wera ! at <literal>W1</literal> receives this file it does an additional ! rewrite in order to have the links point to itself rather than to ! <literal>W2</literal>'s Wera.</para> ! <para>In a real-life large scale Web Archive where the ARC files are ! distributed across tens or hundreds of hosts it will not be practical to ! set up one Wera installation for each of these. A better solution will ! be to introduce communication between the different retrievers or have ! one front-end retriever interfacing all the other retrievers within one ! archive. This has to be added in a later release of Wera.</para> ! </section> </section> </article> \ No newline at end of file --- 138,236 ---- </listitem> </itemizedlist> + </section> ! <section> ! <title>Practical use</title> ! <para>The original vision for the <ulink ! url="http://nwa.nb.no">NwaToolset</ulink> (the predecessor of Wera) was to ! enable search across the different Nordic Web Archives and provide ! seamless navigation within the different archives. The ability to search ! across the different indexes was solved by the using <ulink ! url="http://fastsearch.com/">Fast Search & Transfer</ulink>'s multi ! node architecture. To enable Wera to retrieve a particular document with a ! given <literal>aid</literal> (Archive ID) from the right archive the ! collection field was introduced in the index (also present in the NutchWax ! index). The Wera config file holds the mapping from collection to archive ! (or rather Wera installation).</para> ! <para>Another reason to include the collection field was to ensure that ! the actual link rewriting was done by the owner of the document. Each ! archive holder would have to set up their own Wera installation. When one ! Wera was requesting a document from a remote archive, the remote Wera ! should make the necessary changes to the document before delivering it to ! the calling Wera. The reason for this was to make sure that the owner had ! full control over what was delivered to the calling site, thus being able ! to threat the document in accordance with local policies rather than the ! policies of the caller site. The figure below illustrates the currently ! supported use of mapping between collection and archive nodes.</para> ! <figure> ! <title>Wera interfacing several archive nodes</title> ! <mediaobject> ! <imageobject> ! <imagedata fileref="images/wera3.png" /> ! </imageobject> ! </mediaobject> ! </figure> ! <para>In the Wera installation of <emphasis>W1</emphasis> the different ! collections indexed in NutchWax are mapped to corresponding Wera ! installations of <emphasis>W2- Wn</emphasis>. When the timeline view on W1 ! encounters a resource located on a different node (e.g. the collection ! mapping points to the Wera installation of <emphasis>W2</emphasis>) it ! requests that resource from the Wera installation at ! <literal>W2</literal>. Wera at <literal>W2</literal> fetches the resource ! from its Retriever and does the necessary changes to the file before ! delivering it to Wera at <literal>W1</literal> (e.g. inserts javascript ! link rewriter or rewrites it server side). When Wera at ! <literal>W1</literal> receives this file it does an additional rewrite in ! order to have the links point to itself rather than to ! <literal>W2</literal>'s Wera.</para> ! <para>In a real-life large scale Web Archive where the ARC files are ! distributed across tens or hundreds of hosts it will not be practical to ! set up one Wera installation for each of these. A better solution will be ! to introduce communication between the different retrievers or have one ! front-end retriever interfacing all the other retrievers within one ! archive. This has to be added in a later release of Wera.</para> ! </section> ! ! <section> ! <title>The future of WERA</title> ! ! <para>As long as there are institutions using WERA, and these institutions ! see a need for fixing bugs and adding functionality, WERA will evolve. Of ! course, the actual work put into it will depend on the resources available ! at these institutions. We also hope that future enhancements of WERA will ! be funded, or partly funded by IIPC, as was the case with the work done to ! enable release 0.4.0 of WERA (and NutchWax).</para> ! ! <para>The most important requirement for a future release of WERA will be ! to support retrieval from several Web Archive hosts through one single ARC ! retriever interface. In addition we need to do something with the ! remaining bugs that didn't make it into the 0.4.0. release (handling of ! redirects and better handling of frames). There are also a few requests ! for enhancements registered that needs attention, one of them being the ! advanced search interface.</para> ! ! <para>One of the main complaints from users has been that WERA required ! the user to install and set up Tomcat, Apache + PHP and Perl + a number of ! CPAN modules. The dependency on Perl is long since removed but WERA still ! requires Tomcat (java Arc Retriever) and Apache (PHP web applications for ! searching and navigating). Over time, we would like WERA to move ! completely to Java, both for simplifying the install, setup and ! maintenance as well as improving the chances of getting users involved in ! the further development of WERA. Fortunately the move to Java may be done ! gradually because WERA is modular, and http is used to communicate between ! the different modules. The work of porting WERA to Java should be ! coordinated with the work done on <ulink ! url="http://archive-access.sourceforge.net/projects/wayback/">wayback</ulink>, ! to prevent implementing the same functionallity twice.</para> ! ! <para>We strongly encourage users of WERA/NutchWax to contribute by ! submitting bugs and RFE's, as well as providing feedback on the ! usefullness of the tools.</para> </section> </article> \ No newline at end of file |
From: Brad <bra...@us...> - 2005-10-26 01:17:24
|
Update of /cvsroot/archive-access/archive-access/projects/wayback/src/webapp/WEB-INF In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv12135/src/webapp/WEB-INF Modified Files: web.xml Log Message: TWEAK: switched to JSReplayUI Index: web.xml =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/wayback/src/webapp/WEB-INF/web.xml,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** web.xml 21 Oct 2005 03:24:40 -0000 1.3 --- web.xml 26 Oct 2005 01:17:13 -0000 1.4 *************** *** 19,23 **** --- 19,26 ---- <context-param> <param-name>replayui.class</param-name> + <!-- <param-value>org.archive.wayback.rawreplayui.RawReplayUI</param-value> + --> + <param-value>org.archive.wayback.jsreplayui.JSReplayUI</param-value> <description>Class that implements ReplayUI for this Wayback</description> </context-param> *************** *** 103,115 **** </context-param> - <!-- - <context-param> - <param-name></param-name> - <param-value></param-value> - <description></description> - </context-param> - --> - - <!-- Replay Servlet Configuration --> --- 106,109 ---- |
From: Brad <bra...@us...> - 2005-10-26 01:16:58
|
Update of /cvsroot/archive-access/archive-access/projects/wayback/xdocs In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv12039/xdocs Modified Files: faq.fml Log Message: FEATURE: added basic "what is" answer, added question and answer, "how to install" Index: faq.fml =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/wayback/xdocs/faq.fml,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** faq.fml 20 Oct 2005 01:30:37 -0000 1.1 --- faq.fml 26 Oct 2005 01:16:47 -0000 1.2 *************** *** 11,15 **** <answer> <p> ! Fill in.. </p> </answer> --- 11,79 ---- <answer> <p> ! The project is designed to replace the current Wayback Machine with an ! all Java solution that is flexible enough to provide an easy-to-use ! solution for the single-machine at-home user, as well as scaling up ! to hundreds of machines for a full historical collection. ! </p> ! <p> ! Primarily it is a few interfaces, and some core classes that utilize ! those interfaces to provide the Wayback service. Presently only ! trivial implementations of those interfaces have been developed, ! but we hope that these interfaces will allow a high degree of ! flexibility and experimentation. ! </p> ! </answer> ! </faq> ! <faq id="install"> ! <question> ! How can I install and use this? ! </question> ! <answer> ! <p> ! The project output is a .WAR file, so it can be used with any servlet ! container (but it has only been tested on Tomcat on Linux.) ! </p> ! <p> ! Once it is unpacked, there are 5 modifications that can ! be made to the web.xml file: ! <table> ! <tr> ! <td>parameter</td> ! <td>description</td> ! <td>default</td> ! </tr> ! <tr> ! <td>arcpath</td> ! <td>directory where ARC are found</td> ! <td><b>/tmp/wayback/arcs</b></td> ! </tr> ! <tr> ! <td>resourceindex.indexpath</td> ! <td>directory where index should be stored</td> ! <td><b>/tmp/wayback/index</b></td> ! </tr> ! <tr> ! <td>resourceindex.dbname</td> ! <td>Name of index within directory</td> ! <td><b>DB1</b></td> ! </tr> ! <tr> ! <td>indexpipeline.workpath</td> ! <td>directory where temporary files and processing state is stored</td> ! <td><b>/tmp/wayback/pipeline</b></td> ! </tr> ! <tr> ! <td>indexpipeline.runpipeline</td> ! <td>if set to '1', then new ARC files will be indexed</td> ! <td><b>1</b></td> ! </tr> ! </table> ! </p> ! <p> ! All directories MUST exist before the servlet is initialized. After ! these configurations are set, and the servlet container is running, ! the service can be accessed at http://localhost:8080/wayback/. ! Of course, you might be running on a different port, machine, or ! ContextPath, so you might need to vary the URL. </p> </answer> |
From: Brad <bra...@us...> - 2005-10-26 01:16:15
|
Update of /cvsroot/archive-access/archive-access/projects/wayback/xdocs In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv11951/xdocs Modified Files: index.xml Log Message: TWEAK -- slightly flushed out, long ways to go Index: index.xml =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/wayback/xdocs/index.xml,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** index.xml 20 Oct 2005 16:51:54 -0000 1.2 --- index.xml 26 Oct 2005 01:16:08 -0000 1.3 *************** *** 10,16 **** <body> <section name="Introduction"> ! <p><b>wayback</b> is an open source implementation of the ! The Internet Archive Wayback Machine. Stay tuned for first release. ! </p> </section> </body> --- 10,27 ---- <body> <section name="Introduction"> ! <p><b>wayback</b> is an open source java implementation of the ! The Internet Archive Wayback Machine. ! </p> ! <p> ! The first revision is intended to operate as a standalone webapp. ! It currently supports Archival URL queries, similar to the current ! Wayback Machine, and hopefully soon will integrate fully with ! Heritrix to provide browsing of crawled data as it is crawled. ! </p> ! <p> ! This version includes some basic ARC file indexing, so it can be directed to ! scan for and automatically index new content in the location that Heritrix ! is writing output ARC files. ! </p> </section> </body> |
From: Brad <bra...@us...> - 2005-10-26 01:15:43
|
Update of /cvsroot/archive-access/archive-access/projects/wayback/src/webapp In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv11845/src/webapp Modified Files: help.jsp index.jsp Log Message: TWEAK: minimal UI improvement -- still very rough.. Index: help.jsp =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/wayback/src/webapp/help.jsp,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** help.jsp 20 Oct 2005 00:40:41 -0000 1.1 --- help.jsp 26 Oct 2005 01:15:35 -0000 1.2 *************** *** 1,3 **** <jsp:include page="template/UI-header.jsp" /> ! Sorry, no help yet. <jsp:include page="template/UI-footer.jsp" /> --- 1,4 ---- <jsp:include page="template/UI-header.jsp" /> ! Please refer to the FAQs ! <a href="http://archive-access.sourceforge.net/projects/wayback/faq.html">here</a>. <jsp:include page="template/UI-footer.jsp" /> Index: index.jsp =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/wayback/src/webapp/index.jsp,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** index.jsp 20 Oct 2005 00:40:41 -0000 1.1 --- index.jsp 26 Oct 2005 01:15:35 -0000 1.2 *************** *** 1,3 **** <jsp:include page="template/UI-header.jsp" /> ! This is the wayback Machine! <jsp:include page="template/UI-footer.jsp" /> --- 1,10 ---- <jsp:include page="template/UI-header.jsp" /> ! <p> ! This is the new Wayback Machine prototype. Any URL in ARC files accessible to ! this sevice can be searched above. ! </p> ! <p> ! If you have configured the ARC indexing pipeline, basic status can be accessed ! <a href="pipeline">here</a>. ! </p> <jsp:include page="template/UI-footer.jsp" /> |
From: Brad <bra...@us...> - 2005-10-26 01:15:16
|
Update of /cvsroot/archive-access/archive-access/projects/wayback/src/webapp/template In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv11715/src/webapp/template Modified Files: UI-header.jsp Log Message: BUGFIX: after ArchivalUrl Query or PathQuery, form at top of page was missing -- the ACTION was relative, now is absolute Index: UI-header.jsp =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/wayback/src/webapp/template/UI-header.jsp,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** UI-header.jsp 20 Oct 2005 00:40:40 -0000 1.1 --- UI-header.jsp 26 Oct 2005 01:15:01 -0000 1.2 *************** *** 44,48 **** <!-- URL FORM --> ! <form action="query" method="GET"> --- 44,48 ---- <!-- URL FORM --> ! <form action="<%= request.getContextPath() %>/query" method="GET"> |
From: Brad <bra...@us...> - 2005-10-26 01:14:08
|
Update of /cvsroot/archive-access/archive-access/projects/wayback/src/webapp/jsp/QueryUI In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv11461/src/webapp/jsp/QueryUI Modified Files: requestform.jsp Log Message: TWEAK: added minimal instructions, put FORM into TABLE to pretty it up a bit. Index: requestform.jsp =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/wayback/src/webapp/jsp/QueryUI/requestform.jsp,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** requestform.jsp 20 Oct 2005 00:40:41 -0000 1.1 --- requestform.jsp 26 Oct 2005 01:14:00 -0000 1.2 *************** *** 1,12 **** <jsp:include page="../../template/UI-header.jsp" /> <FORM ACTION="../../query"> ! URL:<INPUT TYPE="TEXT" NAME="url" WIDTH="80"><BR> ! Exact Date:<INPUT TYPE="TEXT" NAME="date" WIDTH="80"><BR> ! Earliest Date:<INPUT TYPE="TEXT" NAME="earliest" WIDTH="80"><BR> ! Latest Date:<INPUT TYPE="TEXT" NAME="latest" WIDTH="80"><BR> ! Type: ! Query<INPUT TYPE="RADIO" NAME="type" VALUE="query" CHECKED="YES"> ! PathQuery<INPUT TYPE="RADIO" NAME="type" VALUE="pathQuery"> ! <INPUT TYPE="SUBMIT" VALUE="Submit"> </FORM> <jsp:include page="../../template/UI-footer.jsp" /> --- 1,24 ---- <jsp:include page="../../template/UI-header.jsp" /> + <h2>Wayabck Search form:</h2> + <p>The URL field is required. All date fields are optional.<br> + To search for a single URL only, use the Query Type.<br> + To search for all URLs beginning with a prefix URL, use PathQuery Type.<br> + </p> + <hr> + <table> <FORM ACTION="../../query"> ! <tr><td>URL:</td><td><INPUT TYPE="TEXT" NAME="url" WIDTH="80"></td></tr> ! <tr><td>Exact Date:</td><td><INPUT TYPE="TEXT" NAME="date" WIDTH="80"></td></tr> ! <tr><td>Earliest Date:</td><td><INPUT TYPE="TEXT" NAME="earliest" WIDTH="80"></td></tr> ! <tr><td>Latest Date:</td><td><INPUT TYPE="TEXT" NAME="latest" WIDTH="80"></td></tr> ! <tr> ! <td>Type:</td> ! <td> ! Query <INPUT TYPE="RADIO" NAME="type" VALUE="query" CHECKED="YES"> ! PathQuery <INPUT TYPE="RADIO" NAME="type" VALUE="pathQuery"> ! </td> ! </tr> ! <tr><td colspan="2" align="left"><INPUT TYPE="SUBMIT" VALUE="Submit"></td></tr> </FORM> + </table> <jsp:include page="../../template/UI-footer.jsp" /> |
From: Brad <bra...@us...> - 2005-10-26 01:13:38
|
Update of /cvsroot/archive-access/archive-access/projects/wayback/src/webapp/jsp/QueryUI In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv11316/src/webapp/jsp/QueryUI Modified Files: PathQueryResults.jsp Log Message: TWEAK: added HR before new URLs to help break up the results. Index: PathQueryResults.jsp =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/wayback/src/webapp/jsp/QueryUI/PathQueryResults.jsp,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** PathQueryResults.jsp 20 Oct 2005 00:40:41 -0000 1.2 --- PathQueryResults.jsp 26 Oct 2005 01:13:30 -0000 1.3 *************** *** 52,56 **** if(newUrl) { %> ! <B><%= url %></B><BR> <% } --- 52,56 ---- if(newUrl) { %> ! <HR><B><%= url %></B><BR> <% } |
From: Brad <bra...@us...> - 2005-10-26 01:12:43
|
Update of /cvsroot/archive-access/archive-access/projects/wayback/src/java/org/archive/wayback/rawreplayui In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv11139/src/java/org/archive/wayback/rawreplayui Modified Files: RawReplayUI.java Log Message: BUGFIX: now uses current timestamp as end of search, instead of last possible timestamp (in 2099...) Index: RawReplayUI.java =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/wayback/src/java/org/archive/wayback/rawreplayui/RawReplayUI.java,v retrieving revision 1.4 retrieving revision 1.5 diff -C2 -d -r1.4 -r1.5 *** RawReplayUI.java 25 Oct 2005 21:42:34 -0000 1.4 --- RawReplayUI.java 26 Oct 2005 01:12:35 -0000 1.5 *************** *** 145,149 **** wmRequest.setExactTimestamp(Timestamp.parseBefore(dateStr)); wmRequest.setStartTimestamp(Timestamp.earliestTimestamp()); ! wmRequest.setEndTimestamp(Timestamp.latestTimestamp()); } catch (ParseException e1) { e1.printStackTrace(); --- 145,149 ---- wmRequest.setExactTimestamp(Timestamp.parseBefore(dateStr)); wmRequest.setStartTimestamp(Timestamp.earliestTimestamp()); ! wmRequest.setEndTimestamp(Timestamp.currentTimestamp()); } catch (ParseException e1) { e1.printStackTrace(); |
From: Brad <bra...@us...> - 2005-10-26 01:11:39
|
Update of /cvsroot/archive-access/archive-access/projects/wayback/src/java/org/archive/wayback/jsreplayui In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv10835/src/java/org/archive/wayback/jsreplayui Added Files: JSReplayUI.java Log Message: FEATURE: new ReplayUI that adds Javascript to HTML result pages which attempts to make URLs point back to this service. --- NEW FILE: JSReplayUI.java --- /* JSReplayUI * * Created on Oct 25, 2005 * * Copyright (C) 2005 Internet Archive. * * This file is part of the wayback (crawler.archive.org). * * wayback is free software; you can redistribute it and/or modify * it under the terms of the GNU Lesser Public License as published by * the Free Software Foundation; either version 2.1 of the License, or * any later version. * * wayback is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Lesser Public License for more details. * * You should have received a copy of the GNU Lesser Public License * along with wayback; if not, write to the Free Software * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */ package org.archive.wayback.jsreplayui; import java.io.IOException; import java.text.ParseException; import javax.servlet.ServletOutputStream; import javax.servlet.http.HttpServletRequest; import javax.servlet.http.HttpServletResponse; import org.archive.io.arc.ARCRecord; import org.archive.wayback.core.Resource; import org.archive.wayback.core.ResourceResult; import org.archive.wayback.core.ResourceResults; import org.archive.wayback.core.Timestamp; import org.archive.wayback.core.WMRequest; import org.archive.wayback.rawreplayui.RawReplayUI; /** * ReplayUI that inserts classic Wayback Machine Javascript into pages to * rewrite images and anchors for HTML pages. * * @author brad * @version $Date: 2005/10/26 01:11:27 $, $Revision: 1.1 $ */ public class JSReplayUI extends RawReplayUI { /** * Constructor */ public JSReplayUI() { super(); // TODO Auto-generated constructor stub } private boolean isRawReplayResult(ResourceResult result) { if (-1 == result.getMimeType().indexOf("text/html")) { return true; } return false; } public void replayResource(WMRequest wmRequest, ResourceResult result, Resource resource, HttpServletRequest request, HttpServletResponse response, ResourceResults results) throws IOException { if (resource == null) { throw new IllegalArgumentException("No resource"); } if (result == null) { throw new IllegalArgumentException("No result"); } if (isRawReplayResult(result)) { super.replayResource(wmRequest, result, resource, request, response, results); return; } ARCRecord record = resource.getArcRecord(); record.skipHttpHeader(); copyRecordHttpHeader(response, record, true); // slurp the whole thing into RAM: byte[] bbuffer = new byte[4 * 1024]; StringBuffer sbuffer = new StringBuffer(); for (int r = -1; (r = record.read(bbuffer, 0, bbuffer.length)) != -1;) { String chunk = new String(bbuffer); sbuffer.append(chunk.substring(0, r)); } markUpPage(sbuffer, result, results, request); response.setHeader("Content-Length", "" + sbuffer.length()); ServletOutputStream out = response.getOutputStream(); out.print(new String(sbuffer)); } private void markUpPage(StringBuffer page, ResourceResult result, ResourceResults results, HttpServletRequest request) { insertBaseTag(page, result, request); insertJavascript(page, result, request); } private void insertBaseTag(StringBuffer page, ResourceResult result, HttpServletRequest request) { String resultUrl = result.getUrl(); String baseTag = "<BASE HREF=\"http://" + resultUrl + "\">"; int insertPoint = page.indexOf("<head>"); if (-1 == insertPoint) { insertPoint = page.indexOf("<HEAD>"); } if (-1 == insertPoint) { insertPoint = 0; } else { insertPoint += 6; // just after the tag } page.insert(insertPoint, baseTag); } private void insertJavascript(StringBuffer page, ResourceResult result, HttpServletRequest request) { String resourceTS = result.getTimestamp().getDateStr(); String nowTS; try { nowTS = Timestamp.currentTimestamp().getDateStr(); } catch (ParseException e) { nowTS = "UNKNOWN"; } String protocol = "http"; String serverName = request.getServerName(); int serverPort = request.getServerPort(); String context = request.getContextPath(); String contextPath = protocol + "://" + serverName + (serverPort == 80 ? "" : ":" + serverPort) + context + "/" + result.getTimestamp().getDateStr() + "/"; String scriptInsert = "<SCRIPT language=\"Javascript\">\n" + "<!--\n" + "\n" + "// FILE ARCHIVED ON " + resourceTS + " AND RETRIEVED FROM THE\n" + "// INTERNET ARCHIVE ON " + nowTS + ".\n" + "// JAVASCRIPT APPENDED BY WAYBACK MACHINE, COPYRIGHT INTERNET ARCHIVE.\n" + "//\n" + "// ALL OTHER CONTENT MAY ALSO BE PROTECTED BY COPYRIGHT (17 U.S.C.\n" + "// SECTION 108(a)(3)).\n" + "\n" + " var sWayBackCGI = \"" + contextPath + "\";\n" + " \n" + "function xResolveUrl(url) {\n" + " var image = new Image();\n" + " image.src = url;\n" + " return image.src;\n" + "}\n" + "function xLateUrl(aCollection, sProp) {\n" + " var i = 0;\n" + " for(i = 0; i < aCollection.length; i++) {\n" + " if (typeof(aCollection[i][sProp]) == \"string\") {\n" + " if (aCollection[i][sProp].indexOf(\"mailto:\") == -1 &&\n" + " aCollection[i][sProp].indexOf(\"javascript:\") == -1) {\n" + " if(aCollection[i][sProp].indexOf(\"http\") == 0) {\n" + " aCollection[i][sProp] = sWayBackCGI + aCollection[i][sProp];\n" + " } else {\n" + " aCollection[i][sProp] = sWayBackCGI + xResolveUrl(aCollection[i][sProp]);\n" + " }\n" + " }\n" + " }\n" + " }\n" + "}\n" + " \n" + " xLateUrl(document.getElementsByTagName(\"IMG\"),\"src\");\n" + " xLateUrl(document.getElementsByTagName(\"A\"),\"href\");\n" + " xLateUrl(document.getElementsByTagName(\"AREA\"),\"href\");\n" + " xLateUrl(document.getElementsByTagName(\"OBJECT\"),\"codebase\");\n" + " xLateUrl(document.getElementsByTagName(\"OBJECT\"),\"data\");\n" + " xLateUrl(document.getElementsByTagName(\"APPLET\"),\"codebase\");\n" + " xLateUrl(document.getElementsByTagName(\"APPLET\"),\"archive\");\n" + " xLateUrl(document.getElementsByTagName(\"EMBED\"),\"src\");\n" + " xLateUrl(document.getElementsByTagName(\"BODY\"),\"background\");\n" + "\n" + "// -->\n" + "\n" + "</SCRIPT>\n"; int insertPoint = page.indexOf("</body>"); if (-1 == insertPoint) { insertPoint = page.indexOf("</BODY>"); } if (-1 == insertPoint) { insertPoint = page.length(); } page.insert(insertPoint, scriptInsert); } } |
From: Brad <bra...@us...> - 2005-10-26 01:10:37
|
Update of /cvsroot/archive-access/archive-access/projects/wayback/src/java/org/archive/wayback/jsreplayui In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv10481/src/java/org/archive/wayback/jsreplayui Log Message: Directory /cvsroot/archive-access/archive-access/projects/wayback/src/java/org/archive/wayback/jsreplayui added to the repository |
From: Brad <bra...@us...> - 2005-10-26 01:10:15
|
Update of /cvsroot/archive-access/archive-access/projects/wayback/src/java/org/archive/wayback/core In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv10314/src/java/org/archive/wayback/core Modified Files: WMRequest.java Log Message: BUGFIX: was not correctly parsing CGI Queries where url had no trailing '/' Index: WMRequest.java =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/wayback/src/java/org/archive/wayback/core/WMRequest.java,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** WMRequest.java 20 Oct 2005 00:40:41 -0000 1.3 --- WMRequest.java 26 Oct 2005 01:10:00 -0000 1.4 *************** *** 269,275 **** } parseCGIArgsDates(queryMap); ! if (!requestURIStr.startsWith("http://")) { requestURIStr = "http://" + requestURIStr; } requestURI = new UURI(requestURIStr,false); setRetrieval(); --- 269,283 ---- } parseCGIArgsDates(queryMap); ! if (requestURIStr.startsWith("http://")) { ! if(-1 == requestURIStr.indexOf('/',8)) { ! requestURIStr = requestURIStr + "/"; ! } ! } else { ! if (!requestURIStr.contains("/")) { ! requestURIStr = requestURIStr + "/"; ! } requestURIStr = "http://" + requestURIStr; } + requestURI = new UURI(requestURIStr,false); setRetrieval(); *************** *** 302,306 **** } parseCGIArgsDates(queryMap); ! if (!requestURIStr.startsWith("http://")) { requestURIStr = "http://" + requestURIStr; } --- 310,321 ---- } parseCGIArgsDates(queryMap); ! if (requestURIStr.startsWith("http://")) { ! if(-1 == requestURIStr.indexOf('/',8)) { ! requestURIStr = requestURIStr + "/"; ! } ! } else { ! if (!requestURIStr.contains("/")) { ! requestURIStr = requestURIStr + "/"; ! } requestURIStr = "http://" + requestURIStr; } *************** *** 358,362 **** // the latest possible: if(origExactDateRequest == null) { ! endTimestamp = Timestamp.latestTimestamp(); } else { // no end specified, but they asked for an exact date. --- 373,377 ---- // the latest possible: if(origExactDateRequest == null) { ! endTimestamp = Timestamp.currentTimestamp(); } else { // no end specified, but they asked for an exact date. *************** *** 365,369 **** if(origExactDateRequest.equals(exactTimestamp.getDateStr())) { ! endTimestamp = Timestamp.latestTimestamp(); } else { endTimestamp = Timestamp.parseAfter(exactDateRequest); --- 380,384 ---- if(origExactDateRequest.equals(exactTimestamp.getDateStr())) { ! endTimestamp = Timestamp.currentTimestamp(); } else { endTimestamp = Timestamp.parseAfter(exactDateRequest); |
From: Brad <bra...@us...> - 2005-10-26 01:08:30
|
Update of /cvsroot/archive-access/archive-access/projects/wayback/src/java/org/archive/wayback/arcindexer In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv9934/src/java/org/archive/wayback/arcindexer Modified Files: ArcIndexer.java Log Message: BUGFIX: was including filedesc record in output Index: ArcIndexer.java =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/wayback/src/java/org/archive/wayback/arcindexer/ArcIndexer.java,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** ArcIndexer.java 21 Oct 2005 03:24:40 -0000 1.3 --- ArcIndexer.java 26 Oct 2005 01:08:22 -0000 1.4 *************** *** 83,87 **** continue; } ! results.addResourceResult(result); } return results; --- 83,89 ---- continue; } ! if(result != null) { ! results.addResourceResult(result); ! } } return results; *************** *** 103,107 **** result.setMd5Fragment(meta.getDigest()); result.setMimeType(meta.getMimetype()); ! UURI uri = new UURI(meta.getUrl(), false); result.setOrigHost(uri.getHost()); --- 105,115 ---- result.setMd5Fragment(meta.getDigest()); result.setMimeType(meta.getMimetype()); ! ! String uriStr = meta.getUrl(); ! if(uriStr.startsWith("filedesc")) { ! // skip filedesc record... ! return null; ! } ! UURI uri = new UURI(uriStr, false); result.setOrigHost(uri.getHost()); |
From: Brad <bra...@us...> - 2005-10-25 21:42:45
|
Update of /cvsroot/archive-access/archive-access/projects/wayback/src/java/org/archive/wayback/simplequeryui In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv3731/src/java/org/archive/wayback/simplequeryui Modified Files: SimpleQueryUI.java Log Message: BUGFIX: incorrect generation of query arguments -- only append '?' + query args if they are present. Index: SimpleQueryUI.java =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/wayback/src/java/org/archive/wayback/simplequeryui/SimpleQueryUI.java,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** SimpleQueryUI.java 22 Oct 2005 00:29:20 -0000 1.3 --- SimpleQueryUI.java 25 Oct 2005 21:42:34 -0000 1.4 *************** *** 82,87 **** WMRequest wmRequest = null; Matcher matcher = null; ! ! String origRequestPath = request.getRequestURI() + "?" + request.getQueryString(); String contextPath = request.getContextPath(); if (!origRequestPath.startsWith(contextPath)) { --- 82,90 ---- WMRequest wmRequest = null; Matcher matcher = null; ! String queryString = request.getQueryString(); ! String origRequestPath = request.getRequestURI(); ! if(queryString != null) { ! origRequestPath = request.getRequestURI() + "?" + queryString; ! } String contextPath = request.getContextPath(); if (!origRequestPath.startsWith(contextPath)) { |
From: Brad <bra...@us...> - 2005-10-25 21:42:44
|
Update of /cvsroot/archive-access/archive-access/projects/wayback/src/java/org/archive/wayback/rawreplayui In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv3731/src/java/org/archive/wayback/rawreplayui Modified Files: RawReplayUI.java Log Message: BUGFIX: incorrect generation of query arguments -- only append '?' + query args if they are present. Index: RawReplayUI.java =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/wayback/src/java/org/archive/wayback/rawreplayui/RawReplayUI.java,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** RawReplayUI.java 22 Oct 2005 00:29:20 -0000 1.3 --- RawReplayUI.java 25 Oct 2005 21:42:34 -0000 1.4 *************** *** 98,102 **** Matcher matcher = null; ! String origRequestPath = request.getRequestURI() + "?" + request.getQueryString(); String contextPath = request.getContextPath(); if (!origRequestPath.startsWith(contextPath)) { --- 98,106 ---- Matcher matcher = null; ! String queryString = request.getQueryString(); ! String origRequestPath = request.getRequestURI(); ! if(queryString != null) { ! origRequestPath = request.getRequestURI() + "?" + queryString; ! } String contextPath = request.getContextPath(); if (!origRequestPath.startsWith(contextPath)) { |
From: Michael S. <sta...@us...> - 2005-10-25 20:09:45
|
Update of /cvsroot/archive-access/archive-access/projects/wayback In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv13537 Modified Files: .classpath project.properties project.xml Log Message: * .classpath * project.properties * project.xml * build.xml * src/webapp/WEB-INF/lib/libidn-0.5.9.jar Add libidn jar. Index: .classpath =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/wayback/.classpath,v retrieving revision 1.4 retrieving revision 1.5 diff -C2 -d -r1.4 -r1.5 *** .classpath 25 Oct 2005 03:23:41 -0000 1.4 --- .classpath 25 Oct 2005 20:09:31 -0000 1.5 *************** *** 16,19 **** --- 16,21 ---- path="src/webapp/WEB-INF/lib/arc-1.5.1-200510181911.jar"/> <classpathentry kind="lib" + path="src/webapp/WEB-INF/lib/libidn-0.5.9.jar"/> + <classpathentry kind="lib" path="/src/webapp/WEB-INF/lib/commons-codec-1.3.jar"/> <classpathentry kind="lib" Index: project.properties =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/wayback/project.properties,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** project.properties 25 Oct 2005 03:23:41 -0000 1.2 --- project.properties 25 Oct 2005 20:09:31 -0000 1.3 *************** *** 21,24 **** --- 21,25 ---- maven.jar.je = ${basedir}/src/webapp/WEB-INF/lib/je-2.0.83.jar maven.jar.arc = ${basedir}/src/webapp/WEB-INF/lib/arc-1.5.1-200510181911.jar + maven.jar.libidn = ${basedir}/src/webapp/WEB-INF/lib/libidn-0.5.9.jar maven.jar.commons-codec = ${basedir}/src/webapp/WEB-INF/lib/commons-codec-1.3.jar maven.jar.dsi-unimi-it = ${basedir}/src/webapp/WEB-INF/lib/dsi-unimi-it-1.0.0.kb.jar Index: project.xml =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/wayback/project.xml,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** project.xml 25 Oct 2005 03:23:41 -0000 1.3 --- project.xml 25 Oct 2005 20:09:31 -0000 1.4 *************** *** 211,214 **** --- 211,231 ---- </dependency> <dependency> + <id>libidn</id> + <version>0.5.9</version> + <url>http://www.gnu.org/software/libidn/</url> + <properties> + <war.bundle>true</war.bundle> + <ear.bundle>true</ear.bundle> + <ear.bundle.dir>APP-INF/lib</ear.bundle.dir> + <description>GNU Libidn is an implementation of the Stringprep, + Punycode and IDNA specifications defined by the IETF + Internationalized Domain Names (IDN) working group, used for + internationalized domain names. + </description> + <license>GNU Lesser General Public License + http://www.gnu.org/licenses/lgpl.txt</license> + </properties> + </dependency> + <dependency> <id>commons-codec</id> <version>1.3</version> |
From: Michael S. <sta...@us...> - 2005-10-25 20:09:40
|
Update of /cvsroot/archive-access/archive-access/projects/wayback/src/webapp/WEB-INF/lib In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv13537/src/webapp/WEB-INF/lib Added Files: libidn-0.5.9.jar Log Message: * .classpath * project.properties * project.xml * build.xml * src/webapp/WEB-INF/lib/libidn-0.5.9.jar Add libidn jar. --- NEW FILE: libidn-0.5.9.jar --- (This appears to be a binary file; contents omitted.) |
From: Michael S. <sta...@us...> - 2005-10-25 03:23:51
|
Update of /cvsroot/archive-access/archive-access/projects/wayback In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv29017 Modified Files: .classpath project.properties project.xml Added Files: build.xml Log Message: * .classpath * project.properties * project.xml Add commons-codec and dsi lib. * build.xml Empty, placeholder build.xml (Prevents harmless exception spew during maven build). * src/webapp/WEB-INF/lib/commons-codec-1.3.jar * src/webapp/WEB-INF/lib/dsi-unimi-it-1.0.0.kb.jar Added. Index: .classpath =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/wayback/.classpath,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** .classpath 20 Oct 2005 01:37:39 -0000 1.3 --- .classpath 25 Oct 2005 03:23:41 -0000 1.4 *************** *** 15,18 **** --- 15,22 ---- <classpathentry kind="lib" path="src/webapp/WEB-INF/lib/arc-1.5.1-200510181911.jar"/> + <classpathentry kind="lib" + path="/src/webapp/WEB-INF/lib/commons-codec-1.3.jar"/> + <classpathentry kind="lib" + path="src/webapp/WEB-INF/lib/dsi-unimi-it-1.0.0.kb.jar"/> <classpathentry kind="output" path="src/webapp/WEB-INF/classes"/> </classpath> Index: project.properties =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/wayback/project.properties,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** project.properties 20 Oct 2005 01:30:36 -0000 1.1 --- project.properties 25 Oct 2005 03:23:41 -0000 1.2 *************** *** 21,24 **** --- 21,26 ---- maven.jar.je = ${basedir}/src/webapp/WEB-INF/lib/je-2.0.83.jar maven.jar.arc = ${basedir}/src/webapp/WEB-INF/lib/arc-1.5.1-200510181911.jar + maven.jar.commons-codec = ${basedir}/src/webapp/WEB-INF/lib/commons-codec-1.3.jar + maven.jar.dsi-unimi-it = ${basedir}/src/webapp/WEB-INF/lib/dsi-unimi-it-1.0.0.kb.jar # Junit properties --- NEW FILE: build.xml --- <?xml version="1.0" encoding="UTF-8"?> <!--Use maven to build. Ant not supported. (This is a placeholder build.xml. Without it, the maven build of src will try to autogenerate an ant build file spewing an ugly exception into the build). --> Index: project.xml =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/wayback/project.xml,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** project.xml 20 Oct 2005 16:51:54 -0000 1.2 --- project.xml 25 Oct 2005 03:23:41 -0000 1.3 *************** *** 210,213 **** --- 210,247 ---- </properties> </dependency> + <dependency> + <id>commons-codec</id> + <version>1.3</version> + <url>http://jakarta.apache.org/commons/codec/</url> + <properties> + <war.bundle>true</war.bundle> + <ear.bundle>true</ear.bundle> + <ear.bundle.dir>APP-INF/lib</ear.bundle.dir> + <description>Commons Codec provides implementations of common + encoders and decoders such as Base64, Hex, various phonetic + encodings, and URLs.</description> + <license>Apache 2.0 + http://www.apache.org/licenses/LICENSE-2.0</license> + </properties> + </dependency> + <dependency> + <id>dsi-unimi-it</id> + <version>1.0.0</version> + <url>http://mg4j.dsi.unimi.it/</url> + <properties> + <war.bundle>true</war.bundle> + <ear.bundle>true</ear.bundle> + <ear.bundle.dir>APP-INF/lib</ear.bundle.dir> + <description>Alternatives to String, + StringBuffer, unsynchronized I/0, and a ConsistentHashFunction. + Made from subsets of mg4j-0.9.1 and fastutil-4.4.0, + -- two jars that came of the ubicrawler project, + http://ubi0.iit.cnr.it/projects/ubi/ -- using autojar: + java -jar ~/workspace/autojar-1.2.2/autojar-1.2.2.jar -v -o + dss.unimi.it-1.0.0.jar -c fastutil-4.4.0/fastutil-4.4.0.jar:mg4j-0.9.1/mg4j-0.9.1.jar:ubix-1.0.3/ubix-1.0.3.jar: it.unimi.dsi.mg4j.util.MutableString.class it.unimi.dsi.mg4j.io.FastBufferedInputStream.class it.unimi.dsi.mg4j.io.FastBufferedOutputStream.class it.unimi.dsi.mg4j.io.FastBufferedReader.class it.unimi.dsi.mg4j.io.FastByteArrayInputStream.class it.unimi.dsi.mg4j.io.FastByteArrayOutputStream.class it.unimi.dsi.mg4j.io.FastMultiByteArrayInputStream.class it.unimi.dsi.ubix.ConsistentHashFunction.class</description> + <license>MG4J, ConsistentHashFunction, and fastutils are + LGPL</license> + </properties> + </dependency> </dependencies> |