You can subscribe to this list here.
2005 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
(10) |
Sep
(36) |
Oct
(339) |
Nov
(103) |
Dec
(152) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2006 |
Jan
(141) |
Feb
(102) |
Mar
(125) |
Apr
(203) |
May
(57) |
Jun
(30) |
Jul
(139) |
Aug
(46) |
Sep
(64) |
Oct
(105) |
Nov
(34) |
Dec
(162) |
2007 |
Jan
(81) |
Feb
(57) |
Mar
(141) |
Apr
(72) |
May
(9) |
Jun
(1) |
Jul
(144) |
Aug
(88) |
Sep
(40) |
Oct
(43) |
Nov
(34) |
Dec
(20) |
2008 |
Jan
(44) |
Feb
(45) |
Mar
(16) |
Apr
(36) |
May
(8) |
Jun
(77) |
Jul
(177) |
Aug
(66) |
Sep
(8) |
Oct
(33) |
Nov
(13) |
Dec
(37) |
2009 |
Jan
(2) |
Feb
(5) |
Mar
(8) |
Apr
|
May
(36) |
Jun
(19) |
Jul
(46) |
Aug
(8) |
Sep
(1) |
Oct
(66) |
Nov
(61) |
Dec
(10) |
2010 |
Jan
(13) |
Feb
(16) |
Mar
(38) |
Apr
(76) |
May
(47) |
Jun
(32) |
Jul
(35) |
Aug
(45) |
Sep
(20) |
Oct
(61) |
Nov
(24) |
Dec
(16) |
2011 |
Jan
(22) |
Feb
(34) |
Mar
(11) |
Apr
(8) |
May
(24) |
Jun
(23) |
Jul
(11) |
Aug
(42) |
Sep
(81) |
Oct
(48) |
Nov
(21) |
Dec
(20) |
2012 |
Jan
(30) |
Feb
(25) |
Mar
(4) |
Apr
(6) |
May
(1) |
Jun
(5) |
Jul
(5) |
Aug
(8) |
Sep
(6) |
Oct
(6) |
Nov
|
Dec
|
Revision: 2530 http://archive-access.svn.sourceforge.net/archive-access/?rev=2530&view=rev Author: bradtofel Date: 2008-08-08 23:35:06 +0000 (Fri, 08 Aug 2008) Log Message: ----------- TWEAK: improvements to logging. Modified Paths: -------------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/locationdb/ResourceFileLocationDBUpdater.java Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/locationdb/ResourceFileLocationDBUpdater.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/locationdb/ResourceFileLocationDBUpdater.java 2008-08-08 23:03:50 UTC (rev 2529) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/locationdb/ResourceFileLocationDBUpdater.java 2008-08-08 23:35:06 UTC (rev 2530) @@ -90,8 +90,9 @@ if(update.getName().endsWith(TMP_SUFFIX)) { continue; } - updated++; - synchronize(update); + if(synchronize(update)) { + updated++; + } } return updated; } @@ -160,10 +161,8 @@ int updated = updater.synchronizeIncoming(); if(updated > 0) { - LOGGER.info("Updated " + updated + " files.."); sleepInterval = runInterval; } else { - LOGGER.info("Updated ZERO files.."); sleepInterval += runInterval; } sleep(sleepInterval); This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2008-08-08 23:03:43
|
Revision: 2529 http://archive-access.svn.sourceforge.net/archive-access/?rev=2529&view=rev Author: bradtofel Date: 2008-08-08 23:03:50 +0000 (Fri, 08 Aug 2008) Log Message: ----------- TWEAK: improvements to logging. Modified Paths: -------------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/updater/LocalResourceIndexUpdater.java trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/indexer/IndexQueueUpdater.java trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/indexer/IndexWorker.java trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/locationdb/ResourceFileLocationDBUpdater.java trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/resourcefile/ResourceFileSourceUpdater.java Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/updater/LocalResourceIndexUpdater.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/updater/LocalResourceIndexUpdater.java 2008-08-08 23:02:29 UTC (rev 2528) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/updater/LocalResourceIndexUpdater.java 2008-08-08 23:03:50 UTC (rev 2529) @@ -304,7 +304,7 @@ sleepInterval = runInterval; } } catch (InterruptedException e) { - e.printStackTrace(); + LOGGER.info("Shutting Down."); return; } } Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/indexer/IndexQueueUpdater.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/indexer/IndexQueueUpdater.java 2008-08-08 23:02:29 UTC (rev 2528) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/indexer/IndexQueueUpdater.java 2008-08-08 23:03:50 UTC (rev 2529) @@ -87,7 +87,9 @@ CloseableIterator<String> newNames = db.getNamesBetweenMarks(lastMarkPoint, currentMarkPoint); while(newNames.hasNext()) { - queue.enqueue(newNames.next()); + String newName = newNames.next(); + LOGGER.info("Queued " + newName + " for indexing."); + queue.enqueue(newName); added++; } newNames.close(); @@ -143,15 +145,13 @@ int updated = updater.updateQueue(); if(updated > 0) { - LOGGER.info("Updated " + updated + " files.."); sleepInterval = runInterval; } else { - LOGGER.info("Updated ZERO files.."); sleepInterval += runInterval; } sleep(sleepInterval); } catch (InterruptedException e) { - e.printStackTrace(); + LOGGER.info("Shutting Down."); return; } catch (IOException e) { e.printStackTrace(); Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/indexer/IndexWorker.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/indexer/IndexWorker.java 2008-08-08 23:02:29 UTC (rev 2528) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/indexer/IndexWorker.java 2008-08-08 23:03:50 UTC (rev 2529) @@ -103,6 +103,7 @@ try { if(pathsOrUrls != null) { for(String pathOrUrl : pathsOrUrls) { + LOGGER.info("Indexing " + name + " from " + pathOrUrl); CloseableIterator<CaptureSearchResult> itr = indexFile(pathOrUrl); target.addSearchResults(name, itr); itr.close(); @@ -151,17 +152,15 @@ boolean worked = worker.doWork(); if(worked) { - LOGGER.info("Did work, no sleep.."); sleepInterval = 0; } else { - LOGGER.info("No Work to do - sleeping.."); sleepInterval += runInterval; } if(sleepInterval > 0) { sleep(sleepInterval); } } catch (InterruptedException e) { - e.printStackTrace(); + LOGGER.info("Shutting Down."); return; } catch (IOException e) { e.printStackTrace(); Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/locationdb/ResourceFileLocationDBUpdater.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/locationdb/ResourceFileLocationDBUpdater.java 2008-08-08 23:02:29 UTC (rev 2528) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/locationdb/ResourceFileLocationDBUpdater.java 2008-08-08 23:03:50 UTC (rev 2529) @@ -168,7 +168,7 @@ } sleep(sleepInterval); } catch (InterruptedException e) { - e.printStackTrace(); + LOGGER.info("Shutting Down."); return; } catch (IOException e) { e.printStackTrace(); Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/resourcefile/ResourceFileSourceUpdater.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/resourcefile/ResourceFileSourceUpdater.java 2008-08-08 23:02:29 UTC (rev 2528) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/resourcefile/ResourceFileSourceUpdater.java 2008-08-08 23:03:50 UTC (rev 2529) @@ -125,7 +125,7 @@ ". Not sleeping."); } } catch (InterruptedException e) { - e.printStackTrace(); + LOGGER.info("Shutting Down."); return; } } This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
Revision: 2528 http://archive-access.svn.sourceforge.net/archive-access/?rev=2528&view=rev Author: bradtofel Date: 2008-08-08 23:02:29 +0000 (Fri, 08 Aug 2008) Log Message: ----------- HACKHACK: moved log creation after BDB creation - default configuration may need locationdb directory to be created. Modified Paths: -------------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/locationdb/BDBResourceFileLocationDB.java Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/locationdb/BDBResourceFileLocationDB.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/locationdb/BDBResourceFileLocationDB.java 2008-08-08 23:00:44 UTC (rev 2527) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/locationdb/BDBResourceFileLocationDB.java 2008-08-08 23:02:29 UTC (rev 2528) @@ -86,16 +86,16 @@ } public void init() throws IOException { - if(logPath == null) { - throw new IOException("No logPath"); - } - log = new ResourceFileLocationDBLog(logPath); bdb = new BDBRecordSet(); try { bdb.initializeDB(bdbPath,bdbName); } catch (DatabaseException e) { throw wrapDBException(e); } + if(logPath == null) { + throw new IOException("No logPath"); + } + log = new ResourceFileLocationDBLog(logPath); } /** This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
Revision: 2527 http://archive-access.svn.sourceforge.net/archive-access/?rev=2527&view=rev Author: bradtofel Date: 2008-08-08 23:00:44 +0000 (Fri, 08 Aug 2008) Log Message: ----------- BUGFIX(unreported): NPE was occurring with missing directory. TWEAK: cleaned up logging of missing directory. Modified Paths: -------------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/resourcefile/DirectoryResourceFileSource.java Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/resourcefile/DirectoryResourceFileSource.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/resourcefile/DirectoryResourceFileSource.java 2008-08-08 22:59:14 UTC (rev 2526) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/resourcefile/DirectoryResourceFileSource.java 2008-08-08 23:00:44 UTC (rev 2527) @@ -29,6 +29,7 @@ import java.io.IOException; import java.util.ArrayList; import java.util.List; +import java.util.logging.Logger; /** * Local directory tree holding ARC and WARC files. @@ -37,6 +38,8 @@ * @version $Date$, $Revision$ */ public class DirectoryResourceFileSource implements ResourceFileSource { + private static final Logger LOGGER = + Logger.getLogger(DirectoryResourceFileSource.class.getName()); private static char SEPRTR = '_'; private String name = null; @@ -68,16 +71,22 @@ */ private void populateFileList(ResourceFileList list, File root, boolean recurse) throws IOException { - - File[] files = root.listFiles(); - for(File file : files) { - if(file.isFile() && filter.accept(root, file.getName())) { - ResourceFileLocation location = new ResourceFileLocation( - file.getName(),file.getAbsolutePath()); - list.add(location); - } else if(recurse && file.isDirectory()){ - populateFileList(list, file, recurse); + if(root.isDirectory()) { + File[] files = root.listFiles(); + if(files != null) { + for(File file : files) { + if(file.isFile() && filter.accept(root, file.getName())) { + ResourceFileLocation location = new ResourceFileLocation( + file.getName(),file.getAbsolutePath()); + list.add(location); + } else if(recurse && file.isDirectory()){ + populateFileList(list, file, recurse); + } + } } + } else { + LOGGER.warning(root.getAbsolutePath() + " is not a directory."); + return; } } This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2008-08-08 22:59:05
|
Revision: 2526 http://archive-access.svn.sourceforge.net/archive-access/?rev=2526&view=rev Author: bradtofel Date: 2008-08-08 22:59:14 +0000 (Fri, 08 Aug 2008) Log Message: ----------- BUGFIX(unreported) setting start/end dates is the responsibility of RequestParser. Modified Paths: -------------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/core/WaybackRequest.java Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/core/WaybackRequest.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/core/WaybackRequest.java 2008-08-07 22:00:34 UTC (rev 2525) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/core/WaybackRequest.java 2008-08-08 22:59:14 UTC (rev 2526) @@ -758,34 +758,34 @@ */ public void fixup(HttpServletRequest httpRequest) { extractHttpRequestInfo(httpRequest); - String startDate = get(REQUEST_START_DATE); - String endDate = get(REQUEST_END_DATE); - String exactDate = get(REQUEST_EXACT_DATE); - String partialDate = get(REQUEST_DATE); - if (partialDate == null) { - partialDate = ""; - } - if (startDate == null || startDate.length() == 0) { - put(REQUEST_START_DATE, Timestamp - .padStartDateStr(partialDate)); - } else if (startDate.length() < 14) { - put(REQUEST_START_DATE, Timestamp - .padStartDateStr(startDate)); - } - if (endDate == null || endDate.length() == 0) { - put(REQUEST_END_DATE, Timestamp - .padEndDateStr(partialDate)); - } else if (endDate.length() < 14) { - put(REQUEST_END_DATE, Timestamp - .padEndDateStr(endDate)); - } - if (exactDate == null || exactDate.length() == 0) { - put(REQUEST_EXACT_DATE, Timestamp - .padEndDateStr(partialDate)); - } else if (exactDate.length() < 14) { - put(REQUEST_EXACT_DATE, Timestamp - .padEndDateStr(exactDate)); - } +// String startDate = get(REQUEST_START_DATE); +// String endDate = get(REQUEST_END_DATE); +// String exactDate = get(REQUEST_EXACT_DATE); +// String partialDate = get(REQUEST_DATE); +// if (partialDate == null) { +// partialDate = ""; +// } +// if (startDate == null || startDate.length() == 0) { +// put(REQUEST_START_DATE, Timestamp +// .padStartDateStr(partialDate)); +// } else if (startDate.length() < 14) { +// put(REQUEST_START_DATE, Timestamp +// .padStartDateStr(startDate)); +// } +// if (endDate == null || endDate.length() == 0) { +// put(REQUEST_END_DATE, Timestamp +// .padEndDateStr(partialDate)); +// } else if (endDate.length() < 14) { +// put(REQUEST_END_DATE, Timestamp +// .padEndDateStr(endDate)); +// } +// if (exactDate == null || exactDate.length() == 0) { +// put(REQUEST_EXACT_DATE, Timestamp +// .padEndDateStr(partialDate)); +// } else if (exactDate.length() < 14) { +// put(REQUEST_EXACT_DATE, Timestamp +// .padEndDateStr(exactDate)); +// } } /** This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
Revision: 2525 http://archive-access.svn.sourceforge.net/archive-access/?rev=2525&view=rev Author: bradtofel Date: 2008-08-07 22:00:34 +0000 (Thu, 07 Aug 2008) Log Message: ----------- INITIAL-REV: flat file partial implementation of the locationDB, does not support any of the indexing features - add, remove, and mark-related methods. Added Paths: ----------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/locationdb/FlatFileResourceFileLocationDB.java Added: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/locationdb/FlatFileResourceFileLocationDB.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/locationdb/FlatFileResourceFileLocationDB.java (rev 0) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/locationdb/FlatFileResourceFileLocationDB.java 2008-08-07 22:00:34 UTC (rev 2525) @@ -0,0 +1,88 @@ +package org.archive.wayback.resourcestore.locationdb; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.Iterator; + +import org.archive.wayback.util.CloseableIterator; +import org.archive.wayback.util.flatfile.FlatFile; + +public class FlatFileResourceFileLocationDB implements ResourceFileLocationDB { + private String path = null; + private FlatFile flatFile = null; + private String delimiter = "\t"; + + + public void addNameUrl(String name, String url) throws IOException { + // NO-OP + } + + public long getCurrentMark() throws IOException { + return 0; + } + + public CloseableIterator<String> getNamesBetweenMarks(long start, long end) + throws IOException { + return null; + } + + public String[] nameToUrls(String name) throws IOException { + ArrayList<String> urls = new ArrayList<String>(); + String prefix = name + delimiter; + Iterator<String> itr = flatFile.getRecordIterator(prefix); + while(itr.hasNext()) { + String line = itr.next(); + if(line.startsWith(prefix)) { + urls.add(line.substring(prefix.length())); + } else { + break; + } + } + if(itr instanceof CloseableIterator) { + CloseableIterator<String> citr = (CloseableIterator<String>) itr; + citr.close(); + } + String[] a = new String[urls.size()]; + for(int i=0; i < urls.size(); i++) { + a[i] = urls.get(i); + } + return a; + } + + public void removeNameUrl(String name, String url) throws IOException { + // NO-OP + } + + public void shutdown() throws IOException { + // NO-OP + } + + /** + * @param path the path to set + */ + public void setPath(String path) { + this.path = path; + flatFile = new FlatFile(path); + } + + /** + * @return the path + */ + public String getPath() { + return path; + } + + /** + * @param delimter the delimiter to set + */ + public void setDelimiter(String delimiter) { + this.delimiter = delimiter; + } + + /** + * @return the delimiter + */ + public String getDelimiter() { + return delimiter; + } +} This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2008-08-05 01:28:37
|
Revision: 2524 http://archive-access.svn.sourceforge.net/archive-access/?rev=2524&view=rev Author: bradtofel Date: 2008-08-05 01:28:47 +0000 (Tue, 05 Aug 2008) Log Message: ----------- BUGFIX (ACC-28): check that guessed charset is supported before attempting to use. Modified Paths: -------------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/replay/TextDocument.java Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/replay/TextDocument.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/replay/TextDocument.java 2008-08-01 17:17:44 UTC (rev 2523) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/replay/TextDocument.java 2008-08-05 01:28:47 UTC (rev 2524) @@ -91,6 +91,9 @@ private boolean isCharsetSupported(String charsetName) { // can you believe that this throws a runtime? Just asking if it's // supported!!?! They coulda just said "no"... + if(charsetName == null) { + return false; + } try { return Charset.isSupported(charsetName); } catch(IllegalCharsetNameException e) { @@ -192,8 +195,10 @@ // (5) detector.reset(); - - return charsetName; + if(isCharsetSupported(charsetName)) { + return charsetName; + } + return null; } /** This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
Revision: 2523 http://archive-access.svn.sourceforge.net/archive-access/?rev=2523&view=rev Author: miklosh Date: 2008-08-01 17:17:44 +0000 (Fri, 01 Aug 2008) Log Message: ----------- Throw UnsupportedOperationException when searching against an index not suitable for image search. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/imagesearch/src/java/org/archive/nutchwax/imagesearch/ImageSearcherBean.java Modified: trunk/archive-access/projects/nutchwax/imagesearch/src/java/org/archive/nutchwax/imagesearch/ImageSearcherBean.java =================================================================== --- trunk/archive-access/projects/nutchwax/imagesearch/src/java/org/archive/nutchwax/imagesearch/ImageSearcherBean.java 2008-07-31 21:29:53 UTC (rev 2522) +++ trunk/archive-access/projects/nutchwax/imagesearch/src/java/org/archive/nutchwax/imagesearch/ImageSearcherBean.java 2008-08-01 17:17:44 UTC (rev 2523) @@ -188,7 +188,10 @@ currentDoc = spans.doc(); doc = reader.document(currentDoc); // Skip document with no images - if ("0".equals(doc.getField(ImageSearch.HAS_IMAGE_KEY).stringValue())) { + String hasImagesValue = doc.getField(ImageSearch.HAS_IMAGE_KEY).stringValue(); + if (hasImagesValue == null) { + throw new UnsupportedOperationException("Index not suitable for image search."); + } else if ("0".equals(hasImagesValue)) { while (more && spans.doc() == currentDoc) { more = spans.next(); } This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
Revision: 2522 http://archive-access.svn.sourceforge.net/archive-access/?rev=2522&view=rev Author: bradtofel Date: 2008-07-31 21:29:53 +0000 (Thu, 31 Jul 2008) Log Message: ----------- TWEAK: removed(commented out) that old, annoying log message about closing ARC files, also removed references to the LOGGER, as this leaves it unused. Modified Paths: -------------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/resourcefile/ArcResource.java Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/resourcefile/ArcResource.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/resourcefile/ArcResource.java 2008-07-31 18:49:49 UTC (rev 2521) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/resourcefile/ArcResource.java 2008-07-31 21:29:53 UTC (rev 2522) @@ -7,7 +7,7 @@ import java.util.Iterator; import java.util.Map; import java.util.Set; -import java.util.logging.Logger; +//import java.util.logging.Logger; import org.apache.commons.httpclient.Header; import org.archive.io.ArchiveRecord; @@ -19,8 +19,8 @@ /** * Logger for this class */ - private static final Logger LOGGER = Logger.getLogger(ArcResource.class - .getName()); +// private static final Logger LOGGER = Logger.getLogger(ArcResource.class +// .getName()); /** * String prefix for ARC file related metadata namespace of keys within @@ -157,7 +157,7 @@ arcRecord.close(); if(arcReader != null) { arcReader.close(); - LOGGER.info("closed..("+arcReader+")"); +// LOGGER.fine("closed..("+arcReader+")"); } } This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2008-07-31 18:49:42
|
Revision: 2521 http://archive-access.svn.sourceforge.net/archive-access/?rev=2521&view=rev Author: bradtofel Date: 2008-07-31 18:49:49 +0000 (Thu, 31 Jul 2008) Log Message: ----------- REFACTOR: moved client-side watching of location.href changes to client-rewrite.js, which is where it should have been from the start. Modified Paths: -------------- trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/js/client-rewrite.js trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/js/disclaim.js Modified: trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/js/client-rewrite.js =================================================================== --- trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/js/client-rewrite.js 2008-07-29 06:43:31 UTC (rev 2520) +++ trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/js/client-rewrite.js 2008-07-31 18:49:49 UTC (rev 2521) @@ -63,3 +63,24 @@ } } } +var interceptRunAlready = false; +function intercept_js_href_iawm(destination) { + if(!interceptRunAlready &&top.location.href != destination) { + interceptRunAlready = true; + top.location.href = sWayBackCGI+xResolveUrl(destination); + } +} +// ie triggers +href_iawmWatcher = document.createElement("a"); +top.location.href_iawm = top.location.href; +if(href_iawmWatcher.setExpression) { + href_iawmWatcher.setExpression("dummy","intercept_js_href_iawm(top.location.href_iawm)"); +} +// mozilla triggers +function intercept_js_moz(prop,oldval,newval) { + intercept_js_href_iawm(newval); + return newval; +} +if(top.location.watch) { + top.location.watch("href_iawm",intercept_js_moz); +} Modified: trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/js/disclaim.js =================================================================== --- trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/js/disclaim.js 2008-07-29 06:43:31 UTC (rev 2520) +++ trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/js/disclaim.js 2008-07-31 18:49:49 UTC (rev 2521) @@ -1,26 +1,3 @@ - -var interceptRunAlready = false; -function intercept_js_href_iawm(destination) { - if(!interceptRunAlready &&top.location.href != destination) { - interceptRunAlready = true; - top.location.href = sWayBackCGI+xResolveUrl(destination); - } -} -// ie triggers -href_iawmWatcher = document.createElement("a"); -top.location.href_iawm = top.location.href; -if(href_iawmWatcher.setExpression) { - href_iawmWatcher.setExpression("dummy","intercept_js_href_iawm(top.location.href_iawm)"); -} -// mozilla triggers -function intercept_js_moz(prop,oldval,newval) { - intercept_js_href_iawm(newval); - return newval; -} -if(top.location.watch) { - top.location.watch("href_iawm",intercept_js_moz); -} - var notice = "<div style='" + "position:relative;z-index:99999;"+ This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <mi...@us...> - 2008-07-29 06:43:22
|
Revision: 2520 http://archive-access.svn.sourceforge.net/archive-access/?rev=2520&view=rev Author: miklosh Date: 2008-07-29 06:43:31 +0000 (Tue, 29 Jul 2008) Log Message: ----------- Update the README with info about DocIndexer, ImageProcessor and the web UI. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/imagesearch/README.txt Modified: trunk/archive-access/projects/nutchwax/imagesearch/README.txt =================================================================== --- trunk/archive-access/projects/nutchwax/imagesearch/README.txt 2008-07-29 02:18:10 UTC (rev 2519) +++ trunk/archive-access/projects/nutchwax/imagesearch/README.txt 2008-07-29 06:43:31 UTC (rev 2520) @@ -47,13 +47,58 @@ Then install the "nutch-1.0-dev.tar.gz" tarball as normal. +Indexing +-------- +After performing the usual steps to import or fetch the files, invert +the links, indexing has to be done using the DocIndexer: + + $ bin/nutch org.archive.nutchwax.imagesearch.DocIndexer <index> <crawldb> <linkdb> <segment> ... + +DocIndexer is based on Nutch's indexer and has to be parameterized the +same way as Nutch's indexer. The difference between the two indexers is +that DocIndexer does an extra MapReduce step to determine the exact +image version to be used for image URLs embedded in HTML pages. + + +Thumbnail generation +-------------------- +Image metadata and thumbnails have to be created by the ImageProcessor: + + $ bin/nutch org.archive.nutchwax.imagesearch.ImageProcessor <segmentDir> + +This tool processes one segment at a time, making thumbnails for any +images found in the segment and recording some metadata about them. The +results of this operation are stored in a directory named "image_data" +in the segment's directory. + +The ImageProcessor can be configured by the following properties: + o imagesearcher.thumbnail.quality: specifies the JPEG quality of + thumbnails (specified by an integer between 0 and 100) + o imagesearcher.thumbnail.maxSize: specifies the maximum width and + height of a thumbnail + +In order to have thumbnails shown in the search results on the web UI, +ImageProcessor has to be run for every indexed segment. However, +thumbnail generation is not needed for command-line searching. + + Searching --------- -After performing the usual steps to import or fetch the files, invert -the links and index the documents, you can search the resulting indexes -for images by: +After performing the steps needed for index generation, you can search +the resulting indexes for images by: - bin/nutch org.archive.nutchwax.imagesearch.ImageSearcherBean product + $ bin/nutch org.archive.nutchwax.imagesearch.ImageSearcherBean product This calls the ImageSearcherBean to execute a simple keyword search for "product". + + +Web deployment +-------------- +The web application for image searching can be built by invoking the +following command in this contrib's directory: + + $ ant imagesearch-war + +This will generate a WAR file named "imagesearch.war" in the "build" +directory of Nutch, which can be deployed as usual. This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2008-07-29 02:18:01
|
Revision: 2519 http://archive-access.svn.sourceforge.net/archive-access/?rev=2519&view=rev Author: bradtofel Date: 2008-07-29 02:18:10 +0000 (Tue, 29 Jul 2008) Log Message: ----------- TWEAK: now uses WaybackRequest.getContextPrefix() instead of assembling it from HttpServletRequest fields. Modified Paths: -------------- trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/replay/ClientSideJSInsert.jsp trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/replay/Disclaimer.jsp Modified: trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/replay/ClientSideJSInsert.jsp =================================================================== --- trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/replay/ClientSideJSInsert.jsp 2008-07-29 02:16:35 UTC (rev 2518) +++ trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/replay/ClientSideJSInsert.jsp 2008-07-29 02:18:10 UTC (rev 2519) @@ -8,12 +8,9 @@ UIResults results = UIResults.extractReplay(request); String requestDate = results.getResult().getCaptureTimestamp(); String contextPath = results.getURIConverter().makeReplayURI(requestDate,""); -String contextRoot = request.getScheme() + "://" + request.getServerName() + ":" - + request.getServerPort() + request.getContextPath(); - -String jsUrl = contextRoot + "/replay/client-rewrite.js"; +String contextRoot = results.getWbRequest().getContextPrefix(); %> <script type="text/javascript"> var sWayBackCGI = "<%= contextPath %>"; </script> -<script type="text/javascript" src="<%= jsUrl %>" ></script> +<script type="text/javascript" src="<%= contextRoot %>js/client-rewrite.js" ></script> Modified: trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/replay/Disclaimer.jsp =================================================================== --- trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/replay/Disclaimer.jsp 2008-07-29 02:16:35 UTC (rev 2518) +++ trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/replay/Disclaimer.jsp 2008-07-29 02:18:10 UTC (rev 2519) @@ -30,13 +30,10 @@ String wmNotice = fmt.format("ReplayView.banner", resultUrl, resultDate); String wmHideNotice = fmt.format("ReplayView.bannerHideLink"); - -String contextRoot = request.getScheme() + "://" + request.getServerName() + ":" -+ request.getServerPort() + request.getContextPath(); -String jsUrl = contextRoot + "/replay/disclaim.js"; +String contextRoot = results.getWbRequest().getContextPrefix(); %> <script type="text/javascript"> var wmNotice = "<%= wmNotice %><%= dupeMsg %>"; var wmHideNotice = "<%= wmHideNotice %>"; </script> -<script type="text/javascript" src="<%= jsUrl %>"></script> +<script type="text/javascript" src="<%= contextRoot %>js/disclaim.js"></script> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
Revision: 2518 http://archive-access.svn.sourceforge.net/archive-access/?rev=2518&view=rev Author: bradtofel Date: 2008-07-29 02:16:35 +0000 (Tue, 29 Jul 2008) Log Message: ----------- FEATURE: added translation of old ORIGINAL-HOST field to ORIGINAL-URL field Modified Paths: -------------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/cdx/CDXLineToSearchResultAdapter.java Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/cdx/CDXLineToSearchResultAdapter.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/cdx/CDXLineToSearchResultAdapter.java 2008-07-29 02:15:02 UTC (rev 2517) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/cdx/CDXLineToSearchResultAdapter.java 2008-07-29 02:16:35 UTC (rev 2518) @@ -27,6 +27,7 @@ import org.archive.wayback.core.CaptureSearchResult; import org.archive.wayback.util.Adapter; +import org.archive.wayback.util.url.UrlOperations; /** * Adapter that converts a CDX record String into a CaptureSearchResult @@ -36,6 +37,43 @@ */ public class CDXLineToSearchResultAdapter implements Adapter<String,CaptureSearchResult> { + + private final static String SCHEME_STRING = "://"; + private final static String DEFAULT_SCHEME = "http://"; + + private static int getEndOfHostIndex(String url) { + int portIdx = url.indexOf(UrlOperations.PORT_SEPARATOR); + int pathIdx = url.indexOf(UrlOperations.PATH_START); + if(portIdx == -1 && pathIdx == -1) { + return url.length(); + } + if(portIdx == -1) { + return pathIdx; + } + if(pathIdx == -1) { + return portIdx; + } + if(pathIdx > portIdx) { + return portIdx; + } else { + return pathIdx; + } + } + + /* (non-Javadoc) + * @see org.archive.wayback.util.Adapter#adapt(java.lang.Object) + */ + public CaptureSearchResult adapt(CaptureSearchResult o) { + String urlKey = o.getUrlKey(); + StringBuilder sb = new StringBuilder(urlKey.length()); + sb.append(DEFAULT_SCHEME); + sb.append(o.getOriginalUrl()); + sb.append(urlKey.substring(getEndOfHostIndex(urlKey))); + o.setOriginalUrl(sb.toString()); + return o; + } + + public CaptureSearchResult adapt(String line) { return doAdapt(line); } @@ -53,6 +91,15 @@ String urlKey = tokens[0]; String captureTS = tokens[1]; String originalUrl = tokens[2]; + + // convert from ORIG_HOST to ORIG_URL here: + if(!originalUrl.contains(SCHEME_STRING)) { + StringBuilder sb = new StringBuilder(urlKey.length()); + sb.append(DEFAULT_SCHEME); + sb.append(originalUrl); + sb.append(urlKey.substring(getEndOfHostIndex(urlKey))); + originalUrl = sb.toString(); + } String mimeType = tokens[3]; String httpCode = tokens[4]; String digest = tokens[5]; This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
Revision: 2517 http://archive-access.svn.sourceforge.net/archive-access/?rev=2517&view=rev Author: bradtofel Date: 2008-07-29 02:15:02 +0000 (Tue, 29 Jul 2008) Log Message: ----------- BUGFIX (unreported): fix to recent change in parser/fixup interaction which set endDate to the date of request, prevented forward-in-time redirects for embedded/click-thru replay requests. Modified Paths: -------------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/archivalurl/requestparser/ReplayRequestParser.java Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/archivalurl/requestparser/ReplayRequestParser.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/archivalurl/requestparser/ReplayRequestParser.java 2008-07-29 02:13:46 UTC (rev 2516) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/archivalurl/requestparser/ReplayRequestParser.java 2008-07-29 02:15:02 UTC (rev 2517) @@ -66,6 +66,9 @@ if (dateStr.length() == 14) { startDate = getEarliestTimestamp(); endDate = getLatestTimestamp(); + if(endDate == null) { + endDate = Timestamp.currentTimestamp().getDateStr(); + } } else { // classic behavior: This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2008-07-29 02:13:37
|
Revision: 2516 http://archive-access.svn.sourceforge.net/archive-access/?rev=2516&view=rev Author: bradtofel Date: 2008-07-29 02:13:46 +0000 (Tue, 29 Jul 2008) Log Message: ----------- TWEAK: added some common cookie keys to omit from generated query URLs. Modified Paths: -------------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/core/WaybackRequest.java Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/core/WaybackRequest.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/core/WaybackRequest.java 2008-07-28 21:58:30 UTC (rev 2515) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/core/WaybackRequest.java 2008-07-29 02:13:46 UTC (rev 2516) @@ -332,6 +332,12 @@ private static String UI_RESOURCE_BUNDLE_NAME = "WaybackUI"; + private static String STD_LOGGED_IN_VER = "logged-in-ver"; + private static String STD_LOGGED_IN_NAME = "logged-in-name"; + private static String STD_LOGGED_IN_USER = "logged-in-user"; + private static String STD_PHP_SESSION_ID = "PHPSESSID"; + private static String STD_J_SESSION_ID = "JSESSIONID"; + /** * set of filter keys that are not forwarded to subsequent paginated * requests. @@ -344,7 +350,12 @@ REQUEST_WAYBACK_CONTEXT, REQUEST_AUTH_TYPE, REQUEST_REMOTE_USER, - REQUEST_LOCALE_LANG }; + REQUEST_LOCALE_LANG, + STD_LOGGED_IN_USER, + STD_LOGGED_IN_VER, + STD_LOGGED_IN_NAME, + STD_PHP_SESSION_ID, + STD_J_SESSION_ID }; /** * @return Returns the resultsPerPage. This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-07-28 21:58:22
|
Revision: 2515 http://archive-access.svn.sourceforge.net/archive-access/?rev=2515&view=rev Author: binzino Date: 2008-07-28 21:58:30 +0000 (Mon, 28 Jul 2008) Log Message: ----------- Creation of NutchWAX 0.12.1 release tag. Added Paths: ----------- tags/nutchwax-0_12_1/ tags/nutchwax-0_12_1/archive/ This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-07-28 21:56:50
|
Revision: 2514 http://archive-access.svn.sourceforge.net/archive-access/?rev=2514&view=rev Author: binzino Date: 2008-07-28 21:57:00 +0000 (Mon, 28 Jul 2008) Log Message: ----------- Oops, wrong name format...removing. Removed Paths: ------------- tags/nutchwax_0_12_1/ This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-07-28 21:54:47
|
Revision: 2513 http://archive-access.svn.sourceforge.net/archive-access/?rev=2513&view=rev Author: binzino Date: 2008-07-28 21:54:56 +0000 (Mon, 28 Jul 2008) Log Message: ----------- Creation of target directory for NutchWAX 0.12.1 tag. Added Paths: ----------- tags/nutchwax_0_12_1/ This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2008-07-28 20:54:28
|
Revision: 2512 http://archive-access.svn.sourceforge.net/archive-access/?rev=2512&view=rev Author: bradtofel Date: 2008-07-28 20:54:37 +0000 (Mon, 28 Jul 2008) Log Message: ----------- CONFIG: eclipse project files changed for Eclipse 3.4 Modified Paths: -------------- trunk/archive-access/projects/wayback/.classpath trunk/archive-access/projects/wayback/.project Added Paths: ----------- trunk/archive-access/projects/wayback/.settings/ trunk/archive-access/projects/wayback/.settings/org.eclipse.jdt.core.prefs trunk/archive-access/projects/wayback/.settings/org.eclipse.wst.common.component trunk/archive-access/projects/wayback/.settings/org.eclipse.wst.common.project.facet.core.xml trunk/archive-access/projects/wayback/.settings/org.eclipse.wst.jsdt.ui.superType.container trunk/archive-access/projects/wayback/.settings/org.eclipse.wst.jsdt.ui.superType.name Modified: trunk/archive-access/projects/wayback/.classpath =================================================================== --- trunk/archive-access/projects/wayback/.classpath 2008-07-28 20:52:25 UTC (rev 2511) +++ trunk/archive-access/projects/wayback/.classpath 2008-07-28 20:54:37 UTC (rev 2512) @@ -1,8 +1,18 @@ <?xml version="1.0" encoding="UTF-8"?> <classpath> - <classpathentry kind="src" path="wayback-core/src/main/java"/> - <classpathentry kind="src" path="wayback-core/src/test/java"/> + <classpathentry kind="src" output="wayback-core/target/classes" path="wayback-core/src/main/java"/> + <classpathentry kind="src" output="wayback-core/target/test-classes" path="wayback-core/src/test/java"/> + <classpathentry kind="src" output="wayback-mapreduce-prereq/target/classes" path="wayback-mapreduce-prereq/src/main/java"/> + <classpathentry kind="src" output="wayback-webapp/target/classes" path="wayback-webapp/src/main/java"/> + <classpathentry kind="src" output="wayback-webapp/target/test-classes" path="wayback-webapp/src/test/java"/> <classpathentry kind="con" path="org.eclipse.jdt.launching.JRE_CONTAINER"/> - <classpathentry kind="con" path="org.maven.ide.eclipse.MAVEN2_CLASSPATH_CONTAINER"/> + <classpathentry kind="con" path="org.eclipse.jst.server.core.container/org.eclipse.jst.server.tomcat.runtimeTarget/Apache Tomcat v5.5"/> + <classpathentry kind="con" path="org.eclipse.jst.j2ee.internal.module.container"/> + <classpathentry kind="con" path="org.maven.ide.eclipse.MAVEN2_CLASSPATH_CONTAINER"> + <attributes> + <attribute name="org.eclipse.jst.component.dependency" value="/WEB-INF/lib"/> + </attributes> + </classpathentry> + <classpathentry kind="con" path="org.eclipse.jst.j2ee.internal.module.container"/> <classpathentry kind="output" path="target/classes"/> </classpath> Modified: trunk/archive-access/projects/wayback/.project =================================================================== --- trunk/archive-access/projects/wayback/.project 2008-07-28 20:52:25 UTC (rev 2511) +++ trunk/archive-access/projects/wayback/.project 2008-07-28 20:54:37 UTC (rev 2512) @@ -6,11 +6,26 @@ </projects> <buildSpec> <buildCommand> + <name>org.eclipse.wst.jsdt.core.javascriptValidator</name> + <arguments> + </arguments> + </buildCommand> + <buildCommand> <name>org.eclipse.jdt.core.javabuilder</name> <arguments> </arguments> </buildCommand> <buildCommand> + <name>org.eclipse.wst.common.project.facet.core.builder</name> + <arguments> + </arguments> + </buildCommand> + <buildCommand> + <name>org.eclipse.wst.validation.validationbuilder</name> + <arguments> + </arguments> + </buildCommand> + <buildCommand> <name>org.maven.ide.eclipse.maven2Builder</name> <arguments> </arguments> @@ -19,6 +34,9 @@ <natures> <nature>org.eclipse.jdt.core.javanature</nature> <nature>org.maven.ide.eclipse.maven2Nature</nature> - <nature>com.sysdeo.eclipse.tomcat.tomcatnature</nature> + <nature>org.eclipse.wst.common.project.facet.core.nature</nature> + <nature>org.eclipse.wst.common.modulecore.ModuleCoreNature</nature> + <nature>org.eclipse.jem.workbench.JavaEMFNature</nature> + <nature>org.eclipse.wst.jsdt.core.jsNature</nature> </natures> </projectDescription> Added: trunk/archive-access/projects/wayback/.settings/org.eclipse.jdt.core.prefs =================================================================== --- trunk/archive-access/projects/wayback/.settings/org.eclipse.jdt.core.prefs (rev 0) +++ trunk/archive-access/projects/wayback/.settings/org.eclipse.jdt.core.prefs 2008-07-28 20:54:37 UTC (rev 2512) @@ -0,0 +1,12 @@ +#Fri Jul 25 18:19:09 PDT 2008 +eclipse.preferences.version=1 +org.eclipse.jdt.core.compiler.codegen.inlineJsrBytecode=enabled +org.eclipse.jdt.core.compiler.codegen.targetPlatform=1.5 +org.eclipse.jdt.core.compiler.codegen.unusedLocal=preserve +org.eclipse.jdt.core.compiler.compliance=1.5 +org.eclipse.jdt.core.compiler.debug.lineNumber=generate +org.eclipse.jdt.core.compiler.debug.localVariable=generate +org.eclipse.jdt.core.compiler.debug.sourceFile=generate +org.eclipse.jdt.core.compiler.problem.assertIdentifier=error +org.eclipse.jdt.core.compiler.problem.enumIdentifier=error +org.eclipse.jdt.core.compiler.source=1.5 Added: trunk/archive-access/projects/wayback/.settings/org.eclipse.wst.common.component =================================================================== --- trunk/archive-access/projects/wayback/.settings/org.eclipse.wst.common.component (rev 0) +++ trunk/archive-access/projects/wayback/.settings/org.eclipse.wst.common.component 2008-07-28 20:54:37 UTC (rev 2512) @@ -0,0 +1,13 @@ +<?xml version="1.0" encoding="UTF-8"?> +<project-modules id="moduleCoreId" project-version="1.5.0"> +<wb-module deploy-name="wayback"> +<wb-resource deploy-path="/" source-path="/wayback-webapp/src/main/webapp"/> +<wb-resource deploy-path="/WEB-INF/classes" source-path="/wayback-core/src/main/java"/> +<wb-resource deploy-path="/WEB-INF/classes" source-path="/wayback-core/src/test/java"/> +<wb-resource deploy-path="/WEB-INF/classes" source-path="/wayback-mapreduce-prereq/src/main/java"/> +<wb-resource deploy-path="/WEB-INF/classes" source-path="/wayback-webapp/src/main/java"/> +<wb-resource deploy-path="/WEB-INF/classes" source-path="/wayback-webapp/src/test/java"/> +<property name="context-root"/> +<property name="java-output-path"/> +</wb-module> +</project-modules> Added: trunk/archive-access/projects/wayback/.settings/org.eclipse.wst.common.project.facet.core.xml =================================================================== --- trunk/archive-access/projects/wayback/.settings/org.eclipse.wst.common.project.facet.core.xml (rev 0) +++ trunk/archive-access/projects/wayback/.settings/org.eclipse.wst.common.project.facet.core.xml 2008-07-28 20:54:37 UTC (rev 2512) @@ -0,0 +1,5 @@ +<?xml version="1.0" encoding="UTF-8"?> +<faceted-project> + <installed facet="jst.java" version="5.0"/> + <installed facet="jst.web" version="2.4"/> +</faceted-project> Added: trunk/archive-access/projects/wayback/.settings/org.eclipse.wst.jsdt.ui.superType.container =================================================================== --- trunk/archive-access/projects/wayback/.settings/org.eclipse.wst.jsdt.ui.superType.container (rev 0) +++ trunk/archive-access/projects/wayback/.settings/org.eclipse.wst.jsdt.ui.superType.container 2008-07-28 20:54:37 UTC (rev 2512) @@ -0,0 +1 @@ +org.eclipse.wst.jsdt.launching.baseBrowserLibrary \ No newline at end of file Added: trunk/archive-access/projects/wayback/.settings/org.eclipse.wst.jsdt.ui.superType.name =================================================================== --- trunk/archive-access/projects/wayback/.settings/org.eclipse.wst.jsdt.ui.superType.name (rev 0) +++ trunk/archive-access/projects/wayback/.settings/org.eclipse.wst.jsdt.ui.superType.name 2008-07-28 20:54:37 UTC (rev 2512) @@ -0,0 +1 @@ +Window \ No newline at end of file This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2008-07-28 20:52:16
|
Revision: 2511 http://archive-access.svn.sourceforge.net/archive-access/?rev=2511&view=rev Author: bradtofel Date: 2008-07-28 20:52:25 +0000 (Mon, 28 Jul 2008) Log Message: ----------- INITIAL REV: developer tool which can be useful for debugging, and may provide useful examples to other .jsp developers. Added Paths: ----------- trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/replay/DebugBanner.jsp Added: trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/replay/DebugBanner.jsp =================================================================== --- trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/replay/DebugBanner.jsp (rev 0) +++ trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/replay/DebugBanner.jsp 2008-07-28 20:52:25 UTC (rev 2511) @@ -0,0 +1,161 @@ +<%@ page language="java" pageEncoding="utf-8" contentType="text/html;charset=utf-8"%> +<%@ page import="java.util.Date" %> +<%@ page import="java.util.Map" %> +<%@ page import="java.util.Set" %> +<%@ page import="java.util.Iterator" %> +<%@ page import="org.archive.wayback.WaybackConstants" %> +<%@ page import="org.archive.wayback.core.CaptureSearchResult" %> +<%@ page import="org.archive.wayback.core.CaptureSearchResults" %> +<%@ page import="org.archive.wayback.core.SearchResult" %> +<%@ page import="org.archive.wayback.core.UIResults" %> +<%@ page import="org.archive.wayback.core.WaybackRequest" %> +<%@ page import="org.archive.wayback.util.StringFormatter" %> +<%@ page import="org.archive.wayback.util.html.SelectHTML" %> +<jsp:include page="/WEB-INF/template/CookieJS.jsp" flush="true" /> +<% +SelectHTML window = new SelectHTML("foo"); +window.setProps("onchange=\"SetAnchorWindow(this.value); location.reload(true);\""); +window.addOption("1 day","86400"); +window.addOption("1 week","604800"); +window.addOption("1 month","2592000"); +window.addOption("1 year","31536000"); +window.addOption("10 years","315360000"); +UIResults results = UIResults.extractReplay(request); +WaybackRequest wbr = results.getWbRequest(); +String contextRoot = wbr.getContextPrefix(); +window.setActive(wbr.get(WaybackRequest.REQUEST_ANCHOR_WINDOW)); +Set<String> keys = wbr.keySet(); +Iterator<String> keysItr = keys.iterator(); +Map<String,String> headers = results.getResource().getHttpHeaders(); +%> +<!-- +Start of DebugBanner.jsp output +--> +<script type="text/javascript"> +function SetCookie(cookieName,cookieValue,nDays) { + var today = new Date(); + var expire = new Date(); + if (nDays==null || nDays==0) nDays=1; + expire.setTime(today.getTime() + 3600000*24*nDays); + document.cookie = cookieName+"="+escape(cookieValue) + + ";expires="+expire.toGMTString() + ";path=/"; +} +function DoCookieThing() { + var nI = document.getElementById("cookName"); + var vI = document.getElementById("cookValue"); + if(nI != null) { + if(vI != null) { + SetCookie(nI.value,vI.value,365); + } + } +} +function toggleID(id) { + var nI = document.getElementById(id); + if(nI != null) { + if(nI.style.display == "none") { + nI.style.display = "block"; + } else { + nI.style.display = "none"; + } + } +} +function showHide(id,val) { + var nI = document.getElementById(id); + if(nI != null) { + nI.style.display=val; + } +} +</script> +<div id="wm-debug-banner" style="display:none; position:relative; z-index:99999; background-color:#ffffff; font-size:10px; text-align:center; width:100%;"> + <button onmouseover="showHide('requestDiv','block');" onmouseout="showHide('requestDiv','none');">Request Parameters</button> + <div id="requestDiv" style="display:none; position:absolute; background-color:white; border:line;"> + <table style="border:0px solid #000000; margin:0px; padding:0px; border-spacing:0px; color:black; width:100%;"> +<% +while(keysItr.hasNext()) { + String key = keysItr.next(); + String val = wbr.get(key); + %> + <tr> + <td> + <%= key %> + </td> + <td> + <%= val %> + </td> + </tr> + <% +} +%> + </table> + </div> + <button onmouseover="showHide('resultDiv','block');" onmouseout="showHide('resultDiv','none');">Result Data</button> + <div id="resultDiv" style="display:none; position:absolute; background-color:white; border:line;" class="fdfdfd"> + <table style="border:0px solid #000000; margin:0px; padding:0px; border-spacing:0px; color:black; width:100%;"> +<% +CaptureSearchResult result = results.getResult(); +Map<String,String> resultMap = result.toCanonicalStringMap(); +keysItr = resultMap.keySet().iterator(); +while(keysItr.hasNext()) { + String key = keysItr.next(); + String val = resultMap.get(key); + %> + <tr> + <td> + <%= key %> + </td> + <td> + <%= val %> + </td> + </tr> + <% + } +%> + </table> + </div> + <%= window.draw() %> + <button onclick="toggleID('cookieDiv');">CookieForm</button> + <div id="cookieDiv" style="position:absolute; display:none; background-color:white; color:black;"> + <form name="setCookie"> + <table border=0 cellpadding=3 cellspacing=3> + <tr> + <td>Cookie Name: </td> + <td><input name=t1 type=text size=20 value="cookieName"></td> + </tr> + <tr> + <td>Cookie Value: </td> + <td><input name=t2 type=text size=20 value="cookieValue"></td> + </tr> + <tr> + <td>Must expire in: </td> + <td><input name=t3 type=text size=3 value="5"> days from today</td> + </tr> + <tr> + <td></td> + <td> + <input name=b1 type=button value="Set Cookie" + onClick="SetCookie(this.form.t1.value,this.form.t2.value,this.form.t3.value);"> + </td> + </tr> + </table> + </form> + </div> + <!-- + <select onchange="SetAnchorWindow(this.value); location.reload(true);"> + <option value="86400">1 day</option> + <option value="604800">1 week</option> + <option value="2592000">1 month</option> + <option value="31536000">1 year</option> + <option value="315360000">10 years</option> + </select> + --> +</div> +<script type="text/javascript" src="<%= contextRoot %>js/disclaim-element.js" ></script> +<script type="text/javascript"> + var debugBanner = document.getElementById("wm-debug-banner"); + if(debugBanner != null) { + disclaimElement(debugBanner); + } +</script> +<!-- +End of DebugBanner.jsp output +--> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2008-07-28 20:48:52
|
Revision: 2510 http://archive-access.svn.sourceforge.net/archive-access/?rev=2510&view=rev Author: bradtofel Date: 2008-07-28 20:49:00 +0000 (Mon, 28 Jul 2008) Log Message: ----------- INITIAL REV: likely temporary class to simplify creation of an HTML select.. Handy until we switch to JSTL. Added Paths: ----------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/util/html/ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/util/html/SelectHTML.java Added: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/util/html/SelectHTML.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/util/html/SelectHTML.java (rev 0) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/util/html/SelectHTML.java 2008-07-28 20:49:00 UTC (rev 2510) @@ -0,0 +1,50 @@ +package org.archive.wayback.util.html; + +import java.util.ArrayList; +import java.util.List; + +public class SelectHTML { + List<String[]> options = null; + String activeValue = null; + String name = null; + String props = null; + public SelectHTML(String name) { + this.name = name; + options = new ArrayList<String[]>(); + } + public void addOption(String name, String value) { + String[] newOption = {name,value}; + options.add(newOption); + } + public void addOption(String name) { + addOption(name,name); + } + public void setActive(String value) { + activeValue = value; + } + public void setProps(String props) { + this.props = props; + } + public String draw() { + StringBuilder sb = new StringBuilder(100); + sb.append("<select"); + if(props != null) { + sb.append(" ").append(props); + } + sb.append(" name=\"").append(name).append("\">"); + + for(String[] option : options) { + sb.append("<option value=\"").append(option[1]).append("\""); + if(activeValue != null) { + if(activeValue.equals(option[1])) { + sb.append(" selected"); + } + } + sb.append(">"); + sb.append(option[0]).append("</option>"); + } + + sb.append("</select>"); + return sb.toString(); + } +} This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-07-28 19:49:58
|
Revision: 2509 http://archive-access.svn.sourceforge.net/archive-access/?rev=2509&view=rev Author: binzino Date: 2008-07-28 19:50:07 +0000 (Mon, 28 Jul 2008) Log Message: ----------- Updated for NutchWAX 0.12.1 release. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/HOWTO.txt trunk/archive-access/projects/nutchwax/archive/INSTALL.txt trunk/archive-access/projects/nutchwax/archive/README.txt trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt Modified: trunk/archive-access/projects/nutchwax/archive/HOWTO.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/HOWTO.txt 2008-07-28 19:43:10 UTC (rev 2508) +++ trunk/archive-access/projects/nutchwax/archive/HOWTO.txt 2008-07-28 19:50:07 UTC (rev 2509) @@ -1,6 +1,6 @@ HOWTO.txt -2008-05-20 +2008-07-28 Aaron Binns Table of Contents Modified: trunk/archive-access/projects/nutchwax/archive/INSTALL.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/INSTALL.txt 2008-07-28 19:43:10 UTC (rev 2508) +++ trunk/archive-access/projects/nutchwax/archive/INSTALL.txt 2008-07-28 19:50:07 UTC (rev 2509) @@ -1,6 +1,6 @@ INSTALL.txt -2008-07-02 +2008-07-28 Aaron Binns This installation guide assumes the reader is already familiar with @@ -46,11 +46,11 @@ Nutch SVN trunk. The specific SVN revision that NutchWAX 0.12 is built against is: - 673823 + 676736 To checkout this revision of Nutch, use: - $ svn checkout -r 673823 http://svn.apache.org/repos/asf/lucene/nutch/trunk nutch + $ svn checkout -r 676736 http://svn.apache.org/repos/asf/lucene/nutch/trunk nutch $ cd nutch Modified: trunk/archive-access/projects/nutchwax/archive/README.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/README.txt 2008-07-28 19:43:10 UTC (rev 2508) +++ trunk/archive-access/projects/nutchwax/archive/README.txt 2008-07-28 19:50:07 UTC (rev 2509) @@ -1,9 +1,9 @@ README.txt -2008-07-02 +2008-07-25 Aaron Binns -Welcome to NutchWAX 0.12! +Welcome to NutchWAX 0.12.1! NutchWAX is a set of add-ons to Nutch in order to index and search archived web data. @@ -76,15 +76,15 @@ ====================================================================== -This 0.12 release of NutchWAX is radically different in source-code +This 0.12.x release of NutchWAX is radically different in source-code form compared to the previous release, 0.10. -One of the design goals of 0.12 was to reduce or even eliminate the +One of the design goals of 0.12.x was to reduce or even eliminate the "copy/paste/edit" approach of 0.10. The 0.10 (and prior) NutchWAX releases had to copy/paste/edit large chunks of Nutch source code in order to add the NutchWAX features. -Also, the NutchWAX 0.12 sources and build are designed to one day be +Also, the NutchWAX 0.12.x sources and build are designed to one day be added into mainline Nutch as a proper "contrib" package; then eventually be fully integrated into the core Nutch source code. Modified: trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt 2008-07-28 19:43:10 UTC (rev 2508) +++ trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt 2008-07-28 19:50:07 UTC (rev 2509) @@ -1,9 +1,9 @@ RELEASE-NOTES.TXT -2007-07-03 +2007-07-25 Aaron Binns -Release notes for NutchWAX 0.12 +Release notes for NutchWAX 0.12.1 For the most recent updates and information on NutchWAX, please visit the project wiki at: @@ -15,28 +15,10 @@ Overview ====================================================================== -NutchWAX 0.12-beta-1 was released on June 2, 2008. We anticipated -releasing another beta mid-June with bug fixes and some minor -enhancements based on feedback from the community. +NutchWAX 0.12.1 contains some minor enhancements and fixes to NutchWAX +0.12. One of the driving forces behind some of the enhancements was +integration with the Wayback machine. -During internal testing by the Internet Archive Web Team, a few -serious problems were found, the most critical being the failure to -store different copies of the same URL when importing large batches of -archive files. - -The NutchWAX team canceled the mid-month release in order to focus on -fixing this problem. - -The good news is that not only has that problem been fixed, but the -solution is part of a broader enhancement to manage the de-duplication -of archive contnet during import and indexing. - -For more details on de-duplication in NutchWAX, please see - - HOWTO-dedup.txt - README-dedup.txt - - ====================================================================== Issues ====================================================================== @@ -47,16 +29,24 @@ Issues resolved in this release: -WAX-9 Entire file not imported -WAX-8 Investigate why so many PDFs fail to parse +WAX-16 + Option to skip ARC record import based on HTTP status code of + content - Fixing the first one caused nearly all of the PDF parsing errors to - disappear. +WAX-12 + Add metadata field "fileoffset" -WAX-7 Change config to that URL filters are not applied during link inversion +WAX-11 + Change metadata field name in search results from "arcname" to + "filename" - This is easily achieved by using command-line options when invoking - the Nutch "invertlinks" command. +WAX-10 + Add "exacturl" metadata field to indexing so it can be searched + as-is, not parsed/tokenized like the "url" field. -WAX-3 Observe content size limit on importing -WAX-2 Date queries cause TooManyClauses exceptions +WAX-6 + Change DateAdder to allow for implementation of URLCanonicalizer to + be defined in property. + +WAX-4 + Implementor/user-provided XSLT for OpenSearch results This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-07-28 19:43:01
|
Revision: 2508 http://archive-access.svn.sourceforge.net/archive-access/?rev=2508&view=rev Author: binzino Date: 2008-07-28 19:43:10 +0000 (Mon, 28 Jul 2008) Log Message: ----------- Initial revision. Added in NutchWAX version 0.12.1. Added Paths: ----------- trunk/archive-access/projects/nutchwax/archive/HOWTO-xslt.txt Added: trunk/archive-access/projects/nutchwax/archive/HOWTO-xslt.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/HOWTO-xslt.txt (rev 0) +++ trunk/archive-access/projects/nutchwax/archive/HOWTO-xslt.txt 2008-07-28 19:43:10 UTC (rev 2508) @@ -0,0 +1,105 @@ + +HOWTO-xslt.txt +2008-07-25 +Aaron Binns + +Table of Contents + o Prerequisites + - NutchWAX HOWTO.txt + o Overview + o XSLTFilter and web.xml + + +====================================================================== +Prerequisites +====================================================================== + +This HOWTO assumes you've already read the main NutchWAX HOWTO and are +familiar with importing and indexing archive files with NutchWAX. + +Also, we assume that you are familiar with deploying the Nutch(WAX) +web application into a servlet container such as Tomcat. + + +====================================================================== +Overview +====================================================================== + +Nutch is bundled with two search interfaces + + JSP pages: search.jsp, refine-query.jsp, etc. + Servlet : OpenSearchServlet + +If you read the OpenSearchServlet.java source code and the search.jsp +page, you'll notice a lot of similarity, if not duplication of code. + +The Internet Archive Web Team plans to improve and expand upon the +existing OpenSearchServlet interface as well as adding more XML-based +capabilities, including replacements for the existing JSP pages. In +short, moving away from JSP and toward XML. + +But by favoring XML over JSP, how does one make an HTML UI? By adding +XSLT to the XML interfaces. + +This HOWTO describes the process for adding an XSL transformation to +the OpenSearch XML output. + +This shall be the blueprint for future XML-based interfaces as well. + + +====================================================================== +XSLTFilter and web.xml +====================================================================== + +Adding an XSL transformation to an XML-based interface, such as the +OpenSearchServlet is straightforward. Simply add the XSLTFilter to +the servlet's path and specify the XSL transform to apply. + +For example, consider the default Nutch web.xml + + <servlet> + <servlet-name>OpenSearch</servlet-name> + <servlet-class>org.apache.nutch.searcher.OpenSearchServlet</servlet-class> + </servlet> + + <servlet-mapping> + <servlet-name>OpenSearch</servlet-name> + <url-pattern>/opensearch</url-pattern> + </servlet-mapping> + +Let's say we want to retain the '/opensearch' path for the XML output, +and add the human-friendly HTML page at '/coolsearch' + +First, we add an additional 'servlet-mapping' for our new path: + + <servlet-mapping> + <servlet-name>OpenSearch</servlet-name> + <url-pattern>/coolsearch</url-pattern> + </servlet-mapping> + +Then, we add the XSLTFilter, passing it a URL to the XSLT file + + <filter> + <filter-name>XSLT Filter</filter-name> + <filter-class>org.archive.nutchwax.XSLTFilter</filter-class> + <init-param> + <param-name>xsltUrl</param-name> + <param-value>[URL to XSLT file]</param-value> + </init-param> + </filter> + +Lastly, we apply the filter to the same path as the our human-friendly +HTML path: + + <filter-mapping> + <filter-name>XSLT Filter</filter-name> + <url-pattern>/coolsearch</url-pattern> + </filter-mapping> + +This way, we have two URLs, which run the exact same +OpenSearchServlet, but one produces the unperturbed OpenSearch XML +output whereas the other produces human-friendly HTML output. + + OpenSearch XML : http://someserver/opensearch?query=foo + Human-friendly HTML : http://someserver/coolsearch?query=foo + This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-07-28 19:40:56
|
Revision: 2507 http://archive-access.svn.sourceforge.net/archive-access/?rev=2507&view=rev Author: binzino Date: 2008-07-28 19:41:05 +0000 (Mon, 28 Jul 2008) Log Message: ----------- Updated for 0.12.1 release. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/HOWTO.txt Modified: trunk/archive-access/projects/nutchwax/archive/HOWTO.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/HOWTO.txt 2008-07-28 19:34:33 UTC (rev 2506) +++ trunk/archive-access/projects/nutchwax/archive/HOWTO.txt 2008-07-28 19:41:05 UTC (rev 2507) @@ -213,12 +213,15 @@ <name>nutchwax.filter.index</name> <value> url:false:true:true + url:flase:true:false:true:exacturl orig:false digest:false - arcname:false + filename:false + fileoffset:false collection date type + length </value> </property> @@ -263,7 +266,9 @@ <name>nutchwax.filter.query</name> <value> raw:digest:false - raw:arcname:false + raw:filename:false + raw:fileoffset:false + raw:exacturl:false group:collection group:type field:anchor @@ -304,6 +309,62 @@ must be specified here. -------------------------------------------------- +nutchwax.filter.http.status +-------------------------------------------------- +This property configures a filter with a list of ranges +of HTTP status codes to allow. + +Typically, most NutchWAX implementors do not wish to import and index +404, 500, 302 and other non-success pages. This is an inclusion +filter, meaning that only ARC records with an HTTP status code +matching any of the values will be imported. + +There is a special "unknown" value which can be used to include ARC +records that don't have an HTTP status code (for whatever reason). + +The default setting provided in nutch-site.xml is to allow any 2XX +success code: + + <property> + <name>nutchwax.filter.http.status</name> + <value> + 200-299 + </value> + </property> + +But some other examples are: + + Allow any 2XX success code *and* redirects, use: + <property> + <name>nutchwax.filter.http.status</name> + <value> + 200-299 + 300-399 + </value> + </property> + + Be really strict about only certain codes, use: + <property> + <name>nutchwax.filter.http.status</name> + <value> + 200 + 301 + 302 + 304 + </value> + </property> + + Mix of ranges and specific codes, including the "unknown" + <property> + <name>nutchwax.filter.http.status</name> + <value> + Unknown + 200 + 300-399 + </value> + </property> + +-------------------------------------------------- nutchwax.import.content.limit -------------------------------------------------- Similar to Nutch's This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-07-28 19:34:24
|
Revision: 2506 http://archive-access.svn.sourceforge.net/archive-access/?rev=2506&view=rev Author: binzino Date: 2008-07-28 19:34:33 +0000 (Mon, 28 Jul 2008) Log Message: ----------- Corrected some type-o's. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/README-dedup.txt Modified: trunk/archive-access/projects/nutchwax/archive/README-dedup.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/README-dedup.txt 2008-07-28 19:29:16 UTC (rev 2505) +++ trunk/archive-access/projects/nutchwax/archive/README-dedup.txt 2008-07-28 19:34:33 UTC (rev 2506) @@ -29,7 +29,7 @@ ====================================================================== This sounds simple enough, but in practice the implementation is not -as straightfoward as suggested by the above. +as straightforward as suggested by the above. For one, Nutch's underlying Lucene search indexes are not easily modified "in place". That is, updating an existing record by adding @@ -96,7 +96,7 @@ To prevent the importing of multiple copies of the same version of a page, we could get the URL+digest of the page to be imported, then -look in the existing Nutch index to see if we alread have it. If we +look in the existing Nutch index to see if we already have it. If we do, do not import it, instead add the crawl date to the existing record in the search index. @@ -109,7 +109,7 @@ have a solution, which we'll describe in more detail later. The first doesn't seem challenging at first and in theory it isn't. -However, in practice it is difficult becuase for a a large deployment, +However, in practice it is difficult because for a a large deployment, we usually have many Lucene indexes spread over many machines. It's not as simple as opening up a single Lucene index on the local machine and searching for a matching URL+digest. In one of the deployments at @@ -192,7 +192,7 @@ simple example, we have an webmaster who just can't make up his mind on what to say. -Thep point is that our CDX file will have lines of the form +The point is that our CDX file will have lines of the form 20071001 abc123 example.org/index.html 20071002 abc123 example.org/index.html @@ -376,7 +376,7 @@ This is all fine and good when calling the NutchWaxBean from the command-line, but what about in a webapp? -The NutchBean has a static method for self-initialization upon recipt +The NutchBean has a static method for self-initialization upon receipt of a application startup message from the servlet container. We have a similar hook in NutchWaxBean, which is run after the NutchBean is initialized. @@ -402,7 +402,7 @@ Taking our example from above, whenever the page is crawled and hasn't changed, a revisit record would be written to the WARC file. -For de-duplication, WARC files with revisit records are nice becuase +For de-duplication, WARC files with revisit records are nice because the crawler is doing the duplicate detection for us. Rather than write a duplicate copy of the page, it writes a record that has @@ -539,7 +539,7 @@ your search index. This means that you'll have a search result hit for each copy of the page in the index. If you imported the same page 10 times, then a search query that finds that page will find all 10 -copies and return 10 identical search results -- one for eaach copy. +copies and return 10 identical search results -- one for each copy. In addition, the de-duplication feature and the add-dates feature with @@ -548,7 +548,7 @@ the records in the Lucene index. In this case, you would only have 1 date associated with each record: -the date the record was imorted. Any information about subsequent +the date the record was imported. Any information about subsequent revisits to the same version of the page would not be in the search index. @@ -577,7 +577,7 @@ The core of the change from URL to URL+digest happens in the NutchWAX Indexer class. In that class the segment is created and all the document-related information is added to it. When a document is added -to a segment, it is written to a Haddop MapFile. +to a segment, it is written to a Hadoop MapFile. Hadoop MapFiles act like Java Maps. They are essentially key/value pairs. In Nutch, the key is the URL and the value is a collection of @@ -632,7 +632,7 @@ Not only that, but the BasicIndexingFilter goes on to insert that urlString into the Lucene document in the "url" field. -We work around this by configuring our NutchWAX indexin filter plugin +We work around this by configuring our NutchWAX indexing filter plugin to run *after* the BasicIndexingFilter and over-write the "url" field with the correct URL. This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |