You can subscribe to this list here.
2005 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
(10) |
Sep
(36) |
Oct
(339) |
Nov
(103) |
Dec
(152) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2006 |
Jan
(141) |
Feb
(102) |
Mar
(125) |
Apr
(203) |
May
(57) |
Jun
(30) |
Jul
(139) |
Aug
(46) |
Sep
(64) |
Oct
(105) |
Nov
(34) |
Dec
(162) |
2007 |
Jan
(81) |
Feb
(57) |
Mar
(141) |
Apr
(72) |
May
(9) |
Jun
(1) |
Jul
(144) |
Aug
(88) |
Sep
(40) |
Oct
(43) |
Nov
(34) |
Dec
(20) |
2008 |
Jan
(44) |
Feb
(45) |
Mar
(16) |
Apr
(36) |
May
(8) |
Jun
(77) |
Jul
(177) |
Aug
(66) |
Sep
(8) |
Oct
(33) |
Nov
(13) |
Dec
(37) |
2009 |
Jan
(2) |
Feb
(5) |
Mar
(8) |
Apr
|
May
(36) |
Jun
(19) |
Jul
(46) |
Aug
(8) |
Sep
(1) |
Oct
(66) |
Nov
(61) |
Dec
(10) |
2010 |
Jan
(13) |
Feb
(16) |
Mar
(38) |
Apr
(76) |
May
(47) |
Jun
(32) |
Jul
(35) |
Aug
(45) |
Sep
(20) |
Oct
(61) |
Nov
(24) |
Dec
(16) |
2011 |
Jan
(22) |
Feb
(34) |
Mar
(11) |
Apr
(8) |
May
(24) |
Jun
(23) |
Jul
(11) |
Aug
(42) |
Sep
(81) |
Oct
(48) |
Nov
(21) |
Dec
(20) |
2012 |
Jan
(30) |
Feb
(25) |
Mar
(4) |
Apr
(6) |
May
(1) |
Jun
(5) |
Jul
(5) |
Aug
(8) |
Sep
(6) |
Oct
(6) |
Nov
|
Dec
|
From: Brad <bra...@us...> - 2005-10-21 03:24:49
|
Update of /cvsroot/archive-access/archive-access/projects/wayback/src/webapp/WEB-INF In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv14129/src/webapp/WEB-INF Modified Files: web.xml Log Message: Heavy modification of configuration to be Context-level, instead of Servlet-level, which dramatically reduces configuration redundancy. Cleaned up IndexPipeline, moved a few classes around, added a really simple JSP to view the Index and Merge queue sizes, and a filter, which both allows the index thread to start with the context, and allows access to the jsp. Index: web.xml =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/wayback/src/webapp/WEB-INF/web.xml,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** web.xml 20 Oct 2005 00:40:41 -0000 1.2 --- web.xml 21 Oct 2005 03:24:40 -0000 1.3 *************** *** 4,142 **** <web-app> ! <servlet> ! <servlet-name>RetrievalServlet</servlet-name> ! <servlet-class>org.archive.wayback.servletglue.WBReplayUIServlet</servlet-class> - <init-param> - <param-name>UNUSED-replayui.class</param-name> - <param-value>org.archive.wayback.ippreplayui.InPagePresenceReplayUI</param-value> - </init-param> - <init-param> - <param-name>replayui.class</param-name> - <param-value>org.archive.wayback.rawreplayui.RawReplayUI</param-value> - </init-param> - <init-param> - <param-name>replayui.jsppath</param-name> - <param-value>jsp/ReplayUI</param-value> - </init-param> ! <init-param> ! <param-name>queryui.class</param-name> ! <param-value>org.archive.wayback.simplequeryui.SimpleQueryUI</param-value> ! </init-param> ! <init-param> ! <param-name>queryui.jsppath</param-name> ! <param-value>jsp/QueryUI</param-value> ! </init-param> ! <init-param> ! <param-name>resourcestore.class</param-name> ! <param-value>org.archive.wayback.localresourcestore.LocalARCResourceStore</param-value> ! </init-param> ! <init-param> ! <param-name>resourcestore.arcpath</param-name> ! <param-value>/home/brad/test-arc3</param-value> ! </init-param> - <init-param> - <param-name>resourceindex.class</param-name> - <param-value>org.archive.wayback.localbdbresourceindex.LocalBDBResourceIndex</param-value> - </init-param> - <init-param> - <param-name>resourceindex.indexPath</param-name> - <param-value>/tmp/test-db</param-value> - </init-param> - <init-param> - <param-name>resourceindex.dbName</param-name> - <param-value>DB1</param-value> - </init-param> - <init-param> - <param-name>resourceindex.arcPath</param-name> - <param-value>/home/brad/test-arc3</param-value> - </init-param> - <init-param> - <param-name>resourceindex.workPath</param-name> - <param-value>/tmp/index-pipeline</param-value> - </init-param> - <init-param> - <param-name>resourceindex.runPipeline</param-name> - <param-value>1</param-value> - </init-param> - </servlet> - <servlet-mapping> - <servlet-name>RetrievalServlet</servlet-name> - <url-pattern>/replay</url-pattern> - </servlet-mapping> ! <servlet> ! <servlet-name>QueryServlet</servlet-name> ! <servlet-class>org.archive.wayback.servletglue.WBQueryUIServlet</servlet-class> - <init-param> - <param-name>UNUSED-replayui.class</param-name> - <param-value>org.archive.wayback.ippreplayui.InPagePresenceReplayUI</param-value> - </init-param> - <init-param> - <param-name>replayui.class</param-name> - <param-value>org.archive.wayback.rawreplayui.RawReplayUI</param-value> - </init-param> - <init-param> - <param-name>replayui.jsppath</param-name> - <param-value>jsp/ReplayUI</param-value> - </init-param> - <init-param> - <param-name>queryui.class</param-name> - <param-value>org.archive.wayback.simplequeryui.SimpleQueryUI</param-value> - </init-param> - <init-param> - <param-name>queryui.jsppath</param-name> - <param-value>jsp/QueryUI</param-value> - </init-param> ! <init-param> ! <param-name>resourcestore.class</param-name> ! <param-value>org.archive.wayback.localresourcestore.LocalARCResourceStore</param-value> ! </init-param> ! <init-param> ! <param-name>resourcestore.arcpath</param-name> ! <param-value>/home/brad/test-arc3</param-value> ! </init-param> - <init-param> - <param-name>resourceindex.class</param-name> - <param-value>org.archive.wayback.localbdbresourceindex.LocalBDBResourceIndex</param-value> - </init-param> - <init-param> - <param-name>resourceindex.indexPath</param-name> - <param-value>/tmp/test-db</param-value> - </init-param> - <init-param> - <param-name>resourceindex.dbName</param-name> - <param-value>DB1</param-value> - </init-param> - <init-param> - <param-name>resourceindex.arcPath</param-name> - <param-value>/home/brad/test-arc3</param-value> - </init-param> - <init-param> - <param-name>resourceindex.workPath</param-name> - <param-value>/tmp/index-pipeline</param-value> - </init-param> - <init-param> - <param-name>resourceindex.runPipeline</param-name> - <param-value>1</param-value> - </init-param> - </servlet> <servlet-mapping> ! <servlet-name>QueryServlet</servlet-name> ! <url-pattern>/query</url-pattern> </servlet-mapping> ! <filter> --- 4,128 ---- <web-app> ! <context-param> ! <param-name>arcpath</param-name> ! <param-value>/tmp/wayback/arcs</param-value> ! <description> ! Directory where ARC files are found (possibly where Heritrix writes them.) ! This directory must exist. ! </description> ! </context-param> ! <!-- ReplayUI Configuration --> ! ! <context-param> ! <param-name>replayui.class</param-name> ! <param-value>org.archive.wayback.rawreplayui.RawReplayUI</param-value> ! <description>Class that implements ReplayUI for this Wayback</description> ! </context-param> ! <context-param> ! <param-name>replayui.jsppath</param-name> ! <param-value>jsp/ReplayUI</param-value> ! <description> ! RawReplayUI specific path to jsp pages. relative to webapp/ ! </description> ! </context-param> + <!-- QueryUI Configuration --> + + <context-param> + <param-name>queryui.class</param-name> + <param-value>org.archive.wayback.simplequeryui.SimpleQueryUI</param-value> + <description>Class that implements QueryUI for this Wayback</description> + </context-param> ! <context-param> ! <param-name>queryui.jsppath</param-name> ! <param-value>jsp/QueryUI</param-value> ! <description> ! SimpleQueryUI specific path to jsp pages. relative to webapp/ ! </description> ! </context-param> ! <!-- ResourceStore Configuration --> ! ! <context-param> ! <param-name>resourcestore.class</param-name> ! <param-value>org.archive.wayback.localresourcestore.LocalARCResourceStore</param-value> ! <description>Class that implements ResourceStore for this Wayback</description> ! </context-param> + <!-- ResourceIndex Configuration --> + + <context-param> + <param-name>resourceindex.class</param-name> + <param-value>org.archive.wayback.localbdbresourceindex.LocalBDBResourceIndex</param-value> + <description>Class that implements ResourceIndex for this Wayback</description> + </context-param> + + <context-param> + <param-name>resourceindex.indexpath</param-name> + <param-value>/tmp/wayback/index</param-value> + <description> + LocalBDBResourceIndex specific directory to store the BDB files. + Directory must exists. + </description> + </context-param> + + <context-param> + <param-name>resourceindex.dbname</param-name> + <param-value>DB1</param-value> + <description> + LocalBDBResourceIndex specific name for BDB database + </description> + </context-param> + + <context-param> + <param-name>indexpipeline.workpath</param-name> + <param-value>/tmp/wayback/pipeline</param-value> + <description> + LocalBDBResourceIndex specific directory to store flag files and temporary index data. + </description> + </context-param> + + <context-param> + <param-name>indexpipeline.runpipeline</param-name> + <param-value>1</param-value> + <description> + if set to '1' then a background indexing thread will automatically update the BDB + index when new ARC files are noticed in the 'arcpath' directory. + </description> + </context-param> + + <!-- + <context-param> + <param-name></param-name> + <param-value></param-value> + <description></description> + </context-param> + --> + + + + <!-- Replay Servlet Configuration --> + + <servlet> + <servlet-name>ReplayServlet</servlet-name> + <servlet-class>org.archive.wayback.servletglue.WBReplayUIServlet</servlet-class> + </servlet> <servlet-mapping> ! <servlet-name>ReplayServlet</servlet-name> ! <url-pattern>/replay</url-pattern> </servlet-mapping> ! ! <!-- Replay Filter Configuration --> <filter> *************** *** 145,149 **** <init-param> ! <param-name>requestParser.class</param-name> <param-value>org.archive.wayback.rawreplayui.RawReplayUI</param-value> </init-param> --- 131,135 ---- <init-param> ! <param-name>requestparser.class</param-name> <param-value>org.archive.wayback.rawreplayui.RawReplayUI</param-value> </init-param> *************** *** 152,156 **** <param-value>/replay</param-value> </init-param> - </filter> <filter-mapping> --- 138,141 ---- *************** *** 160,163 **** --- 145,162 ---- + + <!-- Query Servlet Configuration --> + + <servlet> + <servlet-name>QueryServlet</servlet-name> + <servlet-class>org.archive.wayback.servletglue.WBQueryUIServlet</servlet-class> + </servlet> + <servlet-mapping> + <servlet-name>QueryServlet</servlet-name> + <url-pattern>/query</url-pattern> + </servlet-mapping> + + <!-- Query Filter Configuration --> + <filter> <filter-name>QueryFilter</filter-name> *************** *** 165,169 **** <init-param> ! <param-name>requestParser.class</param-name> <param-value>org.archive.wayback.simplequeryui.SimpleQueryUI</param-value> </init-param> --- 164,168 ---- <init-param> ! <param-name>requestparser.class</param-name> <param-value>org.archive.wayback.simplequeryui.SimpleQueryUI</param-value> </init-param> *************** *** 178,180 **** --- 177,197 ---- <url-pattern>/*</url-pattern> </filter-mapping> + + + <!-- Pipeline Filter Configuration --> + + <filter> + <filter-name>PipelineFilter</filter-name> + <filter-class>org.archive.wayback.arcindexer.PipelineFilter</filter-class> + <init-param> + <param-name>pipeline.statusjsp</param-name> + <param-value>jsp/PipelineUI/PipelineStatus.jsp</param-value> + </init-param> + </filter> + <filter-mapping> + <filter-name>PipelineFilter</filter-name> + <url-pattern>/pipeline</url-pattern> + </filter-mapping> + + </web-app> |
From: Brad <bra...@us...> - 2005-10-21 03:24:49
|
Update of /cvsroot/archive-access/archive-access/projects/wayback/src/webapp/jsp/PipelineUI In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv14129/src/webapp/jsp/PipelineUI Added Files: PipelineStatus.jsp Log Message: Heavy modification of configuration to be Context-level, instead of Servlet-level, which dramatically reduces configuration redundancy. Cleaned up IndexPipeline, moved a few classes around, added a really simple JSP to view the Index and Merge queue sizes, and a filter, which both allows the index thread to start with the context, and allows access to the jsp. --- NEW FILE: PipelineStatus.jsp --- <jsp:include page="../../template/UI-header.jsp" /> <%@ page import="org.archive.wayback.arcindexer.PipelineStatus" %> <% PipelineStatus status = (PipelineStatus) request.getAttribute("pipelinestatus"); %> <H2>Pipeline Status</H2> <HR> Queued For Index:<B><%= status.getNumQueuedForIndex() %></B><BR> Queued For Merge:<B><%= status.getNumQueuedForMerge() %></B><BR> <jsp:include page="../../template/UI-footer.jsp" /> |
From: Brad <bra...@us...> - 2005-10-21 03:24:49
|
Update of /cvsroot/archive-access/archive-access/projects/wayback/src/java/org/archive/wayback/servletglue In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv14129/src/java/org/archive/wayback/servletglue Modified Files: RequestFilter.java WBQueryUIServlet.java WBReplayUIServlet.java Log Message: Heavy modification of configuration to be Context-level, instead of Servlet-level, which dramatically reduces configuration redundancy. Cleaned up IndexPipeline, moved a few classes around, added a really simple JSP to view the Index and Merge queue sizes, and a filter, which both allows the index thread to start with the context, and allows access to the jsp. Index: WBQueryUIServlet.java =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/wayback/src/java/org/archive/wayback/servletglue/WBQueryUIServlet.java,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** WBQueryUIServlet.java 20 Oct 2005 00:40:41 -0000 1.3 --- WBQueryUIServlet.java 21 Oct 2005 03:24:40 -0000 1.4 *************** *** 29,32 **** --- 29,33 ---- import javax.servlet.ServletConfig; + import javax.servlet.ServletContext; import javax.servlet.ServletException; import javax.servlet.http.HttpServlet; *************** *** 66,70 **** p.put(key, c.getInitParameter(key)); } ! try { wayback.init(p); --- 67,76 ---- p.put(key, c.getInitParameter(key)); } ! ServletContext sc = c.getServletContext(); ! for (Enumeration e = sc.getInitParameterNames(); e.hasMoreElements();) { ! String key = (String) e.nextElement(); ! p.put(key, sc.getInitParameter(key)); ! } ! try { wayback.init(p); Index: WBReplayUIServlet.java =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/wayback/src/java/org/archive/wayback/servletglue/WBReplayUIServlet.java,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** WBReplayUIServlet.java 20 Oct 2005 00:40:41 -0000 1.3 --- WBReplayUIServlet.java 21 Oct 2005 03:24:40 -0000 1.4 *************** *** 29,32 **** --- 29,33 ---- import javax.servlet.ServletConfig; + import javax.servlet.ServletContext; import javax.servlet.ServletException; import javax.servlet.http.HttpServlet; *************** *** 65,68 **** --- 66,74 ---- p.put(key, c.getInitParameter(key)); } + ServletContext sc = c.getServletContext(); + for (Enumeration e = sc.getInitParameterNames(); e.hasMoreElements();) { + String key = (String) e.nextElement(); + p.put(key, sc.getInitParameter(key)); + } try { Index: RequestFilter.java =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/wayback/src/java/org/archive/wayback/servletglue/RequestFilter.java,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** RequestFilter.java 19 Oct 2005 01:22:37 -0000 1.2 --- RequestFilter.java 21 Oct 2005 03:24:40 -0000 1.3 *************** *** 55,59 **** private static final String WMREQUEST_ATTRIBUTE = "wmrequest.attribute"; ! private static final String REQUEST_PARSER_CLASS = "requestParser.class"; private static final String HANDLER_URL = "handler.url"; --- 55,59 ---- private static final String WMREQUEST_ATTRIBUTE = "wmrequest.attribute"; ! private static final String REQUEST_PARSER_CLASS = "requestparser.class"; private static final String HANDLER_URL = "handler.url"; |
From: Brad <bra...@us...> - 2005-10-21 03:24:43
|
Update of /cvsroot/archive-access/archive-access/projects/wayback/src/webapp/jsp/PipelineUI In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv14120/src/webapp/jsp/PipelineUI Log Message: Directory /cvsroot/archive-access/archive-access/projects/wayback/src/webapp/jsp/PipelineUI added to the repository |
From: Michael S. <sta...@us...> - 2005-10-21 01:29:37
|
Update of /cvsroot/archive-access/archive-access/projects/nutch/src/java/org/archive/access/nutch In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv20626/src/java/org/archive/access/nutch Modified Files: Arc2Segment.java Log Message: Implement '[ 1309781 ] Add in skipping certain types if > size' * src/java/org/archive/access/nutch/Arc2Segment.java Test for text/html that is larger than the archive.skip.big.html value. Log and skip any found. * conf/nutch-site.xml.nutchwax Edit. Index: Arc2Segment.java =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/nutch/src/java/org/archive/access/nutch/Arc2Segment.java,v retrieving revision 1.30 retrieving revision 1.31 diff -C2 -d -r1.30 -r1.31 *** Arc2Segment.java 18 Oct 2005 23:21:11 -0000 1.30 --- Arc2Segment.java 21 Oct 2005 01:29:29 -0000 1.31 *************** *** 87,90 **** --- 87,101 ---- } } + private static boolean skipBigHtml = false; + private static long bigHtmlMax = -1; + static { + String tmp = NutchConf.get().get("archive.skip.big.html"); + if (tmp != null) { + bigHtmlMax = Long.parseLong(tmp); + if (bigHtmlMax != -1) { + skipBigHtml = true; + } + } + } /** Get the MimeTypes resolver instance. */ *************** *** 195,201 **** metaData.put(header.getName(), header.getValue()); } - String noSpacesMimetype = (mimetype == null)? "null": TextUtils.replaceAll(WHITESPACE, mimetype, "-"); LOG.info("adding " + Long.toString(arcData.getLength()) + " bytes of mimetype " + noSpacesMimetype + " " + url); --- 206,224 ---- metaData.put(header.getName(), header.getValue()); } String noSpacesMimetype = (mimetype == null)? "null": TextUtils.replaceAll(WHITESPACE, mimetype, "-"); + + // New test for Dan. If text/html and > than a certain size, then + // skip completly. + if (skipBigHtml && mimetype != null && + mimetype.startsWith("text/html")) { + if (arcData.getLength() >= bigHtmlMax) { + LOG.info("skipping big html " + + Long.toString(arcData.getLength()) + " bytes of mimetype " + + noSpacesMimetype + " " + url); + return; + } + } + LOG.info("adding " + Long.toString(arcData.getLength()) + " bytes of mimetype " + noSpacesMimetype + " " + url); *************** *** 209,213 **** metaData.put(CONTENT_TYPE_KEY, mimetype); } ! // Collect content bytes // TODO: Skip if unindexable type. --- 232,236 ---- metaData.put(CONTENT_TYPE_KEY, mimetype); } ! // Collect content bytes // TODO: Skip if unindexable type. *************** *** 322,325 **** --- 345,349 ---- Arc2Segment arc2Segment = new Arc2Segment(segmentDir, collectionName, nfs); LOG.info("Index all mimetypes: " + arc2Segment.isIndexAll()); + LOG.info("skipBigHtml " + skipBigHtml + ", cutoff size " + bigHtmlMax); try { |
From: Michael S. <sta...@us...> - 2005-10-21 00:42:27
|
Update of /cvsroot/archive-access/archive-access/projects/nutch In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv9063 Modified Files: maven.xml Log Message: * maven.xml nutch-site.xml.all renamed as nutch-site.xml.nutchwax. * conf/nutch-site.xml.nutchwax Added. Replaces... * conf/nutch-site.xml.all Deleted. Index: maven.xml =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/nutch/maven.xml,v retrieving revision 1.10 retrieving revision 1.11 diff -C2 -d -r1.10 -r1.11 *** maven.xml 5 Oct 2005 23:16:47 -0000 1.10 --- maven.xml 21 Oct 2005 00:42:14 -0000 1.11 *************** *** 121,125 **** in place to a viewer such as wera.--> <copy tofile="${maven.dist.bin.assembly.dir}/conf/nutch-site.xml" ! file="${basedir}/conf/nutch-site.xml.all" filtering="true" overwrite="true" /> </postGoal> --- 121,125 ---- in place to a viewer such as wera.--> <copy tofile="${maven.dist.bin.assembly.dir}/conf/nutch-site.xml" ! file="${basedir}/conf/nutch-site.xml.nutchwax" filtering="true" overwrite="true" /> </postGoal> |
From: Michael S. <sta...@us...> - 2005-10-21 00:42:27
|
Update of /cvsroot/archive-access/archive-access/projects/nutch/conf In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv9063/conf Added Files: nutch-site.xml.nutchwax Removed Files: nutch-site.xml.all Log Message: * maven.xml nutch-site.xml.all renamed as nutch-site.xml.nutchwax. * conf/nutch-site.xml.nutchwax Added. Replaces... * conf/nutch-site.xml.all Deleted. --- nutch-site.xml.all DELETED --- --- NEW FILE: nutch-site.xml.nutchwax --- <?xml version="1.0"?> <!--Internet Archive Nutch configuration. This config. is what gets built into nutchwax. Overrides a few Nutch defaults and adds nutchwax specific config (Such config. options have an 'archive' prefix). --> <nutch-conf> <!-- Enable parse-ext (parse-ext is a parser that calls the 'ext'ernal program xpdf to parse pdf files). Also enable parse-default and the ia plugins. --> <property> <name>plugin.includes</name> <value>urlfilter-regex|parse-(text|html|ext|default)|index-(basic|ia)|query-(basic|site|url|ia)</value> </property> <property> <name>db.ignore.internal.links</name> <value>false</value> <description>Keep all links, not just inter-host. db updates will be FASTER if set to true. Downside is that link text from same site won't be included (More valuable to take anchor text from other hosts). Use this if wide variety of sites to index. </description> </property> <property> <name>indexer.boost.by.link.count</name> <value>true</value> <description>Use in-degree as poor-man's link analysis.</description> </property> <property> <name>indexer.max.tokens</name> <value>100000</value> <description>Don't truncate documents as much as by default. </description> </property> <property> <name>http.content.limit</name> <value>10000000</value> </property> <property> <name>io.map.index.skip</name> <value>7</value> <description>Use less RAM. Index files get read into memory. This config. says read 1/7th only in at a time. Random access is slower but use more memory. </description> </property> <property> <name>indexer.termIndexInterval</name> <value>1024</value> <description>Determines the fraction of terms which Lucene keeps in RAM when searching, to facilitate random-access. Smaller values use more memory but make searches somewhat faster. Larger values use less memory but make searches somewhat slower. For lucene indexes, normally. The default is 128. Write every 1024 entries rather than every 128, the default. </description> </property> <property> <name>indexer.maxMergeDocs</name> <value>2147483647</value> <description>This number determines the maximum number of Lucene Documents to be merged into a new Lucene segment. Larger values increase indexing speed and reduce the number of Lucene segments, which reduces the number of open file handles; however, this also increases RAM usage during indexing. Doug says: "There was a bogus value for indexer.maxMergeDocs in nutch-default.xml which made indexing really slow. The correct value is something really big (like Integer.MAX_VALUE)." </description> </property> <property> <name>searcher.summary.context</name> <value>20</value> <description> The number of context terms to display preceding and following matching terms in a hit summary. Make summaries a little longer than the default. </description> </property> <property> <name>searcher.summary.length</name> <value>80</value> <description> The total number of terms to display in a hit summary. </description> </property> <property> <name>collections.host</name> <value>collections.example.org</value> <description>The name of the server hosting collections. </description> </property> <!-- The name of this archive collection. DEPRECATED. Now search.jsp uses the 'collection' returned by the search result drawing up the wayback URL and at index time, use the command-line 'collection' option. <property> <name>archive.collection</name> <value>be05</value> </property> --> <!-- <property> <name>searcher.dir</name> <value>/home/stack/workspace/nutch-datadir</value> <description>Optionally, hardcode the nutch datadir location rather than rely on tomcat startup location. </description> </property> --> <property> <name>archive.index.all</name> <value>true</value> <description>If set to true, all contenttypes are indexed. Otherwise we only index text/* and application/* </description> </property> <property> <name>archive.skip.big.html</name> <value>-1</value> <description>If text/html is larger than value, just skip it completely. Use this setting to bypass problematic massive text/html (We were seeing the text/html parser hang for hours in bad, big html docs). Default value is -1 which says don't skip text/html docs.</description> </property> <property> <name>archive.dedup.count.collection</name> <value>false</value> <description>If true, when deduping, compare collection names as well as URL and content-md5 deduping. </description> </property> </nutch-conf> |
From: Michael S. <sta...@us...> - 2005-10-20 23:51:43
|
Update of /cvsroot/archive-access/archive-access/projects/nutch/xdocs In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv30555/xdocs Modified Files: faq.fml Log Message: * xdocs/faq.fml Add id for mimetype faq. Index: faq.fml =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/nutch/xdocs/faq.fml,v retrieving revision 1.11 retrieving revision 1.12 diff -C2 -d -r1.11 -r1.12 *** faq.fml 19 Oct 2005 00:55:25 -0000 1.11 --- faq.fml 20 Oct 2005 23:51:35 -0000 1.12 *************** *** 252,256 **** </faq> <faq> ! <question>How to query for mimetypes? </question> <answer> --- 252,256 ---- </faq> <faq> ! <question id="mimetype">How to query for mimetypes? </question> <answer> |
From: Doug C. <cu...@us...> - 2005-10-20 23:31:19
|
Update of /cvsroot/archive-access/archive-access/projects/nutch/conf In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv27500/conf Added Files: Tag: mapred parse-plugins.xml Log Message: Pre-au fixes. --- NEW FILE: parse-plugins.xml --- <?xml version="1.0" encoding="UTF-8"?> <!-- Copyright 2005 The Apache Software Foundation Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. Author : mattmann Description: This xml file represents a natural ordering for which parsing plugin should get called for a particular mimeType. --> <parse-plugins> <!-- by default if the mimeType is set to *, or can't be determined, use parse-text --> <mimeType name="*"> <plugin id="parse-text" /> </mimeType> <mimeType name="application/java"> <plugin id="parse-text" /> </mimeType> <mimeType name="application/msword"> <plugin id="parse-text" /> </mimeType> <mimeType name="application/pdf"> <plugin id="parse-ext" /> </mimeType> <mimeType name="application/rss+xml"> <plugin id="parse-text" /> </mimeType> <mimeType name="application/vnd.ms-excel"> <plugin id="parse-msexcel" /> </mimeType> <mimeType name="application/vnd.ms-powerpoint"> <plugin id="parse-mspowerpoint" /> </mimeType> <mimeType name="application/vnd.wap.wbxml"> <plugin id="parse-text" /> </mimeType> <mimeType name="application/vnd.wap.wmlc"> <plugin id="parse-text" /> </mimeType> <mimeType name="application/vnd.wap.wmlscriptc"> <plugin id="parse-text" /> </mimeType> <mimeType name="application/xhtml+xml"> <plugin id="parse-text" /> </mimeType> <mimeType name="application/x-bzip2"> <!-- try and parse it with the zip parser --> <plugin id="parse-zip" /> </mimeType> <mimeType name="application/x-csh"> <plugin id="parse-text" /> </mimeType> <mimeType name="application/x-gzip"> <!-- try and parse it with the zip parser --> <plugin id="parse-zip" /> </mimeType> <mimeType name="application/x-javascript"> <plugin id="parse-js" /> <plugin id="parse-text" /> </mimeType> <mimeType name="application/x-kword"> <!-- try and parse it with the word parser --> <plugin id="parse-msword" /> </mimeType> <mimeType name="application/x-kspread"> <!-- try and parse it with the msexcel parser --> <plugin id="parse-msexcel" /> </mimeType> <mimeType name="application/x-latex"> <plugin id="parse-text" /> </mimeType> <mimeType name="application/x-netcdf"> <plugin id="parse-text" /> </mimeType> <mimeType name="application/x-sh"> <plugin id="parse-text" /> </mimeType> <mimeType name="application/x-tcl"> <plugin id="parse-text" /> </mimeType> <mimeType name="application/x-tex"> <plugin id="parse-text" /> </mimeType> <mimeType name="application/x-texinfo"> <plugin id="parse-text" /> </mimeType> <mimeType name="application/x-troff"> <plugin id="parse-text" /> </mimeType> <mimeType name="application/x-troff-man"> <plugin id="parse-text" /> </mimeType> <mimeType name="application/x-troff-me"> <plugin id="parse-text" /> </mimeType> <mimeType name="application/x-troff-ms"> <plugin id="parse-text" /> </mimeType> <mimeType name="application/zip"> <plugin id="parse-zip" /> </mimeType> <mimeType name="message/news"> <plugin id="parse-text" /> </mimeType> <mimeType name="message/rfc822"> <plugin id="parse-text" /> </mimeType> <mimeType name="text/css"> <plugin id="parse-text" /> </mimeType> <mimeType name="text/html"> <plugin id="parse-html" /> </mimeType> <mimeType name="text/plain"> <plugin id="parse-text" /> </mimeType> <mimeType name="text/richtext"> <plugin id="parse-rtf" /> <plugin id="parse-msword" /> </mimeType> <mimeType name="text/rtf"> <plugin id="parse-rtf" /> <plugin id="parse-msword" /> </mimeType> <mimeType name="text/sgml"> <plugin id="parse-html" /> <plugin id="parse-text" /> </mimeType> <mimeType name="text/tab-separated-values"> <plugin id="parse-msexcel" /> <plugin id="parse-text" /> </mimeType> <mimeType name="text/vnd.wap.wml"> <plugin id="parse-text" /> </mimeType> <mimeType name="text/vnd.wap.wmlscript"> <plugin id="parse-text" /> </mimeType> <mimeType name="text/xml"> <plugin id="parse-text" /> <plugin id="parse-html" /> <plugin id="parse-rss" /> </mimeType> <mimeType name="text/x-setext"> <plugin id="parse-text" /> </mimeType> <!-- Types for parse-ext plugin: required for unit tests to pass. --> <mimeType name="application/vnd.nutch.example.cat"> <plugin id="parse-ext" /> </mimeType> <mimeType name="application/vnd.nutch.example.md5sum"> <plugin id="parse-ext" /> </mimeType> </parse-plugins> |
From: Doug C. <cu...@us...> - 2005-10-20 23:30:57
|
Update of /cvsroot/archive-access/archive-access/projects/nutch/src/java/org/archive/access/nutch In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv27408/src/java/org/archive/access/nutch Modified Files: Tag: mapred ImportArcs.java IndexArcs.java Log Message: Pre-au fixes. Index: ImportArcs.java =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/nutch/src/java/org/archive/access/nutch/Attic/ImportArcs.java,v retrieving revision 1.1.2.1 retrieving revision 1.1.2.2 diff -C2 -d -r1.1.2.1 -r1.1.2.2 *** ImportArcs.java 1 Sep 2005 18:45:29 -0000 1.1.2.1 --- ImportArcs.java 20 Oct 2005 23:30:49 -0000 1.1.2.2 *************** *** 55,60 **** import org.apache.nutch.parse.Parse; import org.apache.nutch.parse.ParseStatus; ! import org.apache.nutch.parse.Parser; ! import org.apache.nutch.parse.ParserFactory; import org.apache.nutch.parse.ParseImpl; --- 55,59 ---- import org.apache.nutch.parse.Parse; import org.apache.nutch.parse.ParseStatus; ! import org.apache.nutch.parse.ParseUtil; import org.apache.nutch.parse.ParseImpl; *************** *** 127,135 **** if (arcName == null) { // first entry has arc name ! String arcPath = new URI(rec.getMetaData().getUrl()).getPath(); ! arcName = new File(arcPath).getName(); ! if (arcName.endsWith(".arc")) { ! arcName = arcName.substring(0, arcName.indexOf(".arc")); ! } reporter.setStatus(arcName); } --- 126,130 ---- if (arcName == null) { // first entry has arc name ! arcName = rec.getMetaData().getUrl(); reporter.setStatus(arcName); } *************** *** 211,220 **** Content content = new Content(url, url, contentBytes, mimetype, metaData); - metaData.put(Fetcher.DIGEST_KEY, MD5Hash.digest(contentBytes).toString()); - metaData.put(Fetcher.SEGMENT_NAME_KEY, segmentName); - CrawlDatum datum = new CrawlDatum(); datum.setStatus(CrawlDatum.STATUS_FETCH_SUCCESS); long date = 0; try { --- 206,216 ---- Content content = new Content(url, url, contentBytes, mimetype, metaData); CrawlDatum datum = new CrawlDatum(); datum.setStatus(CrawlDatum.STATUS_FETCH_SUCCESS); + metaData.put(Fetcher.DIGEST_KEY, MD5Hash.digest(contentBytes).toString()); + metaData.put(Fetcher.SEGMENT_NAME_KEY, segmentName); + metaData.put(Fetcher.SCORE_KEY, Float.toString(datum.getScore())); + long date = 0; try { *************** *** 228,234 **** ParseStatus parseStatus; try { ! Parser parser = ParserFactory.getParser(content.getContentType(), ! content.getBaseUrl()); ! parse = parser.getParse(content); parseStatus = parse.getData().getStatus(); } catch (Exception e) { --- 224,228 ---- ParseStatus parseStatus; try { ! parse = ParseUtil.parse(content); parseStatus = parse.getData().getStatus(); } catch (Exception e) { Index: IndexArcs.java =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/nutch/src/java/org/archive/access/nutch/Attic/IndexArcs.java,v retrieving revision 1.1.2.2 retrieving revision 1.1.2.3 diff -C2 -d -r1.1.2.2 -r1.1.2.3 *** IndexArcs.java 12 Oct 2005 16:49:04 -0000 1.1.2.2 --- IndexArcs.java 20 Oct 2005 23:30:49 -0000 1.1.2.3 *************** *** 28,31 **** --- 28,32 ---- import org.apache.nutch.mapred.*; import org.apache.nutch.crawl.*; + import org.apache.nutch.indexer.IndexMerger; public class IndexArcs { *************** *** 51,54 **** --- 52,56 ---- boolean noImport = false; + boolean noUpdate = false; boolean noInvert = false; boolean noIndex = false; *************** *** 57,60 **** --- 59,64 ---- if ("-noimport".equals(args[i])) { noImport = true; + } else if ("-noupdate".equals(args[i])) { + noUpdate = true; } else if ("-noinvert".equals(args[i])) { noInvert = true; *************** *** 69,81 **** LOG.info("arcsDir = " + arcsDir); File linkDb = new File(crawlDir + "/linkdb"); File segments = new File(crawlDir + "/segments"); if (!noImport) { // import arcs - File segment = new File(segments, getDate()); LOG.info("importing arcs in " + arcsDir + " to " + segment); new ImportArcs(conf).importArcs(arcsDir, segment); } if (!noInvert) { // invert links LOG.info("inverting links in " + segments); --- 73,95 ---- LOG.info("arcsDir = " + arcsDir); + File crawlDb = new File(crawlDir + "/crawldb"); File linkDb = new File(crawlDir + "/linkdb"); File segments = new File(crawlDir + "/segments"); + File segment = new File(segments, getDate()); + File indexes = new File(crawlDir + "/indexes"); + File index = new File(crawlDir + "/index"); + + File tmpDir = conf.getLocalFile("crawl", getDate()); if (!noImport) { // import arcs LOG.info("importing arcs in " + arcsDir + " to " + segment); new ImportArcs(conf).importArcs(arcsDir, segment); } + if (!noUpdate) { // update crawldb + LOG.info("updating crawldb in " + crawlDb); + new CrawlDb(conf).update(crawlDb, segment); + } + if (!noInvert) { // invert links LOG.info("inverting links in " + segments); *************** *** 84,92 **** if (!noIndex) { // index - File index = new File(crawlDir + "/indexes"); LOG.info("indexing " + crawlDir); ! new Indexer(conf).index(index, linkDb, fs.listFiles(segments)); } LOG.info("IndexArcs finished: " + crawlDir); } --- 98,108 ---- if (!noIndex) { // index LOG.info("indexing " + crawlDir); ! new Indexer(conf).index(indexes,crawlDb,linkDb,fs.listFiles(segments)); } + new DeleteDuplicates(conf).dedup(new File[] { indexes }); + new IndexMerger(fs, fs.listFiles(indexes), index, tmpDir).merge(); + LOG.info("IndexArcs finished: " + crawlDir); } |
From: Doug C. <cu...@us...> - 2005-10-20 23:30:57
|
Update of /cvsroot/archive-access/archive-access/projects/nutch/conf In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv27408/conf Modified Files: Tag: mapred nutch-site.xml Log Message: Pre-au fixes. Index: nutch-site.xml =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/nutch/conf/nutch-site.xml,v retrieving revision 1.24.2.3 retrieving revision 1.24.2.4 diff -C2 -d -r1.24.2.3 -r1.24.2.4 *** nutch-site.xml 1 Sep 2005 18:45:29 -0000 1.24.2.3 --- nutch-site.xml 20 Oct 2005 23:30:48 -0000 1.24.2.4 *************** *** 9,83 **** <!-- <property> --> <!-- <name>fs.default.name</name> --> ! <!-- <value>ia109102:8009</value> --> <!-- </property> --> - <property> - <name>ndfs.name.dir</name> - <value>/0/nutch/ndfs/names</value> - </property> - - <property> - <name>ndfs.data.dir</name> - <value>/0/nutch/ndfs/doug,/1/nutch/ndfs/doug</value> - </property> - - <property> - <name>ndfs.replication</name> - <value>2</value> - </property> - <!-- MapReduce --> <!-- <property> --> <!-- <name>mapred.job.tracker</name> --> ! <!-- <value>ia109102:8010</value> --> ! <!-- </property> --> ! ! <!-- <property> --> ! <!-- <name>mapred.job.tracker.info.port</name> --> ! <!-- <value>7846</value> --> ! <!-- </property> --> ! ! <!-- <property> --> ! <!-- <name>mapred.local.dir</name> --> ! <!-- <value>/0/nutch/mapred/local</value> --> ! <!-- </property> --> ! ! <!-- <property> --> ! <!-- <name>mapred.system.dir</name> --> ! <!-- <value>/mapred/system</value> --> ! <!-- </property> --> ! ! <!-- <property> --> ! <!-- <name>mapred.task.timeout</name> --> ! <!-- <value>3600000</value> --> <!-- </property> --> <!-- Override a few Nutch defaults --> - - <!-- Enable parse-ext (parse-ext is a parser that calls the 'ext'ernal program - xpdf to parse pdf files. Also enable parse-default and the ia plugins. - --> <property> ! <name>plugin.includes</name> ! <value>urlfilter-regex|parse-(text|html|ext|default)|index-(basic|ia)|query-(basic|site|url|ia)</value> </property> ! <!-- keep all links, not just inter-host --> ! <!-- db updates will be FASTER if set to true. ! Downside is that link text from same site won't be included. ! (More valuable to take anchor text from other hosts). Use this ! if wide variety of sites to index. ! --> <property> ! <name>db.ignore.internal.links</name> ! <value>false</value> </property> - <!-- use in-degree as poor-man's link analysis --> <property> ! <name>indexer.boost.by.link.count</name> ! <value>true</value> </property> --- 9,38 ---- <!-- <property> --> <!-- <name>fs.default.name</name> --> ! <!-- <value>ia109102.archive.org:8009</value> --> <!-- </property> --> <!-- MapReduce --> <!-- <property> --> <!-- <name>mapred.job.tracker</name> --> ! <!-- <value>ia109102.archive.org:8010</value> --> <!-- </property> --> <!-- Override a few Nutch defaults --> <property> ! <name>archive.collection</name> ! <value>au</value> </property> ! <!-- the name of the archive server hosting this archive --> <property> ! <name>archive.host</name> ! <value>crawls.archive.org</value> </property> <property> ! <name>plugin.includes</name> ! <value>urlfilter-regex|parse-(text|html|js|ext)|index-(basic|ia)|query-(basic|site|url|ia)</value> </property> *************** *** 132,160 **** </property> - <!-- the name of the archive server hosting this archive --> - <property> - <name>archive.host</name> - <value>crawls.archive.org</value> - </property> - - <!-- The name of this archive collection. - DEPRECATED. Now search.jsp uses the 'collection' returned by the search - result drawing up the wayback URL and at index time, use the - command-line 'collection' option. - - <property> - <name>archive.collection</name> - <value>be05</value> - </property> - --> - - <!--Optionally, hardcode the nutch datadir location rather - than rely on tomcat startup location. - <property> - <name>searcher.dir</name> - <value>/home/stack/workspace/nutch-datadir</value> - </property> - --> - <!--If set to true, all contenttypes are indexed. Otherwise we only index text/* and application/* --- 87,90 ---- *************** *** 164,166 **** --- 94,97 ---- <value>false</value> </property> + </nutch-conf> |
From: Doug C. <cu...@us...> - 2005-10-20 23:30:57
|
Update of /cvsroot/archive-access/archive-access/projects/nutch/src/web In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv27408/src/web Modified Files: Tag: mapred search.jsp Log Message: Pre-au fixes. Index: search.jsp =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/nutch/src/web/search.jsp,v retrieving revision 1.17 retrieving revision 1.17.2.1 diff -C2 -d -r1.17 -r1.17.2.1 *** search.jsp 2 Aug 2005 18:34:02 -0000 1.17 --- search.jsp 20 Oct 2005 23:30:49 -0000 1.17.2.1 *************** *** 16,19 **** --- 16,20 ---- import="org.apache.nutch.util.NutchConf" import="org.archive.access.nutch.NutchwaxQuery" + import="org.archive.util.ArchiveUtils" %><%! *************** *** 169,176 **** String summary = summaries[i]; String id = "idx=" + hit.getIndexNo() + "&id=" + hit.getIndexDocNo(); ! ! String archiveDate = FORMAT.format(new Date(bean.getFetchDate(detail))); ! String archiveDisplayDate = ! DISPLAY_FORMAT.format(new Date(bean.getFetchDate(detail))); String archiveCollection = detail.getValue("collection"); String url = detail.getValue("url"); --- 170,177 ---- String summary = summaries[i]; String id = "idx=" + hit.getIndexNo() + "&id=" + hit.getIndexDocNo(); ! ! Date date = new Date(Long.valueOf(detail.getValue("date"))*1000); ! String archiveDate = FORMAT.format(date); ! String archiveDisplayDate = DISPLAY_FORMAT.format(date); String archiveCollection = detail.getValue("collection"); String url = detail.getValue("url"); |
From: Doug C. <cu...@us...> - 2005-10-20 23:30:57
|
Update of /cvsroot/archive-access/archive-access/projects/nutch/src/plugin In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv27408/src/plugin Modified Files: Tag: mapred build.xml Log Message: Pre-au fixes. Index: build.xml =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/nutch/src/plugin/build.xml,v retrieving revision 1.5 retrieving revision 1.5.2.1 diff -C2 -d -r1.5 -r1.5.2.1 *** build.xml 25 Jul 2005 20:35:00 -0000 1.5 --- build.xml 20 Oct 2005 23:30:49 -0000 1.5.2.1 *************** *** 9,14 **** <ant dir="index-ia" target="deploy"/> <ant dir="query-ia" target="deploy"/> ! <ant dir="parse-default" target="deploy"/> ! </target> <!-- ====================================================== --> --- 9,14 ---- <ant dir="index-ia" target="deploy"/> <ant dir="query-ia" target="deploy"/> ! <!-- <ant dir="parse-default" target="deploy"/> --> ! </target> <!-- ====================================================== --> *************** *** 18,22 **** <ant dir="index-ia" target="test"/> <ant dir="query-ia" target="test"/> ! <ant dir="parse-default" target="test"/> </target> --- 18,22 ---- <ant dir="index-ia" target="test"/> <ant dir="query-ia" target="test"/> ! <!-- <ant dir="parse-default" target="test"/> --> </target> *************** *** 27,31 **** <ant dir="index-ia" target="clean"/> <ant dir="query-ia" target="clean"/> ! <ant dir="parse-default" target="clean"/> </target> --- 27,31 ---- <ant dir="index-ia" target="clean"/> <ant dir="query-ia" target="clean"/> ! <!-- <ant dir="parse-default" target="clean"/> --> </target> |
From: Doug C. <cu...@us...> - 2005-10-20 23:30:57
|
Update of /cvsroot/archive-access/archive-access/projects/nutch In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv27408 Modified Files: Tag: mapred build.xml Log Message: Pre-au fixes. Index: build.xml =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/nutch/build.xml,v retrieving revision 1.10.2.1 retrieving revision 1.10.2.2 diff -C2 -d -r1.10.2.1 -r1.10.2.2 *** build.xml 15 Aug 2005 21:06:25 -0000 1.10.2.1 --- build.xml 20 Oct 2005 23:30:48 -0000 1.10.2.2 *************** *** 117,120 **** --- 117,121 ---- <target name="job" depends="compile"> <jar jarfile="${build.dir}/${name}.job.jar"> + <zipfileset prefix="classes" file="${conf.dir}/parse-plugins.xml"/> <zipfileset prefix="classes" dir="${build.classes}"/> <zipfileset refid="lib.jars"/> |
From: Michael S. <sta...@us...> - 2005-10-20 20:51:14
|
Update of /cvsroot/archive-access/archive-access/projects/wera In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv15246 Added Files: build.xml Log Message: * build.xml Added empty build.xml. Avoids spew of exceptions during maven build. --- NEW FILE: build.xml --- <?xml version="1.0" encoding="UTF-8"?> <!--Use maven to build. Ant not supported. (This is a placeholder build.xml. Without it, the maven build of src will try to autogenerate an ant build file spewing an ugly exception into the build). --> |
From: Michael S. <sta...@us...> - 2005-10-20 20:49:20
|
Update of /cvsroot/archive-access/archive-access/projects/infiniteurl In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv14824 Added Files: build.xml Log Message: * build.xml Add empty build.xml. Avoids harmless spew of exceptions during maven build. --- NEW FILE: build.xml --- <?xml version="1.0" encoding="UTF-8"?> <!--Use maven to build. Ant not supported. (This is a placeholder build.xml. Without it, the maven build of src will try to autogenerate an ant build file spewing an ugly exception into the build). --> |
From: Michael S. <sta...@us...> - 2005-10-20 20:45:07
|
Update of /cvsroot/archive-access/archive-access/projects/wb In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv13701 Added Files: build.xml Log Message: * build.xml Add empty build.xml (Avoids ugly, harmless spew of exceptions during maven build). --- NEW FILE: build.xml --- <?xml version="1.0" encoding="UTF-8"?> <!--Use maven to build. Ant not supported. (This is a placeholder build.xml. Without it, the maven build of src will try to autogenerate an ant build file spewing an ugly exception into the build). --> |
From: Michael S. <sta...@us...> - 2005-10-20 20:39:09
|
Update of /cvsroot/archive-access/archive-access/projects/wb In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv11835 Modified Files: project.xml Log Message: * project.xml Don't generate reports not read. Index: project.xml =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/wb/project.xml,v retrieving revision 1.6 retrieving revision 1.7 diff -C2 -d -r1.6 -r1.7 *** project.xml 2 Aug 2005 20:59:49 -0000 1.6 --- project.xml 20 Oct 2005 20:38:59 -0000 1.7 *************** *** 153,155 **** --- 153,184 ---- </resources> </build> + <!--List of reports to generate. + Some are not working. Fix. + --> + <reports> + <!--Use the heritrix javadoc goal rather than the default + maven javadoc plugin. The latter doesn't copy over doc-files + nor package.html files. + --> + <report>maven-license-plugin</report> + <!--Takes a long time. No one looks at it. Comment in when wanted. + <report>maven-changelog-plugin</report> + <report>maven-checkstyle-plugin</report> + --> + <!-- + <report>maven-jdepend-plugin</report> + --> + <report>maven-junit-report-plugin</report> + <report>maven-jxr-plugin</report> + <report>maven-pmd-plugin</report> + <report>maven-tasklist-plugin</report> + <!--<report>maven-findbugs-plugin</report> + --> + <!--<report>maven-developer-activity-plugin</report>--> + <!--TODO: <report>maven-file-activity-plugin</report>--> + <!--TODO: OOME and takes long time. + <report>maven-linkcheck-plugin</report> + --> + </reports> + </project> |
From: Michael S. <sta...@us...> - 2005-10-20 20:13:42
|
Update of /cvsroot/archive-access/archive-access/projects/infiniteurl In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv5946 Modified Files: project.xml Log Message: * project.xml Don't generate unread reports. Index: project.xml =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/infiniteurl/project.xml,v retrieving revision 1.4 retrieving revision 1.5 diff -C2 -d -r1.4 -r1.5 *** project.xml 28 Nov 2004 01:20:49 -0000 1.4 --- project.xml 20 Oct 2005 20:13:34 -0000 1.5 *************** *** 124,126 **** --- 124,154 ---- </resources> </build> + <!--List of reports to generate. + Some are not working. Fix. + --> + <reports> + <!--Use the heritrix javadoc goal rather than the default + maven javadoc plugin. The latter doesn't copy over doc-files + nor package.html files. + --> + <report>maven-license-plugin</report> + <!--Takes a long time. No one looks at it. Comment in when wanted. + <report>maven-changelog-plugin</report> + <report>maven-checkstyle-plugin</report> + --> + <!-- + <report>maven-jdepend-plugin</report> + --> + <report>maven-junit-report-plugin</report> + <report>maven-jxr-plugin</report> + <report>maven-pmd-plugin</report> + <report>maven-tasklist-plugin</report> + <!--<report>maven-findbugs-plugin</report> + --> + <!--<report>maven-developer-activity-plugin</report>--> + <!--TODO: <report>maven-file-activity-plugin</report>--> + <!--TODO: OOME and takes long time. + <report>maven-linkcheck-plugin</report> + --> + </reports> </project> |
From: Michael S. <sta...@us...> - 2005-10-20 19:54:33
|
Update of /cvsroot/archive-access/archive-access/projects/nutch/xdocs In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv1754/xdocs Modified Files: srcbuild.xml Log Message: * xdocs/srcbuild.xml nutch-site.all renamed as nutch-site.nutchwax Index: srcbuild.xml =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/nutch/xdocs/srcbuild.xml,v retrieving revision 1.8 retrieving revision 1.9 diff -C2 -d -r1.8 -r1.9 *** srcbuild.xml 18 Oct 2005 23:21:11 -0000 1.8 --- srcbuild.xml 20 Oct 2005 19:54:26 -0000 1.9 *************** *** 40,50 **** </p> ! <p>Symlink <literal>${NUTCHWAX}/nutch/conf/nutch-site.xml.all</literal> to ${NUTCHWAX}/conf/nutch-site.xml. Doing this, there is only one nutch-site.xml shared by core Nutch and by NutchWAX extensions. <pre> % cd ${NUTCH_HOME}/nutch/conf % mv nutch-site.xml nutch-site.xml.original ! % ln -s ${NUTCHWAX}/conf/nutch-site.xml.all nutch-site.xml</pre> ! The <literal>nutch-site.xml.all</literal> that is in ${NUTCHWAX} has NutchWAX specific configuration overrides as well as hardcodings of collection names and the name of the archive host that holds archived pages. Edit these to suit your --- 40,51 ---- </p> ! <p>Symlink <literal>${NUTCHWAX}/nutch/conf/nutch-site.xml.nutchwax</literal> to ${NUTCHWAX}/conf/nutch-site.xml. Doing this, there is only one nutch-site.xml shared by core Nutch and by NutchWAX extensions. <pre> % cd ${NUTCH_HOME}/nutch/conf % mv nutch-site.xml nutch-site.xml.original ! % ln -s ${NUTCHWAX}/conf/nutch-site.xml.nutchwax nutch-site.xml</pre> ! The <literal>nutch-site.xml.nutchwax</literal> that is in ${NUTCHWAX} has ! NutchWAX specific configuration overrides as well as hardcodings of collection names and the name of the archive host that holds archived pages. Edit these to suit your |
From: Sverre B. <sv...@us...> - 2005-10-20 18:53:14
|
Update of /cvsroot/archive-access/archive-access/projects/wera/src/webapps/wera In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv18906 Modified Files: index.php Log Message: removed link to pdf manual Index: index.php =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/wera/src/webapps/wera/index.php,v retrieving revision 1.8 retrieving revision 1.9 diff -C2 -d -r1.8 -r1.9 *** index.php 20 Oct 2005 10:40:48 -0000 1.8 --- index.php 20 Oct 2005 18:53:01 -0000 1.9 *************** *** 386,390 **** </td> <td align="right" class="norm"> ! Manual : <a href="./manual/manual.html">HTML</a> - <a href="./manual/manual.pdf">pdf</a> | <a href="./RELEASE-NOTES">Release Notes</a> | <a href="http://sourceforge.net/tracker/?group_id=118427&atid=681137">Report bugs</a> </td> </tr> --- 386,390 ---- </td> <td align="right" class="norm"> ! <a href="./manual/manual.html">Manual</a> | <a href="./releasenotes.html">Release Notes</a> | <a href="http://sourceforge.net/tracker/?group_id=118427&atid=681137">Report bugs</a> </td> </tr> |
From: Sverre B. <sv...@us...> - 2005-10-20 18:28:04
|
Update of /cvsroot/archive-access/archive-access/projects/wera/src/webapps/wera/lib In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv14288/lib Modified Files: config.inc.template Log Message: removed some rubbish Index: config.inc.template =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/wera/src/webapps/wera/lib/config.inc.template,v retrieving revision 1.4 retrieving revision 1.5 diff -C2 -d -r1.4 -r1.5 *** config.inc.template 20 Oct 2005 07:20:39 -0000 1.4 --- config.inc.template 20 Oct 2005 18:27:53 -0000 1.5 *************** *** 43,48 **** $conf_searchengine = "@searchEngine@"; $conf_searchengine_url = "@searchEngineUrl@"; - //$conf_searchengine = "fast"; - //$conf_searchengine_url = "http://utvikling1.nb.no:15100/cgi-bin/asearch"; $conf_index_file = $conf_searchenginepath . "/" . $conf_searchengine . ".inc"; $conf_index_class = $conf_searchengine . "Search"; --- 43,46 ---- *************** *** 98,102 **** // JavaScript disable/enabled for archived pages viewed in Archive Document View // values can be "on" or "off" ! $conf_javascript = "off"; //Help file links --- 96,100 ---- // JavaScript disable/enabled for archived pages viewed in Archive Document View // values can be "on" or "off" ! $conf_javascript = "off"; //Help file links |
From: Michael S. <sta...@us...> - 2005-10-20 17:47:55
|
Update of /cvsroot/archive-access/archive-access/projects/wera/src/java/no/nb/nwa/retriever In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv3413/src/java/no/nb/nwa/retriever Modified Files: AID.java ARCRetriever.java Log Message: Implement [ 1246916 ] Means of asking for resource http headers * src/java/no/nb/nwa/retriever/AID.java * src/java/no/nb/nwa/retriever/ARCRetriever.java Changes to make the meta fetching -- and status and archive info -- work again. * src/webapps/arcretriever/index.jsp Doc. edits. Index: AID.java =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/wera/src/java/no/nb/nwa/retriever/AID.java,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** AID.java 19 Oct 2005 23:59:22 -0000 1.2 --- AID.java 20 Oct 2005 17:47:41 -0000 1.3 *************** *** 25,28 **** --- 25,30 ---- package no.nb.nwa.retriever; + import java.io.File; + /** * *************** *** 58,61 **** --- 60,71 ---- /** + * @param Full path to the directory of arcs. + * @return Full path to arc file. + */ + public File getFile(final File arcdir) { + return new File(arcdir, getFilename()); + } + + /** * @return Returns the filename (If no suffix, appends arc.gz). */ Index: ARCRetriever.java =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/wera/src/java/no/nb/nwa/retriever/ARCRetriever.java,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** ARCRetriever.java 19 Oct 2005 23:59:22 -0000 1.2 --- ARCRetriever.java 20 Oct 2005 17:47:41 -0000 1.3 *************** *** 39,42 **** --- 39,45 ---- import java.util.logging.Logger; import java.util.logging.Level; + import java.util.Iterator; + import java.util.regex.Pattern; + import java.util.regex.Matcher; import javax.servlet.ServletException; *************** *** 79,82 **** --- 82,86 ---- private File arcdir = null; + private Pattern CR = Pattern.compile("\r"); /** *************** *** 164,168 **** String status = ""; String status_long = ""; ! File file = new File(aid.getFilename()); ARCReader arc = null; try { --- 168,172 ---- String status = ""; String status_long = ""; ! File file = aid.getFile(this.arcdir); ARCReader arc = null; try { *************** *** 265,269 **** OutputStream out = null; try { ! File file = new File(aid.getFilename()); try { arc = ARCReaderFactory.get(file); --- 269,273 ---- OutputStream out = null; try { ! File file = aid.getFile(this.arcdir); try { arc = ARCReaderFactory.get(file); *************** *** 326,332 **** addTextElement(metadata, "filestatus", "online"); addTextElement(metadata, "filestatus_long", ""); ! // TODO: Fix. ! // String header = rec.getHttpHeaderString().replaceAll("\r", ""); ! String header = "UNIMPLEMENTED-TODO"; //remove illegal XML-characters header = header.replaceAll("[\\p{Cc}&&[^\\u0009\\u000A\\u000D]]+", --- 330,334 ---- addTextElement(metadata, "filestatus", "online"); addTextElement(metadata, "filestatus_long", ""); ! String header = getAllHeadersAsString(headers); //remove illegal XML-characters header = header.replaceAll("[\\p{Cc}&&[^\\u0009\\u000A\\u000D]]+", *************** *** 366,369 **** --- 368,385 ---- } + private String getAllHeadersAsString(final HeaderGroup headers) { + StringBuffer buffer = new StringBuffer(); + for (final Iterator i = headers.getIterator(); i.hasNext();) { + Header h = (Header)i.next(); + String hdrStr = h.toString(); + Matcher m = CR.matcher(hdrStr); + if (m != null) { + hdrStr = m.replaceAll(" "); + } + buffer.append(hdrStr); + } + return buffer.toString(); + } + private String getHttpHeader(HeaderGroup headers, String headerName) { Header header = headers.getCondensedHeader(headerName); *************** *** 378,382 **** } ! private void addCDataElement(Node parent, String elementName, String value) { value = value == null ? "" : value; Document dom = parent.getOwnerDocument(); --- 394,399 ---- } ! private void addCDataElement(Node parent, String elementName, ! String value) { value = value == null ? "" : value; Document dom = parent.getOwnerDocument(); *************** *** 390,394 **** ARCRecord rec = null; ARCReader arc = null; ! File file = new File(this.arcdir, aid.getFilename()); arc = ARCReaderFactory.get(file); rec = arc.get(aid.getOffset()); --- 407,411 ---- ARCRecord rec = null; ARCReader arc = null; ! File file = aid.getFile(this.arcdir); arc = ARCReaderFactory.get(file); rec = arc.get(aid.getOffset()); |
From: Michael S. <sta...@us...> - 2005-10-20 17:47:53
|
Update of /cvsroot/archive-access/archive-access/projects/wera/src/webapps/arcretriever In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv3413/src/webapps/arcretriever Modified Files: index.jsp Log Message: Implement [ 1246916 ] Means of asking for resource http headers * src/java/no/nb/nwa/retriever/AID.java * src/java/no/nb/nwa/retriever/ARCRetriever.java Changes to make the meta fetching -- and status and archive info -- work again. * src/webapps/arcretriever/index.jsp Doc. edits. Index: index.jsp =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/wera/src/webapps/arcretriever/index.jsp,v retrieving revision 1.4 retrieving revision 1.5 diff -C2 -d -r1.4 -r1.5 *** index.jsp 19 Oct 2005 23:59:22 -0000 1.4 --- index.jsp 20 Oct 2005 17:47:41 -0000 1.5 *************** *** 15,31 **** directory that holds ARC files and then redeploy this webapp. Be aware that changing this value in the web.xml of an ! arcretriever sitting under a containers webapp directory ! can prove frustrating. The container usually notices your ! change then re-undoes the original WAR file overwriting your ! edits. If you remove the WAR file version, the container will subsequently 'cleanup' the lone webapp directory. Best to unjar outside of the container webapp directory and ! copy the unjarred WAR into the webapp dir. </p> <h2>Request Parameters</h2> <p>This webapp takes the following request parameters. <ul> ! <li><b>reqtype</b>: Possible values include: getfile, getmeta, ! getfilestatus, getarchiveinfo.</li> <li><b>aid</b>: The archive identifier. Its format is <i>OFFSET '/' ARCNAME</i>.</li> --- 15,32 ---- directory that holds ARC files and then redeploy this webapp. Be aware that changing this value in the web.xml of an ! <i>arcretriever</i> sitting under a containers' webapp directory ! can prove frustrating. The container usually notices your ! change and redeploys. But if you are not careful, ! you will lose your edit. ! If you remove the WAR file version, the container will subsequently 'cleanup' the lone webapp directory. Best to unjar outside of the container webapp directory and ! copy the unjarred WAR under the webapp dir. </p> <h2>Request Parameters</h2> <p>This webapp takes the following request parameters. <ul> ! <li><b>reqtype</b>: Possible values include: <i>getfile</i>, ! <i>getmeta</i>, <i>getfilestatus</i>, and <i>getarchiveinfo</i>.</li> <li><b>aid</b>: The archive identifier. Its format is <i>OFFSET '/' ARCNAME</i>.</li> |
From: Michael S. <sta...@us...> - 2005-10-20 16:52:05
|
Update of /cvsroot/archive-access/archive-access/projects/wayback In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv21905 Modified Files: project.xml Log Message: * project.xml * xdocs/index.xml Minor description changes (mostly to see if wayback bundles are showing up on continuous build server). Index: project.xml =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/wayback/project.xml,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** project.xml 20 Oct 2005 01:30:36 -0000 1.1 --- project.xml 20 Oct 2005 16:51:54 -0000 1.2 *************** *** 29,33 **** <!-- A short but descriptive name for the project --> ! <name>Internet Archive Wayback Machine</name> <!-- The version of the project under development, e.g. 1.1, 1.2, 2.0-SNAPSHOT --- 29,33 ---- <!-- A short but descriptive name for the project --> ! <name>Wayback</name> <!-- The version of the project under development, e.g. 1.1, 1.2, 2.0-SNAPSHOT *************** *** 54,62 **** <package>org.archive</package> <logo>/images/logo.gif</logo> ! <description>The Internet Archive's Wayback Machine. </description> <!-- a short description of what the project does --> ! <shortDescription> ! Internet Archive Wayback Machine. </shortDescription> --- 54,62 ---- <package>org.archive</package> <logo>/images/logo.gif</logo> ! <description>The wayback project is an open source implementation of the ! Internet Archive's Wayback Machine. </description> <!-- a short description of what the project does --> ! <shortDescription>Open source wayback machine. </shortDescription> |