You can subscribe to this list here.
2005 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
(10) |
Sep
(36) |
Oct
(339) |
Nov
(103) |
Dec
(152) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2006 |
Jan
(141) |
Feb
(102) |
Mar
(125) |
Apr
(203) |
May
(57) |
Jun
(30) |
Jul
(139) |
Aug
(46) |
Sep
(64) |
Oct
(105) |
Nov
(34) |
Dec
(162) |
2007 |
Jan
(81) |
Feb
(57) |
Mar
(141) |
Apr
(72) |
May
(9) |
Jun
(1) |
Jul
(144) |
Aug
(88) |
Sep
(40) |
Oct
(43) |
Nov
(34) |
Dec
(20) |
2008 |
Jan
(44) |
Feb
(45) |
Mar
(16) |
Apr
(36) |
May
(8) |
Jun
(77) |
Jul
(177) |
Aug
(66) |
Sep
(8) |
Oct
(33) |
Nov
(13) |
Dec
(37) |
2009 |
Jan
(2) |
Feb
(5) |
Mar
(8) |
Apr
|
May
(36) |
Jun
(19) |
Jul
(46) |
Aug
(8) |
Sep
(1) |
Oct
(66) |
Nov
(61) |
Dec
(10) |
2010 |
Jan
(13) |
Feb
(16) |
Mar
(38) |
Apr
(76) |
May
(47) |
Jun
(32) |
Jul
(35) |
Aug
(45) |
Sep
(20) |
Oct
(61) |
Nov
(24) |
Dec
(16) |
2011 |
Jan
(22) |
Feb
(34) |
Mar
(11) |
Apr
(8) |
May
(24) |
Jun
(23) |
Jul
(11) |
Aug
(42) |
Sep
(81) |
Oct
(48) |
Nov
(21) |
Dec
(20) |
2012 |
Jan
(30) |
Feb
(25) |
Mar
(4) |
Apr
(6) |
May
(1) |
Jun
(5) |
Jul
(5) |
Aug
(8) |
Sep
(6) |
Oct
(6) |
Nov
|
Dec
|
Revision: 3330 http://archive-access.svn.sourceforge.net/archive-access/?rev=3330&view=rev Author: bradtofel Date: 2010-11-11 05:29:16 +0000 (Thu, 11 Nov 2010) Log Message: ----------- TWEAK: now uses the "Prefix" AccessPoint special configuration, which is wrong. Need to completely rework the URL configuration system for AccessPoints... Modified Paths: -------------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/memento/TimeBundleRequestParser.java Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/memento/TimeBundleRequestParser.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/memento/TimeBundleRequestParser.java 2010-11-11 05:26:09 UTC (rev 3329) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/memento/TimeBundleRequestParser.java 2010-11-11 05:29:16 UTC (rev 3330) @@ -72,10 +72,8 @@ wbRequest.setCaptureQueryRequest(); wbRequest.setRequestUrl(urlStr); - ArchivalUrlResultURIConverter conv = - (ArchivalUrlResultURIConverter) accessPoint.getUriConverter(); - - String uriPrefix = conv.getReplayURIPrefix(); + String uriPrefix = accessPoint.getConfigs().getProperty("Prefix"); + String betterUrl = uriPrefix + "timemap/rdf/" + urlStr; throw new BetterRequestException(betterUrl, 303); This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2010-11-11 05:26:15
|
Revision: 3329 http://archive-access.svn.sourceforge.net/archive-access/?rev=3329&view=rev Author: bradtofel Date: 2010-11-11 05:26:09 +0000 (Thu, 11 Nov 2010) Log Message: ----------- BUGFIX: was not sending correct Location redirect URL, was using resultToReplayUrl, which uses AccessPoint ResultURIConverter, which was recently changed to not rewrite URLs. Now explicitly uses an ArchivalUrl object to construct the correct Location. Modified Paths: -------------- trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/query/Memento.jsp Modified: trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/query/Memento.jsp =================================================================== --- trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/query/Memento.jsp 2010-11-11 05:22:40 UTC (rev 3328) +++ trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/query/Memento.jsp 2010-11-11 05:26:09 UTC (rev 3329) @@ -1,4 +1,5 @@ <%@ page import="java.util.Date" +%><%@ page import="org.archive.wayback.archivalurl.ArchivalUrl" %><%@ page import="org.archive.wayback.core.UIResults" %><%@ page import="org.archive.wayback.util.StringFormatter" %><%@ page import="org.archive.wayback.core.WaybackRequest" @@ -53,7 +54,10 @@ + ">;rel=\"timemap\"; type=\"text/csv\""; String origlink = ", <" + u + ">;rel=\"original\""; String uriPrefix = wbRequest.getAccessPoint().getReplayPrefix(); - String replayUrl = results.resultToReplayUrl(res); + + ArchivalUrl aUrl = new ArchivalUrl(wbRequest); + String replayUrl = uriPrefix + aUrl.toString(res.getCaptureTimestamp(), + res.getOriginalUrl()); StringBuffer sb = new StringBuffer(); This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2010-11-11 05:22:47
|
Revision: 3328 http://archive-access.svn.sourceforge.net/archive-access/?rev=3328&view=rev Author: bradtofel Date: 2010-11-11 05:22:40 +0000 (Thu, 11 Nov 2010) Log Message: ----------- BUGFIX: was not sending correct URL prefixes for timemaps. Split out replayPrefix and queryPrefix for timegate and timemaps, respectively Modified Paths: -------------- trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/query/ORE.jsp Modified: trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/query/ORE.jsp =================================================================== --- trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/query/ORE.jsp 2010-11-09 00:59:07 UTC (rev 3327) +++ trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/query/ORE.jsp 2010-11-11 05:22:40 UTC (rev 3328) @@ -38,12 +38,13 @@ CaptureSearchResults cResults = results.getCaptureResults(); CaptureSearchResult res = cResults.getClosest(); - String uriPrefix = wbRequest.getAccessPoint().getReplayPrefix(); + String replayPrefix = wbRequest.getAccessPoint().getReplayPrefix(); + String queryPrefix = wbRequest.getAccessPoint().getQueryPrefix(); String u = wbRequest.getRequestUrl(); - String agguri = uriPrefix + "timebundle/" + u; + String agguri = replayPrefix + "timebundle/" + u; String format = wbRequest.get("format"); Aggregation agg = OREFactory.createAggregation(new URI(agguri)); - ResourceMap rem = agg.createResourceMap(new URI(uriPrefix + ResourceMap rem = agg.createResourceMap(new URI(queryPrefix + "timemap/" + format + "/" + u)); Date now = new Date(); @@ -79,7 +80,7 @@ "http://www.mementoweb.org/terms/tb/OriginalResource")); //include timegate into aggregation AggregatedResource ar_tg = agg.createAggregatedResource(new URI( - results.getContextConfig("Prefix") + "timegate/" + u)); + replayPrefix + "timegate/" + u)); Predicate pr_format = new Predicate(); pr_format.setURI(new URI("http://purl.org/dc/elements/1.1/format")); @@ -104,9 +105,9 @@ linkbf.append("<" + u + ">;rel=\"original\"\n"); linkbf.append(",<" + agguri + ">;rel=\"timebundle\"\n"); - linkbf.append(",<" + results.getContextConfig("Prefix") + linkbf.append(",<" + replayPrefix + "timegate/" + u + ">;rel=\"timegate\"\n"); - linkbf.append(",<" + uriPrefix + "timemap/" + format + "/" + u + linkbf.append(",<" + queryPrefix + "timemap/" + format + "/" + u + ">;rel=\"timemap\";type=\"text/csv\"\n"); String firstmemento = null; @@ -115,7 +116,7 @@ CaptureSearchResult cur = itr.next(); //I am not deduping urls (by digest) for the rdf serialization running out of time, extra efforts for me now ;) - String resurl = results.getContextConfig("Prefix") + String resurl = replayPrefix + formatterk.format(cur.getCaptureDate()) + "/" + u; String digest = cur.getDigest(); This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
Revision: 3327 http://archive-access.svn.sourceforge.net/archive-access/?rev=3327&view=rev Author: binzino Date: 2010-11-09 00:59:07 +0000 (Tue, 09 Nov 2010) Log Message: ----------- Added check for null hostname. Warning log message is issued. Modified Paths: -------------- tags/nutchwax-0_13-JIRA-WAX-75/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/ConfigurableIndexingFilter.java Modified: tags/nutchwax-0_13-JIRA-WAX-75/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/ConfigurableIndexingFilter.java =================================================================== --- tags/nutchwax-0_13-JIRA-WAX-75/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/ConfigurableIndexingFilter.java 2010-11-03 07:13:24 UTC (rev 3326) +++ tags/nutchwax-0_13-JIRA-WAX-75/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/ConfigurableIndexingFilter.java 2010-11-09 00:59:07 UTC (rev 3327) @@ -149,8 +149,6 @@ { Metadata meta = parse.getData().getContentMeta(); - // - for ( FieldSpecification spec : this.fieldSpecs ) { String value = null; @@ -169,6 +167,13 @@ value = uri.getHost( ); + if ( value == null ) + { + LOG.warn( "URL has no hostname: \"" + uri + "\""); + + return null; + } + // Strip off any "www." header. if ( value.startsWith( "www." ) ) { This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2010-11-03 07:13:30
|
Revision: 3326 http://archive-access.svn.sourceforge.net/archive-access/?rev=3326&view=rev Author: bradtofel Date: 2010-11-03 07:13:24 +0000 (Wed, 03 Nov 2010) Log Message: ----------- added targets directory to svn.ignore Property Changed: ---------------- trunk/archive-access/projects/wayback/wayback-hadoop/ trunk/archive-access/projects/wayback/wayback-hadoop-java/ Property changes on: trunk/archive-access/projects/wayback/wayback-hadoop ___________________________________________________________________ Added: svn:ignore + target Property changes on: trunk/archive-access/projects/wayback/wayback-hadoop-java ___________________________________________________________________ Added: svn:ignore + target This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2010-11-03 07:12:46
|
Revision: 3325 http://archive-access.svn.sourceforge.net/archive-access/?rev=3325&view=rev Author: bradtofel Date: 2010-11-03 07:12:38 +0000 (Wed, 03 Nov 2010) Log Message: ----------- added nested module support Added Paths: ----------- trunk/archive-access/projects/wayback/.settings/org.maven.ide.eclipse.prefs Added: trunk/archive-access/projects/wayback/.settings/org.maven.ide.eclipse.prefs =================================================================== --- trunk/archive-access/projects/wayback/.settings/org.maven.ide.eclipse.prefs (rev 0) +++ trunk/archive-access/projects/wayback/.settings/org.maven.ide.eclipse.prefs 2010-11-03 07:12:38 UTC (rev 3325) @@ -0,0 +1,9 @@ +#Wed Nov 03 14:03:43 GMT+07:00 2010 +activeProfiles= +eclipse.preferences.version=1 +fullBuildGoals=process-test-resources +includeModules=true +resolveWorkspaceProjects=true +resourceFilterGoals=process-resources resources\:testResources +skipCompilerPlugin=true +version=1 This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2010-10-28 22:47:11
|
Revision: 3324 http://archive-access.svn.sourceforge.net/archive-access/?rev=3324&view=rev Author: binzino Date: 2010-10-28 22:47:05 +0000 (Thu, 28 Oct 2010) Log Message: ----------- Do not use compound index format. Modified Paths: -------------- tags/nutchwax-0_13-JIRA-WAX-75/archive/src/java/org/archive/nutchwax/tools/DateAdder.java Modified: tags/nutchwax-0_13-JIRA-WAX-75/archive/src/java/org/archive/nutchwax/tools/DateAdder.java =================================================================== --- tags/nutchwax-0_13-JIRA-WAX-75/archive/src/java/org/archive/nutchwax/tools/DateAdder.java 2010-10-28 22:46:40 UTC (rev 3323) +++ tags/nutchwax-0_13-JIRA-WAX-75/archive/src/java/org/archive/nutchwax/tools/DateAdder.java 2010-10-28 22:47:05 UTC (rev 3324) @@ -107,7 +107,8 @@ } IndexWriter writer = new IndexWriter( new NIOFSDirectory( new File( destIndexDir ) ), new KeywordAnalyzer(), IndexWriter.MaxFieldLength.UNLIMITED ); - + writer.setUseCompoundFile( false ); + UrlCanonicalizer canonicalizer = getCanonicalizer( this.getConf( ) ); for ( int i = 0 ; i < reader.numDocs( ) ; i++ ) This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2010-10-28 22:46:46
|
Revision: 3323 http://archive-access.svn.sourceforge.net/archive-access/?rev=3323&view=rev Author: binzino Date: 2010-10-28 22:46:40 +0000 (Thu, 28 Oct 2010) Log Message: ----------- Added parameter to IndexReader.open to open in read/write mode. Modified Paths: -------------- tags/nutchwax-0_13-JIRA-WAX-75/archive/src/java/org/archive/nutchwax/tools/LengthNormUpdater.java Modified: tags/nutchwax-0_13-JIRA-WAX-75/archive/src/java/org/archive/nutchwax/tools/LengthNormUpdater.java =================================================================== --- tags/nutchwax-0_13-JIRA-WAX-75/archive/src/java/org/archive/nutchwax/tools/LengthNormUpdater.java 2010-10-28 04:32:32 UTC (rev 3322) +++ tags/nutchwax-0_13-JIRA-WAX-75/archive/src/java/org/archive/nutchwax/tools/LengthNormUpdater.java 2010-10-28 22:46:40 UTC (rev 3323) @@ -124,7 +124,8 @@ String pagerankFile = args[pos++]; - IndexReader reader = IndexReader.open( new NIOFSDirectory( new File( args[pos++] ) ) ); + // The 'false' means to open read/write not read-only. + IndexReader reader = IndexReader.open( new NIOFSDirectory( new File( args[pos++] ) ), false ); try { This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2010-10-28 04:32:38
|
Revision: 3322 http://archive-access.svn.sourceforge.net/archive-access/?rev=3322&view=rev Author: binzino Date: 2010-10-28 04:32:32 +0000 (Thu, 28 Oct 2010) Log Message: ----------- Replace per-format parsers with Tika. Modified Paths: -------------- tags/nutchwax-0_13-JIRA-WAX-75/archive/src/nutch/conf/nutch-site.xml Modified: tags/nutchwax-0_13-JIRA-WAX-75/archive/src/nutch/conf/nutch-site.xml =================================================================== --- tags/nutchwax-0_13-JIRA-WAX-75/archive/src/nutch/conf/nutch-site.xml 2010-10-28 04:32:11 UTC (rev 3321) +++ tags/nutchwax-0_13-JIRA-WAX-75/archive/src/nutch/conf/nutch-site.xml 2010-10-28 04:32:32 UTC (rev 3322) @@ -10,7 +10,7 @@ <!-- Add 'index-nutchwax' and 'query-nutchwax' to plugin list. --> <!-- Also, add 'parse-pdf' --> <!-- Remove 'urlfilter-regex' and 'normalizer-(pass|regex|basic)' --> - <value>protocol-http|parse-(text|html|pdf|msword|mspowerpoint|oo)|index-nutchwax|query-(basic|nutchwax)|summary-basic|scoring-nutchwax|urlfilter-nutchwax</value> + <value>protocol-http|parse-tika|index-nutchwax|query-(basic|nutchwax)|summary-basic|scoring-nutchwax|urlfilter-nutchwax</value> </property> <!-- This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2010-10-28 04:32:17
|
Revision: 3321 http://archive-access.svn.sourceforge.net/archive-access/?rev=3321&view=rev Author: binzino Date: 2010-10-28 04:32:11 +0000 (Thu, 28 Oct 2010) Log Message: ----------- Add parse-tika. Modified Paths: -------------- tags/nutchwax-0_13-JIRA-WAX-75/archive/src/nutch/src/plugin/build.xml Modified: tags/nutchwax-0_13-JIRA-WAX-75/archive/src/nutch/src/plugin/build.xml =================================================================== --- tags/nutchwax-0_13-JIRA-WAX-75/archive/src/nutch/src/plugin/build.xml 2010-10-28 04:31:47 UTC (rev 3320) +++ tags/nutchwax-0_13-JIRA-WAX-75/archive/src/nutch/src/plugin/build.xml 2010-10-28 04:32:11 UTC (rev 3321) @@ -62,6 +62,7 @@ <!-- <ant dir="parse-rtf" target="deploy"/> --> <ant dir="parse-swf" target="deploy"/> <ant dir="parse-text" target="deploy"/> + <ant dir="parse-tika" target="deploy"/> <ant dir="parse-zip" target="deploy"/> <ant dir="query-basic" target="deploy"/> <ant dir="query-more" target="deploy"/> @@ -172,6 +173,7 @@ <ant dir="parse-rtf" target="clean"/> <ant dir="parse-swf" target="clean"/> <ant dir="parse-text" target="clean"/> + <ant dir="parse-tika" target="clean"/> <ant dir="parse-zip" target="clean"/> <ant dir="query-basic" target="clean"/> <ant dir="query-more" target="clean"/> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2010-10-28 04:31:54
|
Revision: 3320 http://archive-access.svn.sourceforge.net/archive-access/?rev=3320&view=rev Author: binzino Date: 2010-10-28 04:31:47 +0000 (Thu, 28 Oct 2010) Log Message: ----------- Over-rides of Nutch's default parse-tika config. We only want Tika to handle an explicit list of content types, not everything. Added Paths: ----------- tags/nutchwax-0_13-JIRA-WAX-75/archive/src/nutch/conf/parse-plugins.xml tags/nutchwax-0_13-JIRA-WAX-75/archive/src/nutch/src/plugin/parse-tika/ tags/nutchwax-0_13-JIRA-WAX-75/archive/src/nutch/src/plugin/parse-tika/plugin.xml Added: tags/nutchwax-0_13-JIRA-WAX-75/archive/src/nutch/conf/parse-plugins.xml =================================================================== --- tags/nutchwax-0_13-JIRA-WAX-75/archive/src/nutch/conf/parse-plugins.xml (rev 0) +++ tags/nutchwax-0_13-JIRA-WAX-75/archive/src/nutch/conf/parse-plugins.xml 2010-10-28 04:31:47 UTC (rev 3320) @@ -0,0 +1,169 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. + + Author : mattmann + Description: This xml file represents a natural ordering for which parsing + plugin should get called for a particular mimeType. +--> + +<parse-plugins> + + <!-- Explicitly set parse-tika as the parser for *only* the types we want + to parse. In the parse-tika plugin's plugin.xml, we disable the '*' + (wildcard) which matches everything. --> + + <mimeType name="application/msword"> + <plugin id="parse-tika" /> + </mimeType> + + <mimeType name="application/pdf"> + <plugin id="parse-tika" /> + </mimeType> + + <mimeType name="application/vnd.ms-excel"> + <plugin id="parse-tika" /> + </mimeType> + + <mimeType name="application/vnd.ms-powerpoint"> + <plugin id="parse-tika" /> + </mimeType> + + <mimeType name="application/vnd.oasis.opendocument.text"> + <plugin id="parse-tika" /> + </mimeType> + + <mimeType name="application/vnd.oasis.opendocument.text-template"> + <plugin id="parse-tika" /> + </mimeType> + + <mimeType name="application/vnd.oasis.opendocument.text-master"> + <plugin id="parse-tika" /> + </mimeType> + + <mimeType name="application/vnd.oasis.opendocument.text-web"> + <plugin id="parse-tika" /> + </mimeType> + + <mimeType name="application/vnd.oasis.opendocument.presentation"> + <plugin id="parse-tika" /> + </mimeType> + + <mimeType name="application/vnd.oasis.opendocument.presentation-template"> + <plugin id="parse-tika" /> + </mimeType> + + <mimeType name="application/vnd.oasis.opendocument.spreadsheet"> + <plugin id="parse-tika" /> + </mimeType> + + <mimeType name="application/vnd.oasis.opendocument.spreadsheet-template"> + <plugin id="parse-tika" /> + </mimeType> + + <mimeType name="application/vnd.sun.xml.calc"> + <plugin id="parse-tika" /> + </mimeType> + + <mimeType name="application/vnd.sun.xml.calc.template"> + <plugin id="parse-tika" /> + </mimeType> + + <mimeType name="application/vnd.sun.xml.impress"> + <plugin id="parse-tika" /> + </mimeType> + + <mimeType name="application/vnd.sun.xml.impress.template"> + <plugin id="parse-tika" /> + </mimeType> + + <mimeType name="application/vnd.sun.xml.writer"> + <plugin id="parse-tika" /> + </mimeType> + + <mimeType name="application/vnd.sun.xml.writer.template"> + <plugin id="parse-tika" /> + </mimeType> + + <mimeType name="application/x-kword"> + <plugin id="parse-tika" /> + </mimeType> + + <mimeType name="application/x-kspread"> + <plugin id="parse-tika" /> + </mimeType> + + <mimeType name="text/html"> + <plugin id="parse-tika" /> + </mimeType> + + <mimeType name="application/xhtml+xml"> + <plugin id="parse-tika" /> + </mimeType> + + <mimeType name="text/plain"> + <plugin id="parse-tika" /> + </mimeType> + + <mimeType name="text/richtext"> + <plugin id="parse-tika" /> + </mimeType> + + <mimeType name="text/rtf"> + <plugin id="parse-tika" /> + </mimeType> + + <!-- + <mimeType name="text/sgml"> + <plugin id="parse-tika" /> + </mimeType> + + <mimeType name="text/tab-separated-values"> + <plugin id="parse-tika" /> + </mimeType> + --> + + <!-- Types for parse-ext plugin: required for unit tests to pass. --> + <mimeType name="application/vnd.nutch.example.cat"> + <plugin id="parse-ext" /> + </mimeType> + + <mimeType name="application/vnd.nutch.example.md5sum"> + <plugin id="parse-ext" /> + </mimeType> + + <!-- alias mappings for parse-xxx names to the actual extension implementation + ids described in each plugin's plugin.xml file --> + <aliases> + <alias name="parse-tika" extension-id="org.apache.nutch.parse.tika.Parser" /> + <alias name="parse-ext" extension-id="ExtParser" /> + <!-- + <alias name="parse-html" extension-id="org.apache.nutch.parse.html.HtmlParser" /> + <alias name="parse-js" extension-id="JSParser" /> + <alias name="parse-msexceld" extension-id="org.apache.nutch.parse.msexcel.MSExcelParser" /> + <alias name="parse-mspowerpoint" extension-id="org.apache.nutch.parse.mspowerpoint.MSPowerPointParser" /> + <alias name="parse-msword" extension-id="org.apache.nutch.parse.msword.MSWordParser" /> + <alias name="parse-oo" extension-id="org.apache.nutch.parse.oo.OpenDocument.Text" /> + <alias name="parse-pdf" extension-id="org.apache.nutch.parse.pdf.PdfParser" /> + <alias name="parse-rss" extension-id="org.apache.nutch.parse.rss.RSSParser" /> + <alias name="feed" extension-id="org.apache.nutch.parse.feed.FeedParser" /> + <alias name="parse-swf" extension-id="org.apache.nutch.parse.swf.SWFParser" /> + <alias name="parse-text" extension-id="org.apache.nutch.parse.text.TextParser" /> + <alias name="parse-zip" extension-id="org.apache.nutch.parse.zip.ZipParser" /> + --> + </aliases> + +</parse-plugins> Added: tags/nutchwax-0_13-JIRA-WAX-75/archive/src/nutch/src/plugin/parse-tika/plugin.xml =================================================================== --- tags/nutchwax-0_13-JIRA-WAX-75/archive/src/nutch/src/plugin/parse-tika/plugin.xml (rev 0) +++ tags/nutchwax-0_13-JIRA-WAX-75/archive/src/nutch/src/plugin/parse-tika/plugin.xml 2010-10-28 04:31:47 UTC (rev 3320) @@ -0,0 +1,68 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> +<plugin + id="parse-tika" + name="Tika Parser Plug-in" + version="1.0.0" + provider-name="nutch.org"> + + <runtime> + <library name="parse-tika.jar"> + <export name="*"/> + </library> + + <library name="asm-3.1.jar"/> + <library name="bcmail-jdk14-136.jar"/> + <library name="bcmail-jdk15-1.45.jar"/> + <library name="bcprov-jdk14-136.jar"/> + <library name="bcprov-jdk15-1.45.jar"/> + <library name="commons-compress-1.0.jar"/> + <library name="commons-logging-1.1.1.jar"/> + <library name="dom4j-1.6.1.jar"/> + <library name="fontbox-1.1.0.jar"/> + <library name="geronimo-stax-api_1.0_spec-1.0.1.jar"/> + <library name="jempbox-1.1.0.jar"/> + <library name="metadata-extractor-2.4.0-beta-1.jar"/> + <library name="pdfbox-1.1.0.jar"/> + <library name="poi-3.6.jar"/> + <library name="poi-ooxml-3.6.jar"/> + <library name="poi-ooxml-schemas-3.6.jar"/> + <library name="poi-scratchpad-3.6.jar"/> + <library name="tagsoup-1.2.jar"/> + <library name="tika-parsers-0.7.jar"/> + <library name="xml-apis-1.0.b2.jar"/> + <library name="xmlbeans-2.3.0.jar"/> + </runtime> + + <requires> + <import plugin="nutch-extensionpoints"/> + </requires> + + + <extension point="org.apache.nutch.parse.Parser" + id="org.apache.nutch.parse.tika" + name="TikaParser"> + + <implementation id="org.apache.nutch.parse.tika.Parser" + class="org.apache.nutch.parse.tika.TikaParser"> + <parameter name="contentType" value=""/> + </implementation> + + </extension> + +</plugin> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2010-10-28 00:54:36
|
Revision: 3319 http://archive-access.svn.sourceforge.net/archive-access/?rev=3319&view=rev Author: binzino Date: 2010-10-28 00:54:30 +0000 (Thu, 28 Oct 2010) Log Message: ----------- Changed log message to debug. Modified Paths: -------------- tags/nutchwax-0_13-JIRA-WAX-75/archive/src/java/org/archive/nutchwax/IndexerMapReduce.java Modified: tags/nutchwax-0_13-JIRA-WAX-75/archive/src/java/org/archive/nutchwax/IndexerMapReduce.java =================================================================== --- tags/nutchwax-0_13-JIRA-WAX-75/archive/src/java/org/archive/nutchwax/IndexerMapReduce.java 2010-10-28 00:54:00 UTC (rev 3318) +++ tags/nutchwax-0_13-JIRA-WAX-75/archive/src/java/org/archive/nutchwax/IndexerMapReduce.java 2010-10-28 00:54:30 UTC (rev 3319) @@ -103,6 +103,8 @@ return ; } + if ( LOG.isDebugEnabled( ) ) LOG.debug( "Indexing: " + metadata.get("type") + " " + key ); + // add segment, used to map from merged index back to segment files //doc.add("segment", metadata.get(Nutch.SEGMENT_NAME_KEY)); This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2010-10-28 00:54:07
|
Revision: 3318 http://archive-access.svn.sourceforge.net/archive-access/?rev=3318&view=rev Author: binzino Date: 2010-10-28 00:54:00 +0000 (Thu, 28 Oct 2010) Log Message: ----------- Added key to store original content-type from (w)arc file. Modified Paths: -------------- tags/nutchwax-0_13-JIRA-WAX-75/archive/src/java/org/archive/nutchwax/NutchWax.java Modified: tags/nutchwax-0_13-JIRA-WAX-75/archive/src/java/org/archive/nutchwax/NutchWax.java =================================================================== --- tags/nutchwax-0_13-JIRA-WAX-75/archive/src/java/org/archive/nutchwax/NutchWax.java 2010-10-28 00:53:11 UTC (rev 3317) +++ tags/nutchwax-0_13-JIRA-WAX-75/archive/src/java/org/archive/nutchwax/NutchWax.java 2010-10-28 00:54:00 UTC (rev 3318) @@ -30,6 +30,7 @@ public static final String DATE_KEY = "date"; public static final String DIGEST_KEY = "digest"; public static final String CONTENT_TYPE_KEY = "type"; + public static final String ORIGINAL_TYPE_KEY = "type_original"; public static final String CONTENT_LENGTH_KEY = "length"; public static final String HTTP_RESPONSE_KEY = "http_response_code"; } This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2010-10-28 00:53:17
|
Revision: 3317 http://archive-access.svn.sourceforge.net/archive-access/?rev=3317&view=rev Author: binzino Date: 2010-10-28 00:53:11 +0000 (Thu, 28 Oct 2010) Log Message: ----------- Added digest to metadata. Added use of auto-content-type-detection. Disabled BoilerPipe. Modified Paths: -------------- tags/nutchwax-0_13-JIRA-WAX-75/archive/src/java/org/archive/nutchwax/Importer.java Modified: tags/nutchwax-0_13-JIRA-WAX-75/archive/src/java/org/archive/nutchwax/Importer.java =================================================================== --- tags/nutchwax-0_13-JIRA-WAX-75/archive/src/java/org/archive/nutchwax/Importer.java 2010-10-28 00:52:10 UTC (rev 3316) +++ tags/nutchwax-0_13-JIRA-WAX-75/archive/src/java/org/archive/nutchwax/Importer.java 2010-10-28 00:53:11 UTC (rev 3317) @@ -334,13 +334,29 @@ contentMetadata.set( NutchWax.COLLECTION_KEY, collectionName ); contentMetadata.set( NutchWax.DATE_KEY, meta.getDate() ); contentMetadata.set( NutchWax.DIGEST_KEY, meta.getDigest() ); - contentMetadata.set( NutchWax.CONTENT_TYPE_KEY, meta.getMimetype() ); contentMetadata.set( NutchWax.CONTENT_LENGTH_KEY, String.valueOf( meta.getLength() ) ); contentMetadata.set( NutchWax.HTTP_RESPONSE_KEY, String.valueOf( record.getStatusCode() ) ); + String type = (meta.getMimetype( ) == null ? "" : meta.getMimetype( )).split( "[;]" )[0].toLowerCase().trim(); + + // If the Content-Type from the HTTP response is "text/plain", + // set it to null to trigger full auto-detection via Tika. + if ( "text/plain".equals( type ) ) + { + type = null; + } + + Content content = new Content( url, url, bytes, type, contentMetadata, getConf() ); + + if ( LOG.isDebugEnabled() ) LOG.debug( "Auto-detect content-type: " + type + " " + content.getContentType( ) + " " + url ); + + // Store both the original and auto-detected content types. + contentMetadata.set( NutchWax.ORIGINAL_TYPE_KEY, meta.getMimetype( ) ); + contentMetadata.set( NutchWax.CONTENT_TYPE_KEY, content.getContentType( ) ); + // BoilerPipe! /* - if ( "text/html".equals( meta.getMimetype() ) ) + if ( "text/html".equals( content.getContentType( ) ) ) { String boiledHTML = de.l3s.boilerpipe.extractors.DefaultExtractor.INSTANCE.getText( new org.xml.sax.InputSource( new java.io.ByteArrayInputStream( bytes ) ) ); @@ -348,8 +364,6 @@ } */ - Content content = new Content( url, url, bytes, meta.getMimetype(), contentMetadata, getConf() ); - output( output, new Text( key ), content ); return true; This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
Revision: 3316 http://archive-access.svn.sourceforge.net/archive-access/?rev=3316&view=rev Author: binzino Date: 2010-10-28 00:52:10 +0000 (Thu, 28 Oct 2010) Log Message: ----------- Changed log messages to debug. Modified Paths: -------------- tags/nutchwax-0_13-JIRA-WAX-75/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/ConfigurableIndexingFilter.java Modified: tags/nutchwax-0_13-JIRA-WAX-75/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/ConfigurableIndexingFilter.java =================================================================== --- tags/nutchwax-0_13-JIRA-WAX-75/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/ConfigurableIndexingFilter.java 2010-10-28 00:51:19 UTC (rev 3315) +++ tags/nutchwax-0_13-JIRA-WAX-75/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/ConfigurableIndexingFilter.java 2010-10-28 00:52:10 UTC (rev 3316) @@ -162,7 +162,7 @@ if ( ! this.urlfilter.isAllowed( uri ) ) { - LOG.info( "Rejecting: " + key + " due to url: " + uri ); + if ( LOG.isDebugEnabled( ) ) LOG.debug( "Rejecting: " + key + " due to url: " + uri ); return null; } @@ -201,7 +201,7 @@ if ( ! this.typefilter.isAllowed( value ) ) { - LOG.info( "Rejecting: " + key + " due to type: " + value ); + if ( LOG.isDebugEnabled( ) ) LOG.debug( "Rejecting: " + key + " due to type: " + value ); return null; } This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
Revision: 3315 http://archive-access.svn.sourceforge.net/archive-access/?rev=3315&view=rev Author: binzino Date: 2010-10-28 00:51:19 +0000 (Thu, 28 Oct 2010) Log Message: ----------- Added another alias for text/html: application/xhtml Modified Paths: -------------- tags/nutchwax-0_13-JIRA-WAX-75/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/TypeNormalizer.java Modified: tags/nutchwax-0_13-JIRA-WAX-75/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/TypeNormalizer.java =================================================================== --- tags/nutchwax-0_13-JIRA-WAX-75/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/TypeNormalizer.java 2010-10-28 00:50:30 UTC (rev 3314) +++ tags/nutchwax-0_13-JIRA-WAX-75/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/TypeNormalizer.java 2010-10-28 00:51:19 UTC (rev 3315) @@ -29,6 +29,7 @@ { "application/x-pdf", "application/pdf" }, // HTML aliases. { "application/xhtml+xml", "text/html" }, + { "application/xhtml", "text/html" }, // MS Word aliases. { "application/vnd.ms-word", "application/msword" }, { "application/vnd.openxmlformats-officedocument.wordprocessingml.document", "application/msword" }, This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2010-10-28 00:50:41
|
Revision: 3314 http://archive-access.svn.sourceforge.net/archive-access/?rev=3314&view=rev Author: binzino Date: 2010-10-28 00:50:30 +0000 (Thu, 28 Oct 2010) Log Message: ----------- Enabled mime-type deduction via magic numbers. Modified Paths: -------------- tags/nutchwax-0_13-JIRA-WAX-75/archive/src/nutch/conf/nutch-site.xml Modified: tags/nutchwax-0_13-JIRA-WAX-75/archive/src/nutch/conf/nutch-site.xml =================================================================== --- tags/nutchwax-0_13-JIRA-WAX-75/archive/src/nutch/conf/nutch-site.xml 2010-10-28 00:49:16 UTC (rev 3313) +++ tags/nutchwax-0_13-JIRA-WAX-75/archive/src/nutch/conf/nutch-site.xml 2010-10-28 00:50:30 UTC (rev 3314) @@ -81,7 +81,7 @@ the Content-Type that is already in the (W)ARC file. --> <property> <name>mime.type.magic</name> - <value>false</value> + <value>true</value> <description>Defines if the mime content type detector uses magic resolution.</description> </property> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2010-10-28 00:49:25
|
Revision: 3313 http://archive-access.svn.sourceforge.net/archive-access/?rev=3313&view=rev Author: binzino Date: 2010-10-28 00:49:16 +0000 (Thu, 28 Oct 2010) Log Message: ----------- Copied from Nutch's conf. Added magic number for Microsoft ASF format. Added Paths: ----------- tags/nutchwax-0_13-JIRA-WAX-75/archive/src/nutch/conf/tika-mimetypes.xml Added: tags/nutchwax-0_13-JIRA-WAX-75/archive/src/nutch/conf/tika-mimetypes.xml =================================================================== --- tags/nutchwax-0_13-JIRA-WAX-75/archive/src/nutch/conf/tika-mimetypes.xml (rev 0) +++ tags/nutchwax-0_13-JIRA-WAX-75/archive/src/nutch/conf/tika-mimetypes.xml 2010-10-28 00:49:16 UTC (rev 3313) @@ -0,0 +1,4044 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> +<!-- + Description: This xml file defines the valid mime types used by Tika. + The mime type data within this file is based on information from various + sources like Apache Nutch, Apache HTTP Server, the file(1) command, etc. +--> +<mime-info> + + <mime-type type="application/activemessage"/> + <mime-type type="application/andrew-inset"> + <glob pattern="*.ez"/> + </mime-type> + <mime-type type="application/applefile"/> + <mime-type type="application/applixware"> + <glob pattern="*.aw"/> + </mime-type> + + <mime-type type="application/atom+xml"> + <root-XML localName="feed" namespaceURI="http://purl.org/atom/ns#"/> + <glob pattern="*.atom"/> + </mime-type> + + <mime-type type="application/atomcat+xml"> + <glob pattern="*.atomcat"/> + </mime-type> + <mime-type type="application/atomicmail"/> + <mime-type type="application/atomsvc+xml"> + <glob pattern="*.atomsvc"/> + </mime-type> + <mime-type type="application/auth-policy+xml"/> + <mime-type type="application/batch-smtp"/> + <mime-type type="application/beep+xml"/> + <mime-type type="application/cals-1840"/> + <mime-type type="application/ccxml+xml"> + <glob pattern="*.ccxml"/> + </mime-type> + <mime-type type="application/cea-2018+xml"/> + <mime-type type="application/cellml+xml"/> + <mime-type type="application/cnrp+xml"/> + <mime-type type="application/commonground"/> + <mime-type type="application/conference-info+xml"/> + <mime-type type="application/cpl+xml"/> + <mime-type type="application/csta+xml"/> + <mime-type type="application/cstadata+xml"/> + <mime-type type="application/cu-seeme"> + <glob pattern="*.cu"/> + </mime-type> + <mime-type type="application/cybercash"/> + <mime-type type="application/davmount+xml"> + <glob pattern="*.davmount"/> + </mime-type> + <mime-type type="application/dca-rft"/> + <mime-type type="application/dec-dx"/> + <mime-type type="application/dialog-info+xml"/> + <mime-type type="application/dicom"/> + <mime-type type="application/dns"/> + <mime-type type="application/dvcs"/> + <mime-type type="application/ecmascript"> + <glob pattern="*.ecma"/> + </mime-type> + <mime-type type="application/edi-consent"/> + <mime-type type="application/edi-x12"/> + <mime-type type="application/edifact"/> + <mime-type type="application/emma+xml"> + <glob pattern="*.emma"/> + </mime-type> + <mime-type type="application/epp+xml"/> + + <mime-type type="application/epub+zip"> + <acronym>EPUB</acronym> + <comment>Electronic Publication</comment> + <magic priority="50"> + <match value="PK\003\004" type="string" offset="0"> + <match value="mimetypeapplication/epub+zip" type="string" offset="30"/> + </match> + </magic> + <glob pattern="*.epub"/> + </mime-type> + + <mime-type type="application/eshop"/> + <mime-type type="application/example"/> + <mime-type type="application/fastinfoset"/> + <mime-type type="application/fastsoap"/> + <mime-type type="application/fits"/> + <mime-type type="application/font-tdpfr"> + <glob pattern="*.pfr"/> + </mime-type> + <mime-type type="application/h224"/> + <mime-type type="application/http"/> + <mime-type type="application/hyperstudio"> + <glob pattern="*.stk"/> + </mime-type> + <mime-type type="application/ibe-key-request+xml"/> + <mime-type type="application/ibe-pkg-reply+xml"/> + <mime-type type="application/ibe-pp-data"/> + <mime-type type="application/iges"/> + <mime-type type="application/im-iscomposing+xml"/> + <mime-type type="application/index"/> + <mime-type type="application/index.cmd"/> + <mime-type type="application/index.obj"/> + <mime-type type="application/index.response"/> + <mime-type type="application/index.vnd"/> + <mime-type type="application/iotp"/> + <mime-type type="application/ipp"/> + <mime-type type="application/isup"/> + + <mime-type type="application/java-archive"> + <sub-class-of type="application/zip"/> + <glob pattern="*.jar"/> + </mime-type> + + <mime-type type="application/java-serialized-object"> + <glob pattern="*.ser"/> + </mime-type> + + <mime-type type="application/javascript"> + <sub-class-of type="text/plain"/> + <glob pattern="*.js"/> + </mime-type> + + <mime-type type="application/json"> + <sub-class-of type="application/javascript"/> + <glob pattern="*.json"/> + </mime-type> + + <mime-type type="application/java-vm"> + <magic priority="40"> + <match value="0xcafebabe" type="string" offset="0" /> + </magic> + <glob pattern="*.class"/> + </mime-type> + + <mime-type type="application/kpml-request+xml"/> + <mime-type type="application/kpml-response+xml"/> + <mime-type type="application/lost+xml"> + <glob pattern="*.lostxml"/> + </mime-type> + + <mime-type type="application/mac-binhex40"> + <alias type="application/mac-binhex"/> + <alias type="application/binhex"/> + <magic priority="50"> + <match value="must\ be\ converted\ with\ BinHex" type="string" offset="11"/> + </magic> + <glob pattern="*.hqx"/> + </mime-type> + + <mime-type type="application/mac-compactpro"> + <glob pattern="*.cpt"/> + </mime-type> + + <mime-type type="application/macwriteii"/> + <mime-type type="application/marc"> + <glob pattern="*.mrc"/> + </mime-type> + <mime-type type="application/mathematica"> + <glob pattern="*.ma"/> + <glob pattern="*.nb"/> + <glob pattern="*.mb"/> + </mime-type> + <mime-type type="application/mathml+xml"> + <glob pattern="*.mathml"/> + </mime-type> + <mime-type type="application/mbms-associated-procedure-description+xml"/> + <mime-type type="application/mbms-deregister+xml"/> + <mime-type type="application/mbms-envelope+xml"/> + <mime-type type="application/mbms-msk+xml"/> + <mime-type type="application/mbms-msk-response+xml"/> + <mime-type type="application/mbms-protection-description+xml"/> + <mime-type type="application/mbms-reception-report+xml"/> + <mime-type type="application/mbms-register+xml"/> + <mime-type type="application/mbms-register-response+xml"/> + <mime-type type="application/mbms-user-service-description+xml"/> + <mime-type type="application/mbox"> + <sub-class-of type="text/plain"/> + <glob pattern="*.mbox"/> + </mime-type> + <mime-type type="application/media_control+xml"/> + <mime-type type="application/mediaservercontrol+xml"> + <glob pattern="*.mscml"/> + </mime-type> + <mime-type type="application/mikey"/> + <mime-type type="application/moss-keys"/> + <mime-type type="application/moss-signature"/> + <mime-type type="application/mosskey-data"/> + <mime-type type="application/mosskey-request"/> + <mime-type type="application/mp4"> + <glob pattern="*.mp4s"/> + </mime-type> + <mime-type type="application/mpeg4-generic"/> + <mime-type type="application/mpeg4-iod"/> + <mime-type type="application/mpeg4-iod-xmt"/> + + <!-- http://www.iana.org/assignments/media-types/application/msword --> + <mime-type type="application/msword"> + <alias type="application/vnd.ms-word"/> + <comment>Microsoft Word Document</comment> + <magic priority="50"> + <match value="Microsoft\ Word\ 6.0\ Document" type="string" offset="2080"/> + <match value="Documento\ Microsoft\ Word\ 6" type="string" offset="2080"/> + <match value="MSWordDoc" type="string" offset="2112"/> + <match value="0x31be0000" type="big32" offset="0"/> + <match value="PO^Q`" type="string" offset="0"/> + <match value="\376\067\0\043" type="string" offset="0"/> + <match value="\333\245-\0\0\0" type="string" offset="0"/> + <match value="\354\245\301" type="string" offset="512"/> + <match value="\320\317\021\340\241\261\032\341" type="string" offset="0"/> + <match value="\224\246\056" type="string" offset="0"/> + <match value="R\0o\0o\0t\0\ \0E\0n\0t\0r\0y" type="string" offset="512"/> + </magic> + <glob pattern="*.doc"/> + <glob pattern="*.dot"/> + <sub-class-of type="application/x-tika-msoffice"/> + </mime-type> + + <mime-type type="application/mxf"> + <glob pattern="*.mxf"/> + </mime-type> + <mime-type type="application/nasdata"/> + <mime-type type="application/news-checkgroups"/> + <mime-type type="application/news-groupinfo"/> + <mime-type type="application/news-transmission"/> + <mime-type type="application/nss"/> + <mime-type type="application/ocsp-request"/> + <mime-type type="application/ocsp-response"/> + + <mime-type type="application/octet-stream"> + <magic priority="50"> + <match value="#\ This\ is\ a\ shell\ archive" type="string" offset="10"/> + <match value="\037\036" type="string" offset="0"/> + <match value="017437" type="host16" offset="0"/> + <match value="0x1fff" type="host16" offset="0"/> + <match value="\377\037" type="string" offset="0"/> + <match value="0145405" type="host16" offset="0"/> + </magic> + <glob pattern="*.bin"/> + <glob pattern="*.dms"/> + <glob pattern="*.lha"/> + <glob pattern="*.lrf"/> + <glob pattern="*.lzh"/> + <glob pattern="*.so"/> + <glob pattern="*.iso"/> + <glob pattern="*.dmg"/> + <glob pattern="*.dist"/> + <glob pattern="*.distz"/> + <glob pattern="*.pkg"/> + <glob pattern="*.bpk"/> + <glob pattern="*.dump"/> + <glob pattern="*.elc"/> + <glob pattern="*.deploy"/> + </mime-type> + + <mime-type type="application/oda"> + <glob pattern="*.oda"/> + </mime-type> + <mime-type type="application/oebps-package+xml"> + <glob pattern="*.opf"/> + </mime-type> + + <mime-type type="application/ogg"> + <alias type="application/x-ogg"/> + <magic priority="50"> + <match value="OggS" type="string" offset="0"/> + </magic> + <glob pattern="*.ogx"/> + </mime-type> + + <mime-type type="application/onenote"> + <glob pattern="*.onetoc"/> + <glob pattern="*.onetoc2"/> + <glob pattern="*.onetmp"/> + <glob pattern="*.onepkg"/> + </mime-type> + <mime-type type="application/parityfec"/> + <mime-type type="application/patch-ops-error+xml"> + <glob pattern="*.xer"/> + </mime-type> + + <mime-type type="application/pdf"> + <alias type="application/x-pdf"/> + <acronym>PDF</acronym> + <comment>Portable Document Format</comment> + <magic priority="50"> + <match value="%PDF-" type="string" offset="0"/> + </magic> + <glob pattern="*.pdf"/> + </mime-type> + + <mime-type type="application/pgp-encrypted"> + <glob pattern="*.pgp"/> + </mime-type> + <mime-type type="application/pgp-keys"/> + <mime-type type="application/pgp-signature"> + <glob pattern="*.asc"/> + <glob pattern="*.sig"/> + </mime-type> + <mime-type type="application/pics-rules"> + <glob pattern="*.prf"/> + </mime-type> + <mime-type type="application/pidf+xml"/> + <mime-type type="application/pidf-diff+xml"/> + <mime-type type="application/pkcs10"> + <glob pattern="*.p10"/> + </mime-type> + <mime-type type="application/pkcs7-mime"> + <glob pattern="*.p7m"/> + <glob pattern="*.p7c"/> + </mime-type> + <mime-type type="application/pkcs7-signature"> + <glob pattern="*.p7s"/> + </mime-type> + <mime-type type="application/pkix-cert"> + <glob pattern="*.cer"/> + </mime-type> + <mime-type type="application/pkix-crl"> + <glob pattern="*.crl"/> + </mime-type> + <mime-type type="application/pkix-pkipath"> + <glob pattern="*.pkipath"/> + </mime-type> + <mime-type type="application/pkixcmp"> + <glob pattern="*.pki"/> + </mime-type> + <mime-type type="application/pls+xml"> + <glob pattern="*.pls"/> + </mime-type> + <mime-type type="application/poc-settings+xml"/> + + <mime-type type="application/postscript"> + <comment>PostScript</comment> + <magic priority="50"> + <match value="%!" type="string" offset="0" /> + <match value="\004%!" type="string" offset="0" /> + <!-- Windows format EPS --> + <match value="0xc5d0d3c6" type="string" offset="0"/> + </magic> + <glob pattern="*.ai"/> + <glob pattern="*.ps"/> + <glob pattern="*.eps"/> + <glob pattern="*.epsf"/> + <glob pattern="*.epsi"/> + </mime-type> + + <mime-type type="application/prs.alvestrand.titrax-sheet"/> + <mime-type type="application/prs.cww"> + <glob pattern="*.cww"/> + </mime-type> + <mime-type type="application/prs.nprend"/> + <mime-type type="application/prs.plucker"/> + <mime-type type="application/qsig"/> + + <mime-type type="application/rdf+xml"> + <root-XML localName="RDF"/> + <root-XML localName="RDF" namespaceURI="http://www.w3.org/1999/02/22-rdf-syntax-ns#"/> + <sub-class-of type="application/xml"/> + <acronym>RDF/XML</acronym> + <comment>XML syntax for RDF graphs</comment> + <glob pattern="*.rdf"/> + <glob pattern="*.owl"/> + <glob pattern="^rdf$" isregex="true"/> + <glob pattern="^owl$" isregex="true"/> + </mime-type> + + <mime-type type="application/reginfo+xml"> + <glob pattern="*.rif"/> + </mime-type> + <mime-type type="application/relax-ng-compact-syntax"> + <sub-class-of type="text/plain"/> + <glob pattern="*.rnc"/> + </mime-type> + <mime-type type="application/remote-printing"/> + <mime-type type="application/resource-lists+xml"> + <glob pattern="*.rl"/> + </mime-type> + <mime-type type="application/resource-lists-diff+xml"> + <glob pattern="*.rld"/> + </mime-type> + <mime-type type="application/riscos"/> + <mime-type type="application/rlmi+xml"/> + <mime-type type="application/rls-services+xml"> + <glob pattern="*.rs"/> + </mime-type> + <mime-type type="application/rsd+xml"> + <glob pattern="*.rsd"/> + </mime-type> + + <mime-type type="application/rss+xml"> + <alias type="text/rss"/> + <root-XML localName="rss"/> + <root-XML namespaceURI="http://purl.org/rss/1.0/"/> + <glob pattern="*.rss"/> + </mime-type> + + <mime-type type="application/rtf"> + <alias type="text/rtf"/> + <magic priority="50"> + <match value="{\\rtf" type="string" offset="0"/> + </magic> + <glob pattern="*.rtf"/> + <sub-class-of type="text/plain"/> + </mime-type> + + <mime-type type="application/rtx"/> + <mime-type type="application/samlassertion+xml"/> + <mime-type type="application/samlmetadata+xml"/> + <mime-type type="application/sbml+xml"> + <glob pattern="*.sbml"/> + </mime-type> + <mime-type type="application/scvp-cv-request"> + <glob pattern="*.scq"/> + </mime-type> + <mime-type type="application/scvp-cv-response"> + <glob pattern="*.scs"/> + </mime-type> + <mime-type type="application/scvp-vp-request"> + <glob pattern="*.spq"/> + </mime-type> + <mime-type type="application/scvp-vp-response"> + <glob pattern="*.spp"/> + </mime-type> + <mime-type type="application/sdp"> + <glob pattern="*.sdp"/> + </mime-type> + <mime-type type="application/set-payment"/> + <mime-type type="application/set-payment-initiation"> + <glob pattern="*.setpay"/> + </mime-type> + <mime-type type="application/set-registration"/> + <mime-type type="application/set-registration-initiation"> + <glob pattern="*.setreg"/> + </mime-type> + <mime-type type="application/sgml"/> + <mime-type type="application/sgml-open-catalog"/> + <mime-type type="application/shf+xml"> + <glob pattern="*.shf"/> + </mime-type> + <mime-type type="application/sieve"/> + <mime-type type="application/simple-filter+xml"/> + <mime-type type="application/simple-message-summary"/> + <mime-type type="application/simplesymbolcontainer"/> + <mime-type type="application/slate"/> + <mime-type type="application/smil"/> + <mime-type type="application/smil+xml"> + <glob pattern="*.smi"/> + <glob pattern="*.smil"/> + </mime-type> + <mime-type type="application/soap+fastinfoset"/> + <mime-type type="application/soap+xml"/> + <mime-type type="application/sparql-query"> + <glob pattern="*.rq"/> + </mime-type> + <mime-type type="application/sparql-results+xml"> + <glob pattern="*.srx"/> + </mime-type> + <mime-type type="application/spirits-event+xml"/> + <mime-type type="application/srgs"> + <glob pattern="*.gram"/> + </mime-type> + <mime-type type="application/srgs+xml"> + <glob pattern="*.grxml"/> + </mime-type> + <mime-type type="application/ssml+xml"> + <glob pattern="*.ssml"/> + </mime-type> + <mime-type type="application/timestamp-query"/> + <mime-type type="application/timestamp-reply"/> + <mime-type type="application/tve-trigger"/> + <mime-type type="application/ulpfec"/> + <mime-type type="application/vemmi"/> + <mime-type type="application/vividence.scriptfile"/> + <mime-type type="application/vnd.3gpp.bsf+xml"/> + <mime-type type="application/vnd.3gpp.pic-bw-large"> + <glob pattern="*.plb"/> + </mime-type> + <mime-type type="application/vnd.3gpp.pic-bw-small"> + <glob pattern="*.psb"/> + </mime-type> + <mime-type type="application/vnd.3gpp.pic-bw-var"> + <glob pattern="*.pvb"/> + </mime-type> + <mime-type type="application/vnd.3gpp.sms"/> + <mime-type type="application/vnd.3gpp2.bcmcsinfo+xml"/> + <mime-type type="application/vnd.3gpp2.sms"/> + <mime-type type="application/vnd.3gpp2.tcap"> + <glob pattern="*.tcap"/> + </mime-type> + <mime-type type="application/vnd.3m.post-it-notes"> + <glob pattern="*.pwn"/> + </mime-type> + <mime-type type="application/vnd.accpac.simply.aso"> + <glob pattern="*.aso"/> + </mime-type> + <mime-type type="application/vnd.accpac.simply.imp"> + <glob pattern="*.imp"/> + </mime-type> + <mime-type type="application/vnd.acucobol"> + <glob pattern="*.acu"/> + </mime-type> + <mime-type type="application/vnd.acucorp"> + <glob pattern="*.atc"/> + <glob pattern="*.acutc"/> + </mime-type> + <mime-type type="application/vnd.adobe.air-application-installer-package+zip"> + <glob pattern="*.air"/> + </mime-type> + <mime-type type="application/vnd.adobe.xdp+xml"> + <glob pattern="*.xdp"/> + </mime-type> + <mime-type type="application/vnd.adobe.xfdf"> + <glob pattern="*.xfdf"/> + </mime-type> + <mime-type type="application/vnd.aether.imp"/> + <mime-type type="application/vnd.airzip.filesecure.azf"> + <glob pattern="*.azf"/> + </mime-type> + <mime-type type="application/vnd.airzip.filesecure.azs"> + <glob pattern="*.azs"/> + </mime-type> + <mime-type type="application/vnd.amazon.ebook"> + <glob pattern="*.azw"/> + </mime-type> + <mime-type type="application/vnd.americandynamics.acc"> + <glob pattern="*.acc"/> + </mime-type> + <mime-type type="application/vnd.amiga.ami"> + <glob pattern="*.ami"/> + </mime-type> + <mime-type type="application/vnd.android.package-archive"> + <glob pattern="*.apk"/> + </mime-type> + <mime-type type="application/vnd.anser-web-certificate-issue-initiation"> + <glob pattern="*.cii"/> + </mime-type> + <mime-type type="application/vnd.anser-web-funds-transfer-initiation"> + <glob pattern="*.fti"/> + </mime-type> + <mime-type type="application/vnd.antix.game-component"> + <glob pattern="*.atx"/> + </mime-type> + <mime-type type="application/vnd.apple.installer+xml"> + <glob pattern="*.mpkg"/> + </mime-type> + <mime-type type="application/vnd.arastra.swi"> + <glob pattern="*.swi"/> + </mime-type> + <mime-type type="application/vnd.audiograph"> + <glob pattern="*.aep"/> + </mime-type> + <mime-type type="application/vnd.autopackage"/> + <mime-type type="application/vnd.avistar+xml"/> + <mime-type type="application/vnd.blueice.multipass"> + <glob pattern="*.mpm"/> + </mime-type> + <mime-type type="application/vnd.bluetooth.ep.oob"/> + <mime-type type="application/vnd.bmi"> + <glob pattern="*.bmi"/> + </mime-type> + <mime-type type="application/vnd.businessobjects"> + <glob pattern="*.rep"/> + </mime-type> + <mime-type type="application/vnd.cab-jscript"/> + <mime-type type="application/vnd.canon-cpdl"/> + <mime-type type="application/vnd.canon-lips"/> + <mime-type type="application/vnd.cendio.thinlinc.clientconf"/> + <mime-type type="application/vnd.chemdraw+xml"> + <glob pattern="*.cdxml"/> + </mime-type> + <mime-type type="application/vnd.chipnuts.karaoke-mmd"> + <glob pattern="*.mmd"/> + </mime-type> + <mime-type type="application/vnd.cinderella"> + <glob pattern="*.cdy"/> + </mime-type> + <mime-type type="application/vnd.cirpack.isdn-ext"/> + <mime-type type="application/vnd.claymore"> + <glob pattern="*.cla"/> + </mime-type> + <mime-type type="application/vnd.clonk.c4group"> + <glob pattern="*.c4g"/> + <glob pattern="*.c4d"/> + <glob pattern="*.c4f"/> + <glob pattern="*.c4p"/> + <glob pattern="*.c4u"/> + </mime-type> + <mime-type type="application/vnd.commerce-battelle"/> + <mime-type type="application/vnd.commonspace"> + <glob pattern="*.csp"/> + </mime-type> + <mime-type type="application/vnd.contact.cmsg"> + <glob pattern="*.cdbcmsg"/> + </mime-type> + <mime-type type="application/vnd.cosmocaller"> + <glob pattern="*.cmc"/> + </mime-type> + <mime-type type="application/vnd.crick.clicker"> + <glob pattern="*.clkx"/> + </mime-type> + <mime-type type="application/vnd.crick.clicker.keyboard"> + <glob pattern="*.clkk"/> + </mime-type> + <mime-type type="application/vnd.crick.clicker.palette"> + <glob pattern="*.clkp"/> + </mime-type> + <mime-type type="application/vnd.crick.clicker.template"> + <glob pattern="*.clkt"/> + </mime-type> + <mime-type type="application/vnd.crick.clicker.wordbank"> + <glob pattern="*.clkw"/> + </mime-type> + <mime-type type="application/vnd.criticaltools.wbs+xml"> + <glob pattern="*.wbs"/> + </mime-type> + <mime-type type="application/vnd.ctc-posml"> + <glob pattern="*.pml"/> + </mime-type> + <mime-type type="application/vnd.ctct.ws+xml"/> + <mime-type type="application/vnd.cups-pdf"/> + <mime-type type="application/vnd.cups-postscript"/> + <mime-type type="application/vnd.cups-ppd"> + <glob pattern="*.ppd"/> + </mime-type> + <mime-type type="application/vnd.cups-raster"/> + <mime-type type="application/vnd.cups-raw"/> + <mime-type type="application/vnd.curl.car"> + <glob pattern="*.car"/> + </mime-type> + <mime-type type="application/vnd.curl.pcurl"> + <glob pattern="*.pcurl"/> + </mime-type> + <mime-type type="application/vnd.cybank"/> + <mime-type type="application/vnd.data-vision.rdz"> + <glob pattern="*.rdz"/> + </mime-type> + <mime-type type="application/vnd.denovo.fcselayout-link"> + <glob pattern="*.fe_launch"/> + </mime-type> + <mime-type type="application/vnd.dir-bi.plate-dl-nosuffix"/> + <mime-type type="application/vnd.dna"> + <glob pattern="*.dna"/> + </mime-type> + <mime-type type="application/vnd.dolby.mlp"> + <glob pattern="*.mlp"/> + </mime-type> + <mime-type type="application/vnd.dolby.mobile.1"/> + <mime-type type="application/vnd.dolby.mobile.2"/> + <mime-type type="application/vnd.dpgraph"> + <glob pattern="*.dpg"/> + </mime-type> + <mime-type type="application/vnd.dreamfactory"> + <glob pattern="*.dfac"/> + </mime-type> + <mime-type type="application/vnd.dvb.esgcontainer"/> + <mime-type type="application/vnd.dvb.ipdcdftnotifaccess"/> + <mime-type type="application/vnd.dvb.ipdcesgaccess"/> + <mime-type type="application/vnd.dvb.ipdcroaming"/> + <mime-type type="application/vnd.dvb.iptv.alfec-base"/> + <mime-type type="application/vnd.dvb.iptv.alfec-enhancement"/> + <mime-type type="application/vnd.dvb.notif-aggregate-root+xml"/> + <mime-type type="application/vnd.dvb.notif-container+xml"/> + <mime-type type="application/vnd.dvb.notif-generic+xml"/> + <mime-type type="application/vnd.dvb.notif-ia-msglist+xml"/> + <mime-type type="application/vnd.dvb.notif-ia-registration-request+xml"/> + <mime-type type="application/vnd.dvb.notif-ia-registration-response+xml"/> + <mime-type type="application/vnd.dvb.notif-init+xml"/> + <mime-type type="application/vnd.dxr"/> + <mime-type type="application/vnd.dynageo"> + <glob pattern="*.geo"/> + </mime-type> + <mime-type type="application/vnd.ecdis-update"/> + <mime-type type="application/vnd.ecowin.chart"> + <glob pattern="*.mag"/> + </mime-type> + <mime-type type="application/vnd.ecowin.filerequest"/> + <mime-type type="application/vnd.ecowin.fileupdate"/> + <mime-type type="application/vnd.ecowin.series"/> + <mime-type type="application/vnd.ecowin.seriesrequest"/> + <mime-type type="application/vnd.ecowin.seriesupdate"/> + <mime-type type="application/vnd.emclient.accessrequest+xml"/> + <mime-type type="application/vnd.enliven"> + <glob pattern="*.nml"/> + </mime-type> + <mime-type type="application/vnd.epson.esf"> + <glob pattern="*.esf"/> + </mime-type> + <mime-type type="application/vnd.epson.msf"> + <glob pattern="*.msf"/> + </mime-type> + <mime-type type="application/vnd.epson.quickanime"> + <glob pattern="*.qam"/> + </mime-type> + <mime-type type="application/vnd.epson.salt"> + <glob pattern="*.slt"/> + </mime-type> + <mime-type type="application/vnd.epson.ssf"> + <glob pattern="*.ssf"/> + </mime-type> + <mime-type type="application/vnd.ericsson.quickcall"/> + <mime-type type="application/vnd.eszigno3+xml"> + <glob pattern="*.es3"/> + <glob pattern="*.et3"/> + </mime-type> + <mime-type type="application/vnd.etsi.aoc+xml"/> + <mime-type type="application/vnd.etsi.cug+xml"/> + <mime-type type="application/vnd.etsi.iptvcommand+xml"/> + <mime-type type="application/vnd.etsi.iptvdiscovery+xml"/> + <mime-type type="application/vnd.etsi.iptvprofile+xml"/> + <mime-type type="application/vnd.etsi.iptvsad-bc+xml"/> + <mime-type type="application/vnd.etsi.iptvsad-cod+xml"/> + <mime-type type="application/vnd.etsi.iptvsad-npvr+xml"/> + <mime-type type="application/vnd.etsi.iptvueprofile+xml"/> + <mime-type type="application/vnd.etsi.mcid+xml"/> + <mime-type type="application/vnd.etsi.sci+xml"/> + <mime-type type="application/vnd.etsi.simservs+xml"/> + <mime-type type="application/vnd.eudora.data"/> + <mime-type type="application/vnd.ezpix-album"> + <glob pattern="*.ez2"/> + </mime-type> + <mime-type type="application/vnd.ezpix-package"> + <glob pattern="*.ez3"/> + </mime-type> + <mime-type type="application/vnd.f-secure.mobile"/> + <mime-type type="application/vnd.fdf"> + <glob pattern="*.fdf"/> + </mime-type> + <mime-type type="application/vnd.fdsn.mseed"> + <glob pattern="*.mseed"/> + </mime-type> + <mime-type type="application/vnd.fdsn.seed"> + <glob pattern="*.seed"/> + <glob pattern="*.dataless"/> + </mime-type> + <mime-type type="application/vnd.ffsns"/> + <mime-type type="application/vnd.fints"/> + <mime-type type="application/vnd.flographit"> + <glob pattern="*.gph"/> + </mime-type> + <mime-type type="application/vnd.fluxtime.clip"> + <glob pattern="*.ftc"/> + </mime-type> + <mime-type type="application/vnd.font-fontforge-sfd"/> + <mime-type type="application/vnd.framemaker"> + <glob pattern="*.fm"/> + <glob pattern="*.frame"/> + <glob pattern="*.maker"/> + <glob pattern="*.book"/> + </mime-type> + <mime-type type="application/vnd.frogans.fnc"> + <glob pattern="*.fnc"/> + </mime-type> + <mime-type type="application/vnd.frogans.ltf"> + <glob pattern="*.ltf"/> + </mime-type> + <mime-type type="application/vnd.fsc.weblaunch"> + <glob pattern="*.fsc"/> + </mime-type> + <mime-type type="application/vnd.fujitsu.oasys"> + <glob pattern="*.oas"/> + </mime-type> + <mime-type type="application/vnd.fujitsu.oasys2"> + <glob pattern="*.oa2"/> + </mime-type> + <mime-type type="application/vnd.fujitsu.oasys3"> + <glob pattern="*.oa3"/> + </mime-type> + <mime-type type="application/vnd.fujitsu.oasysgp"> + <glob pattern="*.fg5"/> + </mime-type> + <mime-type type="application/vnd.fujitsu.oasysprs"> + <glob pattern="*.bh2"/> + </mime-type> + <mime-type type="application/vnd.fujixerox.art-ex"/> + <mime-type type="application/vnd.fujixerox.art4"/> + <mime-type type="application/vnd.fujixerox.hbpl"/> + <mime-type type="application/vnd.fujixerox.ddd"> + <glob pattern="*.ddd"/> + </mime-type> + <mime-type type="application/vnd.fujixerox.docuworks"> + <glob pattern="*.xdw"/> + </mime-type> + <mime-type type="application/vnd.fujixerox.docuworks.binder"> + <glob pattern="*.xbd"/> + </mime-type> + <mime-type type="application/vnd.fut-misnet"/> + <mime-type type="application/vnd.fuzzysheet"> + <glob pattern="*.fzs"/> + </mime-type> + <mime-type type="application/vnd.genomatix.tuxedo"> + <glob pattern="*.txd"/> + </mime-type> + <mime-type type="application/vnd.geogebra.file"> + <glob pattern="*.ggb"/> + </mime-type> + <mime-type type="application/vnd.geogebra.tool"> + <glob pattern="*.ggt"/> + </mime-type> + <mime-type type="application/vnd.geometry-explorer"> + <glob pattern="*.gex"/> + <glob pattern="*.gre"/> + </mime-type> + <mime-type type="application/vnd.gmx"> + <glob pattern="*.gmx"/> + </mime-type> + <mime-type type="application/vnd.google-earth.kml+xml"> + <glob pattern="*.kml"/> + </mime-type> + <mime-type type="application/vnd.google-earth.kmz"> + <glob pattern="*.kmz"/> + </mime-type> + <mime-type type="application/vnd.grafeq"> + <glob pattern="*.gqf"/> + <glob pattern="*.gqs"/> + </mime-type> + <mime-type type="application/vnd.gridmp"/> + <mime-type type="application/vnd.groove-account"> + <glob pattern="*.gac"/> + </mime-type> + <mime-type type="application/vnd.groove-help"> + <glob pattern="*.ghf"/> + </mime-type> + <mime-type type="application/vnd.groove-identity-message"> + <glob pattern="*.gim"/> + </mime-type> + <mime-type type="application/vnd.groove-injector"> + <glob pattern="*.grv"/> + </mime-type> + <mime-type type="application/vnd.groove-tool-message"> + <glob pattern="*.gtm"/> + </mime-type> + <mime-type type="application/vnd.groove-tool-template"> + <glob pattern="*.tpl"/> + </mime-type> + <mime-type type="application/vnd.groove-vcard"> + <glob pattern="*.vcg"/> + </mime-type> + <mime-type type="application/vnd.handheld-entertainment+xml"> + <glob pattern="*.zmm"/> + </mime-type> + <mime-type type="application/vnd.hbci"> + <glob pattern="*.hbci"/> + </mime-type> + <mime-type type="application/vnd.hcl-bireports"/> + <mime-type type="application/vnd.hhe.lesson-player"> + <glob pattern="*.les"/> + </mime-type> + <mime-type type="application/vnd.hp-hpgl"> + <glob pattern="*.hpgl"/> + </mime-type> + <mime-type type="application/vnd.hp-hpid"> + <glob pattern="*.hpid"/> + </mime-type> + <mime-type type="application/vnd.hp-hps"> + <glob pattern="*.hps"/> + </mime-type> + <mime-type type="application/vnd.hp-jlyt"> + <glob pattern="*.jlt"/> + </mime-type> + <mime-type type="application/vnd.hp-pcl"> + <glob pattern="*.pcl"/> + </mime-type> + <mime-type type="application/vnd.hp-pclxl"> + <glob pattern="*.pclxl"/> + </mime-type> + <mime-type type="application/vnd.httphone"/> + <mime-type type="application/vnd.hydrostatix.sof-data"> + <glob pattern="*.sfd-hdstx"/> + </mime-type> + <mime-type type="application/vnd.hzn-3d-crossword"> + <glob pattern="*.x3d"/> + </mime-type> + <mime-type type="application/vnd.ibm.afplinedata"/> + <mime-type type="application/vnd.ibm.electronic-media"/> + <mime-type type="application/vnd.ibm.minipay"> + <glob pattern="*.mpy"/> + </mime-type> + <mime-type type="application/vnd.ibm.modcap"> + <glob pattern="*.afp"/> + <glob pattern="*.listafp"/> + <glob pattern="*.list3820"/> + </mime-type> + <mime-type type="application/vnd.ibm.rights-management"> + <glob pattern="*.irm"/> + </mime-type> + <mime-type type="application/vnd.ibm.secure-container"> + <glob pattern="*.sc"/> + </mime-type> + <mime-type type="application/vnd.iccprofile"> + <glob pattern="*.icc"/> + <glob pattern="*.icm"/> + </mime-type> + <mime-type type="application/vnd.igloader"> + <glob pattern="*.igl"/> + </mime-type> + <mime-type type="application/vnd.immervision-ivp"> + <glob pattern="*.ivp"/> + </mime-type> + <mime-type type="application/vnd.immervision-ivu"> + <glob pattern="*.ivu"/> + </mime-type> + <mime-type type="application/vnd.informedcontrol.rms+xml"/> + <mime-type type="application/vnd.informix-visionary"/> + <mime-type type="application/vnd.intercon.formnet"> + <glob pattern="*.xpw"/> + <glob pattern="*.xpx"/> + </mime-type> + <mime-type type="application/vnd.intertrust.digibox"/> + <mime-type type="application/vnd.intertrust.nncp"/> + <mime-type type="application/vnd.intu.qbo"> + <glob pattern="*.qbo"/> + </mime-type> + <mime-type type="application/vnd.intu.qfx"> + <glob pattern="*.qfx"/> + </mime-type> + <mime-type type="application/vnd.iptc.g2.conceptitem+xml"/> + <mime-type type="application/vnd.iptc.g2.knowledgeitem+xml"/> + <mime-type type="application/vnd.iptc.g2.newsitem+xml"/> + <mime-type type="application/vnd.iptc.g2.packageitem+xml"/> + <mime-type type="application/vnd.ipunplugged.rcprofile"> + <glob pattern="*.rcprofile"/> + </mime-type> + <mime-type type="application/vnd.irepository.package+xml"> + <glob pattern="*.irp"/> + </mime-type> + <mime-type type="application/vnd.is-xpr"> + <glob pattern="*.xpr"/> + </mime-type> + <mime-type type="application/vnd.jam"> + <glob pattern="*.jam"/> + </mime-type> + <mime-type type="application/vnd.japannet-directory-service"/> + <mime-type type="application/vnd.japannet-jpnstore-wakeup"/> + <mime-type type="application/vnd.japannet-payment-wakeup"/> + <mime-type type="application/vnd.japannet-registration"/> + <mime-type type="application/vnd.japannet-registration-wakeup"/> + <mime-type type="application/vnd.japannet-setstore-wakeup"/> + <mime-type type="application/vnd.japannet-verification"/> + <mime-type type="application/vnd.japannet-verification-wakeup"/> + <mime-type type="application/vnd.jcp.javame.midlet-rms"> + <glob pattern="*.rms"/> + </mime-type> + <mime-type type="application/vnd.jisp"> + <glob pattern="*.jisp"/> + </mime-type> + <mime-type type="application/vnd.joost.joda-archive"> + <glob pattern="*.joda"/> + </mime-type> + <mime-type type="application/vnd.kahootz"> + <glob pattern="*.ktz"/> + <glob pattern="*.ktr"/> + </mime-type> + <mime-type type="application/vnd.kde.karbon"> + <glob pattern="*.karbon"/> + </mime-type> + <mime-type type="application/vnd.kde.kchart"> + <glob pattern="*.chrt"/> + </mime-type> + <mime-type type="application/vnd.kde.kformula"> + <glob pattern="*.kfo"/> + </mime-type> + <mime-type type="application/vnd.kde.kivio"> + <glob pattern="*.flw"/> + </mime-type> + <mime-type type="application/vnd.kde.kontour"> + <glob pattern="*.kon"/> + </mime-type> + <mime-type type="application/vnd.kde.kpresenter"> + <glob pattern="*.kpr"/> + <glob pattern="*.kpt"/> + </mime-type> + <mime-type type="application/vnd.kde.kspread"> + <glob pattern="*.ksp"/> + </mime-type> + <mime-type type="application/vnd.kde.kword"> + <glob pattern="*.kwd"/> + <glob pattern="*.kwt"/> + </mime-type> + <mime-type type="application/vnd.kenameaapp"> + <glob pattern="*.htke"/> + </mime-type> + <mime-type type="application/vnd.kidspiration"> + <glob pattern="*.kia"/> + </mime-type> + <mime-type type="application/vnd.kinar"> + <glob pattern="*.kne"/> + <glob pattern="*.knp"/> + </mime-type> + <mime-type type="application/vnd.koan"> + <alias type="application/x-koan"/> + <_comment>SSEYO Koan File</_comment> + <glob pattern="*.skp"/> + <glob pattern="*.skd"/> + <glob pattern="*.skt"/> + <glob pattern="*.skm"/> + </mime-type> + <mime-type type="application/vnd.kodak-descriptor"> + <glob pattern="*.sse"/> + </mime-type> + <mime-type type="application/vnd.liberty-request+xml"/> + <mime-type type="application/vnd.llamagraphics.life-balance.desktop"> + <glob pattern="*.lbd"/> + </mime-type> + <mime-type type="application/vnd.llamagraphics.life-balance.exchange+xml"> + <glob pattern="*.lbe"/> + </mime-type> + <mime-type type="application/vnd.lotus-1-2-3"> + <glob pattern="*.123"/> + </mime-type> + <mime-type type="application/vnd.lotus-approach"> + <glob pattern="*.apr"/> + </mime-type> + <mime-type type="application/vnd.lotus-freelance"> + <glob pattern="*.pre"/> + </mime-type> + <mime-type type="application/vnd.lotus-notes"> + <glob pattern="*.nsf"/> + </mime-type> + <mime-type type="application/vnd.lotus-organizer"> + <glob pattern="*.org"/> + </mime-type> + <mime-type type="application/vnd.lotus-screencam"> + <glob pattern="*.scm"/> + </mime-type> + + <mime-type type="application/vnd.lotus-wordpro"> + <magic priority="50"> + <match value="WordPro\0" type="string" offset="0" /> + <match value="WordPro\r\373" type="string" offset="0" /> + </magic> + <glob pattern="*.lwp"/> + </mime-type> + + <mime-type type="application/vnd.macports.portpkg"> + <glob pattern="*.portpkg"/> + </mime-type> + <mime-type type="application/vnd.marlin.drm.actiontoken+xml"/> + <mime-type type="application/vnd.marlin.drm.conftoken+xml"/> + <mime-type type="application/vnd.marlin.drm.license+xml"/> + <mime-type type="application/vnd.marlin.drm.mdcf"/> + <mime-type type="application/vnd.mcd"> + <glob pattern="*.mcd"/> + </mime-type> + <mime-type type="application/vnd.medcalcdata"> + <glob pattern="*.mc1"/> + </mime-type> + <mime-type type="application/vnd.mediastation.cdkey"> + <glob pattern="*.cdkey"/> + </mime-type> + <mime-type type="application/vnd.meridian-slingshot"/> + <mime-type type="application/vnd.mfer"> + <glob pattern="*.mwf"/> + </mime-type> + <mime-type type="application/vnd.mfmp"> + <glob pattern="*.mfm"/> + </mime-type> + <mime-type type="application/vnd.micrografx.flo"> + <glob pattern="*.flo"/> + </mime-type> + <mime-type type="application/vnd.micrografx.igx"> + <glob pattern="*.igx"/> + </mime-type> + + <mime-type type="application/vnd.mif"> + <comment>FrameMaker MIF document</comment> + <alias type="application/x-mif"/> + <alias type="application/x-frame"/> + <magic priority="50"> + <match value="\<MakerFile" type="string" offset="0" /> + <match value="\<MIFFile" type="string" offset="0" /> + <match value="\<MakerDictionary" type="string" offset="0" /> + <match value="\<MakerScreenFont" type="string" offset="0" /> + <match value="\<MML" type="string" offset="0" /> + <match value="\<Book" type="string" offset="0" /> + <match value="\<Maker" type="string" offset="0" /> + </magic> + <glob pattern="*.mif"/> + </mime-type> + + <mime-type type="application/vnd.minisoft-hp3000-save"/> + <mime-type type="application/vnd.mitsubishi.misty-guard.trustweb"/> + <mime-type type="application/vnd.mobius.daf"> + <glob pattern="*.daf"/> + </mime-type> + <mime-type type="application/vnd.mobius.dis"> + <glob pattern="*.dis"/> + </mime-type> + <mime-type type="application/vnd.mobius.mbk"> + <glob pattern="*.mbk"/> + </mime-type> + <mime-type type="application/vnd.mobius.mqy"> + <glob pattern="*.mqy"/> + </mime-type> + <mime-type type="application/vnd.mobius.msl"> + <glob pattern="*.msl"/> + </mime-type> + <mime-type type="application/vnd.mobius.plc"> + <glob pattern="*.plc"/> + </mime-type> + <mime-type type="application/vnd.mobius.txf"> + <glob pattern="*.txf"/> + </mime-type> + <mime-type type="application/vnd.mophun.application"> + <glob pattern="*.mpn"/> + </mime-type> + <mime-type type="application/vnd.mophun.certificate"> + <glob pattern="*.mpc"/> + </mime-type> + <mime-type type="application/vnd.motorola.flexsuite"/> + <mime-type type="application/vnd.motorola.flexsuite.adsi"/> + <mime-type type="application/vnd.motorola.flexsuite.fis"/> + <mime-type type="application/vnd.motorola.flexsuite.gotap"/> + <mime-type type="application/vnd.motorola.flexsuite.kmr"/> + <mime-type type="application/vnd.motorola.flexsuite.ttc"/> + <mime-type type="application/vnd.motorola.flexsuite.wem"/> + <mime-type type="application/vnd.motorola.iprm"/> + <mime-type type="application/vnd.mozilla.xul+xml"> + <glob pattern="*.xul"/> + </mime-type> + <mime-type type="application/vnd.ms-artgalry"> + <glob pattern="*.cil"/> + </mime-type> + <mime-type type="application/vnd.ms-asf"/> + <mime-type type="application/vnd.ms-cab-compressed"> + <glob pattern="*.cab"/> + </mime-type> + + <!-- http://www.iana.org/assignments/media-types/application/vnd.ms-excel --> + <mime-type type="application/vnd.ms-excel"> + <alias type="application/msexcel" /> + <comment>Microsoft Excel Spreadsheet</comment> + <magic priority="50"> + <match value="Microsoft\ Excel\ 5.0\ Worksheet" type="string" offset="2080"/> + <match value="Foglio\ di\ lavoro\ Microsoft\ Exce" type="string" offset="2080"/> + <match value="Biff5" type="string" offset="2114"/> + <match value="Biff5" type="string" offset="2121"/> + <match value="\x09\x04\x06\x00\x00\x00\x10\x00" type="string" offset="0"/> + </magic> + <glob pattern="*.xls"/> + <glob pattern="*.xlm"/> + <glob pattern="*.xla"/> + <glob pattern="*.xlc"/> + <glob pattern="*.xlt"/> + <glob pattern="*.xlw"/> + <glob pattern="*.xll"/> + <glob pattern="*.xld"/> + <sub-class-of type="application/x-tika-msoffice"/> + </mime-type> + + <mime-type type="application/vnd.ms-excel.addin.macroenabled.12"> + <comment>Office Open XML Workbook Add-in (macro-enabled)</comment> + <glob pattern="*.xlam"/> + <sub-class-of type="application/x-tika-ooxml"/> + </mime-type> + + <mime-type type="application/vnd.ms-excel.sheet.macroenabled.12"> + <comment>Office Open XML Workbook (macro-enabled)</comment> + <glob pattern="*.xlsm"/> + <sub-class-of type="application/x-tika-ooxml"/> + </mime-type> + + <mime-type type="application/vnd.ms-excel.sheet.binary.macroenabled.12"> + <comment>Microsoft Excel 2007 Binary Spreadsheet</comment> + <glob pattern="*.xlsb"/> + <sub-class-of type="application/vnd.ms-excel"/> + </mime-type> + + <mime-type type="application/vnd.ms-excel.template.macroenabled.12"> + <glob pattern="*.xltm"/> + <sub-class-of type="application/x-tika-ooxml"/> + </mime-type> + + <mime-type type="application/vnd.ms-fontobject"> + <glob pattern="*.eot"/> + </mime-type> + <mime-type type="application/vnd.ms-htmlhelp"> + <glob pattern="*.chm"/> + </mime-type> + <mime-type type="application/vnd.ms-ims"> + <glob pattern="*.ims"/> + </mime-type> + <mime-type type="application/vnd.ms-lrm"> + <glob pattern="*.lrm"/> + </mime-type> + + <mime-type type="application/vnd.ms-outlook"> + <comment>Microsoft Outlook Message</comment> + <glob pattern="*.msg" /> + <sub-class-of type="application/x-tika-msoffice"/> + </mime-type> + + <mime-type type="application/vnd.ms-pki.seccat"> + <glob pattern="*.cat"/> + </mime-type> + <mime-type type="application/vnd.ms-pki.stl"> + <glob pattern="*.stl"/> + </mime-type> + <mime-type type="application/vnd.ms-playready.initiator+xml"/> + + <!-- http://www.iana.org/assignments/media-types/application/vnd.ms-powerpoint --> + <mime-type type="application/vnd.ms-powerpoint"> + <alias type="application/mspowerpoint"/> + <comment>Microsoft Powerpoint Presentation</comment> + <glob pattern="*.ppz"/> + <glob pattern="*.ppt"/> + <glob pattern="*.pps"/> + <glob pattern="*.pot"/> + <glob pattern="*.ppa"/> + <sub-class-of type="application/x-tika-msoffice"/> + </mime-type> + + <mime-type type="application/vnd.ms-powerpoint.addin.macroenabled.12"> + <comment>Office Open XML Presentation Add-in (macro-enabled)</comment> + <glob pattern="*.ppam"/> + <sub-class-of type="application/x-tika-msoffice"/> + </mime-type> + + <mime-type type="application/vnd.ms-powerpoint.presentation.macroenabled.12"> + <comment>Office Open XML Presentation (macro-enabled)</comment> + <glob pattern="*.pptm"/> + <sub-class-of type="application/x-tika-msoffice"/> + </mime-type> + + <mime-type type="application/vnd.ms-powerpoint.slide.macroenabled.12"> + <glob pattern="*.sldm"/> + <sub-class-of type="application/x-tika-msoffice"/> + </mime-type> + + <mime-type type="application/vnd.ms-powerpoint.slideshow.macroenabled.12"> + <comment>Office Open XML Presentation Slideshow (macro-enabled)</comment> + <glob pattern="*.ppsm"/> + <sub-class-of type="application/x-tika-msoffice"/> + </mime-type> + + <mime-type type="application/vnd.ms-powerpoint.template.macroenabled.12"> + <glob pattern="*.potm"/> + <sub-class-of type="application/x-tika-msoffice"/> + </mime-type> + + <mime-type type="application/vnd.ms-project"> + <glob pattern="*.mpp"/> + <glob pattern="*.mpt"/> + </mime-type> + + <mime-type type="application/vnd.ms-tnef"> + <magic priority="50"> + <match value="0x223e9f78" type="little16" offset="0" /> + </magic> + </mime-type> + + <mime-type type="application/vnd.ms-wmdrm.lic-chlg-req"/> + <mime-type type="application/vnd.ms-wmdrm.lic-resp"/> + <mime-type type="application/vnd.ms-wmdrm.meter-chlg-req"/> + <mime-type type="application/vnd.ms-wmdrm.meter-resp"/> + + <mime-type type="application/vnd.ms-word.document.macroenabled.12"> + <comment>Office Open XML Document (macro-enabled)</comment> + <glob pattern="*.docm"/> + <sub-class-of type="application/x-tika-ooxml"/> + </mime-type> + + <mime-type type="application/vnd.ms-word.template.macroenabled.12"> + <comment>Office Open XML Document Template (macro-enabled)</comment> + <glob pattern="*.dotm"/> + <sub-class-of type="application/x-tika-ooxml"/> + </mime-type> + + <mime-type type="application/vnd.ms-works"> + <glob pattern="*.wps"/> + <glob pattern="*.wks"/> + <glob pattern="*.wcm"/> + <glob pattern="*.wdb"/> + </mime-type> + <mime-type type="application/vnd.ms-wpl"> + <glob pattern="*.wpl"/> + </mime-type> + <mime-type type="application/vnd.ms-xpsdocument"> + <glob pattern="*.xps"/> + </mime-type> + <mime-type type="application/vnd.mseq"> + <glob pattern="*.mseq"/> + </mime-type> + <mime-type type="application/vnd.msign"/> + <mime-type type="application/vnd.multiad.creator"/> + <mime-type type="application/vnd.multiad.creator.cif"/> + <mime-type type="application/vnd.music-niff"/> + <mime-type type="application/vnd.musician"> + <glob pattern="*.mus"/> + </mime-type> + <mime-type type="application/vnd.muvee.style"> + <glob pattern="*.msty"/> + </mime-type> + <mime-type type="application/vnd.ncd.control"/> + <mime-type type="application/vnd.ncd.reference"/> + <mime-type type="application/vnd.nervana"/> + <mime-type type="application/vnd.netfpx"/> + <mime-type type="application/vnd.neurolanguage.nlu"> + <glob pattern="*.nlu"/> + </mime-type> + <mime-type type="application/vnd.noblenet-directory"> + <glob pattern="*.nnd"/> + </mime-type> + <mime-type type="application/vnd.noblenet-sealer"> + <glob pattern="*.nns"/> + </mime-type> + <mime-type type="application/vnd.noblenet-web"> + <glob pattern="*.nnw"/> + </mime-type> + <mime-type type="application/vnd.nokia.catalogs"/> + <mime-type type="application/vnd.nokia.conml+wbxml"/> + <mime-type type="application/vnd.nokia.conml+xml"/> + <mime-type type="application/vnd.nokia.isds-radio-presets"/> + <mime-type type="application/vnd.nokia.iptv.config+xml"/> + <mime-type type="application/vnd.nokia.landmark+wbxml"/> + <mime-type type="application/vnd.nokia.landmark+xml"/> + <mime-type type="application/vnd.nokia.landmarkcollection+xml"/> + <mime-type type="application/vnd.nokia.n-gage.ac+xml"/> + <mime-type type="application/vnd.nokia.n-gage.data"> + <glob pattern="*.ngdat"/> + </mime-type> + <mime-type type="application/vnd.nokia.n-gage.symbian.install"> + <glob pattern="*.n-gage"/> + </mime-type> + <mime-type type="application/vnd.nokia.ncd"/> + <mime-type type="application/vnd.nokia.pcd+wbxml"/> + <mime-type type="application/vnd.nokia.pcd+xml"/> + <mime-type type="application/vnd.nokia.radio-preset"> + <glob pattern="*.rpst"/> + </mime-type> + <mime-type type="application/vnd.nokia.radio-presets"> + <glob pattern="*.rpss"/> + </mime-type> + <mime-type type="application/vnd.novadigm.edm"> + <glob pattern="*.edm"/> + </mime-type> + <mime-type type="application/vnd.novadigm.edx"> + <glob pattern="*.edx"/> + </mime-type> + <mime-type type="application/vnd.novadigm.ext"> + <glob pattern="*.ext"/> + </mime-type> + + <!-- =================================================================== --> + <!-- Open Document Format for Office Applications (OpenDocument) v1.0 --> + <!-- http://www.oasis-open.org/specs/index.php#opendocumentv1.0 --> + <!-- =================================================================== --> + + <mime-type type="application/vnd.oasis.opendocument.chart"> + <alias type="application/x-vnd.oasis.opendocument.chart"/> + <comment>OpenDocument v1.0: Chart document</comment> + <magic> + <match type="string" offset="0" value="PK"> + <match type="string" offset="30" + value="mimetypeapplication/vnd.oasis.opendocument.chart"/> + </match> + </magic> + <glob pattern="*.odc"/> + </mime-type> + + <mime-type type="application/vnd.oasis.opendocument.chart-template"> + <alias type="application/x-vnd.oasis.opendocument.chart-template"/> + <comment>OpenDocument v1.0: Chart document used as template</comment> + <magic> + <match type="string" offset="0" value="PK"> + <match type="string" offset="30" + value="mimetypeapplication/vnd.oasis.opendocument.chart-template"/> + </match> + </magic> + <glob pattern="*.otc"/> + </mime-type> + + <mime-type type="application/vnd.oasis.opendocument.database"> + <glob pattern="*.odb"/> + </mime-type> + + <mime-type type="application/vnd.oasis.opendocument.formula"> + <alias type="application/x-vnd.oasis.opendocument.formula"/> + <comment>OpenDocument v1.0: Formula document</comment> + <magic> + <match type="string" offset="0" value="PK"> + <match type="string" offset="30" + value="mimetypeapplication/vnd.oasis.opendocument.formula" /> + </match> + </magic> + <glob pattern="*.odf"/> + </mime-type> + + <mime-type type="application/vnd.oasis.opendocument.formula-template"> + <alias type="application/x-vnd.oasis.opendocument.formula-template"/> + <comment>OpenDocument v1.0: Formula document used as template</comment> + <magic> + <match type="string" offset="0" value="PK"> + <match type="string" offset="30" + value="mimetypeapplication/vnd.oasis.opendocument.formula-template"/> + </match> + </magic> + <glob pattern="*.odft"/> + </mime-type> + + <mime-type type="application/vnd.oasis.opendocument.graphics"> + <alias type="application/x-vnd.oasis.opendocument.graphics"/> + <comment>OpenDocument v1.0: Graphics document (Drawing)</comment> + <magic> + <match type="string" offset="0" value="PK"> + <match type="string" offset="30" + value="mimetypeapplication/vnd.oasis.opendocument.graphics"/> + </match> + </magic> + <glob pattern="*.odg"/> + </mime-type> + + <mime-type type="application/vnd.oasis.opendocument.graphics-template"> + <alias type="application/x-vnd.oasis.opendocument.graphics-template"/> + <comment>OpenDocument v1.0: Graphics document used as template</comment> + <magic> + <match type="string" offset="0" value="PK"> + <match type="string" offset="30" + value="mimetypeapplication/vnd.oasis.opendocument.graphics-template"/> + </match> + </magic> + <glob pattern="*.otg"/> + </mime-type> + + <mime-type type="application/vnd.oasis.opendocument.image"> + <alias type="application/x-vnd.oasis.opendocument.image"/> + <comment>OpenDocument v1.0: Image document</comment> + <magic> + <match type="string" offset="0" value="PK"> + <match type="string" offset="30" + value="mimetypeapplication/vnd.oasis.opendocument.image"/> + </match> + </magic> + <glob pattern="*.odi"/> + </mime-type> + + <mime-type type="application/vnd.oasis.opendocument.image-template"> + <alias type="application/x-vnd.oasis.opendocument.image-template"/> + <comment>OpenDocument v1.0: Image document used as template</comment> + <magic> + <match type="string" offset="0" value="PK"> + <match type="string" offset="30" + value="mimetypeapplication/vnd.oasis.opendocument.image-template"/> + </match> + </magic> + <glob pattern="*.oti"/> + </mime-type> + + <mime-type type="application/vnd.oasis.opendocument.presentation"> + <alias type="application/x-vnd.oasis.opendocument.presentation"/> + <comment>OpenDocument v1.0: Presentation document</comment> + <magic> + <match type="string" offset="0" value="PK"> + <match type="string" offset="30" + value="mimetypeapplication/vnd.oasis.opendocument.presentation"/> + </match> + </magic> + <glob pattern="*.odp"/> + </mime-type> + + <mime-type type="application/vnd.oasis.opendocument.presentation-template"> + <alias type="application/x-vnd.oasis.opendocument.presentation-template"/> + <comment>OpenDocument v1.0: Presentation document used as template</comment> + <magic> + <match type="string" offset="0" value="PK"> + <match type="string" offset="30" + value="mimetypeapplication/vnd.oasis.opendocument.presentation-template"/> + </match> + </magic> + <glob pattern="*.otp"/> + </mime-type> + + <mime-type type="application/vnd.oasis.opendocument.spreadsheet"> + <alias type="application/x-vnd.oasis.opendocument.spreadsheet"/> + <comment>OpenDocument v1.0: Spreadsheet document</comment> + <magic> + <match type="string" offset="0" value="PK"> + <match type="string" offset="30" + value="mimetypeapplication/vnd.oasis.opendocument.spreadsheet"/> + </match> + </magic> + <glob pattern="*.ods"/> + </mime-type> + + <mime-type type="application/vnd.oasis.opendocument.spreadsheet-template"> + <alias type="application/x-vnd.oasis.opendocument.spreadsheet-template"/> + <comment>OpenDocument v1.0: Spreadsheet document used as template</comment> + <magic> + <match type="string" offset="0" value="PK"> + <match type="string" offset="30" + value="mimetypeapplication/vnd.oasis.opendocument.spreadsheet-template"/> + </match> + </magic> + <glob pattern="*.ots"/> + </mime-type> + + <mime-type type="application/vnd.oasis.opendocument.text"> + <alias type="application/x-vnd.oasis.opendocument.text"/> + <comment>OpenDocument v1.0: Text document</comment> + <magic> + <match type="string" offset="0" value="PK"> + <match type="string" offset="30" + value="mimetypeapplication/vnd.oasis.opendocument.text"/> + </match> + </magic> + <glob pattern="*.odt"/> + </mime-type> + + <mime-type type="application/vnd.oasis.opendocument.text-master"> + <alias type="application/x-vnd.oasis.opendocument.text-master"/> + <comment>OpenDocument v1.0: Global Text document</comment> + <magic> + <match type="string" offset="0" value="PK"> + <match type="string" offset="30" + value="mimetypeapplication/vnd.oasis.opendocument.text-master"/> + </match> + </magic> + <glob pattern="*.otm"/> + </mime-type> + + <mime-type type="application/vnd.oasis.opendocument.text-template"> + <alias type="application/x-vnd.oasis.opendocument.text-template"/> + <comment>OpenDocument v1.0: Text document used as template</comment> + <magic> + <match type="string" offset="0" value="PK"> + <match type="string" offset="30" + value="mimetypeapplication/vnd.oasis.opendocument.text-template"/> + </match> + </magic> + <glob pattern="*.ott"/> + </mime-type> + + <mime-type type="application/vnd.oasis.opendocument.text-web"> + <alias type="application/x-vnd.oasis.opendocument.text-web"/> + <comment>OpenDocument v1.0: Text document used as template for HTML documents</comment> + <magic> + <match type="string" offset="0" value="PK"> + <match type="string" offset="30" + value="mimetypeapplication/vnd.oasis.opendocument.text-web"/> + </match> + </magic> + <glob pattern="*.oth"/> + </mime-type> + + <mime-type type="application/vnd.obn"/> + <mime-type type="application/vnd.olpc-sugar"> + <glob pattern="*.xo"/> + </mime-type> + <mime-type type="application/vnd.oma-scws-config"/> + <mime-type type="application/vnd.oma-scws-http-request"/> + <mime-type type="application/vnd.oma-scws-http-response"/> + <mime-type type="application/vnd.oma.bcast.associated-procedure-parameter+xml"/> + <mime-type type="application/vnd.oma.bcast.drm-trigger+xml"/> + <mime-type type="application/vnd.oma.bcast.imd+xml"/> + <mime-type type="application/vnd.oma.bcast.ltkm"/> + <mime-type type="application/vnd.oma.bcast.notification+xml"/> + <mime-type type="application/vnd.oma.bcast.provisioningtrigger"/> + <mime-type type="application/vnd.oma.bcast.sgboot"/> + <mime-type type="application/vnd.oma.bcast.sgdd+xml"/> + <mime-type type="application/vnd.oma.bcast.sgdu"/> + <mime-type type="application/vnd.oma.bcast.simple-symbol-container"/> + <mime-type type="application/vnd.oma.bcast.smartcard-trigger+xml"/> + <mime-type type="application/vnd.oma.bcast.sprov+xml"/> + <mime-type type="application/vnd.oma.bcast.stkm"/> + <mime-type type="application/vnd.oma.dcd"/> + <mime-type type="application/vnd.oma.dcdc"/> + <mime-type type="application/vnd.oma.dd2+xml"> + <glob pattern="*.dd2"/> + </mime-type> + <mime-type type="application/vnd.oma.drm.risd+xml"/> + <mime-type type="application/vnd.oma.group-usage-list+xml"/> + <mime-type type="application/vnd.oma.poc.detailed-progress-report+xml"/> + <mime-type type="application/vnd.oma.poc.final-report+xml"/> + <mime-type type="application/vnd.oma.poc.groups+xml"/> + <mime-type type="application/vnd.oma.poc.invocation-descriptor+xml"/> + <mime-type type="application/vnd.oma.poc.optimized-progress-report+xml"/> + <mime-type type="application/vnd.oma.xcap-directory+xml"/> + <mime-type type="application/vnd.omads-email+xml"/> + <mime-type type="application/vnd.omads-file+xml"/> + <mime-type type="application/vnd.omads-folder+xml"/> + <mime-type type="application/vnd.omaloc-supl-init"/> + + <mime-type type="application/vnd.openofficeorg.extension"> + <glob pattern="*.oxt"/> + </mime-type> + + <mime-type type="application/vnd.openxmlformats-officedocument.presentationml.presentation"> + <comment>Office Open XML Presentation</comment> + <glob pattern="*.pptx"/> + <glob pattern="*.thmx"/> + <sub-c... [truncated message content] |
From: <bi...@us...> - 2010-10-27 16:13:32
|
Revision: 3312 http://archive-access.svn.sourceforge.net/archive-access/?rev=3312&view=rev Author: binzino Date: 2010-10-27 16:13:26 +0000 (Wed, 27 Oct 2010) Log Message: ----------- Store digest in index. Modified Paths: -------------- tags/nutchwax-0_13-JIRA-WAX-75/archive/src/nutch/conf/nutch-site.xml Modified: tags/nutchwax-0_13-JIRA-WAX-75/archive/src/nutch/conf/nutch-site.xml =================================================================== --- tags/nutchwax-0_13-JIRA-WAX-75/archive/src/nutch/conf/nutch-site.xml 2010-10-27 07:08:09 UTC (rev 3311) +++ tags/nutchwax-0_13-JIRA-WAX-75/archive/src/nutch/conf/nutch-site.xml 2010-10-27 16:13:26 UTC (rev 3312) @@ -47,6 +47,7 @@ content:false:compress:tokenized site:false:false:untokenized url:false:true:tokenized + digest:false:true:no type:true:true:no_norms length:false:true:no </value> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2010-10-27 07:08:15
|
Revision: 3311 http://archive-access.svn.sourceforge.net/archive-access/?rev=3311&view=rev Author: binzino Date: 2010-10-27 07:08:09 +0000 (Wed, 27 Oct 2010) Log Message: ----------- Removed log message about merging dates. Modified Paths: -------------- tags/nutchwax-0_13-JIRA-WAX-75/archive/src/java/org/archive/nutchwax/IndexerMapReduce.java Modified: tags/nutchwax-0_13-JIRA-WAX-75/archive/src/java/org/archive/nutchwax/IndexerMapReduce.java =================================================================== --- tags/nutchwax-0_13-JIRA-WAX-75/archive/src/java/org/archive/nutchwax/IndexerMapReduce.java 2010-10-27 07:07:51 UTC (rev 3310) +++ tags/nutchwax-0_13-JIRA-WAX-75/archive/src/java/org/archive/nutchwax/IndexerMapReduce.java 2010-10-27 07:08:09 UTC (rev 3311) @@ -138,7 +138,6 @@ String[] sourceDates = src.getValues( "date" ); for ( String date : sourceDates ) { - LOG.warn( "Merging: " + key + " : " + date ); dest.add( "date", date ); } This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2010-10-27 07:07:57
|
Revision: 3310 http://archive-access.svn.sourceforge.net/archive-access/?rev=3310&view=rev Author: binzino Date: 2010-10-27 07:07:51 +0000 (Wed, 27 Oct 2010) Log Message: ----------- Disabled the BoilerPipe stuff for now. Modified Paths: -------------- tags/nutchwax-0_13-JIRA-WAX-75/archive/src/java/org/archive/nutchwax/Importer.java Modified: tags/nutchwax-0_13-JIRA-WAX-75/archive/src/java/org/archive/nutchwax/Importer.java =================================================================== --- tags/nutchwax-0_13-JIRA-WAX-75/archive/src/java/org/archive/nutchwax/Importer.java 2010-10-27 07:07:20 UTC (rev 3309) +++ tags/nutchwax-0_13-JIRA-WAX-75/archive/src/java/org/archive/nutchwax/Importer.java 2010-10-27 07:07:51 UTC (rev 3310) @@ -338,6 +338,16 @@ contentMetadata.set( NutchWax.CONTENT_LENGTH_KEY, String.valueOf( meta.getLength() ) ); contentMetadata.set( NutchWax.HTTP_RESPONSE_KEY, String.valueOf( record.getStatusCode() ) ); + // BoilerPipe! + /* + if ( "text/html".equals( meta.getMimetype() ) ) + { + String boiledHTML = de.l3s.boilerpipe.extractors.DefaultExtractor.INSTANCE.getText( new org.xml.sax.InputSource( new java.io.ByteArrayInputStream( bytes ) ) ); + + contentMetadata.set( "boiledHTML", boiledHTML ); + } + */ + Content content = new Content( url, url, bytes, meta.getMimetype(), contentMetadata, getConf() ); output( output, new Text( key ), content ); This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2010-10-27 07:07:26
|
Revision: 3309 http://archive-access.svn.sourceforge.net/archive-access/?rev=3309&view=rev Author: binzino Date: 2010-10-27 07:07:20 +0000 (Wed, 27 Oct 2010) Log Message: ----------- Disable the import.content.limit. Modified Paths: -------------- tags/nutchwax-0_13-JIRA-WAX-75/archive/src/nutch/conf/nutch-site.xml Modified: tags/nutchwax-0_13-JIRA-WAX-75/archive/src/nutch/conf/nutch-site.xml =================================================================== --- tags/nutchwax-0_13-JIRA-WAX-75/archive/src/nutch/conf/nutch-site.xml 2010-10-27 07:06:57 UTC (rev 3308) +++ tags/nutchwax-0_13-JIRA-WAX-75/archive/src/nutch/conf/nutch-site.xml 2010-10-27 07:07:20 UTC (rev 3309) @@ -125,7 +125,7 @@ --> <property> <name>nutchwax.import.content.limit</name> - <value>1048576</value> + <value>-1</value> </property> <!-- Whether or not we store the full content in the segment's This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2010-10-27 07:07:03
|
Revision: 3308 http://archive-access.svn.sourceforge.net/archive-access/?rev=3308&view=rev Author: binzino Date: 2010-10-27 07:06:57 +0000 (Wed, 27 Oct 2010) Log Message: ----------- Dates are stored as-is, but indexed in YYYY and YYYYMM format. Modified Paths: -------------- tags/nutchwax-0_13-JIRA-WAX-75/archive/src/java/org/archive/nutchwax/tools/DateAdder.java Modified: tags/nutchwax-0_13-JIRA-WAX-75/archive/src/java/org/archive/nutchwax/tools/DateAdder.java =================================================================== --- tags/nutchwax-0_13-JIRA-WAX-75/archive/src/java/org/archive/nutchwax/tools/DateAdder.java 2010-10-27 07:06:04 UTC (rev 3307) +++ tags/nutchwax-0_13-JIRA-WAX-75/archive/src/java/org/archive/nutchwax/tools/DateAdder.java 2010-10-27 07:06:57 UTC (rev 3308) @@ -27,7 +27,7 @@ import org.apache.lucene.index.IndexWriter; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; -import org.apache.lucene.analysis.WhitespaceAnalyzer; +import org.apache.lucene.analysis.*; import org.apache.lucene.store.NIOFSDirectory; import org.apache.hadoop.conf.Configured; @@ -106,7 +106,7 @@ sourceReaders[i] = IndexReader.open( new NIOFSDirectory( new File( args[i+1] ) ), true ); } - IndexWriter writer = new IndexWriter( new NIOFSDirectory( new File( destIndexDir ) ), null, IndexWriter.MaxFieldLength.UNLIMITED ); + IndexWriter writer = new IndexWriter( new NIOFSDirectory( new File( destIndexDir ) ), new KeywordAnalyzer(), IndexWriter.MaxFieldLength.UNLIMITED ); UrlCanonicalizer canonicalizer = getCanonicalizer( this.getConf( ) ); @@ -128,7 +128,9 @@ } for ( String date : uniqueDates ) { - newDoc.add( new Field( NutchWax.DATE_KEY, date, Field.Store.YES, Field.Index.NOT_ANALYZED ) ); + newDoc.add( new Field( NutchWax.DATE_KEY, date, Field.Store.YES, Field.Index.NO ) ); + newDoc.add( new Field( NutchWax.DATE_KEY, date.substring( 0, 4 ), Field.Store.NO, Field.Index.NOT_ANALYZED_NO_NORMS ) ); + newDoc.add( new Field( NutchWax.DATE_KEY, date.substring( 0, 6 ), Field.Store.NO, Field.Index.NOT_ANALYZED_NO_NORMS ) ); } // Obtain the new dates for the document. @@ -156,7 +158,9 @@ { for ( String date : newDates.split("\\s+") ) { - newDoc.add( new Field( NutchWax.DATE_KEY, date, Field.Store.YES, Field.Index.NOT_ANALYZED ) ); + newDoc.add( new Field( NutchWax.DATE_KEY, date, Field.Store.YES, Field.Index.NO ) ); + newDoc.add( new Field( NutchWax.DATE_KEY, date.substring( 0, 4 ), Field.Store.NO, Field.Index.NOT_ANALYZED_NO_NORMS ) ); + newDoc.add( new Field( NutchWax.DATE_KEY, date.substring( 0, 6 ), Field.Store.NO, Field.Index.NOT_ANALYZED_NO_NORMS ) ); } } @@ -207,6 +211,5 @@ System.exit( result ); } - } This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
Revision: 3307 http://archive-access.svn.sourceforge.net/archive-access/?rev=3307&view=rev Author: binzino Date: 2010-10-27 07:06:04 +0000 (Wed, 27 Oct 2010) Log Message: ----------- Add type normalization and filtering. Added uri/path filtering. Modified Paths: -------------- tags/nutchwax-0_13-JIRA-WAX-75/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/ConfigurableIndexingFilter.java Modified: tags/nutchwax-0_13-JIRA-WAX-75/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/ConfigurableIndexingFilter.java =================================================================== --- tags/nutchwax-0_13-JIRA-WAX-75/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/ConfigurableIndexingFilter.java 2010-10-27 07:00:57 UTC (rev 3306) +++ tags/nutchwax-0_13-JIRA-WAX-75/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/ConfigurableIndexingFilter.java 2010-10-27 07:06:04 UTC (rev 3307) @@ -1,30 +1,26 @@ /* - * Copyright (C) 2008 Internet Archive. - * - * This file is part of the archive-access tools project - * (http://sourceforge.net/projects/archive-access). - * - * The archive-access tools are free software; you can redistribute them and/or - * modify them under the terms of the GNU Lesser Public License as published by - * the Free Software Foundation; either version 2.1 of the License, or any - * later version. - * - * The archive-access tools are distributed in the hope that they will be - * useful, but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser - * Public License for more details. - * - * You should have received a copy of the GNU Lesser Public License along with - * the archive-access tools; if not, write to the Free Software Foundation, - * Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. */ + package org.archive.nutchwax.index; -import java.net.MalformedURLException; -import java.net.URL; -import java.util.ArrayList; -import java.util.List; +import java.net.*; +import java.util.*; + import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; import org.apache.hadoop.conf.Configuration; @@ -51,6 +47,9 @@ private List<FieldSpecification> fieldSpecs; private int MAX_TITLE_LENGTH; + private TypeNormalizer typenormalizer; + private TypeFilter typefilter; + private URLFilter urlfilter; public void setConf( Configuration conf ) { @@ -58,6 +57,13 @@ this.MAX_TITLE_LENGTH = conf.getInt("indexer.max.title.length", 100); + // this.allowedTypes = new HashSet<String>( conf.get( "indexer.mimetypes.allowed", "" ).split( "\\s+" ) ); + this.typenormalizer = new TypeNormalizer( ); + this.typenormalizer.setAliases( typenormalizer.getDefaultAliases( ) ); + + this.typefilter = new TypeFilter( TypeFilter.getDefaultAllowed( ), this.typenormalizer ); + this.urlfilter = new URLFilter ( URLFilter.getDefaultProhibited( ) ); + String filterSpecs = conf.get( "nutchwax.filter.index" ); if ( null == filterSpecs ) @@ -143,6 +149,8 @@ { Metadata meta = parse.getData().getContentMeta(); + // + for ( FieldSpecification spec : this.fieldSpecs ) { String value = null; @@ -150,15 +158,24 @@ { try { - value = (new URL( meta.get( "url" ) ) ).getHost( ); + URI uri = new URI( meta.get( "url" ) ); + if ( ! this.urlfilter.isAllowed( uri ) ) + { + LOG.info( "Rejecting: " + key + " due to url: " + uri ); + + return null; + } + + value = uri.getHost( ); + // Strip off any "www." header. if ( value.startsWith( "www." ) ) { value = value.substring( 4 ); } } - catch ( MalformedURLException mue ) { /* Eat it */ } + catch ( URISyntaxException use ) { /* Eat it */ } } else if ( "content".equals( spec.srcKey ) ) { @@ -178,8 +195,16 @@ if ( value == null ) continue ; - int p = value.indexOf( ';' ); - if ( p >= 0 ) value = value.substring( 0, p ); + //int p = value.indexOf( ';' ); + //if ( p >= 0 ) value = value.substring( 0, p ); + value = this.typenormalizer.normalize( value ); + + if ( ! this.typefilter.isAllowed( value ) ) + { + LOG.info( "Rejecting: " + key + " due to type: " + value ); + + return null; + } } else if ( "collection".equals( spec.srcKey ) ) { This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
Revision: 3306 http://archive-access.svn.sourceforge.net/archive-access/?rev=3306&view=rev Author: binzino Date: 2010-10-27 07:00:57 +0000 (Wed, 27 Oct 2010) Log Message: ----------- Changed to just store the date, no indexing. Modified Paths: -------------- tags/nutchwax-0_13-JIRA-WAX-75/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/DateIndexer.java Modified: tags/nutchwax-0_13-JIRA-WAX-75/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/DateIndexer.java =================================================================== --- tags/nutchwax-0_13-JIRA-WAX-75/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/DateIndexer.java 2010-10-27 06:56:42 UTC (rev 3305) +++ tags/nutchwax-0_13-JIRA-WAX-75/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/DateIndexer.java 2010-10-27 07:00:57 UTC (rev 3306) @@ -77,8 +77,6 @@ for ( String date : dates ) { doc.add( "date", date ); - doc.add( "year", date.substring( 0, 4 ) ); - doc.add( "yearmonth", date.substring( 0, 6 ) ); } return doc; @@ -87,8 +85,6 @@ public void addIndexBackendOptions( Configuration conf ) { LuceneWriter.addFieldOptions( "date", LuceneWriter.STORE.YES, LuceneWriter.INDEX.NO, conf ); - LuceneWriter.addFieldOptions( "year", LuceneWriter.STORE.NO, LuceneWriter.INDEX.UNTOKENIZED, conf ); - LuceneWriter.addFieldOptions( "yearmonth", LuceneWriter.STORE.NO, LuceneWriter.INDEX.UNTOKENIZED, conf ); } } This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |