|
From: <jer...@us...> - 2014-05-10 11:52:28
|
Revision: 8263
http://sourceforge.net/p/bigdata/code/8263
Author: jeremy_carroll
Date: 2014-05-10 11:52:25 +0000 (Sat, 10 May 2014)
Log Message:
-----------
Cleaning up of ConfigurableAnalyzerFactory, adding TermCompletionAnalyzer, deprecating DefaultAnalyzerFactory
Finishing of trac 912, work on 915
Unit tests for the old and new behaviors
This merges the branch TEXT_ANALYZERS.
Modified Paths:
--------------
branches/BIGDATA_RELEASE_1_3_0/bigdata/src/java/com/bigdata/search/ConfigurableAnalyzerFactory.java
branches/BIGDATA_RELEASE_1_3_0/bigdata/src/java/com/bigdata/search/DefaultAnalyzerFactory.java
branches/BIGDATA_RELEASE_1_3_0/bigdata/src/test/com/bigdata/search/AbstractAnalyzerFactoryTest.java
branches/BIGDATA_RELEASE_1_3_0/bigdata/src/test/com/bigdata/search/AbstractSearchTest.java
branches/BIGDATA_RELEASE_1_3_0/bigdata/src/test/com/bigdata/search/NonEnglishExamples.java
branches/BIGDATA_RELEASE_1_3_0/bigdata/src/test/com/bigdata/search/TestAll.java
branches/BIGDATA_RELEASE_1_3_0/bigdata/src/test/com/bigdata/search/TestConfigurableAsDefaultAnalyzerFactory.java
branches/BIGDATA_RELEASE_1_3_0/bigdata/src/test/com/bigdata/search/TestDefaultAnalyzerFactory.java
Added Paths:
-----------
branches/BIGDATA_RELEASE_1_3_0/bigdata/src/java/com/bigdata/search/TermCompletionAnalyzer.java
branches/BIGDATA_RELEASE_1_3_0/bigdata/src/test/com/bigdata/search/AbstractDefaultAnalyzerFactoryTest.java
branches/BIGDATA_RELEASE_1_3_0/bigdata/src/test/com/bigdata/search/TestConfigurableAnalyzerFactory.java
branches/BIGDATA_RELEASE_1_3_0/bigdata/src/test/com/bigdata/search/TestUnconfiguredAnalyzerFactory.java
Modified: branches/BIGDATA_RELEASE_1_3_0/bigdata/src/java/com/bigdata/search/ConfigurableAnalyzerFactory.java
===================================================================
--- branches/BIGDATA_RELEASE_1_3_0/bigdata/src/java/com/bigdata/search/ConfigurableAnalyzerFactory.java 2014-05-10 02:56:35 UTC (rev 8262)
+++ branches/BIGDATA_RELEASE_1_3_0/bigdata/src/java/com/bigdata/search/ConfigurableAnalyzerFactory.java 2014-05-10 11:52:25 UTC (rev 8263)
@@ -66,6 +66,7 @@
* Supported classes included all the natural language specific classes from Lucene, and also:
* <ul>
* <li>{@link PatternAnalyzer}
+ * <li>{@link TermCompletionAnalyzer}
* <li>{@link KeywordAnalyzer}
* <li>{@link SimpleAnalyzer}
* <li>{@link StopAnalyzer}
@@ -76,7 +77,6 @@
* <ul>
* <li>no arguments
* <li>{@link Version}
- * <li>{@link Set} (of strings, the stop words)
* <li>{@link Version}, {@link Set}
* </ul>
* is usable. If the class has a static method named <code>getDefaultStopSet()</code> then this is assumed
@@ -89,19 +89,17 @@
* abbreviate to <code>c.b.s.C</code> in this documentation.
* Properties from {@link Options} apply to the factory.
* <p>
- *
- * If there are no such properties at all then the property {@link Options#INCLUDE_DEFAULTS} is set to true,
- * and the behavior of this class is the same as the legacy {@link DefaultAnalyzerFactory}.
- * <p>
* Other properties, from {@link AnalyzerOptions} start with
* <code>c.b.s.C.analyzer.<em>language-range</em></code> where <code><em>language-range</em></code> conforms
- * with the extended language range construct from RFC 4647, section 2.2. These are used to specify
- * an analyzer for the given language range.
+ * with the extended language range construct from RFC 4647, section 2.2.
+ * There is an issue that bigdata does not allow '*' in property names, and we use the character '_' to
+ * substitute for '*' in extended language ranges in property names.
+ * These are used to specify an analyzer for the given language range.
* <p>
* If no analyzer is specified for the language range <code>*</code> then the {@link StandardAnalyzer} is used.
* <p>
* Given any specific language, then the analyzer matching the longest configured language range,
- * measured in number of subtags is used {@link #getAnalyzer(String, boolean)}
+ * measured in number of subtags is returned by {@link #getAnalyzer(String, boolean)}
* In the event of a tie, the alphabetically first language range is used.
* The algorithm to find a match is "Extended Filtering" as defined in section 3.3.2 of RFC 4647.
* <p>
@@ -113,11 +111,13 @@
* <dd>This uses whitespace to tokenize</dd>
* <dt>{@link PatternAnalyzer}</dt>
* <dd>This uses a regular expression to tokenize</dd>
+ * <dt>{@link TermCompletionAnalyzer}</dt>
+ * <dd>This uses up to three regular expressions to specify multiple tokens for each word, to address term completion use cases.</dd>
* <dt>{@link EmptyAnalyzer}</dt>
* <dd>This suppresses the functionality, by treating every expression as a stop word.</dd>
* </dl>
* there are in addition the language specific analyzers that are included
- * by using the option {@link Options#INCLUDE_DEFAULTS}
+ * by using the option {@link Options#NATURAL_LANGUAGE_SUPPORT}
*
*
* @author jeremycarroll
@@ -126,11 +126,26 @@
public class ConfigurableAnalyzerFactory implements IAnalyzerFactory {
final private static transient Logger log = Logger.getLogger(ConfigurableAnalyzerFactory.class);
- static class LanguageRange implements Comparable<LanguageRange> {
+ /**
+ * This is an implementation of RFC 4647 language range,
+ * targetted at the specific needs within bigdata, and only
+ * supporting the extended filtering specified in section 3.3.2
+ * <p>
+ * Language ranges are comparable so that
+ * sorting an array and then matching a language tag against each
+ * member of the array in sequence will give the longest match.
+ * i.e. the longer ranges come first.
+ * @author jeremycarroll
+ *
+ */
+ public static class LanguageRange implements Comparable<LanguageRange> {
private final String range[];
private final String full;
-
+ /**
+ * Note range must be in lower case, this is not verified.
+ * @param range
+ */
public LanguageRange(String range) {
this.range = range.split("-");
full = range;
@@ -174,12 +189,22 @@
return full.hashCode();
}
+ /**
+ * This implements the algoirthm of section 3.3.2 of RFC 4647
+ * as modified with the observation about private use tags
+ * in <a href="http://lists.w3.org/Archives/Public/www-international/2014AprJun/0084">
+ * this message</a>.
+ *
+ *
+ * @param langTag The RFC 5646 Language tag in lower case
+ * @return The result of the algorithm
+ */
public boolean extendedFilterMatch(String langTag) {
return extendedFilterMatch(langTag.toLowerCase(Locale.ROOT).split("-"));
}
// See RFC 4647, 3.3.2
- public boolean extendedFilterMatch(String[] language) {
+ boolean extendedFilterMatch(String[] language) {
// RFC 4647 step 2
if (!matchSubTag(language[0], range[0])) {
return false;
@@ -227,13 +252,14 @@
*/
public interface Options {
/**
- * By setting this option to true, then the behavior of the legacy {@link DefaultAnalyzerFactory}
- * is added, and may be overridden by the settings of the user.
+ * By setting this option to true, then all the known Lucene Analyzers for natural
+ * languages are used for a range of language tags.
+ * These settings may then be overridden by the settings of the user.
* Specifically the following properties are loaded, prior to loading the
* user's specification (with <code>c.b.s.C</code> expanding to
* <code>com.bigdata.search.ConfigurableAnalyzerFactory</code>)
<pre>
-c.b.s.C.analyzer.*.like=eng
+c.b.s.C.analyzer._.like=eng
c.b.s.C.analyzer.por.analyzerClass=org.apache.lucene.analysis.br.BrazilianAnalyzer
c.b.s.C.analyzer.pt.like=por
c.b.s.C.analyzer.zho.analyzerClass=org.apache.lucene.analysis.cn.ChineseAnalyzer
@@ -265,18 +291,13 @@
*
*
*/
- String INCLUDE_DEFAULTS = ConfigurableAnalyzerFactory.class.getName() + ".includeDefaults";
+ String NATURAL_LANGUAGE_SUPPORT = ConfigurableAnalyzerFactory.class.getName() + ".naturalLanguageSupport";
/**
* This is the prefix to all properties configuring the individual analyzers.
*/
String ANALYZER = ConfigurableAnalyzerFactory.class.getName() + ".analyzer.";
-/**
- * If there is no configuration at all, then the defaults are included,
- * but any configuration at all totally replaces the defaults, unless
- * {@link #INCLUDE_DEFAULTS}
- * is explicitly set to true.
- */
- String DEFAULT_INCLUDE_DEFAULTS = "false";
+
+ String DEFAULT_NATURAL_LANGUAGE_SUPPORT = "false";
}
/**
* Options understood by analyzers created by {@link ConfigurableAnalyzerFactory}.
@@ -286,7 +307,9 @@
/**
* If specified this is the fully qualified name of a subclass of {@link Analyzer}
* that has appropriate constructors.
- * Either this or {@link #LIKE} or {@link #PATTERN} must be specified for each language range.
+ * This is set implicitly if some of the options below are selected (for example {@link #PATTERN}).
+ * For each configured language range, if it is not set, either explicitly or implicitly, then
+ * {@link #LIKE} must be specified.
*/
String ANALYZER_CLASS = "analyzerClass";
@@ -326,16 +349,52 @@
String STOPWORDS_VALUE_NONE = "none";
/**
- * If this property is present then the analyzer being used is a
- * {@link PatternAnalyzer} and the value is the pattern to use.
+ * The value of the pattern parameter to
+ * {@link PatternAnalyzer#PatternAnalyzer(Version, Pattern, boolean, Set)}
* (Note the {@link Pattern#UNICODE_CHARACTER_CLASS} flag is enabled).
* It is an error if a different analyzer class is specified.
*/
- String PATTERN = ".pattern";
+ String PATTERN = "pattern";
+ /**
+ * The value of the wordBoundary parameter to
+ * {@link TermCompletionAnalyzer#TermCompletionAnalyzer(Pattern, Pattern, Pattern, boolean)}
+ * (Note the {@link Pattern#UNICODE_CHARACTER_CLASS} flag is enabled).
+ * It is an error if a different analyzer class is specified.
+ */
+ String WORD_BOUNDARY = "wordBoundary";
+ /**
+ * The value of the subWordBoundary parameter to
+ * {@link TermCompletionAnalyzer#TermCompletionAnalyzer(Pattern, Pattern, Pattern, boolean)}
+ * (Note the {@link Pattern#UNICODE_CHARACTER_CLASS} flag is enabled).
+ * It is an error if a different analyzer class is specified.
+ */
+ String SUB_WORD_BOUNDARY = "subWordBoundary";
+ /**
+ * The value of the softHyphens parameter to
+ * {@link TermCompletionAnalyzer#TermCompletionAnalyzer(Pattern, Pattern, Pattern, boolean)}
+ * (Note the {@link Pattern#UNICODE_CHARACTER_CLASS} flag is enabled).
+ * It is an error if a different analyzer class is specified.
+ */
+ String SOFT_HYPHENS = "softHyphens";
+ /**
+ * The value of the alwaysRemoveSoftHypens parameter to
+ * {@link TermCompletionAnalyzer#TermCompletionAnalyzer(Pattern, Pattern, Pattern, boolean)}
+ * (Note the {@link Pattern#UNICODE_CHARACTER_CLASS} flag is enabled).
+ * It is an error if a different analyzer class is specified.
+ */
+ String ALWAYS_REMOVE_SOFT_HYPHENS = "alwaysRemoveSoftHyphens";
+
+ boolean DEFAULT_ALWAYS_REMOVE_SOFT_HYPHENS = false;
+
+ /**
+ * The default sub-word boundary is a pattern that never matches,
+ * i.e. there are no sub-word boundaries.
+ */
+ Pattern DEFAULT_SUB_WORD_BOUNDARY = Pattern.compile("(?!)");
}
- private static final String DEFAULT_PROPERTIES =
+ private static final String ALL_LUCENE_NATURAL_LANGUAGES =
"com.bigdata.search.ConfigurableAnalyzerFactory.analyzer.*.like=eng\n" +
"com.bigdata.search.ConfigurableAnalyzerFactory.analyzer.por.analyzerClass=org.apache.lucene.analysis.br.BrazilianAnalyzer\n" +
"com.bigdata.search.ConfigurableAnalyzerFactory.analyzer.pt.like=por\n" +
@@ -365,33 +424,67 @@
"com.bigdata.search.ConfigurableAnalyzerFactory.analyzer.eng.analyzerClass=org.apache.lucene.analysis.standard.StandardAnalyzer\n" +
"com.bigdata.search.ConfigurableAnalyzerFactory.analyzer.en.like=eng\n";
+ private static final String LUCENE_STANDARD_ANALYZER =
+ "com.bigdata.search.ConfigurableAnalyzerFactory.analyzer.*.analyzerClass=org.apache.lucene.analysis.standard.StandardAnalyzer\n";
+
+ /**
+ * This comment describes the implementation of {@link ConfigurableAnalyzerFactory}.
+ * The only method in the interface is {@link ConfigurableAnalyzerFactory#getAnalyzer(String, boolean)},
+ * a map is used from language tag to {@link AnalyzerPair}, where the pair contains
+ * an {@link Analyzer} both with and without stopwords configured (some times these two analyzers are identical,
+ * if, for example, stop words are not supported or not required).
+ * <p>
+ * If there is no entry for the language tag in the map {@link ConfigurableAnalyzerFactory#langTag2AnalyzerPair},
+ * then one is created, by walking down the array {@link ConfigurableAnalyzerFactory#config} of AnalyzerPairs
+ * until a matching one is found.
+ * <p>
+ * The bulk of the code in this class is invoked from the constructor in order to set up this
+ * {@link ConfigurableAnalyzerFactory#config} array. For example, all of the subclasses of {@link AnalyzerPair}s,
+ * are simply to call the appropriate constructor in the appropriate way: the difficulty is that many subclasses
+ * of {@link Analyzer} have constructors with different signatures, and our code needs to navigate each sort.
+ * @author jeremycarroll
+ *
+ */
private static class AnalyzerPair implements Comparable<AnalyzerPair>{
- private final LanguageRange range;
+ final LanguageRange range;
private final Analyzer withStopWords;
private final Analyzer withoutStopWords;
+ public Analyzer getAnalyzer(boolean filterStopwords) {
+ return filterStopwords ? withStopWords : withoutStopWords;
+ }
+
+ public boolean extendedFilterMatch(String[] language) {
+ return range.extendedFilterMatch(language);
+ }
+
AnalyzerPair(String range, Analyzer withStopWords, Analyzer withOutStopWords) {
this.range = new LanguageRange(range);
this.withStopWords = withStopWords;
this.withoutStopWords = withOutStopWords;
}
+ /**
+ * This clone constructor implements {@link AnalyzerOptions#LIKE}.
+ * @param range
+ * @param copyMe
+ */
AnalyzerPair(String range, AnalyzerPair copyMe) {
this.range = new LanguageRange(range);
this.withStopWords = copyMe.withStopWords;
this.withoutStopWords = copyMe.withoutStopWords;
-
}
-
- public Analyzer getAnalyzer(boolean filterStopwords) {
- return filterStopwords ? withStopWords : withoutStopWords;
- }
- @Override
- public String toString() {
- return range.full + "=(" + withStopWords.getClass().getSimpleName() +")";
- }
-
+ /**
+ * If we have a constructor, with arguments including a populated
+ * stop word set, then we can use it to make both the withStopWords
+ * analyzer, and the withoutStopWords analyzer.
+ * @param range
+ * @param cons A Constructor including a {@link java.util.Set} argument
+ * for the stop words.
+ * @param params The arguments to pass to the constructor including a populated stopword set.
+ * @throws Exception
+ */
AnalyzerPair(String range, Constructor<? extends Analyzer> cons, Object ... params) throws Exception {
this(range, cons.newInstance(params), cons.newInstance(useEmptyStopWordSet(params)));
}
@@ -409,38 +502,52 @@
}
return rslt;
}
+
@Override
+ public String toString() {
+ return range.full + "=(" + withStopWords.getClass().getSimpleName() +")";
+ }
+
+ @Override
public int compareTo(AnalyzerPair o) {
return range.compareTo(o.range);
}
-
- public boolean extendedFilterMatch(String[] language) {
- return range.extendedFilterMatch(language);
- }
}
+ /**
+ * Used for Analyzer classes with a constructor with signature (Version, Set).
+ * @author jeremycarroll
+ *
+ */
private static class VersionSetAnalyzerPair extends AnalyzerPair {
public VersionSetAnalyzerPair(ConfigOptionsToAnalyzer lro,
Class<? extends Analyzer> cls) throws Exception {
super(lro.languageRange, getConstructor(cls, Version.class, Set.class), Version.LUCENE_CURRENT, lro.getStopWords());
}
}
-
+
+ /**
+ * Used for Analyzer classes which do not support stopwords and have a constructor with signature (Version).
+ * @author jeremycarroll
+ *
+ */
private static class VersionAnalyzerPair extends AnalyzerPair {
-
public VersionAnalyzerPair(String range, Class<? extends Analyzer> cls) throws Exception {
super(range, getConstructor(cls, Version.class).newInstance(Version.LUCENE_CURRENT));
}
}
-
+ /**
+ * Special case code for {@link PatternAnalyzer}
+ * @author jeremycarroll
+ *
+ */
private static class PatternAnalyzerPair extends AnalyzerPair {
-
- public PatternAnalyzerPair(ConfigOptionsToAnalyzer lro, String pattern) throws Exception {
+ public PatternAnalyzerPair(ConfigOptionsToAnalyzer lro, Pattern pattern) throws Exception {
super(lro.languageRange, getConstructor(PatternAnalyzer.class,Version.class,Pattern.class,Boolean.TYPE,Set.class),
Version.LUCENE_CURRENT,
- Pattern.compile(pattern, Pattern.UNICODE_CHARACTER_CLASS),
+ pattern,
true,
lro.getStopWords());
}
@@ -451,6 +558,16 @@
* This class is initialized with the config options, using the {@link #setProperty(String, String)}
* method, for a particular language range and works out which pair of {@link Analyzer}s
* to use for that language range.
+ * <p>
+ * Instances of this class are only alive during the execution of
+ * {@link ConfigurableAnalyzerFactory#ConfigurableAnalyzerFactory(FullTextIndex)},
+ * the life-cycle is:
+ * <ol>
+ * <li>The relveant config properties are applied, and are used to populate the fields.
+ * <li>The fields are validated
+ * <li>An {@link AnalyzerPair} is constructed
+ * </ol>
+ *
* @author jeremycarroll
*
*/
@@ -459,9 +576,13 @@
String like;
String className;
String stopwords;
- String pattern;
+ Pattern pattern;
final String languageRange;
AnalyzerPair result;
+ Pattern wordBoundary;
+ Pattern subWordBoundary;
+ Pattern softHyphens;
+ Boolean alwaysRemoveSoftHyphens;
public ConfigOptionsToAnalyzer(String languageRange) {
this.languageRange = languageRange;
@@ -474,7 +595,7 @@
*/
public Set<?> getStopWords() {
- if (AnalyzerOptions.STOPWORDS_VALUE_NONE.equals(stopwords))
+ if (doNotUseStopWords())
return Collections.EMPTY_SET;
if (useDefaultStopWords()) {
@@ -484,6 +605,10 @@
return getStopWordsForClass(stopwords);
}
+ boolean doNotUseStopWords() {
+ return AnalyzerOptions.STOPWORDS_VALUE_NONE.equals(stopwords) || (stopwords == null && pattern != null);
+ }
+
protected Set<?> getStopWordsForClass(String clazzName) {
Class<? extends Analyzer> analyzerClass = getAnalyzerClass(clazzName);
try {
@@ -500,9 +625,13 @@
}
protected boolean useDefaultStopWords() {
- return stopwords == null || AnalyzerOptions.STOPWORDS_VALUE_DEFAULT.equals(stopwords);
+ return ( stopwords == null && pattern == null ) || AnalyzerOptions.STOPWORDS_VALUE_DEFAULT.equals(stopwords);
}
+ /**
+ * The first step in the life-cycle, used to initialize the fields.
+ * @return true if the property was recognized.
+ */
public boolean setProperty(String shortProperty, String value) {
if (shortProperty.equals(AnalyzerOptions.LIKE) ) {
like = value;
@@ -511,13 +640,24 @@
} else if (shortProperty.equals(AnalyzerOptions.STOPWORDS) ) {
stopwords = value;
} else if (shortProperty.equals(AnalyzerOptions.PATTERN) ) {
- pattern = value;
+ pattern = Pattern.compile(value,Pattern.UNICODE_CHARACTER_CLASS);
+ } else if (shortProperty.equals(AnalyzerOptions.WORD_BOUNDARY) ) {
+ wordBoundary = Pattern.compile(value,Pattern.UNICODE_CHARACTER_CLASS);
+ } else if (shortProperty.equals(AnalyzerOptions.SUB_WORD_BOUNDARY) ) {
+ subWordBoundary = Pattern.compile(value,Pattern.UNICODE_CHARACTER_CLASS);
+ } else if (shortProperty.equals(AnalyzerOptions.SOFT_HYPHENS) ) {
+ softHyphens = Pattern.compile(value,Pattern.UNICODE_CHARACTER_CLASS);
+ } else if (shortProperty.equals(AnalyzerOptions.ALWAYS_REMOVE_SOFT_HYPHENS) ) {
+ alwaysRemoveSoftHyphens = Boolean.valueOf(value);
} else {
return false;
}
return true;
}
+ /**
+ * The second phase of the life-cycle, used for sanity checking.
+ */
public void validate() {
if (pattern != null ) {
if ( className != null && className != PatternAnalyzer.class.getName()) {
@@ -525,6 +665,27 @@
}
className = PatternAnalyzer.class.getName();
}
+ if (this.wordBoundary != null ) {
+ if ( className != null && className != TermCompletionAnalyzer.class.getName()) {
+ throw new RuntimeException("Bad Option: Language range "+languageRange + " with pattern propety for class "+ className);
+ }
+ className = TermCompletionAnalyzer.class.getName();
+
+ if ( subWordBoundary == null ) {
+ subWordBoundary = AnalyzerOptions.DEFAULT_SUB_WORD_BOUNDARY;
+ }
+ if ( alwaysRemoveSoftHyphens != null && softHyphens == null ) {
+ throw new RuntimeException("Bad option: Language range "+languageRange + ": must specify softHypens when setting alwaysRemoveSoftHyphens");
+ }
+ if (softHyphens != null && alwaysRemoveSoftHyphens == null) {
+ alwaysRemoveSoftHyphens = AnalyzerOptions.DEFAULT_ALWAYS_REMOVE_SOFT_HYPHENS;
+ }
+
+ } else if ( subWordBoundary != null || softHyphens != null || alwaysRemoveSoftHyphens != null ||
+ TermCompletionAnalyzer.class.getName().equals(className) ) {
+ throw new RuntimeException("Bad option: Language range "+languageRange + ": must specify wordBoundary for TermCompletionAnalyzer");
+ }
+
if (PatternAnalyzer.class.getName().equals(className) && pattern == null ) {
throw new RuntimeException("Bad Option: Language range "+languageRange + " must specify pattern for PatternAnalyzer.");
}
@@ -537,21 +698,45 @@
}
+ /**
+ * The third and final phase of the life-cyle used for identifying
+ * the AnalyzerPair.
+ */
private AnalyzerPair construct() throws Exception {
if (className == null) {
return null;
}
if (pattern != null) {
return new PatternAnalyzerPair(this, pattern);
-
- }
+ }
+ if (softHyphens != null) {
+ return new AnalyzerPair(
+ languageRange,
+ new TermCompletionAnalyzer(
+ wordBoundary,
+ subWordBoundary,
+ softHyphens,
+ alwaysRemoveSoftHyphens));
+ }
+ if (wordBoundary != null) {
+ return new AnalyzerPair(
+ languageRange,
+ new TermCompletionAnalyzer(
+ wordBoundary,
+ subWordBoundary));
+ }
final Class<? extends Analyzer> cls = getAnalyzerClass();
if (hasConstructor(cls, Version.class, Set.class)) {
// RussianAnalyzer is missing any way to access stop words.
- if (RussianAnalyzer.class.equals(cls) && useDefaultStopWords()) {
- return new AnalyzerPair(languageRange, new RussianAnalyzer(Version.LUCENE_CURRENT), new RussianAnalyzer(Version.LUCENE_CURRENT, Collections.EMPTY_SET));
+ if (RussianAnalyzer.class.equals(cls)) {
+ if (useDefaultStopWords()) {
+ return new AnalyzerPair(languageRange, new RussianAnalyzer(Version.LUCENE_CURRENT), new RussianAnalyzer(Version.LUCENE_CURRENT, Collections.EMPTY_SET));
+ }
+ if (doNotUseStopWords()) {
+ return new AnalyzerPair(languageRange, new RussianAnalyzer(Version.LUCENE_CURRENT, Collections.EMPTY_SET));
+ }
}
return new VersionSetAnalyzerPair(this, cls);
}
@@ -569,6 +754,29 @@
throw new RuntimeException("Bad option: cannot find constructor for class " + className + " for language range " + languageRange);
}
+ /**
+ * Also part of the third phase of the life-cycle, following the {@link AnalyzerOptions#LIKE}
+ * properties.
+ * @param depth
+ * @param max
+ * @param analyzers
+ * @return
+ */
+ AnalyzerPair followLikesToAnalyzerPair(int depth, int max,
+ Map<String, ConfigOptionsToAnalyzer> analyzers) {
+ if (result == null) {
+ if (depth == max) {
+ throw new RuntimeException("Bad configuration: - 'like' loop for language range " + languageRange);
+ }
+ ConfigOptionsToAnalyzer next = analyzers.get(like);
+ if (next == null) {
+ throw new RuntimeException("Bad option: - 'like' not found for language range " + languageRange+ " (not found: '"+ like +"')");
+ }
+ result = new AnalyzerPair(languageRange, next.followLikesToAnalyzerPair(depth+1, max, analyzers));
+ }
+ return result;
+ }
+
protected Class<? extends Analyzer> getAnalyzerClass() {
return getAnalyzerClass(className);
}
@@ -587,22 +795,6 @@
void setAnalyzerPair(AnalyzerPair ap) {
result = ap;
}
-
- AnalyzerPair followLikesToAnalyzerPair(int depth, int max,
- Map<String, ConfigOptionsToAnalyzer> analyzers) {
- if (result == null) {
- if (depth == max) {
- throw new RuntimeException("Bad configuration: - 'like' loop for language range " + languageRange);
- }
- ConfigOptionsToAnalyzer next = analyzers.get(like);
- if (next == null) {
- throw new RuntimeException("Bad option: - 'like' not found for language range " + languageRange+ " (not found: '"+ like +"')");
- }
- result = new AnalyzerPair(languageRange, next.followLikesToAnalyzerPair(depth+1, max, analyzers));
- }
- return result;
- }
-
}
private final AnalyzerPair config[];
@@ -615,12 +807,19 @@
* strategy so the code will still work on the {@link #MAX_LANG_CACHE_SIZE}+1 th entry.
*/
private static final int MAX_LANG_CACHE_SIZE = 500;
+
private String defaultLanguage;
private final FullTextIndex<?> fullTextIndex;
+ /**
+ * Builds a new ConfigurableAnalyzerFactory.
+ * @param fullTextIndex
+ */
public ConfigurableAnalyzerFactory(final FullTextIndex<?> fullTextIndex) {
+ // A description of the operation of this method is found on AnalyzerPair and
+ // ConfigOptionsToAnalyzer.
// despite our name, we actually make all the analyzers now, and getAnalyzer method is merely a lookup.
if (fullTextIndex == null)
@@ -717,9 +916,9 @@
while (en.hasMoreElements()) {
String prop = (String)en.nextElement();
- if (prop.equals(Options.INCLUDE_DEFAULTS)) continue;
+ if (prop.equals(Options.NATURAL_LANGUAGE_SUPPORT)) continue;
if (prop.startsWith(Options.ANALYZER)) {
- String languageRangeAndProperty[] = prop.substring(Options.ANALYZER.length()).split("[.]");
+ String languageRangeAndProperty[] = prop.substring(Options.ANALYZER.length()).replaceAll("_","*").split("[.]");
if (languageRangeAndProperty.length == 2) {
String languageRange = languageRangeAndProperty[0].toLowerCase(Locale.US); // Turkish "I" could create a problem
@@ -745,25 +944,29 @@
protected Properties initProperties() {
final Properties parentProperties = fullTextIndex.getProperties();
Properties myProps;
- if (Boolean.getBoolean(parentProperties.getProperty(Options.INCLUDE_DEFAULTS, Options.DEFAULT_INCLUDE_DEFAULTS))) {
- myProps = defaultProperties();
+ if (Boolean.valueOf(parentProperties.getProperty(
+ Options.NATURAL_LANGUAGE_SUPPORT,
+ Options.DEFAULT_NATURAL_LANGUAGE_SUPPORT))) {
+
+ myProps = loadPropertyString(ALL_LUCENE_NATURAL_LANGUAGES);
+
+ } else if (hasPropertiesForStarLanguageRange(parentProperties)){
+
+ myProps = new Properties();
+
} else {
- myProps = new Properties();
+
+ myProps = loadPropertyString(LUCENE_STANDARD_ANALYZER);
}
copyRelevantProperties(fullTextIndex.getProperties(), myProps);
-
- if (myProps.isEmpty()) {
- return defaultProperties();
- } else {
- return myProps;
- }
+ return myProps;
}
- protected Properties defaultProperties() {
+ Properties loadPropertyString(String props) {
Properties rslt = new Properties();
try {
- rslt.load(new StringReader(DEFAULT_PROPERTIES));
+ rslt.load(new StringReader(props));
} catch (IOException e) {
throw new RuntimeException("Impossible - well clearly not!", e);
}
@@ -780,6 +983,17 @@
}
}
+ private boolean hasPropertiesForStarLanguageRange(Properties from) {
+ Enumeration<?> en = from.propertyNames();
+ while (en.hasMoreElements()) {
+ String prop = (String)en.nextElement();
+ if (prop.startsWith(Options.ANALYZER+"_.")
+ || prop.startsWith(Options.ANALYZER+"*.")) {
+ return true;
+ }
+ }
+ return false;
+ }
@Override
public Analyzer getAnalyzer(String languageCode, boolean filterStopwords) {
Modified: branches/BIGDATA_RELEASE_1_3_0/bigdata/src/java/com/bigdata/search/DefaultAnalyzerFactory.java
===================================================================
--- branches/BIGDATA_RELEASE_1_3_0/bigdata/src/java/com/bigdata/search/DefaultAnalyzerFactory.java 2014-05-10 02:56:35 UTC (rev 8262)
+++ branches/BIGDATA_RELEASE_1_3_0/bigdata/src/java/com/bigdata/search/DefaultAnalyzerFactory.java 2014-05-10 11:52:25 UTC (rev 8263)
@@ -29,7 +29,6 @@
import java.util.Collections;
import java.util.HashMap;
-import java.util.HashSet;
import java.util.Locale;
import java.util.Map;
import java.util.Set;
@@ -52,11 +51,21 @@
import com.bigdata.btree.keys.KeyBuilder;
/**
- * Default implementation registers a bunch of {@link Analyzer}s for various
- * language codes and then serves the appropriate {@link Analyzer} based on
- * the specified language code.
+ * This is the default implementation but should be regarded as legacy since
+ * it fails to use the correct {@link Analyzer} for almost all languages (other than
+ * English). It uses the correct natural language analyzer only for literals tagged with
+ * certain three letter ISO 639 codes:
+ * "por", "deu", "ger", "zho", "chi", "jpn", "kor", "ces", "cze", "dut", "nld", "gre", "ell",
+ * "fra", "fre", "rus" and "tha". All other tags are treated as English.
+ * These codes do not work if they are used with subtags, e.g. "ger-AT" is treated as English.
+ * No two letter code, other than "en" works correctly: note that the W3C and
+ * IETF recommend the use of the two letter forms instead of the three letter forms.
*
* @author <a href="mailto:tho...@us...">Bryan Thompson</a>
+ * @deprecated Using {@link ConfigurableAnalyzerFactory} with
+ * the {@link ConfigurableAnalyzerFactory.Options#NATURAL_LANGUAGE_SUPPORT}
+ * uses the appropriate natural language analyzers for the two letter codes
+ * and for tags which include sub-tags.
* @version $Id$
*/
public class DefaultAnalyzerFactory implements IAnalyzerFactory {
Added: branches/BIGDATA_RELEASE_1_3_0/bigdata/src/java/com/bigdata/search/TermCompletionAnalyzer.java
===================================================================
--- branches/BIGDATA_RELEASE_1_3_0/bigdata/src/java/com/bigdata/search/TermCompletionAnalyzer.java (rev 0)
+++ branches/BIGDATA_RELEASE_1_3_0/bigdata/src/java/com/bigdata/search/TermCompletionAnalyzer.java 2014-05-10 11:52:25 UTC (rev 8263)
@@ -0,0 +1,248 @@
+/**
+
+Copyright (C) SYSTAP, LLC 2006-2014. All rights reserved.
+
+Contact:
+ SYSTAP, LLC
+ 4501 Tower Road
+ Greensboro, NC 27410
+ lic...@bi...
+
+This program is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation; version 2 of the License.
+
+This program is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with this program; if not, write to the Free Software
+Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+*/
+/*
+ * Created on May 8, 2014 by Jeremy J. Carroll, Syapse Inc.
+ */
+package com.bigdata.search;
+
+import java.io.IOException;
+import java.io.Reader;
+import java.io.StringReader;
+import java.nio.CharBuffer;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+
+import org.apache.lucene.analysis.Analyzer;
+import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.analysis.KeywordAnalyzer;
+import org.apache.lucene.analysis.tokenattributes.TermAttribute;
+
+
+/**
+ * An analyzer intended for the term-completion use case; particularly
+ * for technical vocabularies and concept schemes.
+ *
+ * <p>
+ * This analyzer generates several index terms for each word in the input.
+ * These are intended to match short sequences (e.g. three or more) characters
+ * of user-input, to then give the user a drop-down list of matching terms.
+ * <p>
+ * This can be set up to address issues like matching <q>half-time</q> when the user types
+ * <q>tim</q> or if the user types <q>halft</q> (treating the hyphen as a soft hyphen); or
+ * to match <q>TermCompletionAnalyzer</q> when the user types <q>Ana</q>
+ * <p>
+ * In contrast, the Lucene Analyzers are mainly geared around the free text search use
+ * case.
+ * <p>
+ * The intended use cases will typical involve a prefix query of the form:
+ * <pre>
+ * ?t bds:search "prefix*" .
+ * </pre>
+ * to find all literals in the selected graphs, which are indexed by a term starting in <q>prefix</q>,
+ * so the problem this class addresses is finding the appropriate index terms to allow
+ * matching, at sensible points, mid-way through words (such as at hyphens).
+ * <p>
+ * To get maximum effectiveness it maybe best to use private language subtags (see RFC 5647),
+ * e.g. <code>"x-term"</code>
+ * which are mapped to this class by {@link ConfigurableAnalyzerFactory} for
+ * the data being loaded into the store, and linked to some very simple process
+ * like {@link KeywordAnalyzer} for queries which are tagged with a different language tag
+ * that is only used for <code>bds:search</code>, e.g. <code>"x-query"</code>.
+ * The above prefix query then becomes:
+ * <pre>
+ * ?t bds:search "prefix*"@x-query .
+ * </pre>
+ *
+ *
+ *
+ * @author jeremycarroll
+ *
+ */
+public class TermCompletionAnalyzer extends Analyzer {
+
+ private final Pattern wordBoundary;
+ private final Pattern subWordBoundary;
+
+ private final Pattern discard;
+ private final boolean alwaysDiscard;
+
+ /**
+ * Divide the input into words and short tokens
+ * as with {@link #TermCompletionAnalyzer(Pattern, Pattern)}.
+ * Each term is generated, and then an additional term
+ * is generated with softHypens (defined by the pattern),
+ * removed. If the alwaysRemoveSoftHypens flag is true,
+ * then the first term (before the removal) is suppressed.
+ *
+ * @param wordBoundary The definition of space (e.g. " ")
+ * @param subWordBoundary Also index after matches to this (e.g. "-")
+ * @param softHyphens Discard these characters from matches
+ * @param alwaysRemoveSoftHypens If false the discard step is optional.
+ */
+ public TermCompletionAnalyzer(Pattern wordBoundary,
+ Pattern subWordBoundary,
+ Pattern softHyphens,
+ boolean alwaysRemoveSoftHypens) {
+ this.wordBoundary = wordBoundary;
+ this.subWordBoundary = subWordBoundary;
+ if (softHyphens != null) {
+ discard = softHyphens;
+ alwaysDiscard = alwaysRemoveSoftHypens;
+ } else {
+ discard = Pattern.compile("(?!)"); // never matches
+ alwaysDiscard = true;
+ }
+ }
+ /**
+ * Divide the input into words, separated by the wordBoundary,
+ * and return a token for each whole word, and then
+ * generate further tokens for each word by removing prefixes
+ * up to and including each successive match of
+ * subWordBoundary
+ * @param wordBoundary
+ * @param subWordBoundary
+ */
+ public TermCompletionAnalyzer(Pattern wordBoundary,
+ Pattern subWordBoundary) {
+ this(wordBoundary, subWordBoundary, null, true);
+ }
+
+
+ @Override
+ public TokenStream tokenStream(String ignoredFieldName, Reader reader) {
+ return new TermCompletionTokenStream((StringReader)reader);
+ }
+
+ /**
+ * This classes has three processes going on
+ * all driven from the {@link #increment()} method.
+ *
+ * One process is that of iterating over the words in the input:
+ * - the words are identified in the constructor, and the iteration
+ * is performed by {@link #nextWord()}
+ *
+ * - the subword boundaries are identified in {@link #next()}
+ * We then set up {@link #found} to contain the most
+ * recently found subword.
+ *
+ * - the soft hyphen discarding is processed in {@link #maybeDiscardHyphens()}
+ *
+ * - if we are not {@link #alwaysDiscard}ing then {@link #afterDiscard}
+ * can be set to null to return the non-discarded version on the next cycle.
+ *
+ */
+ private class TermCompletionTokenStream extends TokenStream {
+
+ final String[] words;
+ final TermAttribute termAtt;
+
+
+
+ char currentWord[] = new char[]{};
+ Matcher softMatcher;
+ int currentWordIx = -1;
+
+
+ int charPos = 0;
+ private String afterDiscard;
+ private CharBuffer found;
+
+ public TermCompletionTokenStream(StringReader reader) {
+ termAtt = addAttribute(TermAttribute.class);
+ try {
+ reader.mark(Integer.MAX_VALUE);
+ int length = (int) reader.skip(Integer.MAX_VALUE);
+ reader.reset();
+ char fileContent[] = new char[length];
+ reader.read(fileContent);
+ words = wordBoundary.split(new String(fileContent));
+ } catch (IOException e) {
+ throw new RuntimeException("Impossible",e);
+ }
+ }
+
+ @Override
+ public boolean incrementToken() throws IOException {
+ if ( next() ) {
+ if (afterDiscard != null) {
+ int lg = afterDiscard.length();
+ afterDiscard.getChars(0, lg, termAtt.termBuffer(), 0);
+ termAtt.setTermLength(lg);
+ } else {
+ int lg = found.length();
+ found.get(termAtt.termBuffer(), 0, lg);
+ termAtt.setTermLength(lg);
+ }
+ return true;
+ } else {
+ return false;
+ }
+ }
+
+ private boolean next() {
+ if (currentWordIx >= words.length) {
+ return false;
+ }
+ if (!alwaysDiscard) {
+ // Last match was the discarded version,
+ // now do the non-discard version.
+ if (afterDiscard != null) {
+ afterDiscard = null;
+ return true;
+ }
+ }
+ afterDiscard = null;
+ if (charPos + 1 < currentWord.length && softMatcher.find(charPos+1)) {
+ charPos = softMatcher.end();
+ maybeDiscardHyphens();
+ return true;
+ } else {
+ return nextWord();
+ }
+ }
+
+ void maybeDiscardHyphens() {
+ found = CharBuffer.wrap(currentWord, charPos, currentWord.length - charPos);
+ Matcher discarding = discard.matcher(found);
+ if (discarding.find()) {
+ afterDiscard = discarding.replaceAll("");
+ }
+ }
+
+ private boolean nextWord() {
+ currentWordIx++;
+ if (currentWordIx >= words.length) {
+ return false;
+ }
+ currentWord = words[currentWordIx].toCharArray();
+ termAtt.resizeTermBuffer(currentWord.length);
+ charPos = 0;
+ softMatcher = subWordBoundary.matcher(words[currentWordIx]);
+ maybeDiscardHyphens();
+ return true;
+ }
+
+ }
+
+}
Modified: branches/BIGDATA_RELEASE_1_3_0/bigdata/src/test/com/bigdata/search/AbstractAnalyzerFactoryTest.java
===================================================================
--- branches/BIGDATA_RELEASE_1_3_0/bigdata/src/test/com/bigdata/search/AbstractAnalyzerFactoryTest.java 2014-05-10 02:56:35 UTC (rev 8262)
+++ branches/BIGDATA_RELEASE_1_3_0/bigdata/src/test/com/bigdata/search/AbstractAnalyzerFactoryTest.java 2014-05-10 11:52:25 UTC (rev 8263)
@@ -22,151 +22,25 @@
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
*/
/*
- * Created on May 7, 2014
+ * Created on May 9, 2014
*/
package com.bigdata.search;
-import java.io.IOException;
-import java.io.StringReader;
-
-import org.apache.lucene.analysis.Analyzer;
-import org.apache.lucene.analysis.TokenStream;
-import org.apache.lucene.analysis.tokenattributes.TermAttribute;
-
public abstract class AbstractAnalyzerFactoryTest extends AbstractSearchTest {
- public AbstractAnalyzerFactoryTest() {
+ public AbstractAnalyzerFactoryTest() {
}
-
- public AbstractAnalyzerFactoryTest(String arg0) {
- super(arg0);
+
+ public AbstractAnalyzerFactoryTest(String arg0) {
+ super(arg0);
}
-
- public void setUp() throws Exception {
- super.setUp();
- init(getExtraProperties());
- }
- abstract String[] getExtraProperties();
-
- private Analyzer getAnalyzer(String lang, boolean filterStopWords) {
- return getNdx().getAnalyzer(lang, filterStopWords);
+
+ @Override
+ public void setUp() throws Exception {
+ super.setUp();
+ init(getExtraProperties());
}
-
- private void comparisonTest(String lang,
- boolean stopWordsSignificant,
- String text,
- String spaceSeparated) throws IOException {
- compareTokenStream(getAnalyzer(lang, stopWordsSignificant), text,
- spaceSeparated.split(" ")); //$NON-NLS-1$
- }
- private void compareTokenStream(Analyzer a, String text, String expected[]) throws IOException {
- TokenStream s = a.tokenStream(null, new StringReader(text));
- int ix = 0;
- while (s.incrementToken()) {
- final TermAttribute term = s.getAttribute(TermAttribute.class);
- final String word = term.term();
- assertTrue(ix < expected.length);
- assertEquals(word, expected[ix++]);
- }
- assertEquals(ix, expected.length);
- }
-
- public void testEnglishFilterStopWords() throws IOException {
- for (String lang: new String[]{ "eng", null, "" }) { //$NON-NLS-1$ //$NON-NLS-2$
- comparisonTest(lang,
- true,
- "The test to end all tests! Forever.", //$NON-NLS-1$
- "test end all tests forever" //$NON-NLS-1$
- );
- }
- }
- public void testEnglishNoFilter() throws IOException {
- for (String lang: new String[]{ "eng", null, "" }) { //$NON-NLS-1$ //$NON-NLS-2$
- comparisonTest(lang,
- false,
- "The test to end all tests! Forever.", //$NON-NLS-1$
- "the test to end all tests forever" //$NON-NLS-1$
- );
- }
- }
-
- // Note we careful use a three letter language code for german.
- // 'de' is more standard, but the DefaultAnalyzerFactory does not
- // implement 'de' correctly.
- public void testGermanFilterStopWords() throws IOException {
- comparisonTest("ger", //$NON-NLS-1$
- true,
- NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.10") + //$NON-NLS-1$
- NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.11"), //$NON-NLS-1$
- NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.12") //$NON-NLS-1$
- );
-
- }
+ abstract String[] getExtraProperties();
- // Note we careful use a three letter language code for Russian.
- // 'ru' is more standard, but the DefaultAnalyzerFactory does not
- // implement 'ru' correctly.
- public void testRussianFilterStopWords() throws IOException {
- comparisonTest("rus", //$NON-NLS-1$
- true,
- // I hope this is not offensive text.
- NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.14") + //$NON-NLS-1$
- NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.15"), //$NON-NLS-1$
- NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.16") //$NON-NLS-1$
- );
-
- }
- public void testGermanNoStopWords() throws IOException {
- comparisonTest("ger", //$NON-NLS-1$
- false,
- NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.18") + //$NON-NLS-1$
- NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.19"), //$NON-NLS-1$
- NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.20") //$NON-NLS-1$
- );
-
- }
- public void testRussianNoStopWords() throws IOException {
- comparisonTest("rus", //$NON-NLS-1$
- false,
- NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.22") + //$NON-NLS-1$
- NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.23"), //$NON-NLS-1$
- NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.24") //$NON-NLS-1$
- );
-
- }
- public void testJapanese() throws IOException {
- for (boolean filterStopWords: new Boolean[]{true, false}) {
- comparisonTest("jpn", //$NON-NLS-1$
- filterStopWords,
- NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.26"), //$NON-NLS-1$
- NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.27") + //$NON-NLS-1$
- NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.28") + //$NON-NLS-1$
- NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.29")); //$NON-NLS-1$
- }
- }
- public void testConfiguredLanguages() {
- checkConfig("BrazilianAnalyzer", "por", "pt"); //$NON-NLS-1$ //$NON-NLS-2$ //$NON-NLS-3$
- checkConfig("ChineseAnalyzer", "zho", "chi", "zh"); //$NON-NLS-1$ //$NON-NLS-2$ //$NON-NLS-3$ //$NON-NLS-4$
- checkConfig("CJKAnalyzer", "jpn", "ja", "kor", "ko"); //$NON-NLS-1$ //$NON-NLS-2$ //$NON-NLS-3$ //$NON-NLS-4$ //$NON-NLS-5$
- checkConfig("CzechAnalyzer", "ces", "cze", "cs"); //$NON-NLS-1$ //$NON-NLS-2$ //$NON-NLS-3$ //$NON-NLS-4$
- checkConfig("DutchAnalyzer", "dut", "nld", "nl"); //$NON-NLS-1$ //$NON-NLS-2$ //$NON-NLS-3$ //$NON-NLS-4$
- checkConfig("GermanAnalyzer", "deu", "ger", "de"); //$NON-NLS-1$ //$NON-NLS-2$ //$NON-NLS-3$ //$NON-NLS-4$
- checkConfig("GreekAnalyzer", "gre", "ell", "el"); //$NON-NLS-1$ //$NON-NLS-2$ //$NON-NLS-3$ //$NON-NLS-4$
- checkConfig("RussianAnalyzer", "rus", "ru"); //$NON-NLS-1$ //$NON-NLS-2$ //$NON-NLS-3$
- checkConfig("ThaiAnalyzer", "th", "tha"); //$NON-NLS-1$ //$NON-NLS-2$ //$NON-NLS-3$
- checkConfig("StandardAnalyzer", "en", "eng", "", null); //$NON-NLS-1$ //$NON-NLS-2$ //$NON-NLS-3$ //$NON-NLS-4$
- }
-
- private void checkConfig(String classname, String ...langs) {
- for (String lang:langs) {
- // The DefaultAnalyzerFactory only works for language tags of length exactly three.
-// if (lang != null && lang.length()==3)
- {
- assertEquals(classname, getAnalyzer(lang,true).getClass().getSimpleName());
- assertEquals(classname, getAnalyzer(lang+NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.0"),true).getClass().getSimpleName()); //$NON-NLS-1$
- }
- }
-
- }
}
Copied: branches/BIGDATA_RELEASE_1_3_0/bigdata/src/test/com/bigdata/search/AbstractDefaultAnalyzerFactoryTest.java (from rev 8253, branches/BIGDATA_RELEASE_1_3_0/bigdata/src/test/com/bigdata/search/AbstractAnalyzerFactoryTest.java)
===================================================================
--- branches/BIGDATA_RELEASE_1_3_0/bigdata/src/test/com/bigdata/search/AbstractDefaultAnalyzerFactoryTest.java (rev 0)
+++ branches/BIGDATA_RELEASE_1_3_0/bigdata/src/test/com/bigdata/search/AbstractDefaultAnalyzerFactoryTest.java 2014-05-10 11:52:25 UTC (rev 8263)
@@ -0,0 +1,133 @@
+/**
+
+Copyright (C) SYSTAP, LLC 2006-2014. All rights reserved.
+
+Contact:
+ SYSTAP, LLC
+ 4501 Tower Road
+ Greensboro, NC 27410
+ lic...@bi...
+
+This program is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation; version 2 of the License.
+
+This program is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with this program; if not, write to the Free Software
+Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+*/
+/*
+ * Created on May 7, 2014
+ */
+package com.bigdata.search;
+
+import java.io.IOException;
+
+
+public abstract class AbstractDefaultAnalyzerFactoryTest extends AbstractAnalyzerFactoryTest {
+
+ public AbstractDefaultAnalyzerFactoryTest() {
+ }
+
+ public AbstractDefaultAnalyzerFactoryTest(String arg0) {
+ super(arg0);
+ }
+
+ public void testEnglishFilterStopWords() throws IOException {
+ for (String lang: new String[]{ "eng", null, "" }) { //$NON-NLS-1$ //$NON-NLS-2$
+ comparisonTest(lang,
+ true,
+ "The test to end all tests! Forever.", //$NON-NLS-1$
+ "test end all tests forever" //$NON-NLS-1$
+ );
+ }
+ }
+ public void testEnglishNoFilter() throws IOException {
+ for (String lang: new String[]{ "eng", null, "" }) { //$NON-NLS-1$ //$NON-NLS-2$
+ comparisonTest(lang,
+ false,
+ "The test to end all tests! Forever.", //$NON-NLS-1$
+ "the test to end all tests forever" //$NON-NLS-1$
+ );
+ }
+ }
+
+ // Note we careful use a three letter language code for german.
+ // 'de' is more standard, but the DefaultAnalyzerFactory does not
+ // implement 'de' correctly.
+ public void testGermanFilterStopWords() throws IOException {
+ comparisonTest("ger", //$NON-NLS-1$
+ true,
+ NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.10") + //$NON-NLS-1$
+ NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.11"), //$NON-NLS-1$
+ NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.12") //$NON-NLS-1$
+ );
+
+ }
+
+ // Note we careful use a three letter language code for Russian.
+ // 'ru' is more standard, but the DefaultAnalyzerFactory does not
+ // implement 'ru' correctly.
+ public void testRussianFilterStopWords() throws IOException {
+ comparisonTest("rus", //$NON-NLS-1$
+ true,
+ // I hope this is not offensive text.
+ NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.14") + //$NON-NLS-1$
+ NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.15"), //$NON-NLS-1$
+ NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.16") //$NON-NLS-1$
+ );
+
+ }
+ public void testGermanNoStopWords() throws IOException {
+ comparisonTest("ger", //$NON-NLS-1$
+ false,
+ NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.18") + //$NON-NLS-1$
+ NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.19"), //$NON-NLS-1$
+ NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.20") //$NON-NLS-1$
+ );
+
+ }
+ public void testRussianNoStopWords() throws IOException {
+ comparisonTest("rus", //$NON-NLS-1$
+ false,
+ NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.22") + //$NON-NLS-1$
+ NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.23"), //$NON-NLS-1$
+ NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.24") //$NON-NLS-1$
+ );
+
+ }
+ public void testJapanese() throws IOException {
+ for (boolean filterStopWords: new Boolean[]{true, false}) {
+ comparisonTest("jpn", //$NON-NLS-1$
+ filterStopWords,
+ NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.26"), //$NON-NLS-1$
+ NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.27") + //$NON-NLS-1$
+ NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.28") + //$NON-NLS-1$
+ NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.29")); //$NON-NLS-1$
+ }
+ }
+ public void testConfiguredLanguages() {
+ checkConfig("BrazilianAnalyzer", "por", "pt"); //$NON-NLS-1$ //$NON-NLS-2$ //$NON-NLS-3$
+ checkConfig("ChineseAnalyzer", "zho", "chi", "zh"); //$NON-NLS-1$ //$NON-NLS-2$ //$NON-NLS-3$ //$NON-NLS-4$
+ checkConfig("CJKAnalyzer", "jpn", "ja", "kor", "ko"); //$NON-NLS-1$ //$NON-NLS-2$ //$NON-NLS-3$ //$NON-NLS-4$ //$NON-NLS-5$
+ checkConfig("CzechAnalyzer", "ces", "cze", "cs"); //$NON-NLS-1$ //$NON-NLS-2$ //$NON-NLS-3$ //$NON-NLS-4$
+ checkConfig("DutchAnalyzer", "dut", "nld", "nl"); //$NON-NLS-1$ //$NON-NLS-2$ //$NON-NLS-3$ //$NON-NLS-4$
+ checkConfig("GermanAnalyzer", "deu", "ger", "de"); //$NON-NLS-1$ //$NON-NLS-2$ //$NON-NLS-3$ //$NON-NLS-4$
+ checkConfig("GreekAnalyzer", "gre", "ell", "el"); //$NON-NLS-1$ //$NON-NLS-2$ //$NON-NLS-3$ //$NON-NLS-4$
+ checkConfig("RussianAnalyzer", "rus", "ru"); //$NON-NLS-1$ //$NON-NLS-2$ //$NON-NLS-3$
+ checkConfig("ThaiAnalyzer", "th", "tha"); //$NON-NLS-1$ //$NON-NLS-2$ //$NON-NLS-3$
+ checkConfig("StandardAnalyzer", "en", "eng", "", null); //$NON-NLS-1$ //$NON-NLS-2$ //$NON-NLS-3$ //$NON-NLS-4$
+ }
+
+ @Override
+ protected void checkConfig(String classname, String ...langs) {
+ checkConfig(isBroken(), classname, langs);
+
+ }
+ abstract boolean isBroken() ;
+}
Modified: branches/BIGDATA_RELEASE_1_3_0/bigdata/src/test/com/bigdata/search/AbstractSearchTest.java
===================================================================
--- branches/BIGDATA_RELEASE_1_3_0/bigdata/src/test/com/bigdata/search/AbstractSearchTest.java 2014-05-10 02:56:35 UTC (rev 8262)
+++ branches/BIGDATA_RELEASE_1_3_0/bigdata/src/test/com/bigdata/search/AbstractSearchTest.java 2014-05-10 11:52:25 UTC (rev 8263)
@@ -26,8 +26,14 @@
*/
package com.bigdata.search;
+import java.io.IOException;
+import java.io.StringReader;
import java.util.Properties;
+import org.apache.lucene.analysis.Analyzer;
+import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.analysis.tokenattributes.TermAttribute;
+
import com.bigdata.journal.IIndexManager;
import com.bigdata.journal.ITx;
import com.bigdata.journal.ProxyTestCase;
@@ -62,7 +68,7 @@
}
FullTextIndex<Long> createFullTextIndex(String namespace, String ...propertyValuePairs) {
- return createFullTextIndex(namespace, getProperties(), propertyValuePairs);
+ return createFullTextIndex(namespace, (Properties)getProperties().clone(), propertyValuePairs);
}
public void tearDown() throws Exception {
@@ -92,4 +98,65 @@
return properties;
}
+ protected Analyzer getAnalyzer(String lang, boolean filterStopWords) {
+ return getNdx().getAnalyzer(lang, filterStopWords);
+ }
+
+ protected void comparisonTest(String lang, boolean filterStopWords, String text, String spaceSeparated)
+ throws IOException {
+ if (spaceSeparated == null) {
+ String rslt = getTokenStream(getAnalyzer(lang, filterStopWords), text);
+ throw new RuntimeException("Got \"" + rslt+ "\"");
+ }
+ compareTokenStream(getAnalyzer(lang, filterStopWords), text,
+ split(spaceSeparated)); //$NON-NLS-1$
+ }
+
+ private String[] split(String spaceSeparated) {
+ if (spaceSeparated.length()==0) {
+ return new String[0];
+ }
+ return spaceSeparated.split(" ");
+ }
+
+ protected String getTokenStream(Analyzer a, String text) throws IOException {
+ StringBuffer sb = new StringBuffer();
+ TokenStream s = a.tokenStream(null, new StringReader(text));
+ while (s.incrementToken()) {
+ final TermAttribute term = s.getAttribute(TermAttribute.class);
+ if (sb.length()!=0) {
+ sb.append(" ");
+ }
+ sb.append(term.term());
+ }
+ return sb.toString();
+ }
+
+ private void compareTokenStream(Analyzer a, String text, String expected[]) throws IOException {
+ TokenStream s = a.tokenStream(null, new StringReader(text));
+ int ix = 0;
+ while (s.incrementToken()) {
+ final TermAttribute term = s.getAttribute(TermAttribute.class);
+ final String word = term.term();
+ assertTrue(ix < expected.length);
+ assertEquals(expected[ix++], word);
+ }
+ assertEquals(ix, expected.length);
+ }
+
+ protected void checkConfig(boolean threeLetterOnly, String classname, String ...langs) {
+ for (String lang:langs) {
+ // The DefaultAnalyzerFactory only works for language tags of length exactly three.
+ if ((!threeLetterOnly) || (lang != null && lang.length()==3)) {
+ assertEquals(classname, getAnalyzer(lang,true).getClass().getSimpleName());
+ if (!threeLetterOnly) {
+ assertEquals(classname, getAnalyzer(lang+"-x-foobar",true).getClass().getSimpleName()); //$NON-NLS-1$
+ }
+ }
+ }
+ }
+ protected void checkConfig(String classname, String ...langs) {
+ checkConfig(false, classname, langs);
+ }
+
}
Modified: branches/BIGDATA_RELEASE_1_3_0/bigdata/src/test/com/bigdata/search/NonEnglishExamples.java
===================================================================
--- branches/BIGDATA_RELEASE_1_3_0/bigdata/src/test/com/bigdata/search/NonEnglishExamples.java 2014-05-10 02:56:35 UTC (rev 8262)
+++ branches/BIGDATA_RELEASE_1_3_0/bigdata/src/test/com/bigdata/search/NonEnglishExamples.java 2014-05-10 11:52:25 UTC (rev 8263)
@@ -1,3 +1,29 @@
+/**
+
+Copyright (C) SYSTAP, LLC 2006-2014. All rights reserved.
+
+Contact:
+ SYSTAP, LLC
+ 4501 Tower Road
+ Greensboro, NC 27410
+ lic...@bi...
+
+This program is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation; version 2 of the License.
+
+This program is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with this program; if not, write to the Free Software
+Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+*/
+/*
+ * Created on May 7, 2014 by Jeremy J. Carroll, Syapse Inc.
+ */
package com.bigdata.search;
import java.util.MissingResourceException;
Modified: branches/BIGDATA_RELEASE_1_3_0/bigdata/src/test/com/bigdata/search/TestAll.java
===================================================================
--- branches/BIGDATA_RELEASE_1_3_0/bigdata/src/test/com/bigdata/search/TestAll.java 2014-05-10 02:56:35 UTC (rev 8262)
+++ branches/BIGDATA_RELEASE_1_3_0/bigdata/src/test/com/bigdata/search/TestAll.java 2014-05-10 11:52:25 UTC (rev 8263)
@@ -114,6 +114,8 @@
// which is intended to be the same as the intended
// behavior of DefaultAnalyzerFactory
suite.addTestSuite(TestConfigurableAsDefaultAnalyzerFactory.class);
+ suite.addTestSuite(TestConfigurableAnalyzerFactory.class);
+ suite.addTestSuite(TestUnconfiguredAnalyzerFactory.class);
return suite;
}
Added: branches/BIGDATA_RELEASE_1_3_0/bigdata/src/test/com/bigdata/search/TestConfigurableAnalyzerFactory.java
===================================================================
--- branches/BIGDATA_RELEASE_1_3_0/bigdata/src/test/com/bigdata/search/TestConfigurableAnalyzerFactory.java (rev 0)
+++ branches/BIGDATA_RELEASE_1_3_0/bigdata/src/test/com/bigdata/search/TestConfigurableAnalyzerFactory.java 2014-05-10 11:52:25 UTC (rev 8263)
@@ -0,0 +1,244 @@
+/**
+
+Copyright (C) SYSTAP, LLC 2006-2014. All rights reserved.
+
+Contact:
+ SYSTAP, LLC
+ 4501 Tower Road
+ Greensboro, NC 27410
+ lic...@bi...
+
+This program is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation; version 2 of the License.
+
+This program is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with this program; if not, write to the Fr...
[truncated message content] |