[Bigdata-commit] SF.net SVN: bigdata:[8263] branches/BIGDATA_RELEASE_1_3_0/bigdata/src

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Revision: 8263
          http://sourceforge.net/p/bigdata/code/8263
Author:   jeremy_carroll
Date:     2014-05-10 11:52:25 +0000 (Sat, 10 May 2014)
Log Message:
-----------
Cleaning up of ConfigurableAnalyzerFactory, adding TermCompletionAnalyzer, deprecating DefaultAnalyzerFactory
Finishing of trac 912, work on 915
Unit tests for the old and new behaviors
This merges the branch TEXT_ANALYZERS.

Modified Paths:
--------------
    branches/BIGDATA_RELEASE_1_3_0/bigdata/src/java/com/bigdata/search/ConfigurableAnalyzerFactory.java
    branches/BIGDATA_RELEASE_1_3_0/bigdata/src/java/com/bigdata/search/DefaultAnalyzerFactory.java
    branches/BIGDATA_RELEASE_1_3_0/bigdata/src/test/com/bigdata/search/AbstractAnalyzerFactoryTest.java
    branches/BIGDATA_RELEASE_1_3_0/bigdata/src/test/com/bigdata/search/AbstractSearchTest.java
    branches/BIGDATA_RELEASE_1_3_0/bigdata/src/test/com/bigdata/search/NonEnglishExamples.java
    branches/BIGDATA_RELEASE_1_3_0/bigdata/src/test/com/bigdata/search/TestAll.java
    branches/BIGDATA_RELEASE_1_3_0/bigdata/src/test/com/bigdata/search/TestConfigurableAsDefaultAnalyzerFactory.java
    branches/BIGDATA_RELEASE_1_3_0/bigdata/src/test/com/bigdata/search/TestDefaultAnalyzerFactory.java

Added Paths:
-----------
    branches/BIGDATA_RELEASE_1_3_0/bigdata/src/java/com/bigdata/search/TermCompletionAnalyzer.java
    branches/BIGDATA_RELEASE_1_3_0/bigdata/src/test/com/bigdata/search/AbstractDefaultAnalyzerFactoryTest.java
    branches/BIGDATA_RELEASE_1_3_0/bigdata/src/test/com/bigdata/search/TestConfigurableAnalyzerFactory.java
    branches/BIGDATA_RELEASE_1_3_0/bigdata/src/test/com/bigdata/search/TestUnconfiguredAnalyzerFactory.java

Modified: branches/BIGDATA_RELEASE_1_3_0/bigdata/src/java/com/bigdata/search/ConfigurableAnalyzerFactory.java
===================================================================

--- branches/BIGDATA_RELEASE_1_3_0/bigdata/src/java/com/bigdata/search/ConfigurableAnalyzerFactory.java	2014-05-10 02:56:35 UTC (rev 8262)
+++ branches/BIGDATA_RELEASE_1_3_0/bigdata/src/java/com/bigdata/search/ConfigurableAnalyzerFactory.java	2014-05-10 11:52:25 UTC (rev 8263)
@@ -66,6 +66,7 @@
  * Supported classes included all the natural language specific classes from Lucene, and also:
  * <ul>
  * <li>{@link PatternAnalyzer}
+ * <li>{@link TermCompletionAnalyzer}
  * <li>{@link KeywordAnalyzer}
  * <li>{@link SimpleAnalyzer}
  * <li>{@link StopAnalyzer}
@@ -76,7 +77,6 @@
  * <ul>
  * <li>no arguments
  * <li>{@link Version}
- * <li>{@link Set} (of strings, the stop words)
  * <li>{@link Version}, {@link Set}
  * </ul>
  * is usable. If the class has a static method named <code>getDefaultStopSet()</code> then this is assumed
@@ -89,19 +89,17 @@
  * abbreviate to <code>c.b.s.C</code> in this documentation. 
  * Properties from {@link Options} apply to the factory.
  * <p>
- * 
- * If there are no such properties at all then the property {@link Options#INCLUDE_DEFAULTS} is set to true,
- * and the behavior of this class is the same as the legacy {@link DefaultAnalyzerFactory}.
- * <p>
  * Other properties, from {@link AnalyzerOptions} start with
  * <code>c.b.s.C.analyzer.<em>language-range</em></code> where <code><em>language-range</em></code> conforms
- * with the extended language range construct from RFC 4647, section 2.2. These are used to specify 
- * an analyzer for the given language range.
+ * with the extended language range construct from RFC 4647, section 2.2. 
+ * There is an issue that bigdata does not allow '*' in property names, and we use the character '_' to
+ * substitute for '*' in extended language ranges in property names.
+ * These are used to specify an analyzer for the given language range.
  * <p>
  * If no analyzer is specified for the language range <code>*</code> then the {@link StandardAnalyzer} is used.
  * <p>
  * Given any specific language, then the analyzer matching the longest configured language range, 
- * measured in number of subtags is used {@link #getAnalyzer(String, boolean)} 
+ * measured in number of subtags is returned by {@link #getAnalyzer(String, boolean)} 
  * In the event of a tie, the alphabetically first language range is used.
  * The algorithm to find a match is "Extended Filtering" as defined in section 3.3.2 of RFC 4647.
  * <p>
@@ -113,11 +111,13 @@
  * <dd>This uses whitespace to tokenize</dd>
  * <dt>{@link PatternAnalyzer}</dt>
  * <dd>This uses a regular expression to tokenize</dd>
+ * <dt>{@link TermCompletionAnalyzer}</dt>
+ * <dd>This uses up to three regular expressions to specify multiple tokens for each word, to address term completion use cases.</dd>
  * <dt>{@link EmptyAnalyzer}</dt>
  * <dd>This suppresses the functionality, by treating every expression as a stop word.</dd>
  * </dl>
  * there are in addition the language specific analyzers that are included
- * by using the option {@link Options#INCLUDE_DEFAULTS}
+ * by using the option {@link Options#NATURAL_LANGUAGE_SUPPORT}
  * 
  * 
  * @author jeremycarroll
@@ -126,11 +126,26 @@
 public class ConfigurableAnalyzerFactory implements IAnalyzerFactory {
 	final private static transient Logger log = Logger.getLogger(ConfigurableAnalyzerFactory.class);
 
-	static class LanguageRange implements Comparable<LanguageRange> {
+	/**
+	 * This is an implementation of RFC 4647 language range,
+	 * targetted at the specific needs within bigdata, and only
+	 * supporting the extended filtering specified in section 3.3.2
+	 * <p>
+	 * Language ranges are comparable so that
+	 * sorting an array and then matching a language tag against each
+	 * member of the array in sequence will give the longest match.
+	 * i.e. the longer ranges come first.
+	 * @author jeremycarroll
+	 *
+	 */
+	public static class LanguageRange implements Comparable<LanguageRange> {
 		
 		private final String range[];
 		private final String full;
-
+		/**
+		 * Note range must be in lower case, this is not verified.
+		 * @param range
+		 */
 		public LanguageRange(String range) {
 			this.range = range.split("-");
 			full = range;
@@ -174,12 +189,22 @@
 			return full.hashCode();
 		}
 		
+		/**
+		 * This implements the algoirthm of section 3.3.2 of RFC 4647
+		 * as modified with the observation about private use tags
+		 * in <a href="http://lists.w3.org/Archives/Public/www-international/2014AprJun/0084">
+		 * this message</a>.
+		 * 
+		 * 
+		 * @param langTag The RFC 5646 Language tag in lower case
+		 * @return The result of the algorithm
+		 */
 		public boolean extendedFilterMatch(String langTag) {
 			return extendedFilterMatch(langTag.toLowerCase(Locale.ROOT).split("-"));
 		}
 
 		// See RFC 4647, 3.3.2
-		public boolean extendedFilterMatch(String[] language) {
+		boolean extendedFilterMatch(String[] language) {
 			// RFC 4647 step 2
 			if (!matchSubTag(language[0], range[0])) {
 				return false;
@@ -227,13 +252,14 @@
      */
     public interface Options {
     	/**
-    	 * By setting this option to true, then the behavior of the legacy {@link DefaultAnalyzerFactory}
-    	 * is added, and may be overridden by the settings of the user.
+    	 * By setting this option to true, then all the known Lucene Analyzers for natural
+    	 * languages are used for a range of language tags.
+    	 * These settings may then be overridden by the settings of the user.
     	 * Specifically the following properties are loaded, prior to loading the
     	 * user's specification (with <code>c.b.s.C</code> expanding to 
     	 * <code>com.bigdata.search.ConfigurableAnalyzerFactory</code>)
 <pre>
-c.b.s.C.analyzer.*.like=eng
+c.b.s.C.analyzer._.like=eng
 c.b.s.C.analyzer.por.analyzerClass=org.apache.lucene.analysis.br.BrazilianAnalyzer
 c.b.s.C.analyzer.pt.like=por
 c.b.s.C.analyzer.zho.analyzerClass=org.apache.lucene.analysis.cn.ChineseAnalyzer
@@ -265,18 +291,13 @@
     	 * 
     	 * 
     	 */
-        String INCLUDE_DEFAULTS = ConfigurableAnalyzerFactory.class.getName() + ".includeDefaults";
+        String NATURAL_LANGUAGE_SUPPORT = ConfigurableAnalyzerFactory.class.getName() + ".naturalLanguageSupport";
         /**
          * This is the prefix to all properties configuring the individual analyzers.
          */
         String ANALYZER = ConfigurableAnalyzerFactory.class.getName() + ".analyzer.";
-/**
- * If there is no configuration at all, then the defaults are included,
- * but any configuration at all totally replaces the defaults, unless 
- * {@link #INCLUDE_DEFAULTS}
- * is explicitly set to true.
- */
-        String DEFAULT_INCLUDE_DEFAULTS = "false";
+
+        String DEFAULT_NATURAL_LANGUAGE_SUPPORT = "false";
     }
     /**
      * Options understood by analyzers created by {@link ConfigurableAnalyzerFactory}.
@@ -286,7 +307,9 @@
     	/**
     	 * If specified this is the fully qualified name of a subclass of {@link Analyzer}
     	 * that has appropriate constructors.
-    	 * Either this or {@link #LIKE} or {@link #PATTERN} must be specified for each language range.
+    	 * This is set implicitly if some of the options below are selected (for example {@link #PATTERN}).
+    	 * For each configured language range, if it is not set, either explicitly or implicitly, then 
+    	 * {@link #LIKE}  must be specified.
     	 */
         String ANALYZER_CLASS = "analyzerClass";
         
@@ -326,16 +349,52 @@
         
         String STOPWORDS_VALUE_NONE = "none";
         /**
-         * If this property is present then the analyzer being used is a
-         * {@link PatternAnalyzer} and the value is the pattern to use.
+         * The value of the pattern parameter to
+         * {@link PatternAnalyzer#PatternAnalyzer(Version, Pattern, boolean, Set)} 
          * (Note the {@link Pattern#UNICODE_CHARACTER_CLASS} flag is enabled).
          * It is an error if a different analyzer class is specified.
          */
-        String PATTERN = ".pattern";
+        String PATTERN = "pattern";
+        /**
+         * The value of the wordBoundary parameter to
+         * {@link TermCompletionAnalyzer#TermCompletionAnalyzer(Pattern, Pattern, Pattern, boolean)} 
+         * (Note the {@link Pattern#UNICODE_CHARACTER_CLASS} flag is enabled).
+         * It is an error if a different analyzer class is specified.
+         */
+        String WORD_BOUNDARY = "wordBoundary";
+        /**
+         * The value of the subWordBoundary parameter to
+         * {@link TermCompletionAnalyzer#TermCompletionAnalyzer(Pattern, Pattern, Pattern, boolean)} 
+         * (Note the {@link Pattern#UNICODE_CHARACTER_CLASS} flag is enabled).
+         * It is an error if a different analyzer class is specified.
+         */
+        String SUB_WORD_BOUNDARY = "subWordBoundary";
+        /**
+         * The value of the softHyphens parameter to
+         * {@link TermCompletionAnalyzer#TermCompletionAnalyzer(Pattern, Pattern, Pattern, boolean)} 
+         * (Note the {@link Pattern#UNICODE_CHARACTER_CLASS} flag is enabled).
+         * It is an error if a different analyzer class is specified.
+         */
+        String SOFT_HYPHENS = "softHyphens";
+        /**
+         * The value of the alwaysRemoveSoftHypens parameter to
+         * {@link TermCompletionAnalyzer#TermCompletionAnalyzer(Pattern, Pattern, Pattern, boolean)} 
+         * (Note the {@link Pattern#UNICODE_CHARACTER_CLASS} flag is enabled).
+         * It is an error if a different analyzer class is specified.
+         */
+        String ALWAYS_REMOVE_SOFT_HYPHENS = "alwaysRemoveSoftHyphens";
+        
+        boolean DEFAULT_ALWAYS_REMOVE_SOFT_HYPHENS = false;
+
+        /**
+         * The default sub-word boundary is a pattern that never matches,
+         * i.e. there are no sub-word boundaries.
+         */
+		Pattern DEFAULT_SUB_WORD_BOUNDARY = Pattern.compile("(?!)");
     	
     }
 
-	private static final String DEFAULT_PROPERTIES =  
+	private static final String ALL_LUCENE_NATURAL_LANGUAGES =  
 			"com.bigdata.search.ConfigurableAnalyzerFactory.analyzer.*.like=eng\n" +
 		    "com.bigdata.search.ConfigurableAnalyzerFactory.analyzer.por.analyzerClass=org.apache.lucene.analysis.br.BrazilianAnalyzer\n" +
 		    "com.bigdata.search.ConfigurableAnalyzerFactory.analyzer.pt.like=por\n" +
@@ -365,33 +424,67 @@
 		    "com.bigdata.search.ConfigurableAnalyzerFactory.analyzer.eng.analyzerClass=org.apache.lucene.analysis.standard.StandardAnalyzer\n" +
 		    "com.bigdata.search.ConfigurableAnalyzerFactory.analyzer.en.like=eng\n";
 
+	private static final String LUCENE_STANDARD_ANALYZER = 
+			"com.bigdata.search.ConfigurableAnalyzerFactory.analyzer.*.analyzerClass=org.apache.lucene.analysis.standard.StandardAnalyzer\n";
+
+	/**
+	 * This comment describes the implementation of {@link ConfigurableAnalyzerFactory}.
+	 * The only method in the interface is {@link ConfigurableAnalyzerFactory#getAnalyzer(String, boolean)},
+	 * a map is used from language tag to {@link AnalyzerPair}, where the pair contains
+	 * an {@link Analyzer} both with and without stopwords configured (some times these two analyzers are identical,
+	 * if, for example, stop words are not supported or not required).
+	 * <p>
+	 * If there is no entry for the language tag in the map {@link ConfigurableAnalyzerFactory#langTag2AnalyzerPair},
+	 * then one is created, by walking down the array {@link ConfigurableAnalyzerFactory#config} of AnalyzerPairs
+	 * until a matching one is found.
+	 * <p>
+	 * The bulk of the code in this class is invoked from the constructor in order to set up this 
+	 *  {@link ConfigurableAnalyzerFactory#config} array. For example, all of the subclasses of {@link AnalyzerPair}s,
+	 *  are simply to call the appropriate constructor in the appropriate way: the difficulty is that many subclasses
+	 *  of {@link Analyzer} have constructors with different signatures, and our code needs to navigate each sort.
+	 * @author jeremycarroll
+	 *
+	 */
 	private static class AnalyzerPair implements Comparable<AnalyzerPair>{
-		private final LanguageRange range;
+		final LanguageRange range;
 		private final Analyzer withStopWords;
 		private final Analyzer withoutStopWords;
 		
+		public Analyzer getAnalyzer(boolean filterStopwords) {
+			return filterStopwords ? withStopWords : withoutStopWords;
+		}
+		
+		public boolean extendedFilterMatch(String[] language) {
+			return range.extendedFilterMatch(language);
+		}
+		
     	AnalyzerPair(String range, Analyzer withStopWords, Analyzer withOutStopWords) {
     		this.range = new LanguageRange(range);
     		this.withStopWords = withStopWords;
     		this.withoutStopWords = withOutStopWords;
     	}
     	
+    	/**
+    	 * This clone constructor implements {@link AnalyzerOptions#LIKE}.
+    	 * @param range
+    	 * @param copyMe
+    	 */
     	AnalyzerPair(String range, AnalyzerPair copyMe) {
     		this.range = new LanguageRange(range);
     		this.withStopWords = copyMe.withStopWords;
     		this.withoutStopWords = copyMe.withoutStopWords;
-    		
     	}
-
-		public Analyzer getAnalyzer(boolean filterStopwords) {
-			return filterStopwords ? withStopWords : withoutStopWords;
-		}
-		@Override
-		public String toString() {
-			return range.full + "=(" + withStopWords.getClass().getSimpleName() +")";
-		}
 		
-		
+    	/**
+    	 * If we have a constructor, with arguments including a populated
+    	 * stop word set, then we can use it to make both the withStopWords
+    	 * analyzer, and the withoutStopWords analyzer.
+    	 * @param range
+    	 * @param cons A Constructor including a {@link java.util.Set} argument
+    	 *  for the stop words.
+    	 * @param params The arguments to pass to the constructor including a populated stopword set.
+    	 * @throws Exception
+    	 */
     	AnalyzerPair(String range, Constructor<? extends Analyzer> cons, Object ... params) throws Exception {
     		this(range, cons.newInstance(params), cons.newInstance(useEmptyStopWordSet(params)));
     	}
@@ -409,38 +502,52 @@
 			}
 			return rslt;
 		}
+
 		@Override
+		public String toString() {
+			return range.full + "=(" + withStopWords.getClass().getSimpleName() +")";
+		}
+		
+		@Override
 		public int compareTo(AnalyzerPair o) {
 			return range.compareTo(o.range);
 		}
-
-		public boolean extendedFilterMatch(String[] language) {
-			return range.extendedFilterMatch(language);
-		}
 	}
 	
 
+	/**
+	 * Used for Analyzer classes with a constructor with signature (Version, Set).
+	 * @author jeremycarroll
+	 *
+	 */
 	private static class VersionSetAnalyzerPair extends AnalyzerPair {
 		public VersionSetAnalyzerPair(ConfigOptionsToAnalyzer lro,
 				Class<? extends Analyzer> cls) throws Exception {
 			super(lro.languageRange, getConstructor(cls, Version.class, Set.class), Version.LUCENE_CURRENT, lro.getStopWords());
 		}
 	}
-	
+
+	/**
+	 * Used for Analyzer classes which do not support stopwords and have a constructor with signature (Version).
+	 * @author jeremycarroll
+	 *
+	 */
 	private static class VersionAnalyzerPair extends AnalyzerPair {
-
 		public VersionAnalyzerPair(String range, Class<? extends Analyzer> cls) throws Exception {
 			super(range, getConstructor(cls, Version.class).newInstance(Version.LUCENE_CURRENT));
 		}
 	}
 	
-	
+	/**
+	 * Special case code for {@link PatternAnalyzer}
+	 * @author jeremycarroll
+	 *
+	 */
     private static class PatternAnalyzerPair extends AnalyzerPair {
-
-		public PatternAnalyzerPair(ConfigOptionsToAnalyzer lro, String pattern) throws Exception {
+		public PatternAnalyzerPair(ConfigOptionsToAnalyzer lro, Pattern pattern) throws Exception {
 			super(lro.languageRange, getConstructor(PatternAnalyzer.class,Version.class,Pattern.class,Boolean.TYPE,Set.class), 
 				Version.LUCENE_CURRENT, 
-				Pattern.compile(pattern, Pattern.UNICODE_CHARACTER_CLASS),
+				pattern,
 				true,
 				lro.getStopWords());
 		}
@@ -451,6 +558,16 @@
 	 * This class is initialized with the config options, using the {@link #setProperty(String, String)}
 	 * method, for a particular language range and works out which pair of {@link Analyzer}s
 	 * to use for that language range.
+	 * <p>
+	 * Instances of this class are only alive during the execution of 
+	 * {@link ConfigurableAnalyzerFactory#ConfigurableAnalyzerFactory(FullTextIndex)},
+	 * the life-cycle is:
+	 * <ol>
+	 * <li>The relveant config properties are applied, and are used to populate the fields.
+	 * <li>The fields are validated
+	 * <li>An {@link AnalyzerPair} is constructed
+	 * </ol>
+	 * 
 	 * @author jeremycarroll
 	 *
 	 */
@@ -459,9 +576,13 @@
     	String like;
     	String className;
     	String stopwords;
-    	String pattern;
+    	Pattern pattern;
     	final String languageRange;
     	AnalyzerPair result;
+		Pattern wordBoundary;
+		Pattern subWordBoundary;
+		Pattern softHyphens;
+		Boolean alwaysRemoveSoftHyphens;
 
 		public ConfigOptionsToAnalyzer(String languageRange) {
 			this.languageRange = languageRange;
@@ -474,7 +595,7 @@
 		 */
 		public Set<?> getStopWords() {
 			
-			if (AnalyzerOptions.STOPWORDS_VALUE_NONE.equals(stopwords)) 
+			if (doNotUseStopWords()) 
 				return Collections.EMPTY_SET;
 			
 			if (useDefaultStopWords()) {
@@ -484,6 +605,10 @@
 			return getStopWordsForClass(stopwords);
 		}
 
+		boolean doNotUseStopWords() {
+			return AnalyzerOptions.STOPWORDS_VALUE_NONE.equals(stopwords) || (stopwords == null && pattern != null);
+		}
+
 		protected Set<?> getStopWordsForClass(String clazzName) {
 			Class<? extends Analyzer> analyzerClass = getAnalyzerClass(clazzName);
 			try {
@@ -500,9 +625,13 @@
 		}
 
 		protected boolean useDefaultStopWords() {
-			return stopwords == null || AnalyzerOptions.STOPWORDS_VALUE_DEFAULT.equals(stopwords);
+			return ( stopwords == null && pattern == null ) || AnalyzerOptions.STOPWORDS_VALUE_DEFAULT.equals(stopwords);
 		}
 
+		/**
+		 * The first step in the life-cycle, used to initialize the fields.
+		 * @return true if the property was recognized.
+		 */
 		public boolean setProperty(String shortProperty, String value) {
 			if (shortProperty.equals(AnalyzerOptions.LIKE) ) {
 				like = value;
@@ -511,13 +640,24 @@
 			} else if (shortProperty.equals(AnalyzerOptions.STOPWORDS) ) {
 				stopwords = value;
 			} else if (shortProperty.equals(AnalyzerOptions.PATTERN) ) {
-				pattern = value;
+				pattern = Pattern.compile(value,Pattern.UNICODE_CHARACTER_CLASS);
+			} else if (shortProperty.equals(AnalyzerOptions.WORD_BOUNDARY) ) {
+				wordBoundary = Pattern.compile(value,Pattern.UNICODE_CHARACTER_CLASS);
+			} else if (shortProperty.equals(AnalyzerOptions.SUB_WORD_BOUNDARY) ) {
+				subWordBoundary = Pattern.compile(value,Pattern.UNICODE_CHARACTER_CLASS);
+			} else if (shortProperty.equals(AnalyzerOptions.SOFT_HYPHENS) ) {
+				softHyphens = Pattern.compile(value,Pattern.UNICODE_CHARACTER_CLASS);
+			} else if (shortProperty.equals(AnalyzerOptions.ALWAYS_REMOVE_SOFT_HYPHENS) ) {
+				alwaysRemoveSoftHyphens = Boolean.valueOf(value);
 			} else {
 			   return false;
 			}
 			return true;
 		}
 
+		/**
+		 * The second phase of the life-cycle, used for sanity checking.
+		 */
 		public void validate() {
 			if (pattern != null ) {
 				if ( className != null && className != PatternAnalyzer.class.getName()) {
@@ -525,6 +665,27 @@
 				}
 				className = PatternAnalyzer.class.getName();
 			}
+			if (this.wordBoundary != null  ) {
+				if ( className != null && className != TermCompletionAnalyzer.class.getName()) {
+					throw new RuntimeException("Bad Option: Language range "+languageRange + " with pattern propety for class "+ className);
+				}
+				className = TermCompletionAnalyzer.class.getName();
+				
+				if ( subWordBoundary == null ) {
+					subWordBoundary = AnalyzerOptions.DEFAULT_SUB_WORD_BOUNDARY;
+				}
+				if ( alwaysRemoveSoftHyphens != null && softHyphens == null ) {
+					throw new RuntimeException("Bad option: Language range "+languageRange + ": must specify softHypens when setting alwaysRemoveSoftHyphens");		
+				}
+				if (softHyphens != null && alwaysRemoveSoftHyphens == null) {
+					alwaysRemoveSoftHyphens = AnalyzerOptions.DEFAULT_ALWAYS_REMOVE_SOFT_HYPHENS;
+				}
+				
+			} else if ( subWordBoundary != null || softHyphens != null || alwaysRemoveSoftHyphens != null ||
+					TermCompletionAnalyzer.class.getName().equals(className) ) {
+				throw new RuntimeException("Bad option: Language range "+languageRange + ": must specify wordBoundary for TermCompletionAnalyzer");
+			}
+			
 			if (PatternAnalyzer.class.getName().equals(className) && pattern == null ) {
 				throw new RuntimeException("Bad Option: Language range "+languageRange + " must specify pattern for PatternAnalyzer.");
 			}
@@ -537,21 +698,45 @@
 			
 		}
 		
+		/**
+		 * The third and final phase of the life-cyle used for identifying
+		 * the AnalyzerPair.
+		 */
 		private AnalyzerPair construct() throws Exception {
 			if (className == null) {
 				return null;
 			}
 			if (pattern != null) {
 				return new PatternAnalyzerPair(this, pattern);
-						
-			} 
+			}
+			if (softHyphens != null) {
+				return new AnalyzerPair(
+						languageRange,
+						new TermCompletionAnalyzer(
+								wordBoundary, 
+								subWordBoundary, 
+								softHyphens, 
+								alwaysRemoveSoftHyphens));
+			}
+			if (wordBoundary != null) {
+				return new AnalyzerPair(
+						languageRange,
+						new TermCompletionAnalyzer(
+								wordBoundary, 
+								subWordBoundary));
+			}
 			final Class<? extends Analyzer> cls = getAnalyzerClass();
             
             if (hasConstructor(cls, Version.class, Set.class)) {
 
             	// RussianAnalyzer is missing any way to access stop words.
-            	if (RussianAnalyzer.class.equals(cls) && useDefaultStopWords()) {
-            		return new AnalyzerPair(languageRange, new RussianAnalyzer(Version.LUCENE_CURRENT), new RussianAnalyzer(Version.LUCENE_CURRENT, Collections.EMPTY_SET));
+            	if (RussianAnalyzer.class.equals(cls)) {
+            		if (useDefaultStopWords()) {
+            		    return new AnalyzerPair(languageRange, new RussianAnalyzer(Version.LUCENE_CURRENT), new RussianAnalyzer(Version.LUCENE_CURRENT, Collections.EMPTY_SET));
+            		}
+            		if (doNotUseStopWords()) {
+            		    return new AnalyzerPair(languageRange,  new RussianAnalyzer(Version.LUCENE_CURRENT, Collections.EMPTY_SET));	
+            		}
             	}
             	return new VersionSetAnalyzerPair(this, cls);
             }
@@ -569,6 +754,29 @@
             throw new RuntimeException("Bad option: cannot find constructor for class " + className + " for language range " + languageRange);
 		}
 
+		/**
+		 * Also part of the third phase of the life-cycle, following the {@link AnalyzerOptions#LIKE}
+		 * properties.
+		 * @param depth
+		 * @param max
+		 * @param analyzers
+		 * @return
+		 */
+		AnalyzerPair followLikesToAnalyzerPair(int depth, int max,
+				Map<String, ConfigOptionsToAnalyzer> analyzers) {
+			if (result == null) {
+				if (depth == max) {
+					throw new RuntimeException("Bad configuration: - 'like' loop for language range " + languageRange);
+				}
+				ConfigOptionsToAnalyzer next = analyzers.get(like);
+				if (next == null) {
+					throw new RuntimeException("Bad option: - 'like' not found for language range " + languageRange+ " (not found: '"+ like +"')");	
+				}
+				result = new AnalyzerPair(languageRange, next.followLikesToAnalyzerPair(depth+1, max, analyzers));
+			}
+			return result;
+		}
+
 		protected Class<? extends Analyzer> getAnalyzerClass() {
 			return getAnalyzerClass(className);
 		}
@@ -587,22 +795,6 @@
 		void setAnalyzerPair(AnalyzerPair ap) {
 			result = ap;
 		}
-
-		AnalyzerPair followLikesToAnalyzerPair(int depth, int max,
-				Map<String, ConfigOptionsToAnalyzer> analyzers) {
-			if (result == null) {
-				if (depth == max) {
-					throw new RuntimeException("Bad configuration: - 'like' loop for language range " + languageRange);
-				}
-				ConfigOptionsToAnalyzer next = analyzers.get(like);
-				if (next == null) {
-					throw new RuntimeException("Bad option: - 'like' not found for language range " + languageRange+ " (not found: '"+ like +"')");	
-				}
-				result = new AnalyzerPair(languageRange, next.followLikesToAnalyzerPair(depth+1, max, analyzers));
-			}
-			return result;
-		}
-
 	}
     
     private final AnalyzerPair config[];
@@ -615,12 +807,19 @@
      * strategy so the code will still work on the {@link #MAX_LANG_CACHE_SIZE}+1 th entry.
      */
     private static final int MAX_LANG_CACHE_SIZE = 500;
+
     		
     private String defaultLanguage;
     private final FullTextIndex<?> fullTextIndex;
     
     
+    /**
+     * Builds a new ConfigurableAnalyzerFactory.
+     * @param fullTextIndex
+     */
     public ConfigurableAnalyzerFactory(final FullTextIndex<?> fullTextIndex) {
+    	// A description of the operation of this method is found on AnalyzerPair and
+    	// ConfigOptionsToAnalyzer.
     	// despite our name, we actually make all the analyzers now, and getAnalyzer method is merely a lookup.
 
         if (fullTextIndex == null)
@@ -717,9 +916,9 @@
 		while (en.hasMoreElements()) {
 			
 			String prop = (String)en.nextElement();
-			if (prop.equals(Options.INCLUDE_DEFAULTS)) continue;
+			if (prop.equals(Options.NATURAL_LANGUAGE_SUPPORT)) continue;
 			if (prop.startsWith(Options.ANALYZER)) {
-				String languageRangeAndProperty[] = prop.substring(Options.ANALYZER.length()).split("[.]");
+				String languageRangeAndProperty[] = prop.substring(Options.ANALYZER.length()).replaceAll("_","*").split("[.]");
 				if (languageRangeAndProperty.length == 2) {
 
 					String languageRange = languageRangeAndProperty[0].toLowerCase(Locale.US);  // Turkish "I" could create a problem
@@ -745,25 +944,29 @@
 	protected Properties initProperties() {
 		final Properties parentProperties = fullTextIndex.getProperties();
         Properties myProps;
-        if (Boolean.getBoolean(parentProperties.getProperty(Options.INCLUDE_DEFAULTS, Options.DEFAULT_INCLUDE_DEFAULTS))) {
-        	myProps = defaultProperties();
+        if (Boolean.valueOf(parentProperties.getProperty(
+        		Options.NATURAL_LANGUAGE_SUPPORT, 
+        		Options.DEFAULT_NATURAL_LANGUAGE_SUPPORT))) {
+        	
+        	myProps = loadPropertyString(ALL_LUCENE_NATURAL_LANGUAGES);
+        	
+        } else  if (hasPropertiesForStarLanguageRange(parentProperties)){
+        	
+        	myProps = new Properties();
+        	
         } else {
-        	myProps = new Properties();
+        	
+        	myProps = loadPropertyString(LUCENE_STANDARD_ANALYZER);
         }
         
         copyRelevantProperties(fullTextIndex.getProperties(), myProps);
-        
-        if (myProps.isEmpty()) {
-        	return defaultProperties();
-        } else {
-		    return myProps;
-        }
+        return myProps;
 	}
 
-	protected Properties defaultProperties() {
+	Properties loadPropertyString(String props) {
 		Properties rslt = new Properties();
 		try {
-			rslt.load(new StringReader(DEFAULT_PROPERTIES));
+			rslt.load(new StringReader(props));
 		} catch (IOException e) {
 			throw new RuntimeException("Impossible - well clearly not!", e);
 		}
@@ -780,6 +983,17 @@
 		}
 	}
 
+    private boolean hasPropertiesForStarLanguageRange(Properties from) {
+		Enumeration<?> en = from.propertyNames();
+		while (en.hasMoreElements()) {
+			String prop = (String)en.nextElement();
+			if (prop.startsWith(Options.ANALYZER+"_.") 
+					|| prop.startsWith(Options.ANALYZER+"*.")) {
+				return true;
+			}
+		}
+		return false;
+	}
 	@Override
 	public Analyzer getAnalyzer(String languageCode, boolean filterStopwords) {
 		

Modified: branches/BIGDATA_RELEASE_1_3_0/bigdata/src/java/com/bigdata/search/DefaultAnalyzerFactory.java
===================================================================
--- branches/BIGDATA_RELEASE_1_3_0/bigdata/src/java/com/bigdata/search/DefaultAnalyzerFactory.java	2014-05-10 02:56:35 UTC (rev 8262)
+++ branches/BIGDATA_RELEASE_1_3_0/bigdata/src/java/com/bigdata/search/DefaultAnalyzerFactory.java	2014-05-10 11:52:25 UTC (rev 8263)
@@ -29,7 +29,6 @@
 
 import java.util.Collections;
 import java.util.HashMap;
-import java.util.HashSet;
 import java.util.Locale;
 import java.util.Map;
 import java.util.Set;
@@ -52,11 +51,21 @@
 import com.bigdata.btree.keys.KeyBuilder;
 
 /**
- * Default implementation registers a bunch of {@link Analyzer}s for various
- * language codes and then serves the appropriate {@link Analyzer} based on
- * the specified language code.
+ * This is the default implementation but should be regarded as legacy since
+ * it fails to use the correct {@link Analyzer} for almost all languages (other than
+ * English). It uses the correct natural language analyzer only for literals tagged with
+ * certain three letter ISO 639 codes:
+ * "por", "deu", "ger", "zho", "chi", "jpn", "kor", "ces", "cze", "dut", "nld", "gre", "ell",
+ * "fra", "fre", "rus" and "tha". All other tags are treated as English.
+ * These codes do not work if they are used with subtags, e.g. "ger-AT" is treated as English.
+ * No two letter code, other than "en" works correctly: note that the W3C and 
+ * IETF recommend the use of the two letter forms instead of the three letter forms.
  * 
  * @author <a href="mailto:tho...@us...">Bryan Thompson</a>
+ * @deprecated Using {@link ConfigurableAnalyzerFactory} with 
+ *    the {@link ConfigurableAnalyzerFactory.Options#NATURAL_LANGUAGE_SUPPORT} 
+ *    uses the appropriate natural language analyzers for the two letter codes
+ *    and for tags which include sub-tags.
  * @version $Id$
  */
 public class DefaultAnalyzerFactory implements IAnalyzerFactory {

Added: branches/BIGDATA_RELEASE_1_3_0/bigdata/src/java/com/bigdata/search/TermCompletionAnalyzer.java
===================================================================
--- branches/BIGDATA_RELEASE_1_3_0/bigdata/src/java/com/bigdata/search/TermCompletionAnalyzer.java	                        (rev 0)
+++ branches/BIGDATA_RELEASE_1_3_0/bigdata/src/java/com/bigdata/search/TermCompletionAnalyzer.java	2014-05-10 11:52:25 UTC (rev 8263)
@@ -0,0 +1,248 @@
+/**
+
+Copyright (C) SYSTAP, LLC 2006-2014.  All rights reserved.
+
+Contact:
+     SYSTAP, LLC
+     4501 Tower Road
+     Greensboro, NC 27410
+     lic...@bi...
+
+This program is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation; version 2 of the License.
+
+This program is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with this program; if not, write to the Free Software
+Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+*/
+/*
+ * Created on May 8, 2014 by Jeremy J. Carroll, Syapse Inc.
+ */
+package com.bigdata.search;
+
+import java.io.IOException;
+import java.io.Reader;
+import java.io.StringReader;
+import java.nio.CharBuffer;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+
+import org.apache.lucene.analysis.Analyzer;
+import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.analysis.KeywordAnalyzer;
+import org.apache.lucene.analysis.tokenattributes.TermAttribute;
+
+
+/**
+ * An analyzer intended for the term-completion use case; particularly
+ * for technical vocabularies and concept schemes.
+ * 
+ * <p> 
+ * This analyzer generates several index terms for each word in the input.
+ * These are intended to match short sequences (e.g. three or more) characters
+ * of user-input, to then give the user a drop-down list of matching terms.
+ * <p>
+ * This can be set up to address issues like matching <q>half-time</q> when the user types
+ * <q>tim</q> or if the user types <q>halft</q> (treating the hyphen as a soft hyphen); or
+ * to match <q>TermCompletionAnalyzer</q> when the user types <q>Ana</q>
+ * <p>
+ * In contrast, the Lucene Analyzers are mainly geared around the free text search use
+ * case. 
+ * <p>
+ * The intended use cases will typical involve a prefix query of the form:
+ * <pre>
+ *    ?t bds:search "prefix*" .
+ * </pre>
+ * to find all literals in the selected graphs, which are indexed by a term starting in <q>prefix</q>,
+ * so the problem this class addresses is finding the appropriate index terms to allow
+ * matching, at sensible points, mid-way through words (such as at hyphens).
+ * <p>
+ * To get maximum effectiveness it maybe best to use private language subtags (see RFC 5647),
+ * e.g. <code>"x-term"</code>
+ * which are mapped to this class by {@link ConfigurableAnalyzerFactory} for
+ * the data being loaded into the store, and linked to some very simple process
+ * like {@link KeywordAnalyzer} for queries which are tagged with a different language tag
+ * that is only used for <code>bds:search</code>, e.g. <code>"x-query"</code>.
+ * The above prefix query then becomes:
+ * <pre>
+ *    ?t bds:search "prefix*"@x-query .
+ * </pre>
+ * 
+ * 
+ * 
+ * @author jeremycarroll
+ *
+ */
+public class TermCompletionAnalyzer extends Analyzer {
+	
+	private final Pattern wordBoundary;
+	private final Pattern subWordBoundary;
+
+	private final Pattern discard;
+	private final boolean alwaysDiscard;
+
+	/**
+	 * Divide the input into words and short tokens
+	 * as with {@link #TermCompletionAnalyzer(Pattern, Pattern)}.
+	 * Each term is generated, and then an additional term
+	 * is generated with softHypens (defined by the pattern),
+	 * removed. If the alwaysRemoveSoftHypens flag is true,
+	 * then the first term (before the removal) is suppressed.
+	 * 
+	 * @param wordBoundary      The definition of space (e.g. " ")
+	 * @param subWordBoundary   Also index after matches to this (e.g. "-")
+	 * @param softHyphens     Discard these characters from matches
+	 * @param alwaysRemoveSoftHypens  If false the discard step is optional.
+	 */
+	public TermCompletionAnalyzer(Pattern wordBoundary, 
+			Pattern subWordBoundary, 
+			Pattern softHyphens,
+			boolean alwaysRemoveSoftHypens) {
+		this.wordBoundary = wordBoundary;
+		this.subWordBoundary = subWordBoundary;
+		if (softHyphens != null) {
+			discard = softHyphens;
+			alwaysDiscard = alwaysRemoveSoftHypens;
+		} else {
+			discard = Pattern.compile("(?!)"); // never matches
+			alwaysDiscard = true;
+		}
+	}
+	/**
+	 * Divide the input into words, separated by the wordBoundary,
+	 * and return a token for each whole word, and then 
+	 * generate further tokens for each word by removing prefixes
+	 * up to and including each successive match of
+	 * subWordBoundary
+	 * @param wordBoundary
+	 * @param subWordBoundary
+	 */
+	public TermCompletionAnalyzer(Pattern wordBoundary, 
+			Pattern subWordBoundary) {
+		this(wordBoundary, subWordBoundary, null, true);
+	}
+
+
+	@Override
+	public TokenStream tokenStream(String ignoredFieldName, Reader reader) {
+		return new TermCompletionTokenStream((StringReader)reader);
+	}
+
+	/**
+	 * This classes has three processes going on
+	 * all driven from the {@link #increment()} method.
+	 * 
+	 * One process is that of iterating over the words in the input:
+	 * - the words are identified in the constructor, and the iteration
+	 *   is performed by {@link #nextWord()}
+	 *   
+	 * - the subword boundaries are identified in {@link #next()}
+	 *   We then set up {@link #found} to contain the most
+	 *   recently found subword.
+	 *   
+	 * - the soft hyphen discarding is processed in {@link #maybeDiscardHyphens()}
+	 *   
+	 *   - if we are not {@link #alwaysDiscard}ing then {@link #afterDiscard}
+	 *   can be set to null to return the non-discarded version on the next cycle.
+	 *   
+	 */
+	private class TermCompletionTokenStream extends TokenStream {
+
+		final String[] words;
+		final TermAttribute termAtt;
+		
+		
+		
+		char currentWord[] = new char[]{};
+		Matcher softMatcher;
+		int currentWordIx = -1;
+		
+		
+		int charPos = 0;
+		private String afterDiscard;
+		private CharBuffer found;
+		
+		public TermCompletionTokenStream(StringReader reader) {
+		    termAtt = addAttribute(TermAttribute.class);
+			try {
+				reader.mark(Integer.MAX_VALUE);
+				int length = (int) reader.skip(Integer.MAX_VALUE);
+				reader.reset();
+				char fileContent[] = new char[length];
+				reader.read(fileContent);
+				words = wordBoundary.split(new String(fileContent));
+			} catch (IOException e) {
+				throw new RuntimeException("Impossible",e);
+			}
+		}
+		
+		@Override
+		public boolean incrementToken() throws IOException {
+			if ( next() ) {
+				if (afterDiscard != null) {
+					int lg = afterDiscard.length();
+					afterDiscard.getChars(0, lg, termAtt.termBuffer(), 0);
+				    termAtt.setTermLength(lg);
+				} else {
+				    int lg = found.length();
+					found.get(termAtt.termBuffer(), 0, lg);
+				    termAtt.setTermLength(lg);
+				}
+				return true;
+			} else {
+				return false;
+			}
+		}
+		
+		private boolean next() {
+			if (currentWordIx >= words.length) {
+				return false;
+			}
+			if (!alwaysDiscard) {
+				// Last match was the discarded version,
+				// now do the non-discard version.
+				if (afterDiscard != null) {
+					afterDiscard = null;
+					return true;
+				}
+			}
+			afterDiscard = null;
+			if (charPos + 1 < currentWord.length && softMatcher.find(charPos+1)) {
+				charPos = softMatcher.end();
+				maybeDiscardHyphens();
+				return true;
+			} else {
+				return nextWord();
+			}
+		}
+
+		void maybeDiscardHyphens() {
+			found = CharBuffer.wrap(currentWord, charPos, currentWord.length - charPos);
+			Matcher discarding = discard.matcher(found);
+			if (discarding.find()) {
+				afterDiscard = discarding.replaceAll("");
+			}
+		}
+		
+		private boolean nextWord() {
+			currentWordIx++;
+			if (currentWordIx >= words.length) {
+				return false;
+			}
+			currentWord = words[currentWordIx].toCharArray();
+			termAtt.resizeTermBuffer(currentWord.length);
+			charPos = 0;
+			softMatcher = subWordBoundary.matcher(words[currentWordIx]);
+			maybeDiscardHyphens();
+			return true;
+		}
+
+	}
+	
+}

Modified: branches/BIGDATA_RELEASE_1_3_0/bigdata/src/test/com/bigdata/search/AbstractAnalyzerFactoryTest.java
===================================================================
--- branches/BIGDATA_RELEASE_1_3_0/bigdata/src/test/com/bigdata/search/AbstractAnalyzerFactoryTest.java	2014-05-10 02:56:35 UTC (rev 8262)
+++ branches/BIGDATA_RELEASE_1_3_0/bigdata/src/test/com/bigdata/search/AbstractAnalyzerFactoryTest.java	2014-05-10 11:52:25 UTC (rev 8263)
@@ -22,151 +22,25 @@
 Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
 */
 /*
- * Created on May 7, 2014
+ * Created on May 9, 2014
  */
 package com.bigdata.search;
 
-import java.io.IOException;
-import java.io.StringReader;
-
-import org.apache.lucene.analysis.Analyzer;
-import org.apache.lucene.analysis.TokenStream;
-import org.apache.lucene.analysis.tokenattributes.TermAttribute;
-
 public abstract class AbstractAnalyzerFactoryTest extends AbstractSearchTest {
 
-    public AbstractAnalyzerFactoryTest() {
+	public AbstractAnalyzerFactoryTest() {
 	}
-    
-    public AbstractAnalyzerFactoryTest(String arg0) {
-    	super(arg0);
+
+	public AbstractAnalyzerFactoryTest(String arg0) {
+		super(arg0);
 	}
-    
-    public void setUp() throws Exception {
-    	super.setUp();
-	    init(getExtraProperties());
-    }
-    abstract String[] getExtraProperties();
-    
-	private Analyzer getAnalyzer(String lang, boolean filterStopWords) {
-		return getNdx().getAnalyzer(lang, filterStopWords);
+
+	@Override
+	public void setUp() throws Exception {
+		super.setUp();
+		init(getExtraProperties());
 	}
-	
-	private void comparisonTest(String lang, 
-			boolean stopWordsSignificant, 
-			String text, 
-			String spaceSeparated)  throws IOException {
-		compareTokenStream(getAnalyzer(lang, stopWordsSignificant), text,
-				spaceSeparated.split(" ")); //$NON-NLS-1$
-		}
-	private void compareTokenStream(Analyzer a, String text, String expected[]) throws IOException {
-		TokenStream s = a.tokenStream(null, new StringReader(text));
-		int ix = 0;
-        while (s.incrementToken()) {
-            final TermAttribute term = s.getAttribute(TermAttribute.class);
-            final String word = term.term();
-            assertTrue(ix < expected.length);
-        	assertEquals(word, expected[ix++]);
-        }
-        assertEquals(ix, expected.length);
-	}
-	
 
-    public void testEnglishFilterStopWords() throws IOException {
-    	for (String lang: new String[]{ "eng", null, "" }) { //$NON-NLS-1$ //$NON-NLS-2$
-    	    comparisonTest(lang,
-    			true,
-    			"The test to end all tests! Forever.", //$NON-NLS-1$
-    			"test end all tests forever" //$NON-NLS-1$
-    			);
-    	}
-    }
-    public void testEnglishNoFilter() throws IOException {
-    	for (String lang: new String[]{ "eng", null, "" }) { //$NON-NLS-1$ //$NON-NLS-2$
-    	    comparisonTest(lang,
-    			false,
-    			"The test to end all tests! Forever.", //$NON-NLS-1$
-    			"the test to end all tests forever" //$NON-NLS-1$
-    			);
-    	}
-    }
-    
-    // Note we careful use a three letter language code for german.
-    // 'de' is more standard, but the DefaultAnalyzerFactory does not
-    // implement 'de' correctly.
-    public void testGermanFilterStopWords() throws IOException {
-    	comparisonTest("ger", //$NON-NLS-1$
-    			true,
-    			NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.10") + //$NON-NLS-1$
-    			NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.11"), //$NON-NLS-1$
-    			NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.12") //$NON-NLS-1$
-    			);
-    	
-    }
+	abstract String[] getExtraProperties();
 
-    // Note we careful use a three letter language code for Russian.
-    // 'ru' is more standard, but the DefaultAnalyzerFactory does not
-    // implement 'ru' correctly.
-    public void testRussianFilterStopWords() throws IOException {
-    	comparisonTest("rus", //$NON-NLS-1$
-    			true,
-				// I hope this is not offensive text.
-			NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.14") + //$NON-NLS-1$
-		    NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.15"), //$NON-NLS-1$
-    			NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.16") //$NON-NLS-1$
-    			);
-    	
-    }
-    public void testGermanNoStopWords() throws IOException {
-    	comparisonTest("ger", //$NON-NLS-1$
-    			false,
-    			NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.18") + //$NON-NLS-1$
-    			NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.19"), //$NON-NLS-1$
-    			NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.20") //$NON-NLS-1$
-    			);
-    	
-    }
-    public void testRussianNoStopWords() throws IOException {
-    	comparisonTest("rus", //$NON-NLS-1$
-    			false,
-    			NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.22") + //$NON-NLS-1$
-    		    NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.23"), //$NON-NLS-1$
-    			NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.24") //$NON-NLS-1$
-    			);
-    	
-    }
-    public void testJapanese() throws IOException {
-    	for (boolean filterStopWords: new Boolean[]{true, false}) {
-    	comparisonTest("jpn", //$NON-NLS-1$
-      filterStopWords,
-	NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.26"), //$NON-NLS-1$
-    NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.27") + //$NON-NLS-1$
-	NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.28") + //$NON-NLS-1$
-    NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.29")); //$NON-NLS-1$
-    	}
-    }
-    public void testConfiguredLanguages() {
-    	checkConfig("BrazilianAnalyzer", "por", "pt"); //$NON-NLS-1$ //$NON-NLS-2$ //$NON-NLS-3$
-        checkConfig("ChineseAnalyzer", "zho", "chi", "zh"); //$NON-NLS-1$ //$NON-NLS-2$ //$NON-NLS-3$ //$NON-NLS-4$
-        checkConfig("CJKAnalyzer", "jpn", "ja", "kor", "ko"); //$NON-NLS-1$ //$NON-NLS-2$ //$NON-NLS-3$ //$NON-NLS-4$ //$NON-NLS-5$
-        checkConfig("CzechAnalyzer", "ces", "cze", "cs"); //$NON-NLS-1$ //$NON-NLS-2$ //$NON-NLS-3$ //$NON-NLS-4$
-        checkConfig("DutchAnalyzer", "dut", "nld", "nl"); //$NON-NLS-1$ //$NON-NLS-2$ //$NON-NLS-3$ //$NON-NLS-4$
-        checkConfig("GermanAnalyzer", "deu", "ger", "de"); //$NON-NLS-1$ //$NON-NLS-2$ //$NON-NLS-3$ //$NON-NLS-4$
-        checkConfig("GreekAnalyzer", "gre", "ell", "el"); //$NON-NLS-1$ //$NON-NLS-2$ //$NON-NLS-3$ //$NON-NLS-4$
-        checkConfig("RussianAnalyzer", "rus", "ru"); //$NON-NLS-1$ //$NON-NLS-2$ //$NON-NLS-3$
-        checkConfig("ThaiAnalyzer", "th", "tha"); //$NON-NLS-1$ //$NON-NLS-2$ //$NON-NLS-3$
-        checkConfig("StandardAnalyzer", "en", "eng", "", null); //$NON-NLS-1$ //$NON-NLS-2$ //$NON-NLS-3$ //$NON-NLS-4$
-    }
-
-	private void checkConfig(String classname, String ...langs) {
-		for (String lang:langs) {
-			// The DefaultAnalyzerFactory only works for language tags of length exactly three.
-//			if (lang != null && lang.length()==3)
-			{
-			assertEquals(classname, getAnalyzer(lang,true).getClass().getSimpleName());
-			assertEquals(classname, getAnalyzer(lang+NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.0"),true).getClass().getSimpleName()); //$NON-NLS-1$
-			}
-		}
-		
-	}
 }

Copied: branches/BIGDATA_RELEASE_1_3_0/bigdata/src/test/com/bigdata/search/AbstractDefaultAnalyzerFactoryTest.java (from rev 8253, branches/BIGDATA_RELEASE_1_3_0/bigdata/src/test/com/bigdata/search/AbstractAnalyzerFactoryTest.java)
===================================================================
--- branches/BIGDATA_RELEASE_1_3_0/bigdata/src/test/com/bigdata/search/AbstractDefaultAnalyzerFactoryTest.java	                        (rev 0)
+++ branches/BIGDATA_RELEASE_1_3_0/bigdata/src/test/com/bigdata/search/AbstractDefaultAnalyzerFactoryTest.java	2014-05-10 11:52:25 UTC (rev 8263)
@@ -0,0 +1,133 @@
+/**
+
+Copyright (C) SYSTAP, LLC 2006-2014.  All rights reserved.
+
+Contact:
+     SYSTAP, LLC
+     4501 Tower Road
+     Greensboro, NC 27410
+     lic...@bi...
+
+This program is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation; version 2 of the License.
+
+This program is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with this program; if not, write to the Free Software
+Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+*/
+/*
+ * Created on May 7, 2014
+ */
+package com.bigdata.search;
+
+import java.io.IOException;
+
+
+public abstract class AbstractDefaultAnalyzerFactoryTest extends AbstractAnalyzerFactoryTest  {
+
+    public AbstractDefaultAnalyzerFactoryTest() {
+	}
+    
+    public AbstractDefaultAnalyzerFactoryTest(String arg0) {
+    	super(arg0);
+	}
+    
+    public void testEnglishFilterStopWords() throws IOException {
+    	for (String lang: new String[]{ "eng", null, "" }) { //$NON-NLS-1$ //$NON-NLS-2$
+    	    comparisonTest(lang,
+    			true,
+    			"The test to end all tests! Forever.", //$NON-NLS-1$
+    			"test end all tests forever" //$NON-NLS-1$
+    			);
+    	}
+    }
+    public void testEnglishNoFilter() throws IOException {
+    	for (String lang: new String[]{ "eng", null, "" }) { //$NON-NLS-1$ //$NON-NLS-2$
+    	    comparisonTest(lang,
+    			false,
+    			"The test to end all tests! Forever.", //$NON-NLS-1$
+    			"the test to end all tests forever" //$NON-NLS-1$
+    			);
+    	}
+    }
+    
+    // Note we careful use a three letter language code for german.
+    // 'de' is more standard, but the DefaultAnalyzerFactory does not
+    // implement 'de' correctly.
+    public void testGermanFilterStopWords() throws IOException {
+    	comparisonTest("ger", //$NON-NLS-1$
+    			true,
+    			NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.10") + //$NON-NLS-1$
+    			NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.11"), //$NON-NLS-1$
+    			NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.12") //$NON-NLS-1$
+    			);
+    	
+    }
+
+    // Note we careful use a three letter language code for Russian.
+    // 'ru' is more standard, but the DefaultAnalyzerFactory does not
+    // implement 'ru' correctly.
+    public void testRussianFilterStopWords() throws IOException {
+    	comparisonTest("rus", //$NON-NLS-1$
+    			true,
+				// I hope this is not offensive text.
+			NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.14") + //$NON-NLS-1$
+		    NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.15"), //$NON-NLS-1$
+    			NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.16") //$NON-NLS-1$
+    			);
+    	
+    }
+    public void testGermanNoStopWords() throws IOException {
+    	comparisonTest("ger", //$NON-NLS-1$
+    			false,
+    			NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.18") + //$NON-NLS-1$
+    			NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.19"), //$NON-NLS-1$
+    			NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.20") //$NON-NLS-1$
+    			);
+    	
+    }
+    public void testRussianNoStopWords() throws IOException {
+    	comparisonTest("rus", //$NON-NLS-1$
+    			false,
+    			NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.22") + //$NON-NLS-1$
+    		    NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.23"), //$NON-NLS-1$
+    			NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.24") //$NON-NLS-1$
+    			);
+    	
+    }
+    public void testJapanese() throws IOException {
+    	for (boolean filterStopWords: new Boolean[]{true, false}) {
+    	comparisonTest("jpn", //$NON-NLS-1$
+      filterStopWords,
+	NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.26"), //$NON-NLS-1$
+    NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.27") + //$NON-NLS-1$
+	NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.28") + //$NON-NLS-1$
+    NonEnglishExamples.getString("AbstractAnalyzerFactoryTest.29")); //$NON-NLS-1$
+    	}
+    }
+    public void testConfiguredLanguages() {
+    	checkConfig("BrazilianAnalyzer", "por", "pt"); //$NON-NLS-1$ //$NON-NLS-2$ //$NON-NLS-3$
+        checkConfig("ChineseAnalyzer", "zho", "chi", "zh"); //$NON-NLS-1$ //$NON-NLS-2$ //$NON-NLS-3$ //$NON-NLS-4$
+        checkConfig("CJKAnalyzer", "jpn", "ja", "kor", "ko"); //$NON-NLS-1$ //$NON-NLS-2$ //$NON-NLS-3$ //$NON-NLS-4$ //$NON-NLS-5$
+        checkConfig("CzechAnalyzer", "ces", "cze", "cs"); //$NON-NLS-1$ //$NON-NLS-2$ //$NON-NLS-3$ //$NON-NLS-4$
+        checkConfig("DutchAnalyzer", "dut", "nld", "nl"); //$NON-NLS-1$ //$NON-NLS-2$ //$NON-NLS-3$ //$NON-NLS-4$
+        checkConfig("GermanAnalyzer", "deu", "ger", "de"); //$NON-NLS-1$ //$NON-NLS-2$ //$NON-NLS-3$ //$NON-NLS-4$
+        checkConfig("GreekAnalyzer", "gre", "ell", "el"); //$NON-NLS-1$ //$NON-NLS-2$ //$NON-NLS-3$ //$NON-NLS-4$
+        checkConfig("RussianAnalyzer", "rus", "ru"); //$NON-NLS-1$ //$NON-NLS-2$ //$NON-NLS-3$
+        checkConfig("ThaiAnalyzer", "th", "tha"); //$NON-NLS-1$ //$NON-NLS-2$ //$NON-NLS-3$
+        checkConfig("StandardAnalyzer", "en", "eng", "", null); //$NON-NLS-1$ //$NON-NLS-2$ //$NON-NLS-3$ //$NON-NLS-4$
+    }
+
+    @Override
+	protected void checkConfig(String classname, String ...langs) {
+		checkConfig(isBroken(), classname, langs);
+		
+	}
+	abstract boolean isBroken() ;
+}

Modified: branches/BIGDATA_RELEASE_1_3_0/bigdata/src/test/com/bigdata/search/AbstractSearchTest.java
===================================================================
--- branches/BIGDATA_RELEASE_1_3_0/bigdata/src/test/com/bigdata/search/AbstractSearchTest.java	2014-05-10 02:56:35 UTC (rev 8262)
+++ branches/BIGDATA_RELEASE_1_3_0/bigdata/src/test/com/bigdata/search/AbstractSearchTest.java	2014-05-10 11:52:25 UTC (rev 8263)
@@ -26,8 +26,14 @@
  */
 package com.bigdata.search;
 
+import java.io.IOException;
+import java.io.StringReader;
 import java.util.Properties;
 
+import org.apache.lucene.analysis.Analyzer;
+import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.analysis.tokenattributes.TermAttribute;
+
 import com.bigdata.journal.IIndexManager;
 import com.bigdata.journal.ITx;
 import com.bigdata.journal.ProxyTestCase;
@@ -62,7 +68,7 @@
 	}
 
 	FullTextIndex<Long> createFullTextIndex(String namespace, String ...propertyValuePairs) {
-        return createFullTextIndex(namespace, getProperties(), propertyValuePairs);
+        return createFullTextIndex(namespace, (Properties)getProperties().clone(), propertyValuePairs);
 	}
 	
 	public void tearDown() throws Exception {
@@ -92,4 +98,65 @@
 		return properties;
 	}
 
+	protected Analyzer getAnalyzer(String lang, boolean filterStopWords) {
+		return getNdx().getAnalyzer(lang, filterStopWords);
+	}
+
+	protected void comparisonTest(String lang, boolean filterStopWords, String text, String spaceSeparated)
+			throws IOException {
+		if (spaceSeparated == null) {
+			String rslt = getTokenStream(getAnalyzer(lang, filterStopWords), text);
+			throw new RuntimeException("Got \"" + rslt+ "\"");
+		}
+			compareTokenStream(getAnalyzer(lang, filterStopWords), text,
+					split(spaceSeparated)); //$NON-NLS-1$
+			}
+
+	private String[] split(String spaceSeparated) {
+		if (spaceSeparated.length()==0) {
+			return new String[0];
+		}
+		return spaceSeparated.split(" ");
+	}
+
+	protected String getTokenStream(Analyzer a, String text) throws IOException {
+		StringBuffer sb = new StringBuffer();
+		TokenStream s = a.tokenStream(null, new StringReader(text));
+	    while (s.incrementToken()) {
+	        final TermAttribute term = s.getAttribute(TermAttribute.class);
+	        if (sb.length()!=0) {
+	        	sb.append(" ");
+	        }
+	        sb.append(term.term());
+	    }
+		return sb.toString();
+	}
+
+	private void compareTokenStream(Analyzer a, String text, String expected[]) throws IOException {
+		TokenStream s = a.tokenStream(null, new StringReader(text));
+		int ix = 0;
+		while (s.incrementToken()) {
+			final TermAttribute term = s.getAttribute(TermAttribute.class);
+			final String word = term.term();
+			assertTrue(ix < expected.length);
+			assertEquals(expected[ix++], word);
+		}
+		assertEquals(ix, expected.length);
+	}
+
+	protected void checkConfig(boolean threeLetterOnly, String classname, String ...langs) {
+		for (String lang:langs) {
+			// The DefaultAnalyzerFactory only works for language tags of length exactly three.
+			if ((!threeLetterOnly) || (lang != null && lang.length()==3)) {
+				assertEquals(classname, getAnalyzer(lang,true).getClass().getSimpleName());
+				if (!threeLetterOnly) {
+					assertEquals(classname, getAnalyzer(lang+"-x-foobar",true).getClass().getSimpleName()); //$NON-NLS-1$
+				}
+			}
+		}
+	}
+	protected void checkConfig(String classname, String ...langs) {
+		checkConfig(false, classname, langs);
+	}
+
 }

Modified: branches/BIGDATA_RELEASE_1_3_0/bigdata/src/test/com/bigdata/search/NonEnglishExamples.java
===================================================================
--- branches/BIGDATA_RELEASE_1_3_0/bigdata/src/test/com/bigdata/search/NonEnglishExamples.java	2014-05-10 02:56:35 UTC (rev 8262)
+++ branches/BIGDATA_RELEASE_1_3_0/bigdata/src/test/com/bigdata/search/NonEnglishExamples.java	2014-05-10 11:52:25 UTC (rev 8263)
@@ -1,3 +1,29 @@
+/**
+
+Copyright (C) SYSTAP, LLC 2006-2014.  All rights reserved.
+
+Contact:
+     SYSTAP, LLC
+     4501 Tower Road
+     Greensboro, NC 27410
+     lic...@bi...
+
+This program is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation; version 2 of the License.
+
+This program is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with this program; if not, write to the Free Software
+Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+*/
+/*
+ * Created on May 7, 2014 by Jeremy J. Carroll, Syapse Inc.
+ */
 package com.bigdata.search;
 
 import java.util.MissingResourceException;

Modified: branches/BIGDATA_RELEASE_1_3_0/bigdata/src/test/com/bigdata/search/TestAll.java
===================================================================
--- branches/BIGDATA_RELEASE_1_3_0/bigdata/src/test/com/bigdata/search/TestAll.java	2014-05-10 02:56:35 UTC (rev 8262)
+++ branches/BIGDATA_RELEASE_1_3_0/bigdata/src/test/com/bigdata/search/TestAll.java	2014-05-10 11:52:25 UTC (rev 8263)
@@ -114,6 +114,8 @@
         // which is intended to be the same as the intended
         // behavior of DefaultAnalyzerFactory
         suite.addTestSuite(TestConfigurableAsDefaultAnalyzerFactory.class);
+        suite.addTestSuite(TestConfigurableAnalyzerFactory.class);
+        suite.addTestSuite(TestUnconfiguredAnalyzerFactory.class);
 
         return suite;
     }

Added: branches/BIGDATA_RELEASE_1_3_0/bigdata/src/test/com/bigdata/search/TestConfigurableAnalyzerFactory.java
===================================================================
--- branches/BIGDATA_RELEASE_1_3_0/bigdata/src/test/com/bigdata/search/TestConfigurableAnalyzerFactory.java	                        (rev 0)
+++ branches/BIGDATA_RELEASE_1_3_0/bigdata/src/test/com/bigdata/search/TestConfigurableAnalyzerFactory.java	2014-05-10 11:52:25 UTC (rev 8263)
@@ -0,0 +1,244 @@
+/**
+
+Copyright (C) SYSTAP, LLC 2006-2014.  All rights reserved.
+
+Contact:
+     SYSTAP, LLC
+     4501 Tower Road
+     Greensboro, NC 27410
+     lic...@bi...
+
+This program is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation; version 2 of the License.
+
+This program is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with this program; if not, write to the Fr...
 
[truncated message content]

[Bigdata-commit] SF.net SVN: bigdata:[8263] branches/BIGDATA_RELEASE_1_3_0/bigdata/src

Fast, scalable, robust graph database platform

[Bigdata-commit] SF.net SVN: bigdata:[8263] branches/BIGDATA_RELEASE_1_3_0/bigdata/src