From: <Sim...@cs...> - 2014-04-07 12:37:09
|
Victor, The behaviour you are seeing is determined by the type of lucene tokenizer (which is part of the lucene analyzer). The analyzer is applied to both the text to be indexed and the search terms (for consistent results) - as you observed. There are a range of analyzers available with Lucene that you can choose and configure GeoNetwork to apply to your index - see WEB-INF/config-lucene.xml. You can even write your own and configure GeoNetwork to use it. By default, GeoNetwork applies its own Analyzer (with a set of tokenizers) to many fields (eg. any) - see WEB-INF/config-lucene.xml again for which ones. The GeoNetworkAnalyzer uses a StandardTokenizer an ASCIIFoldingFilter and a StopFilter (with a configurable list of stop words) - see the Lucene doco for more details of what these do - too much to explain here - but the main advantage of the GeoNetworkAnalyzer (over the Lucene StandardAnalyzer) is support for wild card queries (see the java doc for the GeoNetworkAnalyzer - web/src/main/java/org/fao/geonet/kernel/search/GeoNetworkAnalyzer.java). We've found that changing the GeoNetworkAnalyzer to use a WhitespaceTokenizer (instead of the StandardTokenizer) gives results users expect to see when searching using terms that contain characters like apostrophes, slashes etc. The change is easy to make - see https://github.com/marlin2/core-geonetwork/commit/a6b2830577b29d8096b6efd586bf0b16ee16869c#diff-0d1a0e81fbf55a25e7971a5fd50b5471 for example. (Note: If you apply a recompiled jar containing your modified GeoNetworkAnalyzer or other changes in config-lucene.xml to an existing catalogue then you need to reindex from the admin page). I've been thinking we should apply this to trunk but as yet I haven't had time to fully explore all the implications of this change, other alternatives or discuss it with anyone :-). Cheers, Simon ________________________________________ From: Victor Sinceac [vic...@co...] Sent: Monday, 7 April 2014 8:06 PM To: geo...@li...; geo...@li... Subject: [GeoNetwork-devel] GeoNetwork: Lucene query tokenized for ANY Hi all, I have some trouble getting expected result when entering the uuid of a metadata, and the uuid contains one or more of the chars ":" and ".". I mean for Simple Search, with the uuid entered in the input field "What?". Thus, for a metadata UUID="TEST:LUCENE1:TESTLUCENE2:TEST.LUCENE.3::TEST_AFTER_DOTS", while entering the full UUID in the Keywords field, it is not clear to me what happens, as both Indexer and Searcher tokenize the input string in a strange manner. The searcher does this when entering WHAT?=TEST:LUCENE1:TESTLUCENE2 (picked from logs): * Analyze field any : TEST:LUCENE1:TESTLUCENE2 * Analyzed text is test:lucene1 testlucene2 * Lucene query: ... ... +(+any:test:lucene1 +any:testlucene2) ... ... The Lucene index has also a similar content for the field any (I guess the same tokenizer is used): * any=test:lucene1 * any=testlucene2 Lucene index has not the full content of such an uuid in the field any; it is only kept in the field _uuid, but the latest is not used in a Simple Search (where only the field any is considered). The same behavior happens for the "." separator (for "TEST.LUCENE.3", for example, the tokenizer produces for the field ANY two different values "TEST.LUCENE" and "3"). Moreover, Lucene index keeps the full content of uuid in the "any" index field for uuids with default format (i.e. uuid generated by Geonetwork) like ****-****-****-**** Is this the correct behavior? Why does Lucene keep the first occurrence of ":" or "." but not the following occurrences, when indexing/searching the field ANY, and why it keeps the full uuid content in the field ANY when there are no ":" or "." chars inside? Many Thanks, Victor |