1. Summary
  2. Files
  3. Support
  4. Report Spam
  5. Create account
  6. Log in

Official CLucene FAQ

From clucene

Jump to: navigation, search


If you have a question about using CLucene, please do not add it directly to this FAQ. Join the CLucene mailing list and email your question there. Questions should only be added to this Wiki page when they already have an answer that can be added at the same time.

Contents

Are there any mailing lists available?

There's a user list and a developer list, both available at https://lists.sourceforge.net/lists/listinfo/clucene-developers.

How can I get the latest greatest development code?

See Downloads

Are there any C++ alternatives to CLucene?

Besides commercial products which we don't know much about there's also Xapian and Estraier.

Does CLucene have a web crawler?

No, but check out Strigi.

How do I contribute an improvement?

Please follow all of these steps to submit a CLucene patch.

Searching

Why am I getting no hits / incorrect hits?

Some possible causes:

  • The desired term is in a field that was not defined as 'indexed'. Re-index the document and make the field indexed.
  • The term is in a field that was not tokenized during indexing and therefore, the entire content of the field was considered as a single term. Re-index the documents and make sure the field is tokenized.
  • The field specified in the query simply does not exist. You won't get an error message in this case, you'll just get no matches.
  • The field specified in the query has wrong case. Field names are case sensitive.
  • The term you are searching is a stop word that was dropped by the analyzer you use. For example, if your analyzer uses the !StopFilter, a search for the word 'the' will always fail (i.e. produce no hits).
  • You are using different analyzers (or the same analyzer but with different stop words) for indexing and searching and as a result, the same term is transformed differently during indexing and searching.
  • The analyzer you are using is case sensitive (e.g. it does not use the !LowerCaseFilter) and the term in the query has different case than the term in the document.
  • The documents you are indexing are very large. CLucene by default only indexes the first 10,000 terms of a document to avoid !OutOfMemory errors.
  • Make sure to open a new !IndexSearcher after adding documents. An !IndexSearcher will only see the documents that were in the index in the moment it was opened.
  • If you are using the !QueryParser, it may not be parsing your BooleanQuerySyntax the way you think it is.

If none of the possible causes above apply to your case, this will help you to debug the problem:

  • Use the Query's toString() method to see how it actually got parsed.
  • Use Luke to browse your index.

Why am I getting a TooManyClauses exception?

The following types of queries are expanded by CLucene before it does the search: RangeQuery,
PrefixQuery, WildcardQuery, FuzzyQuery. For example, if the indexed documents contain the terms "car"
and "cars" the query "ca*" will be expanded to "car OR cars" before the search takes place. The
number of these terms is limited to 1024 by default. Here's a few different approaches that can be used to avoid the TooManyClauses exception:

  • Use a filter to replace the part of the query that causes the exception. For example, a !RangeFilter can replace a RangeQuery on date fields and it will never throw the !TooManyClauses exception. Note that filters are slower than queries when used for the first time, so you should cache them using CachingWrapperFilter.
  • Increase the number of terms using BooleanQuery->setMaxClauseCount(int). Note that this will increase the memory requirements for searches that expand to many terms. To deactivate any limits, use BooleanQuery->setMaxClauseCount(INT_MAX).
  • A specfic solution that can work on very precise fields is to reduce the precision of the data in order to reduce the number of terms in the index. For example, the DateField class uses a microsecond resultion, which is often not required. Instead you can save your dates in the "yyyymmddHHMM" format, maybe even without hours and minutes if you don't need them.

How can I search over multiple fields?

Parse your query using !MultiFieldQueryParser. Note that terms which occur in short fields have a higher effect on the result ranking. Also !MultiFieldQueryParser builds queries that sometimes behave unexpectedly, namely for AND queries: it requires alls terms to appear in ''all'' field. This is not what one typically wants, for example in a search over "title" and "body" fields (CLucene 1.9 fixes this problem).

Alternatively you could create a field which concatenates the content you would like to search and
search only that field.

What wildcard search support is available from CLucene?

CLucene supports wild card queries which allow you to perform searches such as ''book*'', which will find documents containing terms such as ''book'', ''bookstore'', ''booklet'', etc. CLucene refers to this type of a query as a 'prefix query'.

CLucene also supports wild card queries which allow you to place a wild card in the middle of the query term. For instance, you could make searches like: ''mi*pelling''. That will match both ''misspelling'', which is the correct way to spell this word, as well as ''mispelling'', which is a common spelling mistake.

Another wild card character that you can use is '?', a question mark. The ? will match a single character. This allows you to perform queries such as ''Bra?il''. Such a query will match both ''Brasil'' and ''Brazil''. CLucene refers to this type of a query as a 'wildcard query'.

'''Note''': Leading wildcards (e.g. ''*ook'') are '''not''' supported by the !QueryParser (although CLucene could handle them.

Is the QueryParser thread-safe?

No, it's not.

How do I restrict searches to only return results from a limited subset of documents in the index (e.g. for privacy reasons)? What is the best way to approach this?

The !QueryFilter class is designed precisely for such cases.

Another way of doing it is the following:

Just before calling `IndexSearcher.search()` add a clause to the query to exclude documents in categories not permitted for this search.

If you are restricting access with a prohibited term, and someone tries to require that term, then the prohibited restriction wins. If you are restricting access with a required term, and they try prohibiting that term, then they will get no documents in their search result.

As for deciding whether to use required or prohibited terms, if possible,
you should choose the method that names the less frequent term. That will
make queries faster.


What is the order of fields returned by Document.fields()?

Fields are returned in the same order they were added to the document.


How does one determine which documents do not have a certain term?

There is no direct way of doing that. You could add a term "x" to every document, and then search for "+x -y" to find all of the documents that don't have "y". Note that for large collections this would be slow because of the high term frequency for term "x".

There is an as-yet unported query called MatchAllDocsQuery.java to make this easier.

How do I get the last document added that has a particular term?

Call:

`TermDocs* td == IndexReader->termDocs(Term);`

Then grab the last `Term` in `TermDocs` that this method returns.


Does MultiSearcher do anything particularly efficient to search multiple indices or does it simply search one after the other?

`MultiSearcher` searches indices sequentially.

Please note that there's a known bug in Lucene < 1.9 in the !MultiSearcher's result ranking. * have we solved this in clucene?? *


Is there a way to use a proximity operator (like near or within) with CLucene?

There is a variable called `slop` in `PhraseQuery` that allows you to perform NEAR/WITHIN-like queries.

By default, `slop` is set to 0 so that only exact phrases will match.
However, you can alter the value using the `setSlop(int)` method.

When using !QueryParser you can use this syntax to specify the slop: "doug cutting"~2 will find documents that contain "doug cutting" as well as ones that contain "cutting doug".


Are Wildcard, Prefix, and Fuzzy queries case sensitive?

No, but unlike other types of CLucene queries, Wildcard, Prefix, and Fuzzy queries are not passed through the `Analyzer`, which is the component that performs operations such as stemming.

The reason for skipping the `Analyzer` is that if you were searching for ''"dogs*"'' you would not want ''"dogs"'' first stemmed to ''"dog"'', since that would then match ''"dog*"'', which is not the
intended query.


Why does IndexReader's maxDoc() return an 'incorrect' number of documents sometimes?

According to the Javadoc for `IndexReader` `maxDoc()` method ''"returns one greater than the largest possible document number".''

In other words, the number returned by `maxDoc()` does not necessarily match the actual number of undeleted documents in the index.

Deleted documents do not get removed from the index immediately, unless you call `optimize()`.


Is there a way to get a text summary of an indexed document with CLucene (a.k.a. a "snippet" or "fragment") to display along with the search result?

You need to store the documents' summary in the index (use ''Field Store'' when creating that field) and then use the Highlighter from the contrib package. It's important to use a rewritten query as the input for the highlighter, i.e. call rewrite() on the query. Otherwise simple queries will work but prefix queries etc will not be highlighted.

Can I search an index while it is being optimized?

Yes, an index can be searched and optimized simultaneously.


Can I cache search results with CLucene?

CLucene does come with a simple cache mechanism, if you use CLucene Filters.
The classes to look at are CachingWrapperFilter and QueryFilter.


Is the IndexSearcher thread-safe?

Yes, !IndexSearcher is thread-safe. Multiple search threads may use the same instance of !IndexSearcher concurrently without any problems. It is recommended to use only one !IndexSearcher from all threads in order to save memory.


Is there a way to retrieve the original term positions during the search?

Yes, see the IndexReader->termPositions().


Can CLucene do a "search within search", so that the second search is constrained by the results of the first query?

Yes. There are two primary options:

  • Use `QueryFilter` with the previous query as the filter. (you can search the Java Lucene mailing list archives for `QueryFilter` and Doug Cutting's recommendations against using it for this purpose)
  • Combine the previous query with the current query using `BooleanQuery`, using the previous query as required.

The `BooleanQuery` approach is the recommended one.

Does the position of the matches in the text affect the scoring?

No, the position of matches within a field does not affect ranking.


How do I make sure that a match in a document title has greater weight than than a match in a document body?

If you put the title in a separate field from the body, and search both fields, matches in the title will usually be stronger without explicit boosting. This is because the scores are normalized by the length of the field, and the title tends to be much shorter than the body. Therefore, even without boosting, title matches usually come before body matches.

How do I implement paging, i.e. showing result from 1-10, 11-20 etc?

Just re-execute the search and ignore the hits you don't want to show. As people usually look only at the first results this approach is usually fast enough.

The search is slow when there are many hits.

Iterating over all hits is slow for two reasons. Firstly, the search() method that returns a Hits object re-executes the search internally when you need more than 100 hits. Solution: use the search method that takes a !HitCollector instead. Secondly, the hits will probably be spread over the disk so accessing them all requires much I/O activity. This cannot easily be avoided unless the index is small enough to be loaded into RAM.

Indexing

Can I use CLucene to crawl my site or other sites on the Internet?

No. CLucene does not know how to access external document, nor does it know how to extract the content and links of HTML and other document format. CLucene focuses on the indexing and searching and does it great.

I get "No tvx file". What does that mean?

It's a "warning" that can safely be ignored. It has been fixed (i.e. the warning has been removed) * check this status*

Does CLucene store a full copy of the indexed documents?

It is up to you. You can tell CLucene what document information to use just for indexing and what document information to also store in the index (with or without indexing).


What happens when you IndexWriter.add() a document that is already in the index? Does it overwrite the previous document?

No, there will be multiple copies of the same document in the index.


How do I delete documents from the index?

If you know the document number of a document (e.g. when iterating over Hits) that you want to delete you may use:

`IndexReader->deleteDocument(docNum)`

That will delete the document numbered `docNum` from the index. Once a document is deleted it will not appear in `TermDocs` nor `TermPositions` enumerations.

Attempts to read its field with the `document` method will result in an exception. The presence of this document may still be reflected in the `docFreq` statistic, though this will be corrected eventually as the index is further modified.

If you want to delete all (one or more) documents that contain a specific term you may use:

`IndexReader.deleteDocuments(term)`

This is useful if one uses a document field to hold a unique ID string for
the document. Then to delete such a document, one merely constructs a
term with the appropriate field and the unique ID string as its text and
passes it to this method. Because a variable number of documents can be affected by this method call this method returns the number of documents deleted.

Starting with CLucene 0.9.15, the new class `IndexModifier` also allows deleting documents.


Is there a way to limit the size of an index?

This question is sometimes brought up because of the 2GB file size limit of some 32-bit operating systems. The biggest CLucene index i know of is 36gb for one .cfs file.

This is a slightly modified answer from Doug Cutting:

The easiest thing is to use `IndexWriter.setMaxMergeDocs()`.

If, for instance, you hit the 2GB limit at 8M documents set `maxMergeDocs` to 7M. That will keep CLucene from trying to merge an index that won't fit in your filesystem. It will actually effectively round this down to the next lower power of `Index.mergeFactor`.

So with the default `mergeFactor` set to 10 and `maxMergeDocs` set to 7M CLucene will generate a series of 1M document indexes, since merging 10 of these would exceed the maximum.

A slightly more complex solution:

You could further minimize the number of segments if, when you've added 7M documents, optimize the index and start a new index. Then use `MultiSearcher` to search the indexes.

An even more complex and optimal solution:

Write a version of `FSDirectory` that, when a file exceeds 2GB, creates a subdirectory and represents the file as a series of files.


Why is it important to use the same analyzer type during indexing and search?

The analyzer controls how the text is broken into terms which are then used to index the document. If you are using an analyzer of one type to index and an analyzer of a different type to parse the search query, it is possible that the same word will be mapped to two different terms and this will result in missing or false hits.

'''NOTE:''' It's not a rule that the same analyzer be used for both indexing and searching, and there are cases where it makes sense to use different ones (ie: when dealing with synonyms). The analyzers must be compatible though.

Also be careful with Fields that are not tokenized (like Keywords). During indexation, the Analyzer won't be called for these fields, but for a search, the !QueryParser can't know this and will pass all search strings through the selected Analyzer. Usually searches for Keywords are constructed in code, but during development it can be handy to use general purpose tools (e.g. Luke) to examine your index. Those tools won't know which fields are tokenized either. In the contrib/analyzers area there's a !KeywordTokenizer with an example !KeywordAnalyzer for cases like this.

What is index optimization and when should I use it?

The !IndexWriter class supports an optimize() method that compacts the index database and speedup queries. You may want to use this method after performing a complete indexing of your document set or after incremental updates of the index. If your incremental update adds documents frequently, you want to perform the optimization only once in a while to avoid the extra overhead of the optimization.

What are Segments?

The index database is composed of 'segments' each stored in a separate file. When you add documents to the index, new segments may be created. You can compact the database and reduce the number of segments by optimizing it (see a separate question regarding index optimization).


Is CLucene index database platform independent?

Yes, you can copy a CLucene index directory from one platform to another and it will work just as well.


When I recreate an index from scratch, do I have to delete the old index files?

No, creating the !IndexWriter with "true" should remove all old files in the old index.


How can I index and search digits and other non-alphabetic characters?

The components responsible for this are various `Analyzers`. Make sure you use the appropriate analyzer.

Is the IndexWriter class, and especially the method addIndexes(Directory) thread safe?

Yes, `IndexWriter.addIndexes(Directory[])` method is thread safe. !IndexWriter in general is thread safe, i.e. you should use the same !IndexWriter object from all of your threads. Actually it's impossible to use more than one !IndexWriter for the same index directory, as this will lead to an exception trying to create the lock file.


When is it possible for document IDs to change?

Documents are only re-numbered after there have been deletions. Once there have been deletions, renumbering may be triggered by any document addition or index optimization. Once an index is optimized, no renumbering will be performed until more deletions are made.

If you require a persistent document id that survives deletions, then add it as a field to your documents.


What is the purpose of write.lock file, when is it used, and by which classes?

The write.lock is used to keep processes from concurrently attempting
to modify an index.

It is obtained by an !IndexWriter while it is open, and by an !IndexReader once documents have been deleted and until it is closed.


What is the purpose of the commit.lock file, when is it used, and by which classes?

The commit.lock file is used to coordinate the contents of the 'segments'
file with the files in the index. It is obtained by an !IndexReader before it reads the 'segments' file, which names all of the other files in the
index, and until the !IndexReader has opened all of these other files.

The commit.lock is also obtained by the !IndexWriter when it is about to write the segments file and until it has finished trying to delete obsolete index files.

The commit.lock should thus never be held for long, since while
it is obtained files are only opened or deleted, and one small file is
read or written.

My program crashed and now I get a "Lock obtain timed out." error. Where is the lock and how can i delete it?

When using FSDirectory, by default, Lock files are kept in the directory specified by the TEMP or TMP system variables. If for some strange reason these variables are not set, then /tmp is used.

Lock files have names that start with "lucene-" followed by an MD5 hash of the index directory path.

If you are certain that a lock file is not in use, you can delete it manually. You should also look at the methods IndexReader::isLocked and IndexReader::unlock if you are interested in writing recovery code that can remove locks automatically.


Is there a maximum number of segment infos whose summary (name and document count) is stored in the segments file?

All segments in the index are listed in the segments file. There is no hard limit. For an un-optimized index it is proportional to the log of the number of documents in the index. An optimized index contains a single segment.


What happens when I open an IndexWriter, optimize the index, and then close the IndexWriter? Which files will be added or modified?

All of the segments are merged into a single new segment file.
If the index was empty to begin with, no segments will be created, only the `segments` file.


If I decide not to optimize the index, when will the deleted documents actually get deleted?

Documents that are deleted are marked as deleted. However, the space they consume in the index does not get reclaimed until the index is optimized. That space will also eventually be reclaimed as more documents are added to the index, even if the index does not get optimized.


How do I update a document or a set of documents that are already indexed?

There is no direct update procedure in CLucene. To update an index incrementally you must first '''delete''' the documents that were updated, and '''then re-add''' them to the index.


If I use a compound file-style index, can I still optimize my index?

Yes. Each .cfs file created in the compound file-style index represents a single segment, which means you can still merge multiple segments into a single segment by optimizing the index.


What is the difference between IndexWriter->addIndexes(IndexReader) and IndexWriter->addIndexes(Directory**), besides them taking different arguments?

When merging lots of indexes (more than the mergeFactor), the Directory-based method will use fewer file handles and less memory, as it will only ever open mergeFactor indexes at once, while the !IndexReader-based method requires that all indexes be open when passed.

The primary advantage of the !IndexReader-based method is that one can pass it !IndexReaders that don't reside in a Directory.


Analyzers

How do I write my own Analyzer?

How do I index non Latin characters?

How can I index HTML documents?

In order to index HTML documents you need to first parse them to extract text that you want to index from them.

How can I index XML documents?

In order to index XML documents you need to first parse them to extract text that you want to index from them.
a

How can I index OpenOffice.org files?

How can I index MS-Word documents?

In order to index MS-Word documents you need to first parse them to extract text that you want to index from them

How can I index MS-Excel documents?

How can I index MS-Powerpoint documents?

How can I index Email (from MS-Exchange or another IMAP server) ?

How can I index RTF documents?

How can I index PDF documents?

Can I use CLucene to index text in Chinese, Japanese, Korean, and other multi-byte character sets?

Yes, you can. CLucene is not limited to English, nor any other language. To index text properly, you need to use an Analyzer appropriate for the language of the text you are indexing. CLucene's default Analyzers work well for English. There are a number of other Analyzers in |CLucene Sandbox, including those for Chinese, Japanese, and Korean.

Personal tools