|
From: Doug C. <cu...@ap...> - 2008-02-29 21:57:30
|
Yonik Seeley wrote: > It depends on the documents I guess... if they are big, putting them > in the index can be a burden because they get copied on every segment > merge, and loading the other stored fields takes longer. Didn't Mike change that? Segments can now point to fields in a separate file, according to: http://lucene.apache.org/java/docs/fileformats.html#Segments%20File I think that's so that they don't have to be copied with every merge. > There are also two levels of "Document"... things like PDF, Word, etc, > also need to be parsed and have fields extracted make a lucene-style > Document. I assume that's out of scope for this project though. Yes, I think an application could implement that with, e.g., a binary field for the raw data, another field for the mime type, and a third for the extracted text to index. The raw data and text might be compressed. Doug |