From: SourceForge.net <no...@so...> - 2005-01-27 23:06:04
|
Bugs item #1063968, was opened at 2004-11-11 04:28 Message generated for change (Comment added) made by joneswoohoo You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=558446&aid=1063968&group_id=80013 Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Jacquie (jacquieh) Assigned to: Nobody/Anonymous (nobody) Summary: Problem with phrase queries Initial Comment: I am having a problem with phrase queries eg "george bush". What happens is that when I run a search the query gets parsed ok, but then somewhere down the line during the actual search SegmentsTermPositions::nextPosition() gets called and it calls ((SegmentTermPositions*)current)->nextPosition(); but "current" in this case isn't a SegmentTermPositions* so it throws an error. I think its a SegmentTermDocs* instead. The problem doesn't happen in the Demo example that I based my program on - it has an identical stack trace except the last call is to SegmentTermPositions::nextPosition() . Optimizing does not make a difference. I ran the demo example and my program side by side and the query parsing seems to be identical in each case. I would appreciate any insights into a) whether it should have gone down this path in the first place and b) if it should have - how to fix it or if it shouldn't - what I did to make it happen. I include the stack trace. NewsView.exe! lucene::index::SegmentsTermPositions::nextPosition() Line 228 C++ NewsView.exe! lucene::search::PhrasePositions::nextPosition() Line 47 + 0x14 C++ NewsView.exe! lucene::search::PhrasePositions::firstPosition() Line 43 C++ NewsView.exe! lucene::search::ExactPhraseScorer::phraseFreq() Line 21 C++ NewsView.exe! lucene::search::PhraseScorer::score (lucene::search::HitCollector & results={...}, const int end=1024) Line 44 + 0xd C++ NewsView.exe! lucene::search::BooleanScorer::score (lucene::search::HitCollector & results={...}, const int maxDoc=16070) Line 68 + 0x1f C++ NewsView.exe! lucene::search::IndexSearcher::Search (lucene::search::Query & query={...}, const lucene::search::Filter * filter=0x00000000, const int nDocs=100) Line 114 + 0x2e C++ NewsView.exe! lucene::search::Hits::getMoreDocs(const int Min=50) Line 88 + 0x24 C++ NewsView.exe!lucene::search::Hits::Hits (lucene::search::Searcher & s={...}, lucene::search::Query & q={...}, const lucene::search::Filter * f=0x00000000) Line 45 C++ NewsView.exe! lucene::search::Searcher::search (lucene::search::Query & query={...}, const lucene::search::Filter * filter=0x00000000) Line 88 + 0x2e C++ NewsView.exe! lucene::search::Searcher::search (lucene::search::Query & query={...}) Line 83 C++ ---------------------------------------------------------------------- Comment By: Paul Jones (joneswoohoo) Date: 2005-01-28 10:05 Message: Logged In: YES user_id=1156975 I think the benifit of implementing the template for SegmentsTermPositions and SegmentsTermDocs, is that it makes it clear what underlying class is handled by these classes. In a template happy moment, I considered having another template for SegmentTermPositions and SegmentTermDocs, but on closer examination of the code, there is too much difference between the two classes for this to make sense. I think Ken's idea for SegmentsTermPositions to use SegmentTermDocs through composition (membership) makes sense because SegmentsTermPositions does actually use the SegmentTermDocs to traverse the TermPositions. ---------------------------------------------------------------------- Comment By: Ken Ensdorf (kensdorf) Date: 2005-01-28 01:24 Message: Logged In: YES user_id=671771 Ok John I'll give my 2 cents event though I'm not one fo the devs- Currently the basic class design has TermDocs and TermPositions as abstract base classes, and SegmentTermDocs and SeqmentTermPositions as their respective derived concrete classes. So far so good. But, since the concept of TermPositions is really just an extension of the concept of TermDocs, TermPositions inherits from TermDocs. No problem there. So now the SegmentTermPositions class must implement the TermDocs "interface", which it does by re-using the implementation of SegmentTermDocs through inheritance. Now we get into trouble, as SegmentTermPostions inherits from TermDocs twice- once through TermPositions and again through SegmentTermDocs. Alas, the "dreaded diamond", that makes the faint of heart run screaming into the night. So, basically, I think the goal is to avoid the diamond if at all possible. IMO, the best way to do this is to have SegmentTermPositions re-use SegmentTermDocs through composition (membership) instead of inheritance. I haven't tried this, but I'm pretty sure it will work. Another alternative would be for SegmentTermPositions and SegmentTermDocs use a common class that would do the grunt work of implementing the TermDocs interface. -Ken ---------------------------------------------------------------------- Comment By: John Wheeler (j_wheeler) Date: 2005-01-27 09:54 Message: Logged In: YES user_id=1149323 I must admit my latest set of patches on this issue really suck. And I'd like to apologize about submitting them. So I took the patches from Paul and ran some small tests. My first impression is that they work, which is good especially in the case of a phrase query. But I am a bit reluctant submitting them right away. (Do not want to mess up things twice) Maybe we must first have a a discussion about the inheritance tree causing all these problems. So my fellow developers what do you think? What are we going to about it? Finally a word to Paul, Thanks Thanks Thanks! You saved my demo that I have tomorrow :D John ---------------------------------------------------------------------- Comment By: Paul Jones (joneswoohoo) Date: 2005-01-26 14:59 Message: Logged In: YES user_id=1156975 Well I think I have fixed this. I was worried about how SegmentsTermPositions inherited from SegmentsTermDocs and all the class type casting involved. I made a template class: template<class T> class SegmentsDocs:public virtual TermDocs Which allowed me to have the types appropriate to the class declarations as below: // SegmentTermDocs across multiple readers class SegmentsTermDocs:public virtual SegmentsDocs<SegmentTermDocs> { public: SegmentsTermDocs ():SegmentsDocs<SegmentTermDocs>() {} SegmentsTermDocs(SegmentReader** r, int_t rLen, const int_t* s):SegmentsDocs<SegmentTermDocs> (r, rLen, s) {}; }; // SegmentTermPositions across multiple readers class SegmentsTermPositions:public virtual SegmentsDocs<SegmentTermPositions>,public virtual TermPositions { public: int_t nextPosition(); SegmentsTermPositions ():SegmentsDocs<SegmentTermPositions>() {} SegmentsTermPositions(SegmentReader** r, int_t rLen, const int_t* s):SegmentsDocs<SegmentTermPositions>(r, rLen, s) {}; }; I have edited SegmentsReader.cpp, SegmentsReader.h, Terms.h I reckon pasting the files in this comment would be a bit excessive...but I am happy to make them available. PS. On to the next bug now - two word search phrases work now but >2 doesn't (dooh) ---------------------------------------------------------------------- Comment By: Paul Jones (joneswoohoo) Date: 2005-01-24 16:12 Message: Logged In: YES user_id=1156975 I am still getting an error here even after the latest fixes. I thought int_t SegmentsTermPositions::nextPosition() should have had: return ((SegmentTermPositions *)current)->nextPosition(); rather than return ((SegmentsTermPositions *)current)->nextPosition(); but I could not get the type cast to work either. I have also tried a few other things without any luck. I am currently feeling a bit bamboozled by the wierd inheritance relationships and method overrides between TermDocs, SegmentTermDocs, SegmentsTermDocs, SegmentTermPositions, SegmentsTermPositions, TermPositions etc. How did everyone else go with this fix? ---------------------------------------------------------------------- Comment By: John Wheeler (j_wheeler) Date: 2005-01-23 07:46 Message: Logged In: YES user_id=1149323 Fixed (I think) in CVS: Terms.h 1.3 SegmentsReader.h 1.8 SegmentsReader.cpp 1.20 SegmentHeader.h 1.6 Basically what I did what was to have the constructor of SegmentsTermPositions initialise itself with an array of SegmentTermPositions instead of SegmentTermDocs. Let me know Jacquie if this solves the problems you face. ---------------------------------------------------------------------- Comment By: Petr Gladkikh (batyi) Date: 2004-12-22 00:17 Message: Logged In: YES user_id=622476 Hope this will add some info to the subject. This bug is triggered after flushRamSegments is called. If maybeMergeSegments() changes something then this bug does not appear anymore. Thus for large indexes it usually can not be reproduced. Code paths which works invoke SegmentTermPositions.cpp: Segment{}TermPositions::nextPosition Code paths which fail call SegmentsReader.cpp:: Segment{s}TermPositions::nextPosition Seems to be a plain typo somewhere... ---------------------------------------------------------------------- Comment By: Ken Ensdorf (kensdorf) Date: 2004-12-14 08:42 Message: Logged In: YES user_id=671771 The initial summary correctly diagnoses the problem: "but "current" in this case isn't a SegmentTermPositions* so it throws an error. I think its a SegmentTermDocs* instead." The SegmentsTermPositions object should contain an array of pointers to SegmentTermPositions objects, but it doesn't. This is because the code that creates the objects which fill the array explicitly creates SegmentTermDocs objects. To fix this problem, the following method needs to be overridden by the SegmentsTermPositions class: SegmentTermDocs* SegmentsTermDocs::termDocs(const SegmentReader& reader) const { SegmentTermDocs& ret = SegmentTermDocs&)reader.termDocs(); return &ret; } To do this, this method must first be made virtual. Unfortunately this still will not fix the problem, as the termPositions() method of SegmentReader returns a TermPositions&, and this cannot be cast into a SegmentTermDocs&. One way to fix this is to add another method to SegmentReader which returns a SegmentTermPositions&. This is what I did. As a general note- the use of multiple inheritance, virtual inheritance, and explicit casting seems a bit convoluted and ad-hoc. It may be worth re-thinking the overall design of this group of classes to avoid these types of problems going forward. ---------------------------------------------------------------------- Comment By: Ken Ensdorf (kensdorf) Date: 2004-12-14 08:30 Message: Logged In: YES user_id=671771 The initial summary correctly diagnoses the problem: "but "current" in this case isn't a SegmentTermPositions* so it throws an error. I think its a SegmentTermDocs* instead." The SegmentsTermPositions object should contain an array of pointers to SegmentTermPositions objects, but it doesn't. This is because the code that creates the objects which fill the array explicitly creates SegmentTermDocs objects. To fix this problem, the following method needs to be overridden by the SegmentsTermPositions class: SegmentTermDocs* SegmentsTermDocs::termDocs(const SegmentReader& reader) const { SegmentTermDocs& ret = SegmentTermDocs&)reader.termDocs(); return &ret; } To do this, this method must first be made virtual. Unfortunately this still will not fix the problem, as the termPositions() method of SegmentReader returns a TermPositions&, and this cannot be cast into a SegmentTermDocs&. One way to fix this is to add another method to SegmentReader which returns a SegmentTermPositions&. This is what I did. As a general note- the use of multiple inheritance, virtual inheritance, and explicit casting seems a bit convoluted and ad-hoc. It may be worth re-thinking the overall design of this group of classes to avoid these types of problems going forward. ---------------------------------------------------------------------- Comment By: Jacquie (jacquieh) Date: 2004-11-13 11:07 Message: Logged In: YES user_id=1113468 Thanks for your help - it showed me where to look for the problem - however there seems to be a little more to it than just closing the IndexWriter. My scenario is a program with documents continuously arriving and being indexed. I have an index directory for each day's worth of files. The user can initiate a search on the indexes at any time. For efficiency I keep the indexWriter open all the time closing it only immediately before a search and at the end of the day (when I also optimize it). I had a merge factor of 1000. This is what was happening: If I did a phrase search before the 1000 items is reached it was successful. If I did the search after 1000 items and a second set of index files appeared in the directory then the phrase search fails. I have no idea how to fix this so I am working around the problem by setting the mergefactor to a higher value than the number of documents I would ever expect to index in any one day ie 20000 so it never produces more than one set of index files. This also cures the "niños" problem. ---------------------------------------------------------------------- Comment By: sergej panic (serdzo) Date: 2004-11-12 08:01 Message: Logged In: YES user_id=1157360 Hi, I had the same problem with Phrase Queries, until I discovered I had forgoten to close the IndexWriter at the end of indexing process. Make sure you call "writer->close();" at the end. ---------------------------------------------------------------------- Comment By: Jacquie (jacquieh) Date: 2004-11-11 08:42 Message: Logged In: YES user_id=1113468 Another PS: I can reproduce the problem with the Demo Example if I run it against the index produced by my program (having commented out the indexing part of the Demo) - so it seems to be a problem with the way I build the index. I could send you the index but as it is a folder I can't attach it here. ---------------------------------------------------------------------- Comment By: Jacquie (jacquieh) Date: 2004-11-11 04:53 Message: Logged In: YES user_id=1113468 PS the same thing happens with some but not all single words with special characters eg años works but niños doesn't ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=558446&aid=1063968&group_id=80013 |