From: Rui O. <rui...@ho...> - 2010-04-27 15:06:05
|
I have tested using SimpleAnalyzer and now my application works with portuguese characters. I can index and search words with portuguese characters. Thanks Itamar! :) One strange thing is when I use Luke to analyse the index, the portuguese characters still wrong and with luke I cannot search this words. Thanks & Regards, Rui From: it...@di... To: clu...@li... Date: Mon, 26 Apr 2010 21:23:56 +0300 Subject: Re: [CLucene-dev] Clucene search - Do not found some words CString is MFC's string object, and is TCHAR. Rui, the function we are actually interested in is m_GetFileContents. The error most likely lies there, in the way you are loading your text documents (which we already established are ANSI). Please also let us know how you compile your app with (MBCS or Unicode). In the meantime, try two more things: Index UTF8 / Unicode encoded files instead of your ANSI ones. Use SimpleAnalyzer instead of Stanadard. StandardAnalyzer is meant primarily for English texts, and might be incompatible for accented letters. See cl_test::TestAnalyzers.cpp (esp. testISOLatin1AccentFilter) -- try perhaps playing with it a bit to see if it is an issue with CLucene with or your own code. HTH Itamar. From: Onilton Maciel [mailto:oni...@gm...] Sent: Monday, April 26, 2010 5:20 PM To: clu...@li... Subject: Re: [CLucene-dev] Clucene search - Do not found some words ls_text shouldn't be TCHAR? (I'm asking other people reading this thread) On Mon, Apr 26, 2010 at 9:58 AM, Rui Oliveira <rui...@ho...> wrote: void c_IndexEx::m_Add(CString avs_codRevsId) { CString ls_origem = "c_IndexEx::m_Add"; try { m_InitVariables(); if(!ii_enmIndx) return; IndexWriter* writer = NULL; lucene::analysis::standard::StandardAnalyzer an; if ( IndexReader::indexExists(iclp_indexPath) ){ if ( IndexReader::isLocked(iclp_indexPath) ) { m_AppendLog("Index was locked... unlocking it."); IndexReader::unlock(iclp_indexPath); } writer = _CLNEW IndexWriter( iclp_indexPath, &an, false); } else { writer = _CLNEW IndexWriter( iclp_indexPath ,&an, true); } writer->setMaxFieldLength(IndexWriter::DEFAULT_MAX_FIELD_LENGTH); writer->setUseCompoundFile(true); uint64_t str = lucene::util::Misc::currentTimeMillis(); // make a new, empty document Document* lcl_doc = _CLNEW Document(); if(m_FileDocument( avs_codRevsId, lcl_doc )) { writer->addDocument( lcl_doc ); } _CLDELETE(lcl_doc); writer->optimize(); writer->close(); _CLDELETE(writer); } catch(CLuceneError& err) { // e->Delete(); return; } catch( CException* e ) { // e->Delete(); m_AppendLog(ls_origem); return; } catch(...) { // e->Delete(); return; } } BOOL c_IndexEx::m_FileDocument(CString avs_codRevsId, Document* arcl_doc) { // make a new, empty document CString ls_codDocmId; CString ls_Path = m_GetFilePath(avs_codRevsId, &ls_codDocmId); if(ls_Path.IsEmpty()) { return FALSE; } char* lcl_Path = NULL; lcl_Path = new char[ls_Path.GetLength()+1]; _tcscpy(lcl_Path, ls_Path); CString ls_text; m_GetFileContents(lcl_Path, &ls_text); arcl_doc->add( *_CLNEW Field(_T("contents"), ls_text, Field::STORE_YES | Field::INDEX_TOKENIZED) ); icl_file.m_DeleteFile(ls_Path); // return the document delete lcl_Path; return TRUE; } From: oni...@gm... Date: Mon, 26 Apr 2010 10:36:45 -0300 To: clu...@li... Subject: Re: [CLucene-dev] Clucene search - Do not found some words Can you send the code where you index? On Mon, Apr 26, 2010 at 9:55 AM, Rui Oliveira <rui...@ho...> wrote: How can I check this? I just get text from files to a CString, and after this put them in CLucene. Apparently, the text I get from file to CString it is right, I have checked in degub mode and looks good. Rui > Date: Mon, 26 Apr 2010 14:44:56 +0200 > From: nun...@go... > To: clu...@li... > Subject: Re: [CLucene-dev] Clucene search - Do not found some words > > Rui, > > which encoding do you use internally before you give it to CLucene? > Maybe you use an encoding different to the encoding expected by > CLucene. > > Kind regards, > > Veit > > 2010/4/26 Rui Oliveira <rui...@ho...>: > > Hi, > > > > I have been using luke to analyze index. > > > > Well, all Portuguese characters appear replaced by an strange character. > > > > What I can do to avoid this? > > It is not possible make clucene working with Portuguese characters? > > > > Thanks & Regards, > > Rui > > > > > > > >> Date: Fri, 23 Apr 2010 20:43:49 +0200 > >> From: bva...@gm... > >> To: clu...@li... > >> Subject: Re: [CLucene-dev] Clucene search - Do not found some words > >> > >> I suggest using a program called luke (google it). You can then look > >> into the index and see what is indexed. Let us know if u see all the > >> words you would expect to see. And see if u can find the document if u > >> search from luke > >> > >> handy program :) > >> > >> cheers > >> ben > >> > >> On Friday, April 23, 2010, Rui Oliveira <rui...@ho...> wrote: > >> > > >> > > >> > > >> > > >> > > >> > Itamar, > >> > > >> > The test results are made all them in same file. The same file have > >> > "orçamento" and "administração" and found "administração" and do not found > >> > "orçamento". > >> > > >> > The results are the same for a file in ANSI, Unicode or UTF8 encoded. > >> > The problem is not loading files because I debug the text loaded from file > >> > and this text are ok. > >> > > >> > Rui > >> > > >> > > >> > > >> > > >> > From: it...@di... > >> > To: clu...@li... > >> > Date: Fri, 23 Apr 2010 17:59:27 +0300 > >> > Subject: Re: [CLucene-dev] Clucene search - Do not found some words > >> > > >> > Rui, > >> > > >> > This file is ANSI encoded. Are the other files you do succeed in finding > >> > are Unicode / UTF8 encoded perhaps? If that's the case your routine for > >> > loading the files is buggy. You should either have them all encoded using > >> > the same encoding, or have more intelligent code to convert incompatible > >> > encoding. > >> > > >> > HTH > >> > > >> > Itamar. > >> > > >> > > >> > From: Rui Oliveira [mailto:rui...@ho...] > >> > Sent: Friday, April 23, 2010 4:32 PM > >> > To: clucene-developers; oni...@gm... > >> > Subject: Re: [CLucene-dev] Clucene search - Do not found some words > >> > > >> > > >> > I just attach the file. > >> > > >> > Tks, Rui > >> > > >> > > >> > From: oni...@gm... > >> > Date: Fri, 23 Apr 2010 09:22:05 -0400 > >> > To: clu...@li... > >> > Subject: Re: [CLucene-dev] Clucene search - Do not found some words > >> > > >> > Can you send me this file that has both "orçamento" and administração? > >> > > >> > Or you can do a test: Open the file and delete the ç form orçamento and > >> > administração. > >> > And then type ç again. > >> > > >> > Index again and try to search both words again. > >> > > >> > On Fri, Apr 23, 2010 at 9:14 AM, Rui Oliveira <rui...@ho...> > >> > wrote: > >> > > >> > They are text file (*.txt) and both words are in same document. > >> > When I search for "orçamento" don't found anything and when I search for > >> > "administração" the document is found. > >> > > >> > > >> > Rui > >> > > >> > > >> > From: oni...@gm... > >> > Date: Fri, 23 Apr 2010 09:09:30 -0400 > >> > > >> > > >> > > >> > To: clu...@li... > >> > Subject: Re: [CLucene-dev] Clucene search - Do not found some words > >> > > >> > Seems like an encoding problem with these documents. Are they html > >> > pages? > >> > Are the words "orçamento" and "administração" in the same page? for > >> > example? > >> > > >> > Can you dump one of these files here? (One that has the problem and one > >> > that has not) > >> > > >> > > >> > On Fri, Apr 23, 2010 at 9:05 AM, Rui Oliveira <rui...@ho...> > >> > wrote: > >> > > >> > I am indexing some separated documents. > >> > > >> > The document that have these words are a small text document. This > >> > document is indexed without any visible error. This same document is found > >> > when I search for other words on it. > >> > > >> > > >> > Rui > >> > > >> > > >> > From: oni...@gm... > >> > Date: Fri, 23 Apr 2010 08:58:05 -0400 > >> > > >> > > >> > > >> > To: clu...@li... > >> > Subject: Re: [CLucene-dev] Clucene search - Do not found some words > >> > > >> > What are you indexing? > >> > > >> > Just a big document? > >> > Or a lot of sepparate documents ? (html documents?) > >> > > >> > On Fri, Apr 23, 2010 at 8:54 AM, Rui Oliveira <rui...@ho...> > >> > wrote: > >> > > >> > Hi Onilton, > >> > > >> > I have tested with "orcamento" instead of "orçamento" and didn't get > >> > anything. > >> > > >> > I do not know if lucene indexes "orçamento" in a wrong way, because > >> > indexes without any error, but when I search for it do not get anything. > >> > > >> > Thnaks & Regards, > >> > Rui > >> > > >> > > >> > From: > >> > > >> > >> > >> ------------------------------------------------------------------------------ > >> _______________________________________________ > >> CLucene-developers mailing list > >> CLu...@li... > >> https://lists.sourceforge.net/lists/listinfo/clucene-developers > > > > ________________________________ > > Hotmail has tools for the New Busy. Search, chat and e-mail from your inbox. > > Learn more. > > ------------------------------------------------------------------------------ > > > > _______________________________________________ > > CLucene-developers mailing list > > CLu...@li... > > https://lists.sourceforge.net/lists/listinfo/clucene-developers > > > > > > ------------------------------------------------------------------------------ > _______________________________________________ > CLucene-developers mailing list > CLu...@li... > https://lists.sourceforge.net/lists/listinfo/clucene-developers The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with Hotmail. Get busy. ------------------------------------------------------------------------------ _______________________________________________ CLucene-developers mailing list CLu...@li... https://lists.sourceforge.net/lists/listinfo/clucene-developers The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with Hotmail. Get busy. ------------------------------------------------------------------------------ _______________________________________________ CLucene-developers mailing list CLu...@li... https://lists.sourceforge.net/lists/listinfo/clucene-developers _________________________________________________________________ The New Busy is not the too busy. Combine all your e-mail accounts with Hotmail. http://www.windowslive.com/campaign/thenewbusy?tile=multiaccount&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_4 |