Hi,
storing fields isn't that important for me, I just thought it would be nice if they could be buffered.

I probably don't understand the problem well enough, but what if we just pass
in the length of the stream, cause in my case I do know the length of the stream.
Would that help ?

Pseudo code:

Field* contentsField = new Field(L"contents", [streams stuff], streamLength, Field::STORE_YES | Field::INDEX_TOKENIZED);

> ----- Original Message ----
> From: Ben van Klinken <bvanklinken@gmail.com>
> To: clucene-developers@lists.sourceforge.net
> Sent: Wednesday, January 17, 2007 3:01:48 PM
> Subject: Re: [CLucene-dev] Huge memory footprint
>
> hi Rene,
>
> The issue actually lies in the underlying lucene file format. Let me explain:
>
> the two processes of indexing (storage and tokenising) are separate.
> Therefore, using a reader with a field which is both stored and
> indexed means that the reading must be done twice - however, the
> reader interface does not directly support the concept of rewind (only
> mark and reset) and for a large stream this is a pretty bad
> performance hit. At the time of writing, i considered fixing the
> multiple-pass requirements, but there are several problems with this -
> the main one being that the length of the field is expressed as a VInt
> (variable length int), which means you cannot write the stream, then
> go back and write the stream length after you know the length of the
> stream... so basically the whole issue is still up in the air.
>
> There are two workarounds to this: a) read everything into a null
> terminated string, then add a string to the field, b) give the field 2
> readers, 1 for indexing and 1 for storage...
>
> neither is optimal, but without a major re-work of that part of the
> indexer, unavoidable as it stands (the problem also exists in java
> lucene afaik).
>
> if you can come up with a good solution to doing multiple passes and
> reading the stream into memory (which in affect spikes the memory
> because of having data in memory 2 or 3 times) i'd love to hear your
> ideas.
>
>
> ben
>
> On 17/01/07, Rene Rattur <renerattur@yahoo.com> wrote:
> >
> > Hi,
> >  it's me again, I ran into another problem.
> >
> > Using InputStreamReader, indexing and not storing fields works great, but
> > trying to store fields
> > doesn't work.
> >
> > Here's my code:
> >
> > Field* contentsField = new Field(L"contents", new Reader(new
> > InputStreamReader(new FileInputStream("temp.txt", 1024),
> > Utility::PLATFORM_ENCODING), true),
> >     Field::STORE_YES | Field::INDEX_TOKENIZED); [yeah there's a memory leak,
> > I know]
> >
> > I get a segmentation fault at gconv(), going down the stack trace leads
> > here:
> > jstreams::InputStreamReader::decode(wchar_t* start, int32_t
> > space)
> >
> > space is 2147483647
> >
> > I guess iconv(conv, inbuf, inbytesleft, outbuf, outbytesleft),
> > can't handle outbytesleft with the size of [ space * sizeof(wchar_t) ].
> >
> > I think lucene should use buffered chunks even when just storing fields, and
> > not trying
> > to read everything in one go.
> >
> > How should I approach this problem ?
> >
> > >
> > >
> > > ----- Original Message ----
> > > From: Rene Rattur <renerattur@yahoo.com>
> > > To: clucene-developers@lists.sourceforge.net
> > > Sent: Monday, January 15, 2007 5:08:01 PM
> > > Subject: Re: [CLucene-dev] Huge memory footprint
> > >
> >
> > > Nice, this is exactly what I need.
> > >
> > > Thanks :)
> > >
> > > Now to see if I can get rid of the memory spiking.
> > >
> > > > ----- Original Message ----
> > > > From: Ben van Klinken <bvanklinken@gmail.com>
> > > > To: clucene-developers@lists.sourceforge.net
> > > > Sent: Monday, January 15, 2007 11:56:31 AM
> > > > Subject: Re: [CLucene-dev] Huge memory footprint
> > > >
> > > > Hi,
> > > >
> > > > Have a look at the contributions package. There is an
> > > > inputstreamreader where you can specify the input encoding... actually
> > > > the core has the same but limited functionality. i dont see why you
> > > > would need to write your own reader if it is as simple as an encoding
> > > > converter.
> > > >
> > > > if you are unhappy with the documentation, you can of course always
> > > > work on it... volunteers will never be rebuffed :D
> > > >
> > > > cheers
> > > > ben
> > > >
> > > > On 15/01/07, Rene Rattur <renerattur@yahoo.com> wrote:
> > > > >
> > > > > Yeah, so I tryed messing around with IndexWriter's options. No luck.
> > Memory
> > > > > still
> > > > > spikes up.
> > > > >
> > > > > I came up with an alternative.
> > > > > I create a temporary file and as content gets extracted
> > > > > it is written to that temporary file. Then I construct a FileReader
> > for that
> > > > > temporary file,
> > > > > whichs in turns gets passed to Field(_T("content"), FileReader) and
> > Field to
> > > > > Document#addField
> > > > > and Document to IndexWriter#addDocument.
> > > > >
> > > > > Now here's the problem, I need a FileReader that reades in wchar_t.
> > > > > I tryed writing one, but thanks to the amazing documentation of
> > jstreams, I
> > > > > couldn't
> > > > > figure what exactly is (int32_t space) in
> > > > > BufferedInputStream<T>#fillBuffer(T* start, int32_t
> > space),
> > > > > is it the count of T or (count of T * sizeof(T)), am I supposed to
> > allocate
> > > > > the [start]
> > > > > and what's the method supposed to return ???
> > > > >
> > > > >
> > > > >
> > > > > > ----- Original Message ----
> > > > > > From: Ben van Klinken <bvanklinken@gmail.com>
> > > > > > To: clucene-developers@lists.sourceforge.net
> > > > > > Sent: Wednesday, January 10, 2007 8:47:22 PM
> > > > > > Subject: Re: [CLucene-dev] Huge memory footprint
> > > > > >
> > > > > > have a look at the settings in the IndexWriter object. lots of
> > options
> > > > > > there to customise how much to buffer in memory.
> > > > > >
> > > > > > ben
> > > > > >
> > > > > > > On 10/01/07, Rene Rattur <renerattur@yahoo.com> wrote:
> > > > > > >
> > > > > > > 400 KiB worth of text -> Field[STORE_NO, INDEX_TOKENIZED] ->
> > Document ==
> > > > > > > huge memory footprint(14 MiB) until [IndexWriter->addDocument();
> > delete
> > > > > > > Document].
> > > > > > >
> > > > > > > Is there any way of getting the text indexed, buffered in couple
> > of
> > > > > kilobyte
> > > > > > > chunks ?
> > > > > > >
> > > > > > >  ________________________________
> > > > > > > Any questions? Get answers on any topic at Yahoo! Answers. Try it
> > now.
> > > > > > >
> > > > >
> > -------------------------------------------------------------------------
> > > > > > > Take Surveys. Earn Cash. Influence the Future of IT
> > > > > > > Join SourceForge.net's Techsay panel and you'll get the chance to
> > share
> > > > > your
> > > > > opinions on IT & business topics through brief surveys - and earn cash
> > > > > >
> > > > >
> > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> > > > > > >
> > > > > > _______________________________________________
> > > > > > CLucene-developers mailing list
> > > > > > CLucene-developers@lists.sourceforge.net
> > > > > >
> > > >
> > https://lists.sourceforge.net/lists/listinfo/clucene-developers
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > -------------------------------------------------------------------------
> > > > > > Take Surveys. Earn Cash. Influence the Future of IT
> > > > > > Join SourceForge.net's Techsay panel and you'll get the chance to
> > share
> > > > > your
> > > > > > opinions on IT & business topics through brief surveys - and earn
> > cash
> > > > > >
> > > > >
> > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> > > > > > _______________________________________________
> > > > > > CLucene-developers mailing list
> > > > > > CLucene-developers@lists.sourceforge.net
> > > > > >
> > > > > >
> > https://lists.sourceforge.net/lists/listinfo/clucene-developers
> > > > > >
> > > > > >
> > > > >  ________________________________
> > > > > TV dinner still cooling?
> > > > > Check out "Tonight's Picks" on Yahoo! TV.
> > > > >
> > -------------------------------------------------------------------------
> > > > > Take Surveys. Earn Cash. Influence the Future of IT
> > > > > Join SourceForge.net's Techsay panel and you'll get the chance to
> > share your
> > > > > opinions on IT & business topics through brief surveys - and earn cash
> > > > >
> > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> > > > >
> > > > > _______________________________________________
> > > > > CLucene-developers mailing list
> > > > > CLucene-developers@lists.sourceforge.net
> > > > >
> > https://lists.sourceforge.net/lists/listinfo/clucene-developers
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> > -------------------------------------------------------------------------
> > > > Take Surveys. Earn Cash. Influence the Future of IT
> > > > Join SourceForge.net's Techsay panel and you'll get the chance to share
> > your
> > > > opinions on IT & business topics through brief surveys - and earn cash
> > > >
> > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> > > > _______________________________________________
> > > > CLucene-developers mailing list
> > > > CLucene-developers@lists.sourceforge.net
> > > >
> > https://lists.sourceforge.net/lists/listinfo/clucene-developers
> > > > > ________________________________
> > > Any questions? Get answers on any topic at Yahoo! Answers. Try it now.
> > >
> > -------------------------------------------------------------------------
> > > > Take Surveys. Earn Cash. Influence the Future of IT
> > > > Join SourceForge.net's Techsay panel and you'll get the chance to share
> > your
> > > > opinions on IT & business topics through brief surveys - and earn cash
> > >
> > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> > > _______________________________________________
> > > CLucene-developers mailing list
> > > CLucene-developers@lists.sourceforge.net
> > >
> > https://lists.sourceforge.net/lists/listinfo/clucene-developers
> >
> >
> >  ________________________________
> > TV dinner still cooling?
> > Check out "Tonight's Picks" on Yahoo! TV.
> > -------------------------------------------------------------------------
> > Take Surveys. Earn Cash. Influence the Future of IT
> > Join SourceForge.net's Techsay panel and you'll get the chance to share your
> > opinions on IT & business topics through brief surveys - and earn cash
> > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> >
> > _______________________________________________
> > CLucene-developers mailing list
> > CLucene-developers@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/clucene-developers
> >
> >
> >
>
> -------------------------------------------------------------------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share your
> opinions on IT & business topics through brief surveys - and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> _______________________________________________
> CLucene-developers mailing list
> CLucene-developers@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/clucene-developers
>


Never Miss an Email
Stay connected with Yahoo! Mail on your mobile. Get started!