Thread: [Aperture-devel] preventing hanging Extractors: ThreadedExtractorWrapper

Brought to you by: cfmfluit, leo_sauermann, mylka, reuschling

aperture-devel

[Aperture-devel] preventing hanging Extractors: ThreadedExtractorWrapper

From: Christiaan F. <chr...@ad...> - 2006-06-16 08:49:26

For anyone using Aperture Extractors, the following may be useful to know.

Theoretically, the invocation of an Extractor may never finish, e.g. 
when an Extractor does a read on an InputStream which for some reason 
does not return because of a web server failing to respond in a timely 
manner. Sometimes this results in an IOException, sometimes the read 
does not return. Depending on how you apply Extractors, this may halt 
your entire application.

To circumvent this problem, I have just committed a utility class called 
ThreadedExtractorWrapper. This class implements the Extractor interface 
and wraps an existing Extractor. The invocation of the wrapper's extract 
method creates a separate Thread on which the wrapped Extractor is 
invoked. Furthermore, the InputStream passed to the wrapped Extractor is 
also wrapped in a dedicated FilterInputStream that registers when the 
last read has taken place. When no read has been done for a long time 
(see the class' code for the specifics of how this is determined), we 
assume that the Extractor hangs and the wrapper returns. Also, the 
created Thread is interrupted.

This class is based on code we have used in AutoFocus for over a year 
now, with great success. It has proved to be able to prevent hanging 
crawl processes, e.g. because of the aforementioned lazy web server or 
some huge and problematic PDF files which tended to blow up our system. 
The extra overhead of starting a separate Thread for *every* crawled url 
seems to be negligible in the context of desktop search, as the majority 
of the time is still spent on PDF processing, network communication, 
etc. Of course this is highly subjective, but I couldn't notice the 
difference in performance between the use of this class and direct 
invocation of the Extractors. Only people indexing plain text files on a 
local hard drive may notice any difference.

Still, this class is not used by default, as there are some consequences 
that I believe the system integrator should be aware of. Although 
unlikely, interrupting the created Threads may have some undesired side 
effects, depending on the implementation of the Extractor and any third 
party libs it uses See the Thread javadocs for details on what it does. 
Also, the RDFContainer passed to the wrapper is passed to the wrapped 
Extractor as is, so it may already be partially filled with information 
once the wrapper determines to interrupt its thread. Whether this is 
good or bad entirely depends on your application.


Chris
--

Re: [Aperture-devel] preventing hanging Extractors: ThreadedExtractorWrapper

From: Leo S. <sau...@df...> - 2006-06-21 17:12:30

Christiaan Fluit schrieb:
> For anyone using Aperture Extractors, the following may be useful to know.
>
> Theoretically, the invocation of an Extractor may never finish, e.g. 
> when an Extractor does a read on an InputStream which for some reason 
> does not return because of a web server failing to respond in a timely 
> manner. Sometimes this results in an IOException, sometimes the read 
> does not return. Depending on how you apply Extractors, this may halt 
> your entire application.
>   
Ok, as you said below that is a good solution!

It hit something else I had on my mind longer: I think that we might 
have a more stable solution (without the performance problem of threads)
give up the direct passing of inputstreams and let the DataAccessor 
Retrieve the whole stream before doing the extractor magic.
We would then buffer the stream in an memory-stream

this will have two benefits:
* we simplify the mime-type problem a little. (mark/reset)
* we solve the "hanging extractor" problem

and a side-benefit that might help:
* if we buffer anyway, then we might add a new notion of "file-based 
extractor" called FileExctractor for Extractors that INSIST on using 
java File objects as input (= like the MP3 libraries). buffering the 
stream then to a file is a hack, but at least it makes more features 
work and is worth it i think



> To circumvent this problem, I have just committed a utility class called 
> ThreadedExtractorWrapper. This class implements the Extractor interface 
> and wraps an existing Extractor. The invocation of the wrapper's extract 
> method creates a separate Thread on which the wrapped Extractor is 
> invoked. Furthermore, the InputStream passed to the wrapped Extractor is 
> also wrapped in a dedicated FilterInputStream that registers when the 
> last read has taken place. When no read has been done for a long time 
> (see the class' code for the specifics of how this is determined), we 
> assume that the Extractor hangs and the wrapper returns. Also, the 
> created Thread is interrupted.
>
> This class is based on code we have used in AutoFocus for over a year 
> now, with great success. It has proved to be able to prevent hanging 
> crawl processes, e.g. because of the aforementioned lazy web server or 
> some huge and problematic PDF files which tended to blow up our system. 
> The extra overhead of starting a separate Thread for *every* crawled url 
> seems to be negligible in the context of desktop search, as the majority 
> of the time is still spent on PDF processing, network communication, 
> etc. Of course this is highly subjective, but I couldn't notice the 
> difference in performance between the use of this class and direct 
> invocation of the Extractors. Only people indexing plain text files on a 
> local hard drive may notice any difference.
>   

> Still, this class is not used by default, as there are some consequences 
> that I believe the system integrator should be aware of. Although 
> unlikely, interrupting the created Threads may have some undesired side 
> effects, depending on the implementation of the Extractor and any third 
> party libs it uses See the Thread javadocs for details on what it does. 
> Also, the RDFContainer passed to the wrapper is passed to the wrapped 
> Extractor as is, so it may already be partially filled with information 
> once the wrapper determines to interrupt its thread. Whether this is 
> good or bad entirely depends on your application.
>   

Thanks for the description, it seems that the threading and 
ThreadedExtractorWrapper approach is good for some applications, 
buffering for others.
>
> Chris
> --
>
>
> _______________________________________________
> Aperture-devel mailing list
> Ape...@li...
> https://lists.sourceforge.net/lists/listinfo/aperture-devel
>   


-- 
____________________________________________________
DI Leo Sauermann       http://www.dfki.de/~sauermann 
DFKI GmbH
P.O. Box 2080          Fon:   +49 631 205-3503
67608 Kaiserslautern   Fax:   +49 631 205-3472
Germany                Mail:  leo...@df...
____________________________________________________

Re: [Aperture-devel] preventing hanging Extractors: ThreadedExtractorWrapper

From: Christiaan F. <chr...@ad...> - 2006-06-21 18:36:01

Leo Sauermann wrote:
> It hit something else I had on my mind longer: I think that we might 
> have a more stable solution (without the performance problem of threads)
> give up the direct passing of inputstreams and let the DataAccessor 
> Retrieve the whole stream before doing the extractor magic.
> We would then buffer the stream in an memory-stream
> 
> this will have two benefits:
> * we simplify the mime-type problem a little. (mark/reset)
> * we solve the "hanging extractor" problem

A few remarks:

- I haven't seen a performance problem at all with these threads. It 
seems that the cost of starting and managing an extra thread per file is 
largely insignificant compared to the cost of accessing the file, 
applying the extractor, storing the results, etc.

- You don't solve the hanging extractor problem, although it may not 
always be the extractor that is hanging: (1) the code for creating the 
buffered stream may still hang for a long time or indefinitely, e.g. 
when a webserver suddenly stops responding and (2) I've seen cases with 
very large and complex documents (I believe only PDFs) where practically 
speaking the extractor could be considered as hanging and the I/O system 
was not to blame.

- Buffering files in-memory seems like a rather heavy solution. Trust 
me, you *will* encounter 60 MB PDF files ;)

> and a side-benefit that might help:
> * if we buffer anyway, then we might add a new notion of "file-based 
> extractor" called FileExctractor for Extractors that INSIST on using 
> java File objects as input (= like the MP3 libraries). buffering the 
> stream then to a file is a hack, but at least it makes more features 
> work and is worth it i think

But this would mean that in case of MP3 files residing on a file system, 
each file will be copied before it is processed, right? Doesn't feel 
like a good solution to me.

A reasonable alternative is to extend FileDataObject with an extra File 
property, next to the InputStream. FileDataAccessor makes sure that this 
property has a value, all other DataAccessors won't. A library that 
insists on processing a File can then see if one is available. If not, 
it can still decide to retrieve the entire stream and save it to a local 
file. Or perhaps FileDataObject should offer this functionality. In any 
case, try to prevent copying these potentially large files.


Chris
--

Re: [Aperture-devel] preventing hanging Extractors: ThreadedExtractorWrapper

From: Arjohn K. <arj...@ad...> - 2006-06-22 08:29:07

Christiaan Fluit wrote:
> Leo Sauermann wrote:
>> It hit something else I had on my mind longer: I think that we might 
>> have a more stable solution (without the performance problem of threads)
>> give up the direct passing of inputstreams and let the DataAccessor 
>> Retrieve the whole stream before doing the extractor magic.
>> We would then buffer the stream in an memory-stream
>>
>> this will have two benefits:
>> * we simplify the mime-type problem a little. (mark/reset)
>> * we solve the "hanging extractor" problem
> 
> A few remarks:
> 
> - I haven't seen a performance problem at all with these threads. It 
> seems that the cost of starting and managing an extra thread per file is 
> largely insignificant compared to the cost of accessing the file, 
> applying the extractor, storing the results, etc.

FWIW: using additional threads may even increase performance on
multi-core/multi-cpu machines. Given that all major CPU makers are
headed that way, using additional threads sounds like a good idea to me.

Arjohn

Re: [Aperture-devel] preventing hanging Extractors: ThreadedExtractorWrapper

From: jm <jmu...@gm...> - 2008-01-10 10:11:36

I am using successfully the  ThreadedExtractorWrapper, but I will have
to copy the code to a new class just to modify the timout values, I
would request the values to be parametrizable for next release if
possible (some of my extractions get aborted even if they didnt hang
yet)

thanks

On Jun 22, 2006 9:28 AM, Arjohn Kampman <arj...@ad...> wrote:
> Christiaan Fluit wrote:
> > Leo Sauermann wrote:
> >> It hit something else I had on my mind longer: I think that we might
> >> have a more stable solution (without the performance problem of threads)
> >> give up the direct passing of inputstreams and let the DataAccessor
> >> Retrieve the whole stream before doing the extractor magic.
> >> We would then buffer the stream in an memory-stream
> >>
> >> this will have two benefits:
> >> * we simplify the mime-type problem a little. (mark/reset)
> >> * we solve the "hanging extractor" problem
> >
> > A few remarks:
> >
> > - I haven't seen a performance problem at all with these threads. It
> > seems that the cost of starting and managing an extra thread per file is
> > largely insignificant compared to the cost of accessing the file,
> > applying the extractor, storing the results, etc.
>
> FWIW: using additional threads may even increase performance on
> multi-core/multi-cpu machines. Given that all major CPU makers are
> headed that way, using additional threads sounds like a good idea to me.
>
> Arjohn
>
>
> All the advantages of Linux Managed Hosting--Without the Cost and Risk!
> Fully trained technicians. The highest number of Red Hat certifications in
> the hosting industry. Fanatical Support. Click to learn more
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=107521&bid=248729&dat=121642
> _______________________________________________
> Aperture-devel mailing list
> Ape...@li...
> https://lists.sourceforge.net/lists/listinfo/aperture-devel
>

Re: [Aperture-devel] preventing hanging Extractors: ThreadedExtractorWrapper

From: Christiaan F. <chr...@ad...> - 2008-01-10 10:22:42

jm wrote:
> I am using successfully the  ThreadedExtractorWrapper, but I will have
> to copy the code to a new class just to modify the timout values, I
> would request the values to be parametrizable for next release if
> possible (some of my extractions get aborted even if they didnt hang
> yet)

Sounds like a good idea!

Could you add a feature request to SourceForge's issue tracker for this?


Kind regards,

Chris
--