From: Christiaan F. <chr...@ad...> - 2006-06-16 08:49:26
|
For anyone using Aperture Extractors, the following may be useful to know. Theoretically, the invocation of an Extractor may never finish, e.g. when an Extractor does a read on an InputStream which for some reason does not return because of a web server failing to respond in a timely manner. Sometimes this results in an IOException, sometimes the read does not return. Depending on how you apply Extractors, this may halt your entire application. To circumvent this problem, I have just committed a utility class called ThreadedExtractorWrapper. This class implements the Extractor interface and wraps an existing Extractor. The invocation of the wrapper's extract method creates a separate Thread on which the wrapped Extractor is invoked. Furthermore, the InputStream passed to the wrapped Extractor is also wrapped in a dedicated FilterInputStream that registers when the last read has taken place. When no read has been done for a long time (see the class' code for the specifics of how this is determined), we assume that the Extractor hangs and the wrapper returns. Also, the created Thread is interrupted. This class is based on code we have used in AutoFocus for over a year now, with great success. It has proved to be able to prevent hanging crawl processes, e.g. because of the aforementioned lazy web server or some huge and problematic PDF files which tended to blow up our system. The extra overhead of starting a separate Thread for *every* crawled url seems to be negligible in the context of desktop search, as the majority of the time is still spent on PDF processing, network communication, etc. Of course this is highly subjective, but I couldn't notice the difference in performance between the use of this class and direct invocation of the Extractors. Only people indexing plain text files on a local hard drive may notice any difference. Still, this class is not used by default, as there are some consequences that I believe the system integrator should be aware of. Although unlikely, interrupting the created Threads may have some undesired side effects, depending on the implementation of the Extractor and any third party libs it uses See the Thread javadocs for details on what it does. Also, the RDFContainer passed to the wrapper is passed to the wrapped Extractor as is, so it may already be partially filled with information once the wrapper determines to interrupt its thread. Whether this is good or bad entirely depends on your application. Chris -- |
From: Leo S. <sau...@df...> - 2006-06-21 17:12:30
|
Christiaan Fluit schrieb: > For anyone using Aperture Extractors, the following may be useful to know. > > Theoretically, the invocation of an Extractor may never finish, e.g. > when an Extractor does a read on an InputStream which for some reason > does not return because of a web server failing to respond in a timely > manner. Sometimes this results in an IOException, sometimes the read > does not return. Depending on how you apply Extractors, this may halt > your entire application. > Ok, as you said below that is a good solution! It hit something else I had on my mind longer: I think that we might have a more stable solution (without the performance problem of threads) give up the direct passing of inputstreams and let the DataAccessor Retrieve the whole stream before doing the extractor magic. We would then buffer the stream in an memory-stream this will have two benefits: * we simplify the mime-type problem a little. (mark/reset) * we solve the "hanging extractor" problem and a side-benefit that might help: * if we buffer anyway, then we might add a new notion of "file-based extractor" called FileExctractor for Extractors that INSIST on using java File objects as input (= like the MP3 libraries). buffering the stream then to a file is a hack, but at least it makes more features work and is worth it i think > To circumvent this problem, I have just committed a utility class called > ThreadedExtractorWrapper. This class implements the Extractor interface > and wraps an existing Extractor. The invocation of the wrapper's extract > method creates a separate Thread on which the wrapped Extractor is > invoked. Furthermore, the InputStream passed to the wrapped Extractor is > also wrapped in a dedicated FilterInputStream that registers when the > last read has taken place. When no read has been done for a long time > (see the class' code for the specifics of how this is determined), we > assume that the Extractor hangs and the wrapper returns. Also, the > created Thread is interrupted. > > This class is based on code we have used in AutoFocus for over a year > now, with great success. It has proved to be able to prevent hanging > crawl processes, e.g. because of the aforementioned lazy web server or > some huge and problematic PDF files which tended to blow up our system. > The extra overhead of starting a separate Thread for *every* crawled url > seems to be negligible in the context of desktop search, as the majority > of the time is still spent on PDF processing, network communication, > etc. Of course this is highly subjective, but I couldn't notice the > difference in performance between the use of this class and direct > invocation of the Extractors. Only people indexing plain text files on a > local hard drive may notice any difference. > > Still, this class is not used by default, as there are some consequences > that I believe the system integrator should be aware of. Although > unlikely, interrupting the created Threads may have some undesired side > effects, depending on the implementation of the Extractor and any third > party libs it uses See the Thread javadocs for details on what it does. > Also, the RDFContainer passed to the wrapper is passed to the wrapped > Extractor as is, so it may already be partially filled with information > once the wrapper determines to interrupt its thread. Whether this is > good or bad entirely depends on your application. > Thanks for the description, it seems that the threading and ThreadedExtractorWrapper approach is good for some applications, buffering for others. > > Chris > -- > > > _______________________________________________ > Aperture-devel mailing list > Ape...@li... > https://lists.sourceforge.net/lists/listinfo/aperture-devel > -- ____________________________________________________ DI Leo Sauermann http://www.dfki.de/~sauermann DFKI GmbH P.O. Box 2080 Fon: +49 631 205-3503 67608 Kaiserslautern Fax: +49 631 205-3472 Germany Mail: leo...@df... ____________________________________________________ |
From: Christiaan F. <chr...@ad...> - 2006-06-21 18:36:01
|
Leo Sauermann wrote: > It hit something else I had on my mind longer: I think that we might > have a more stable solution (without the performance problem of threads) > give up the direct passing of inputstreams and let the DataAccessor > Retrieve the whole stream before doing the extractor magic. > We would then buffer the stream in an memory-stream > > this will have two benefits: > * we simplify the mime-type problem a little. (mark/reset) > * we solve the "hanging extractor" problem A few remarks: - I haven't seen a performance problem at all with these threads. It seems that the cost of starting and managing an extra thread per file is largely insignificant compared to the cost of accessing the file, applying the extractor, storing the results, etc. - You don't solve the hanging extractor problem, although it may not always be the extractor that is hanging: (1) the code for creating the buffered stream may still hang for a long time or indefinitely, e.g. when a webserver suddenly stops responding and (2) I've seen cases with very large and complex documents (I believe only PDFs) where practically speaking the extractor could be considered as hanging and the I/O system was not to blame. - Buffering files in-memory seems like a rather heavy solution. Trust me, you *will* encounter 60 MB PDF files ;) > and a side-benefit that might help: > * if we buffer anyway, then we might add a new notion of "file-based > extractor" called FileExctractor for Extractors that INSIST on using > java File objects as input (= like the MP3 libraries). buffering the > stream then to a file is a hack, but at least it makes more features > work and is worth it i think But this would mean that in case of MP3 files residing on a file system, each file will be copied before it is processed, right? Doesn't feel like a good solution to me. A reasonable alternative is to extend FileDataObject with an extra File property, next to the InputStream. FileDataAccessor makes sure that this property has a value, all other DataAccessors won't. A library that insists on processing a File can then see if one is available. If not, it can still decide to retrieve the entire stream and save it to a local file. Or perhaps FileDataObject should offer this functionality. In any case, try to prevent copying these potentially large files. Chris -- |
From: Arjohn K. <arj...@ad...> - 2006-06-22 08:29:07
|
Christiaan Fluit wrote: > Leo Sauermann wrote: >> It hit something else I had on my mind longer: I think that we might >> have a more stable solution (without the performance problem of threads) >> give up the direct passing of inputstreams and let the DataAccessor >> Retrieve the whole stream before doing the extractor magic. >> We would then buffer the stream in an memory-stream >> >> this will have two benefits: >> * we simplify the mime-type problem a little. (mark/reset) >> * we solve the "hanging extractor" problem > > A few remarks: > > - I haven't seen a performance problem at all with these threads. It > seems that the cost of starting and managing an extra thread per file is > largely insignificant compared to the cost of accessing the file, > applying the extractor, storing the results, etc. FWIW: using additional threads may even increase performance on multi-core/multi-cpu machines. Given that all major CPU makers are headed that way, using additional threads sounds like a good idea to me. Arjohn |
From: jm <jmu...@gm...> - 2008-01-10 10:11:36
|
I am using successfully the ThreadedExtractorWrapper, but I will have to copy the code to a new class just to modify the timout values, I would request the values to be parametrizable for next release if possible (some of my extractions get aborted even if they didnt hang yet) thanks On Jun 22, 2006 9:28 AM, Arjohn Kampman <arj...@ad...> wrote: > Christiaan Fluit wrote: > > Leo Sauermann wrote: > >> It hit something else I had on my mind longer: I think that we might > >> have a more stable solution (without the performance problem of threads) > >> give up the direct passing of inputstreams and let the DataAccessor > >> Retrieve the whole stream before doing the extractor magic. > >> We would then buffer the stream in an memory-stream > >> > >> this will have two benefits: > >> * we simplify the mime-type problem a little. (mark/reset) > >> * we solve the "hanging extractor" problem > > > > A few remarks: > > > > - I haven't seen a performance problem at all with these threads. It > > seems that the cost of starting and managing an extra thread per file is > > largely insignificant compared to the cost of accessing the file, > > applying the extractor, storing the results, etc. > > FWIW: using additional threads may even increase performance on > multi-core/multi-cpu machines. Given that all major CPU makers are > headed that way, using additional threads sounds like a good idea to me. > > Arjohn > > > All the advantages of Linux Managed Hosting--Without the Cost and Risk! > Fully trained technicians. The highest number of Red Hat certifications in > the hosting industry. Fanatical Support. Click to learn more > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=107521&bid=248729&dat=121642 > _______________________________________________ > Aperture-devel mailing list > Ape...@li... > https://lists.sourceforge.net/lists/listinfo/aperture-devel > |
From: Christiaan F. <chr...@ad...> - 2008-01-10 10:22:42
|
jm wrote: > I am using successfully the ThreadedExtractorWrapper, but I will have > to copy the code to a new class just to modify the timout values, I > would request the values to be parametrizable for next release if > possible (some of my extractions get aborted even if they didnt hang > yet) Sounds like a good idea! Could you add a feature request to SourceForge's issue tracker for this? Kind regards, Chris -- |