From: Leo S. <leo...@df...> - 2008-02-14 10:03:26
|
Hi Grant, at the moment, I don't optimize for performance, but if you find anything that can be optimized, could you please write a wiki page about optimization to gather your results, that would help us all, and if I optimize later myself, I could build upon your work best Leo It was Grant Ingersoll who said at the right time 13.02.2008 14:11 the following words: > On Feb 12, 2008, at 9:25 AM, Christiaan Fluit wrote: > > >> Grant Ingersoll wrote: >> >>> Has anyone done any benchmarking of Aperture in terms of things like >>> docs/second to extract or anything like that? Or other tests related >>> to scalability and load? Are there best practices for achieving >>> better performance that people have noticed? >>> >> As far as I'm aware, there are no such tests. Useful benchmark numbers >> are also hard to produce given the difference in complexity between >> the >> various document formats, even the fact that one PDF may be much more >> complex to process than another PDF regardless of file size, etc. >> >> I do some informal testing with one of my apps on a collection of >> gathered documents (unfortunately not something I can put online), >> just >> to see whether performance in speed and reliability changes over time. >> >> Best practices probably boil down to tips on how best to realize a >> CrawlerHandler, as the decisions you make there have a big impact on >> how >> well it performs. This mainly depends on the storage framework that >> you >> use, e.g. Sesame, Jena, Lucene. >> >> Can you tell us more about why you need this information? Perhaps it's >> worth the effort to start initiatives on benchmarking and other kind >> of >> tests? >> > > Mostly, I want to know how to make it run as fast as possible during > production and what bottlenecks I may run into, etc. > > I use a few main pieces: File crawler (that is pretty > straightforward, but I wish Java had some lower level hooks, but that > is not Aperture's fault) URL Crawler and then the whole extraction > piece including mime identification. For instance, somethings that > might be of interest. For the mime identification, the examples show > using 8kb for identification, what if I wanted to use 4kb or some > other value? What kind of results would I get per the performance > tradeoff (probably depends on OS page cache settings). I know these > are things I can do, mostly just wondering if anyone else has done them. > > >> FYI: we do have some performance/runtime issues with our software, we >> now get frequent OutOfMemoryErrors. I suspect that PDFBox is doing >> some >> heavy caching in static members that are never disposed during the >> lifetime of the application, but I'm still tracking that down. >> >> >> Regards, >> >> Chris >> -- >> >> ------------------------------------------------------------------------- >> This SF.net email is sponsored by: Microsoft >> Defy all challenges. Microsoft(R) Visual Studio 2008. >> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ >> _______________________________________________ >> Aperture-devel mailing list >> Ape...@li... >> https://lists.sourceforge.net/lists/listinfo/aperture-devel >> > > -------------------------- > Grant Ingersoll > http://lucene.grantingersoll.com > http://www.lucenebootcamp.com > > Lucene Helpful Hints: > http://wiki.apache.org/lucene-java/BasicsOfPerformance > http://wiki.apache.org/lucene-java/LuceneFAQ > > > > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Aperture-devel mailing list > Ape...@li... > https://lists.sourceforge.net/lists/listinfo/aperture-devel > -- ____________________________________________________ DI Leo Sauermann http://www.dfki.de/~sauermann Deutsches Forschungszentrum fuer Kuenstliche Intelligenz DFKI GmbH Trippstadter Strasse 122 P.O. Box 2080 Fon: +49 631 20575-116 D-67663 Kaiserslautern Fax: +49 631 20575-102 Germany Mail: leo...@df... Geschaeftsfuehrung: Prof.Dr.Dr.h.c.mult. Wolfgang Wahlster (Vorsitzender) Dr. Walter Olthoff Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes Amtsgericht Kaiserslautern, HRB 2313 ____________________________________________________ |