From: Kevin C. B. <kbo...@er...> - 2007-01-26 02:09:38
|
I changed it to use the NativeStore and it worked much better. 1.2GB turned into a 461MB .dat file and 471MB Trix out file. About 66% smaller in size from the original source documents. That is consistent to what I have seen using misc extraction utilities. 1. This is a small test dataset for me. My next tests will be 10GB, 50GB, 100GB, 250GB, 500GB, ... a) the .dat file What happens when the .dat file start to grow very large? Has anyone ever broken it out into separate files? Maybe a rolling .dat file based on size? How would this affect the ischanged/ismodifed aspect? It does nothing if the datasource has not changed, processes it if it has changed. b) repository.export and RDFWriter What happens when the repository.export starts to grow very large? Has anyone ever broken it out into separate files? Maybe individual files on export? Thanks Kevin Kevin C. Bombardier wrote: > 4. I tried to crawl my test directory that had 1.2GB of data in it, > 4400 files (mixture of xls, pdf, doc, odt, xml, ps, ppt, rtf) and while > it started out good, it ran out of java memory at file 78. I increased > the jvm to init with 512MB and grow to 1GB. It looks like it made it to > the last file "Crawling completed, saving results..." but I get a stack > trace in the startup window with OutoFMemory error. The Trix output > file is 86.2 MB. It does not look like it finished correctly? It did > only take 15 minutes to get the completed messages though. > > Any information on how much it can handle (# of docs on one crawl, > types, sizes, memory, ...) Pretty much any perfomance related information. I'm not surprised to see this happen. Again, the file crawler UI is meant as a coding example, it has been kept as simple as possible. Because of this and some historic reasons (the state of Sesame 2 at the time this code was written), it uses a data structure that holds all extracted information (full-text and metadata) in RAM and only writes it to disk at the end of the entire crawl. Clearly, this doesn't scale even remotely. Currently, Sesame 2 has progressed a lot and it now contains a stable disk-based RDF store. We can update the example code to use this native store to improve scalability. However, I would recommend you look into the CrawlerHandler API (see the tutorials on aperture.sourceforge.net), for example because of your next question. As Aperture focuses on providing middleware components that handle crawling and extraction tasks, I'm hesitant to make the examples too complex. Regards, Chris -- |