[Aperture-devel] crawling and files

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

I changed it to use the NativeStore and it worked much better.  1.2GB turned
into a 461MB .dat file and 471MB Trix out file.  About 66% smaller in size
from the original source documents.  That is consistent to what I have seen
using misc extraction utilities.

1.  This is a small test dataset for me.  My next tests will be 10GB, 50GB,
100GB, 250GB, 500GB, ... 

a) the .dat file
What happens when the .dat file start to grow very large?  
Has anyone ever broken it out into separate files?
Maybe a rolling .dat file based on size?
How would this affect the ischanged/ismodifed aspect?  It does nothing if
the datasource has not changed, processes it if it has changed.

b) repository.export and RDFWriter
What happens when the repository.export starts to grow very large?  
Has anyone ever broken it out into separate files?
Maybe individual files on export?

Thanks
Kevin

Kevin C. Bombardier wrote:
> 4.  I tried to crawl my test directory that had 1.2GB of data in it,
> 4400 files (mixture of xls, pdf, doc, odt, xml, ps, ppt, rtf) and while 
> it started out good, it ran out of java memory at file 78.  I increased 
> the jvm to init with 512MB and grow to 1GB.  It looks like it made it to 
> the last file "Crawling completed, saving results..." but I get a stack 
> trace in the startup window with OutoFMemory error.  The Trix output 
> file is 86.2 MB.  It does not look like it finished correctly?  It did 
> only take 15 minutes to get the completed messages though. 
>  
> Any information on how much it can handle (# of docs on one crawl,
> types, sizes, memory, ...)  Pretty much any perfomance related
information.

I'm not surprised to see this happen. Again, the file crawler UI is 
meant as a coding example, it has been kept as simple as possible. 
Because of this and some historic reasons (the state of Sesame 2 at the 
time this code was written), it uses a data structure that holds all 
extracted information (full-text and metadata) in RAM and only writes it 
to disk at the end of the entire crawl. Clearly, this doesn't scale even 
remotely.

Currently, Sesame 2 has progressed a lot and it now contains a stable 
disk-based RDF store. We can update the example code to use this native 
store to improve scalability. However, I would recommend you look into 
the CrawlerHandler API (see the tutorials on aperture.sourceforge.net), 
for example because of your next question. As Aperture focuses on 
providing middleware components that handle crawling and extraction 
tasks, I'm hesitant to make the examples too complex.

Regards,

Chris
--