[Aperture-devel] xls extract; Trix; crawling

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

I just came across aperture.  It is a program I want to use.  I read throug=
h the docs (which are great).  I have a few questions either I missed info =
on or could not find.  =0A=0AMy basic goal seems to align with the project,=
 use aperture as an embedded program (either java or ruby) or create a comm=
and line app (eventually a GUI) that ingests multiple MIME types from a spe=
cified source and output text/xml to be indexed and viewed.  There are othe=
r solutions out there but not put together as well as this project seems to=
 be.  If someone has advice on the following:=0A=0A1.  When using the Apert=
ure File System Crawler GUI (aperture-2006.1-alpha-3.zip) and I have xls in=
 the directory, the Trix output file does not seem to have all the data fro=
m each of the rows (fullText xml tag).  It has some text but not the data f=
rom the rows and columns.  I have checked Determine MIME type and Extract d=
ocument text and metadata.=0A=0AIt prints the following message(s) in the s=
tartup window.=0A=0AINFO: regular POI-based processing failed, falling back=
 to heuristic string extraction for file:/C:/eclipse/workspace/test/28Pyrmd=
NE2.xls=0AJan 19, 2007 12:03:05 PM org.semanticdesktop.aperture.extractor.u=
til.PoiUtil extractAll=0A=0A2.  Once the data has been extracted to XML can=
 the XML document be rendered/viewed the same as the original it was extrac=
ted from?  I would like to keep the single XML extract around and not the o=
riginal.  I will eventually parse out the XML to individual files but that =
is later on.  I can keep both the original and the XML extract but if I do =
not need to then I will not.  I was told that I should be able to view the =
images and formulas (or anything else from the original document) if I have=
 the correct XML plugins/libs in the XML extracted file  (MathXML, ...)=0A=
=0A3.  The File Inspector GUI allows you to changed the type of metadata ou=
tput.   If I pick a format at the beginning can I change it down the road (=
ie.  I decide on Trix now but want to change to Turtle or another serialize=
d format down the road)?  How hard would it be to convert from one to the o=
ther?=0A=0A4.  I tried to crawl my test directory that had 1.2GB of data in=
 it, 4400 files (mixture of xls, pdf, doc, odt, xml, ps, ppt, rtf) and whil=
e it started out good, it ran out of java memory at file 78.  I increased t=
he jvm to init with 512MB and grow to 1GB.  It looks like it made it to the=
 last file "Crawling completed, saving results..." but I get a stack trace =
in the startup window with OutoFMemory error.  The Trix output file is 86.2=
 MB.  It does not look like it finished correctly?  It did only take 15 min=
utes to get the completed messages though.  =0A=0AAny information on how mu=
ch it can handle (# of docs on one crawl, types, sizes, memory, ...)  Prett=
y much any perfomance related information.=0A=0A5.  Has anyone looked into =
multi-threading this?=0A=0A6.  Is there a way/file to configure the metadat=
a that is pull out?=0A=0AI know that is a lot of questions, hopefully someo=
ne has the time and some answers.=0A=0AIf I am going to be using this I wou=
ld be happy to help in any way once I get up to speed and get going.=0A=0AT=
hanks=0AKevin=0A=0A=0A =0A_________________________________________________=
___________________________________=0ADo you Yahoo!?=0AEveryone is raving a=
bout the all-new Yahoo! Mail beta.=0Ahttp://new.mail.yahoo.com