From: Kevin C. B. <kev...@ya...> - 2007-01-19 17:32:48
|
I just came across aperture. It is a program I want to use. I read throug= h the docs (which are great). I have a few questions either I missed info = on or could not find. =0A=0AMy basic goal seems to align with the project,= use aperture as an embedded program (either java or ruby) or create a comm= and line app (eventually a GUI) that ingests multiple MIME types from a spe= cified source and output text/xml to be indexed and viewed. There are othe= r solutions out there but not put together as well as this project seems to= be. If someone has advice on the following:=0A=0A1. When using the Apert= ure File System Crawler GUI (aperture-2006.1-alpha-3.zip) and I have xls in= the directory, the Trix output file does not seem to have all the data fro= m each of the rows (fullText xml tag). It has some text but not the data f= rom the rows and columns. I have checked Determine MIME type and Extract d= ocument text and metadata.=0A=0AIt prints the following message(s) in the s= tartup window.=0A=0AINFO: regular POI-based processing failed, falling back= to heuristic string extraction for file:/C:/eclipse/workspace/test/28Pyrmd= NE2.xls=0AJan 19, 2007 12:03:05 PM org.semanticdesktop.aperture.extractor.u= til.PoiUtil extractAll=0A=0A2. Once the data has been extracted to XML can= the XML document be rendered/viewed the same as the original it was extrac= ted from? I would like to keep the single XML extract around and not the o= riginal. I will eventually parse out the XML to individual files but that = is later on. I can keep both the original and the XML extract but if I do = not need to then I will not. I was told that I should be able to view the = images and formulas (or anything else from the original document) if I have= the correct XML plugins/libs in the XML extracted file (MathXML, ...)=0A= =0A3. The File Inspector GUI allows you to changed the type of metadata ou= tput. If I pick a format at the beginning can I change it down the road (= ie. I decide on Trix now but want to change to Turtle or another serialize= d format down the road)? How hard would it be to convert from one to the o= ther?=0A=0A4. I tried to crawl my test directory that had 1.2GB of data in= it, 4400 files (mixture of xls, pdf, doc, odt, xml, ps, ppt, rtf) and whil= e it started out good, it ran out of java memory at file 78. I increased t= he jvm to init with 512MB and grow to 1GB. It looks like it made it to the= last file "Crawling completed, saving results..." but I get a stack trace = in the startup window with OutoFMemory error. The Trix output file is 86.2= MB. It does not look like it finished correctly? It did only take 15 min= utes to get the completed messages though. =0A=0AAny information on how mu= ch it can handle (# of docs on one crawl, types, sizes, memory, ...) Prett= y much any perfomance related information.=0A=0A5. Has anyone looked into = multi-threading this?=0A=0A6. Is there a way/file to configure the metadat= a that is pull out?=0A=0AI know that is a lot of questions, hopefully someo= ne has the time and some answers.=0A=0AIf I am going to be using this I wou= ld be happy to help in any way once I get up to speed and get going.=0A=0AT= hanks=0AKevin=0A=0A=0A =0A_________________________________________________= ___________________________________=0ADo you Yahoo!?=0AEveryone is raving a= bout the all-new Yahoo! Mail beta.=0Ahttp://new.mail.yahoo.com |