From: Florent J. <flo...@un...> - 2007-08-14 09:07:06
|
Hi, It might be related to this bug: http://sourceforge.net/tracker/index.php?func=detail&aid=1745133&group_id=78 314&atid=552832 But I'm not sure. A 5mins test of PDFbox 0.7.4dev didn't give me anything better. I'm still trying to find more crash-pdf and I'll let you know what I can find. Florent > > > > Florent Jochaud pisze: > > > Hi Aperture, > > > > > > I kept getting some outOfMemory error during the indexing. At the > > beginning, > > > I thought it was due to my code. But it seems to be due to other > > libs. I > > > already found one pdf which crash pdfbox every times. POI-ml > lurking > > gives > > > me the impression that POI might also throws some outOfMemory on > some > > files > > > (still looking for crash-files) > > > > > > My question is: do you Aperture already encountered such situation? > > If yes, > > > how do you manage such situation? Do you simply let the crawling > > crash? Do > > > you catch something to end up correctly the crawling? > > > > > > Do you have some nice idea to get Aperture completely robust on > file > > parsing > > > crash? > > > > > > > We do keep getting OutOfMemoryErrors when crawling large collections > of > > PDF's. I haven't tried it with other file types. It seems that the > > PDFBox does some bad things when parsing a document. It is difficult > to > > debug, because it happens sometimes and when you try to repeat it > > doesn't happen :). I guess this issue will have to bubble up on our > > todo > > list. It would also be nice if you could post that PDF somewhere if > > it's > > not proprietary. E.g. as an attachment to a bug report in the > > sourceforge bug tracker > > > > http://sourceforge.net/tracker/?group_id=150969&atid=779500 > > Sadly, this one a proprietary file with name on it... I can't post it. > But I > will run some test on lot of file to find other which make system crash > (already found an ugly/hacky way to keep the crawler alive by catching > the > error :)) so I will let it run on some big pack of file and build a > list. > Hopefully I'll find some free pdf I can send for debugging. > > > > > > You could try to pass an AccessData instance to the crawler, run the > > crawler, wait for it to crash, remove the obstacle (a faulty file) > and > > rerun crawling with the same accessData instance. It should use the > > AccessData to start from the place where it stopped. > > My problem with this way is that is sometime crash my lucene index if > it's > not properly closed... > Moreover, I might be hard to debug in production environment and we > can't > ask people to remove there files... > > > > > > Apart from that there is little that can be done on the user side for > > the time being. There is much to do improve aperture itself though. > > I've > > been working on a complete overhaul of aperture ontologies (DATA and > > DATASOURCE don't look the same anymore). Probably many things got > > broken > > in the process. There will be a merge in near future (say within a > > week). After that there will be time for testing and bugfixing. > > Followed a bit the changes on the NIE branche, looking forward to use > it. > I'll be happy to debug then :) > > Florent > > > > > Antoni Mylka > > ant...@df... > > > > --------------------------------------------------------------------- > -- > > -- > > This SF.net email is sponsored by: Splunk Inc. > > Still grepping through log files to find problems? Stop. > > Now Search log events and configuration files using AJAX and a > browser. > > Download your FREE copy of Splunk now >> http://get.splunk.com/ > > _______________________________________________ > > Aperture-devel mailing list > > Ape...@li... > > https://lists.sourceforge.net/lists/listinfo/aperture-devel > > > ----------------------------------------------------------------------- > -- > This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? Stop. > Now Search log events and configuration files using AJAX and a browser. > Download your FREE copy of Splunk now >> http://get.splunk.com/ > _______________________________________________ > Aperture-devel mailing list > Ape...@li... > https://lists.sourceforge.net/lists/listinfo/aperture-devel |