From: Antoni M. <ant...@gm...> - 2009-09-26 22:50:23
|
I looked at the ThreadedSubCrawler wrapper issue and brainstormed a little on usage scenarios. 1. a normal well-behaving subcrawler started on a different thread returns some data objects, 1.a all of them are processed with extractors and disposed 1.b they aren't processed with anything but are disposed reads the stream to the end the thread finishes everything is OK 2. a one-level halting scenario started on a different thread returns two data objects, they aren't processed with anything but are closed but then it stops reading the thread finishes the SubCrawlingAbortedException is thrown 3. a two-level halting scenario subcrawler A is started on thread A subcrawler A returns a data object (DO) A1, which is processed and disposed subcrawler A returns DO A2, which is detected as an archive subcrawler B is started on thread B for DO A2 subcrawler B returns a DO B1 which is processed and closed then subcrawler B halts the halting of subcrawler B is detected, the thread B is cancelled DO A2 is properly disposed subcrawler A still correctly subcrawls the stream and returns DO A3 which is processed and closed thread A stops everything is OK The issue is that this setup needs to work for subcrawlers at all levels, a broken zip can come up inside a normal zip attached to an email stored in a .eml file. All other files must be crawled correctly. BTW I'd need an example of a real broken zip file that causes some infinite loop inside the zip subcrawler. I'll write mock subcrawlers that simulate the three scenarios above, but a real broken zip could helpful with some proper integration tests. What do you think? -- Antoni Myłka ant...@gm... |