Reg. Skipping files/folders while crawling

2008-06-05
2013-05-13
  • Charanya Mohan
    Charanya Mohan
    2008-06-05

    As far the extraction is concerned,

        We are allowing only the following files into crawling process,
    MS office files(.doc,.ppt,.xls),PDF files,HTML files and Plaintext files(.txt)

    Though the aperture have corresponding extractors for the above file types, the problem is, the crawling get strucks while extracting the text.
       
    Crawling exception for a .xls file
    ----------------------------------
    -----exception 1----java.lang.ClassCastException: org.apache.poi.hssf.record.UnknownRecord
    [Info] org.semanticdesktop.aperture.extractor.util.PoiUtil regular POI-based processing failed, falling back to heuristic string extraction for file:...\training\Temp\filename.xls.... (size of the file:16 MB)
    [Warning] org.semanticdesktop.aperture.extractor.util.PoiUtil IOException while processing
    java.io.IOException: Resetting to invalid mark

    -----------------------------------------------------------------------
    For a .ppt file,while extracting the text, it displays the following statements and exits with ExitCode :completed (without completing the entire process)
    No core record found with ID 2 based on PersistPtr lookup
    No core record found with ID 3 based on PersistPtr lookup
    No core record found with ID 4 based on PersistPtr lookup
    No core record found with ID 5 based on PersistPtr lookup
    No core record found with ID 6 based on PersistPtr lookup
    No core record found with ID 7 based on PersistPtr lookup
    No core record found with ID 8 based on PersistPtr lookup
    No core record found with ID 9 based on PersistPtr lookup

    Thanks in advance.