From: Demian K. <dem...@vi...> - 2012-09-07 11:03:13
|
The only reason I mentioned Leechcrawler is that the existing Aperture team recommended it as a successor to Aperture... but it's entirely possible that Leechcrawler replicates Aperture functionality that VuFind doesn't need. If the stand-alone Tika app is good enough, I see no pressing reason to complicate things with extra tools. - Demian ________________________________ From: Ronan McHugh [rm...@nl...] Sent: Friday, September 07, 2012 6:57 AM To: Demian Katz; vuf...@li... Subject: RE: Problems in importing fulltext from pdfs We were just discussing that – is there any specific reason for using Leechcrawler as a wrapper? Tika can be called directly from the command line similar to the way in which Aperture is called, so it doesn’t seem necessary in this case. ________________________________ From: Demian Katz [mailto:dem...@vi...] Sent: 07 September 2012 11:56 To: Ronan McHugh; vuf...@li... Subject: RE: Problems in importing fulltext from pdfs Excellent, thanks! I'm also very glad to hear you're experimenting with Tika -- if that all works out it will be a valuable addition to VuFind. Have you looked at the Leechcrawler tool mentioned on http://vufind.org/jira/browse/VUFIND-600<http://vufind.org/jira/browse/VUFIND-600?>? (Of course, if that's not necessary to achieve the necessary functionality, using Tika alone would probably be even better). - Demian ________________________________ From: Ronan McHugh [rm...@nl...] Sent: Friday, September 07, 2012 6:53 AM To: Demian Katz; vuf...@li... Subject: RE: Problems in importing fulltext from pdfs No problem at all. I’ll stick it on the JIRA ticket. We’re experimenting with using Tika instead at the moment which looks promising. Specifically, I’m trying to update getFullText to use either Aperture or Tika. I’ll let you know how it goes. Ronan ________________________________ From: Demian Katz [mailto:dem...@vi...] Sent: 07 September 2012 11:50 To: Ronan McHugh; vuf...@li... Subject: RE: Problems in importing fulltext from pdfs Makes sense. Still in the midst of the last push to finish up 2.0beta, but I'll try to take care of this when I get back to JIRA catchup. If there's a way to also post a sample PDF that causes the problem, that would be very helpful too for testing purposes... though I imagine there's a good chance that copyright issues make that impossible. - Demian ________________________________ From: Ronan McHugh [rm...@nl...] Sent: Friday, September 07, 2012 4:22 AM To: Demian Katz; vuf...@li... Subject: RE: Problems in importing fulltext from pdfs Sure, I have created http://vufind.org/jira/browse/VUFIND-644 with the patch attached. I reckon the try/catch bit at least should probably be committed when you have time since it will prevent the indexer from crashing out on encountering illegal characters. Best, Ronan ________________________________ From: Demian Katz [mailto:dem...@vi...] Sent: 06 September 2012 20:53 To: Ronan McHugh; vuf...@li... Subject: RE: Problems in importing fulltext from pdfs I worked with Nathan a bit more on the HTML problem he had encountered, and we reached a similar solution – strip out invalid characters from the stream prior to parsing. It’s too bad there isn’t a mode to simply ignore these weird control characters… but I think this is the best we can do in the meantime! It will be interesting to see if Tika is more robust if we eventually adopt that in place of Aperture (though that project is not high on my priority list right now, I have to confess). In any case, it might be worth posting this patch on JIRA for future reference. - Demian From: Ronan McHugh [mailto:rm...@nl...] Sent: Thursday, September 06, 2012 12:00 PM To: vuf...@li... Subject: [VuFind-Tech] Problems in importing fulltext from pdfs Hi all, We are experimenting at present with importing large pdf documents into our index and have run into a problem with bad character encoding. It seems that Aperture sometimes outputs an invalid character 0xc which then causes the parser to break, in turn crashing the indexer and preventing the record from being indexed at all. This error was noticed previously by Nathan in relation to a HTML document (http://vufind-tech.2307425.n4.nabble.com/Re-VuFind-General-Aperture-Issue-td4411904.html) but it seems the problem wasn’t solved back then. To solve the problem we have modified the getFullText bsh script as follows: 1) We have inserted an additional method to clean this specific character out of the file before it is sent to be parsed. There are potentially other characters which should be removed, but we decided to err on the side of caution, as using a more zealous replaceAll tended to break the document in a myriad of other ways. Anyone who has other problems should simply expand the string being replaced at this point. 2) The parse method has been encased in a try / catch block meaning that any exceptions caused by the parsing will not prevent the rest of the record from being indexed. Attached is our patch, any questions or comments are more than welcome. Best, Ronan Visit our free exhibitions <http://www.nli.ie/en/udlist/current-exhibitions.aspx> ___________________________________________ Tabhair cuairt ar ?r dtaispe?ntais saor in aisce <http://www.nli.ie/ga/udlist/current-exhibitions.aspx> The contents of this e-mail (including attachments) are private and confidential and may also be subject to legal privilege. It is intended only for the use of the addressee. If you are not the addressee, or the person responsible for delivering it to the addressee, you may not copy or deliver this e-mail or any attachments to anyone else or make any use of its contents; you should not read any part of this e-mail or any attachments. Unauthorised disclosure or communication or other use of the contents of this e-mail or any part thereof may be prohibited by law and may constitute a criminal offence. If you receive this e-mail by mistake please notify the system manager @ 6030219. T? an r?omhphost seo (agus aon iat?n a ghabhann leis) pr?obh?ideach agus r?nda agus d?fh?adfadh go mbeadh eolas inti at? faoi phribhl?id dhl?thi?il. N? ceadmhach ?s?id an r?omhphoist seo d??inne ach don t? ar seoladh chuige ?. Mura duitse an r?omhphost seo n? an t? at? freagrach as ? a sheoladh, t? cosc ar ch?ipe?il agus ar sheachadadh an r?omhphoist seo agus aon iat?n a ghabhann leis chuig ?inne n? ?s?id a bhaint as a bhfuil ann; n? ceart an r?omhphost seo n? aon iat?n a l?amh. D?fh?adfadh go mbeadh cosc ioml?n dl?thi?il ar sceitheadh n? comhfhreagras n? aon ?s?id eile gan chead ar a bhfuil sa r?omhphost seo agus d?fh?adfadh s? a bheith ina chion coiri?il. M? fuair t? an r?omhphost seo tr? earr?id, d?an teagmh?il le bainisteoir an ch?rais @6030219 Visit our free exhibitions <http://www.nli.ie/en/udlist/current-exhibitions.aspx> ___________________________________________ Tabhair cuairt ar ?r dtaispe?ntais saor in aisce <http://www.nli.ie/ga/udlist/current-exhibitions.aspx> The contents of this e-mail (including attachments) are private and confidential and may also be subject to legal privilege. It is intended only for the use of the addressee. If you are not the addressee, or the person responsible for delivering it to the addressee, you may not copy or deliver this e-mail or any attachments to anyone else or make any use of its contents; you should not read any part of this e-mail or any attachments. Unauthorised disclosure or communication or other use of the contents of this e-mail or any part thereof may be prohibited by law and may constitute a criminal offence. If you receive this e-mail by mistake please notify the system manager @ 6030219. T? an r?omhphost seo (agus aon iat?n a ghabhann leis) pr?obh?ideach agus r?nda agus d?fh?adfadh go mbeadh eolas inti at? faoi phribhl?id dhl?thi?il. N? ceadmhach ?s?id an r?omhphoist seo d??inne ach don t? ar seoladh chuige ?. Mura duitse an r?omhphost seo n? an t? at? freagrach as ? a sheoladh, t? cosc ar ch?ipe?il agus ar sheachadadh an r?omhphoist seo agus aon iat?n a ghabhann leis chuig ?inne n? ?s?id a bhaint as a bhfuil ann; n? ceart an r?omhphost seo n? aon iat?n a l?amh. D?fh?adfadh go mbeadh cosc ioml?n dl?thi?il ar sceitheadh n? comhfhreagras n? aon ?s?id eile gan chead ar a bhfuil sa r?omhphost seo agus d?fh?adfadh s? a bheith ina chion coiri?il. M? fuair t? an r?omhphost seo tr? earr?id, d?an teagmh?il le bainisteoir an ch?rais @6030219 Visit our free exhibitions <http://www.nli.ie/en/udlist/current-exhibitions.aspx> ___________________________________________ Tabhair cuairt ar ?r dtaispe?ntais saor in aisce <http://www.nli.ie/ga/udlist/current-exhibitions.aspx> The contents of this e-mail (including attachments) are private and confidential and may also be subject to legal privilege. It is intended only for the use of the addressee. If you are not the addressee, or the person responsible for delivering it to the addressee, you may not copy or deliver this e-mail or any attachments to anyone else or make any use of its contents; you should not read any part of this e-mail or any attachments. Unauthorised disclosure or communication or other use of the contents of this e-mail or any part thereof may be prohibited by law and may constitute a criminal offence. If you receive this e-mail by mistake please notify the system manager @ 6030219. T? an r?omhphost seo (agus aon iat?n a ghabhann leis) pr?obh?ideach agus r?nda agus d?fh?adfadh go mbeadh eolas inti at? faoi phribhl?id dhl?thi?il. N? ceadmhach ?s?id an r?omhphoist seo d??inne ach don t? ar seoladh chuige ?. Mura duitse an r?omhphost seo n? an t? at? freagrach as ? a sheoladh, t? cosc ar ch?ipe?il agus ar sheachadadh an r?omhphoist seo agus aon iat?n a ghabhann leis chuig ?inne n? ?s?id a bhaint as a bhfuil ann; n? ceart an r?omhphost seo n? aon iat?n a l?amh. D?fh?adfadh go mbeadh cosc ioml?n dl?thi?il ar sceitheadh n? comhfhreagras n? aon ?s?id eile gan chead ar a bhfuil sa r?omhphost seo agus d?fh?adfadh s? a bheith ina chion coiri?il. M? fuair t? an r?omhphost seo tr? earr?id, d?an teagmh?il le bainisteoir an ch?rais @6030219 Visit our free exhibitions <http://www.nli.ie/en/udlist/current-exhibitions.aspx> ___________________________________________ Tabhair cuairt ar ?r dtaispe?ntais saor in aisce <http://www.nli.ie/ga/udlist/current-exhibitions.aspx> The contents of this e-mail (including attachments) are private and confidential and may also be subject to legal privilege. It is intended only for the use of the addressee. If you are not the addressee, or the person responsible for delivering it to the addressee, you may not copy or deliver this e-mail or any attachments to anyone else or make any use of its contents; you should not read any part of this e-mail or any attachments. Unauthorised disclosure or communication or other use of the contents of this e-mail or any part thereof may be prohibited by law and may constitute a criminal offence. If you receive this e-mail by mistake please notify the system manager @ 6030219. T? an r?omhphost seo (agus aon iat?n a ghabhann leis) pr?obh?ideach agus r?nda agus d?fh?adfadh go mbeadh eolas inti at? faoi phribhl?id dhl?thi?il. N? ceadmhach ?s?id an r?omhphoist seo d??inne ach don t? ar seoladh chuige ?. Mura duitse an r?omhphost seo n? an t? at? freagrach as ? a sheoladh, t? cosc ar ch?ipe?il agus ar sheachadadh an r?omhphoist seo agus aon iat?n a ghabhann leis chuig ?inne n? ?s?id a bhaint as a bhfuil ann; n? ceart an r?omhphost seo n? aon iat?n a l?amh. D?fh?adfadh go mbeadh cosc ioml?n dl?thi?il ar sceitheadh n? comhfhreagras n? aon ?s?id eile gan chead ar a bhfuil sa r?omhphost seo agus d?fh?adfadh s? a bheith ina chion coiri?il. M? fuair t? an r?omhphost seo tr? earr?id, d?an teagmh?il le bainisteoir an ch?rais @6030219 |