There was indeed a bug specific to using Aperture to harvest full text through SolrMarc.  The code to sanitize the full text was happening too late -- it was being cleaned up AFTER XML parsing had taken place... so if there were characters that broke XML parsing, the code failed before cleanup could occur.  I've fixed this bug in the SolrMarc trunk.  You have three options to move forward:

1.) Switch from Aperture to Tika.  The bug does not affect harvesting with Tika, and Tika is probably the way of the future since the Aperture project has dried up.  It's really easy -- just download the Tika .jar file from here (http://tika.apache.org/) and point to it from web/conf/fulltext.ini (the VuFind 1.4 fulltext.ini has a section for Tika -- obviously you'll also need the 1.4 version of SolrMarc to go with this; you can continue to use Aperture for harvesting through PHP until you upgrade to 1.4).

2.) Get the latest version of getFulltext.bsh from here and use that: http://code.google.com/p/solrmarc/source/browse/trunk/examples/GenericVuFind/index_scripts/getFulltext.bsh (it fixes the Aperture bug).

3.) Compile a fresh copy of SolrMarc from the latest trunk.  I can post the .jar files for you somewhere if you want.

Thanks for catching the problem; let me know if I can be of further assistance.

- Demian

From: Nathan Tallman [ntallman@gmail.com]
Sent: Thursday, February 07, 2013 3:43 PM
To: Demian Katz
Cc: Steven P. Johnson; vufind-tech@lists.sourceforge.net
Subject: Re: [VuFind-Tech] Problems in importing fulltext from pdfs

What's really odd to me is that this (and other similarly offending PDFs) work fine when they're scrapped using the website indexing routing. It's only when they are imported from a MARC 856u does this problem work. Scrapping the full text of an individual PDF using aperture from the command line works without a hitch too.

Look at the first search result here and you'll see what I mean: http://americanjewisharchives.org/catalog/Web/Results?lookfor=dawn+in+the+west&submit=Find


On Thu, Feb 7, 2013 at 3:39 PM, Nathan Tallman <ntallman@gmail.com> wrote:
Upgrading to the new SolrMarc didn't make a difference for me. You can grab a copy of the PDF that is offending my server here: http://americanjewisharchives.org/media/docs/marcus/dawnInTheWest.pdf 

The problem character (U0x1f) is an invisible spacer. It's in the copyright statement. Also, unfortunately, I can't figure out how to edit the PDF in InDesign or anything else to eliminate the character. Retyping that part doesn't get rid of it.

Nathan


On Thu, Feb 7, 2013 at 3:31 PM, Demian Katz <demian.katz@villanova.edu> wrote:
Have you tried upgrading to the new version of SolrMarc included in VuFind 1.4?  You should be able to upgrade this without having to upgrade your entire VuFind installation -- just download the 1.4 release and copy the .jar files from the import directory.  (I'd make a backup first, just to be on the safe side).  This includes improved handling of weird characters in the full text processing, so it might help.

If that's still no good, can you figure out exactly which record in the collection is causing the problem?  Would it be possible for you to share the offending PDF so I can try to reproduce the problem?  (Obviously, I understand if copyright restrictions prevent this).

- Demian

> -----Original Message-----
> From: Steven P. Johnson [mailto:steve@arlis.org]
> Sent: Thursday, February 07, 2013 3:06 PM
> To: vufind-tech@lists.sourceforge.net
> Subject: [VuFind-Tech] Problems in importing fulltext from pdfs
>
> With VuFind 1.3, I have not been able to get full text indexing working with
> either Aperture or Tika. Import fails after a dozen or so records with the
> illegal character message mentioned by others on this list.
>
> The  "wonders of google" fix on the page at
> http://www.exampledepot.com/egs/java.io/ReadFromUTF8.html is no longer
> available at that address.  I wondered if anyone has summarized that fix, and
> others required to get full text indexing working in VuFind 1.3.
>
> My network admin has tried the patches posted to the list, without success.
>
> Our VuFind collection contains 566 records at present.  Each record links to
> one or more pdfs in the ARLIS document store.
>
>
> Steve Johnson
> Systems Coordinator/Management Team
> Alaska Resources Library & Information Services (ARLIS)
> Anchorage, AK 99508
> 907 786-7661
> steve@arlis.org
> Steven_johnson@fws.gov
> ________________________________________
> From: Demian Katz [demian.katz@villanova.edu]
> Sent: Friday, September 07, 2012 8:06 AM
> To: Ronan McHugh; guenter.hipler@unibas.ch; vufind-tech@lists.sourceforge.net
> Subject: Re: [VuFind-Tech] Problems in importing fulltext from pdfs
>
> Wonderful!  Oh, the wonders of Google. :-)
>
> - Demian
> ________________________________
> From: Ronan McHugh [rmchugh@nli.ie]
> Sent: Friday, September 07, 2012 12:05 PM
> To: Demian Katz; guenter.hipler@unibas.ch; vufind-tech@lists.sourceforge.net
> Subject: RE: [VuFind-Tech] Problems in importing fulltext from pdfs
>
> Yes.
>
> ________________________________
> From: Demian Katz [mailto:demian.katz@villanova.edu]
> Sent: 07 September 2012 17:00
> To: Ronan McHugh; guenter.hipler@unibas.ch; vufind-tech@lists.sourceforge.net
> Subject: RE: [VuFind-Tech] Problems in importing fulltext from pdfs
>
> Is it possible the solution is this simple?
>
> http://www.exampledepot.com/egs/java.io/ReadFromUTF8.html
>
> - Demian
> ________________________________
> From: Ronan McHugh [rmchugh@nli.ie]
> Sent: Friday, September 07, 2012 11:55 AM
> To: Demian Katz; guenter.hipler@unibas.ch; vufind-tech@lists.sourceforge.net
> Subject: RE: [VuFind-Tech] Problems in importing fulltext from pdfs
> Hi,
>
> Tika takes an encoding argument. We set this to UTF8 and read the output into
> a StringBuilder from stdout. If we run Tika directly from the command line and
> save the contents of the cmd line to file (using the > operator) this encodes
> the text correctly. Therefore I tried to replicate this in the bsh script.
> Since Tika doesn't take an argument to save output to file, I used the >
> operator in the script also. However, for whatever reason, doing this didn't
> work and nothing was saved to file. Therefore, I had to read directly from
> stdout to the String.
>
> So, our guess is that something is going wrong between Tika's output at stdout
> and the Buffered Reader reading this in. However, we don't know how to fix it
> at this stage. I attempted to use a BufferedWriter to write to file in UTF8
> but this only made matters worse.
>
> Best,
>
> Ronan
>
> ________________________________
> From: Demian Katz [mailto:demian.katz@villanova.edu]
> Sent: 07 September 2012 16:33
> To: Ronan McHugh; guenter.hipler@unibas.ch; vufind-tech@lists.sourceforge.net
> Subject: RE: [VuFind-Tech] Problems in importing fulltext from pdfs
>
> Is Tika incorrectly interpreting characters, or is the problem simply that
> your documents are in a variety of encodings and need to be normalized?  Does
> Tika include a parameter for normalizing character sets?  If not, it might be
> possible to add an intermediate step to detect character sets and normalize
> everything to UTF-8.
>
> - Demian
> ________________________________
> From: Ronan McHugh [rmchugh@nli.ie]
> Sent: Friday, September 07, 2012 11:28 AM
> To: guenter.hipler@unibas.ch; vufind-tech@lists.sourceforge.net
> Subject: Re: [VuFind-Tech] Problems in importing fulltext from pdfs
> Hi,
>
> I've created a new bsh script which uses Tika to parse input files and
> uploaded it to JIRA. We have noticed that there are problems with encoding of
> some characters such as curly apostrophes, long hyphens and diacritics.
> However, we can't see any obvious way to fix this. If anyone had any ideas, it
> would be much appreciated.
>
> Thanks,
>
> Ronan
>
>
> ------------------------------------------------------------------------------
> Free Next-Gen Firewall Hardware Offer
> Buy your Sophos next-gen firewall before the end March 2013
> and get the hardware for free! Learn more.
> http://p.sf.net/sfu/sophos-d2d-feb
> _______________________________________________
> Vufind-tech mailing list
> Vufind-tech@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/vufind-tech

------------------------------------------------------------------------------
Free Next-Gen Firewall Hardware Offer
Buy your Sophos next-gen firewall before the end March 2013
and get the hardware for free! Learn more.
http://p.sf.net/sfu/sophos-d2d-feb
_______________________________________________
Vufind-tech mailing list
Vufind-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/vufind-tech