From: Nathan T. <nta...@gm...> - 2013-02-07 20:40:04
|
Upgrading to the new SolrMarc didn't make a difference for me. You can grab a copy of the PDF that is offending my server here: http://americanjewisharchives.org/media/docs/marcus/dawnInTheWest.pdf The problem character (U0x1f) is an invisible spacer. It's in the copyright statement. Also, unfortunately, I can't figure out how to edit the PDF in InDesign or anything else to eliminate the character. Retyping that part doesn't get rid of it. Nathan On Thu, Feb 7, 2013 at 3:31 PM, Demian Katz <dem...@vi...>wrote: > Have you tried upgrading to the new version of SolrMarc included in VuFind > 1.4? You should be able to upgrade this without having to upgrade your > entire VuFind installation -- just download the 1.4 release and copy the > .jar files from the import directory. (I'd make a backup first, just to be > on the safe side). This includes improved handling of weird characters in > the full text processing, so it might help. > > If that's still no good, can you figure out exactly which record in the > collection is causing the problem? Would it be possible for you to share > the offending PDF so I can try to reproduce the problem? (Obviously, I > understand if copyright restrictions prevent this). > > - Demian > > > -----Original Message----- > > From: Steven P. Johnson [mailto:st...@ar...] > > Sent: Thursday, February 07, 2013 3:06 PM > > To: vuf...@li... > > Subject: [VuFind-Tech] Problems in importing fulltext from pdfs > > > > With VuFind 1.3, I have not been able to get full text indexing working > with > > either Aperture or Tika. Import fails after a dozen or so records with > the > > illegal character message mentioned by others on this list. > > > > The "wonders of google" fix on the page at > > http://www.exampledepot.com/egs/java.io/ReadFromUTF8.html is no longer > > available at that address. I wondered if anyone has summarized that > fix, and > > others required to get full text indexing working in VuFind 1.3. > > > > My network admin has tried the patches posted to the list, without > success. > > > > Our VuFind collection contains 566 records at present. Each record > links to > > one or more pdfs in the ARLIS document store. > > > > > > Steve Johnson > > Systems Coordinator/Management Team > > Alaska Resources Library & Information Services (ARLIS) > > Anchorage, AK 99508 > > 907 786-7661 > > st...@ar... > > Ste...@fw... > > ________________________________________ > > From: Demian Katz [dem...@vi...] > > Sent: Friday, September 07, 2012 8:06 AM > > To: Ronan McHugh; gue...@un...; > vuf...@li... > > Subject: Re: [VuFind-Tech] Problems in importing fulltext from pdfs > > > > Wonderful! Oh, the wonders of Google. :-) > > > > - Demian > > ________________________________ > > From: Ronan McHugh [rm...@nl...] > > Sent: Friday, September 07, 2012 12:05 PM > > To: Demian Katz; gue...@un...; > vuf...@li... > > Subject: RE: [VuFind-Tech] Problems in importing fulltext from pdfs > > > > Yes. > > > > ________________________________ > > From: Demian Katz [mailto:dem...@vi...] > > Sent: 07 September 2012 17:00 > > To: Ronan McHugh; gue...@un...; > vuf...@li... > > Subject: RE: [VuFind-Tech] Problems in importing fulltext from pdfs > > > > Is it possible the solution is this simple? > > > > http://www.exampledepot.com/egs/java.io/ReadFromUTF8.html > > > > - Demian > > ________________________________ > > From: Ronan McHugh [rm...@nl...] > > Sent: Friday, September 07, 2012 11:55 AM > > To: Demian Katz; gue...@un...; > vuf...@li... > > Subject: RE: [VuFind-Tech] Problems in importing fulltext from pdfs > > Hi, > > > > Tika takes an encoding argument. We set this to UTF8 and read the output > into > > a StringBuilder from stdout. If we run Tika directly from the command > line and > > save the contents of the cmd line to file (using the > operator) this > encodes > > the text correctly. Therefore I tried to replicate this in the bsh > script. > > Since Tika doesn't take an argument to save output to file, I used the > > > operator in the script also. However, for whatever reason, doing this > didn't > > work and nothing was saved to file. Therefore, I had to read directly > from > > stdout to the String. > > > > So, our guess is that something is going wrong between Tika's output at > stdout > > and the Buffered Reader reading this in. However, we don't know how to > fix it > > at this stage. I attempted to use a BufferedWriter to write to file in > UTF8 > > but this only made matters worse. > > > > Best, > > > > Ronan > > > > ________________________________ > > From: Demian Katz [mailto:dem...@vi...] > > Sent: 07 September 2012 16:33 > > To: Ronan McHugh; gue...@un...; > vuf...@li... > > Subject: RE: [VuFind-Tech] Problems in importing fulltext from pdfs > > > > Is Tika incorrectly interpreting characters, or is the problem simply > that > > your documents are in a variety of encodings and need to be normalized? > Does > > Tika include a parameter for normalizing character sets? If not, it > might be > > possible to add an intermediate step to detect character sets and > normalize > > everything to UTF-8. > > > > - Demian > > ________________________________ > > From: Ronan McHugh [rm...@nl...] > > Sent: Friday, September 07, 2012 11:28 AM > > To: gue...@un...; vuf...@li... > > Subject: Re: [VuFind-Tech] Problems in importing fulltext from pdfs > > Hi, > > > > I've created a new bsh script which uses Tika to parse input files and > > uploaded it to JIRA. We have noticed that there are problems with > encoding of > > some characters such as curly apostrophes, long hyphens and diacritics. > > However, we can't see any obvious way to fix this. If anyone had any > ideas, it > > would be much appreciated. > > > > Thanks, > > > > Ronan > > > > > > > ------------------------------------------------------------------------------ > > Free Next-Gen Firewall Hardware Offer > > Buy your Sophos next-gen firewall before the end March 2013 > > and get the hardware for free! Learn more. > > http://p.sf.net/sfu/sophos-d2d-feb > > _______________________________________________ > > Vufind-tech mailing list > > Vuf...@li... > > https://lists.sourceforge.net/lists/listinfo/vufind-tech > > > ------------------------------------------------------------------------------ > Free Next-Gen Firewall Hardware Offer > Buy your Sophos next-gen firewall before the end March 2013 > and get the hardware for free! Learn more. > http://p.sf.net/sfu/sophos-d2d-feb > _______________________________________________ > Vufind-tech mailing list > Vuf...@li... > https://lists.sourceforge.net/lists/listinfo/vufind-tech > |