Re: [VuFind-General] Special characaters

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

For the record any anyone who runs into this in the future, here's how
I'm handling it:

1. Export marc records from Horizon (marc8 is the only option)
2. Convert the marc8 encoded file into utf8 encoded marcxml using yaz-marcdump
    This results in the <U+2019> type strings being converted to &lt;U+2019&gt;
3. Using sed, I convert &lt;U+2019&gt; type strings to &#x2019; format
4. Import into VuFind

If you need the data in marc21 format you can convert it back with yaz-marcdump.

So far every record I've checkd looks good.

On Tue, Mar 5, 2013 at 9:15 AM, Jay Roos <ja...@gm...> wrote:
> I did some analysis on the dump file and found that there are 4,631 of
> these unicode characters found in 1,773 different tags across 1,723 of
> our marc records. My preference is to clean them up, but it seems like
> they occur more frequently in newer records so I might need to do some
> custom processing if SirsiDynix can't offer a better solution.
>
> On Mon, Mar 4, 2013 at 5:55 PM, Tod Olson <to...@uc...> wrote:
>> Oh yes, this. IIRC, this happens sometimes in Horizon. "<U+2019>" will be ASCII representation of Unicode code 2019 point (always hexadecimal), RIGHT SINGLE QUOTATION MARK. Maybe this was in an imported record, maybe someone did a cut and paste into the cataloging client, uncertain. That literal string will be in your Sybase or MSSQL tables for that record, maybe the cataloging client hides this for you and I seem to recall that HIP might hide it from you. But VuFind has no magic to correct that for you.
>>
>> You might want to run your MARC dumps through something like yaz-marcdump and then grep for that '<U+[0-9]+>' pattern to see how extensive a problem it is. Or do a pattern match in the bib and bib_longtext tables. If it's only few, maybe the catalogers can correct these manually. Otherwise, you might want to write a batch filter for your MARC records that would fix up this data in the exported files. Pymarc would let you bust apart the records, correct the strings, and reassemble them correctly, for example. Or you could write something to handle it during the Solrmarc import, but I think that's the least appealing option.
>>
>> Best to sanitize the data before handing it to solrmarc. I'd really suggest identifying the problem records and fixing them one way or another. And document the process, because that stuff will creep in again.
>>
>> -Tod
>>
>> Tod Olson <to...@uc...>
>> Systems Librarian
>> University of Chicago Library
>>
>>
>>
>> On Mar 4, 2013, at 2:28 PM, Jay Roos <ja...@gm...>
>>  wrote:
>>
>>> I'm getting several special character encodings when exporting from
>>> Horizon. They show up in the marc records like this '<U+2019>'. Does
>>> anyone have a recommendation on how to deal with them? They stay this
>>> way when imported into Vufind.
>>>
>>> --
>>> Jay Roos
>>> Information Technology Coordinator
>>> Great River Regional Library
>>>
>>> ------------------------------------------------------------------------------
>>> Everyone hates slow websites. So do we.
>>> Make your web apps faster with AppDynamics
>>> Download AppDynamics Lite for free today:
>>> http://p.sf.net/sfu/appdyn_d2d_feb
>>> _______________________________________________
>>> VuFind-General mailing list
>>> VuF...@li...
>>> https://lists.sourceforge.net/lists/listinfo/vufind-general
>>
>
>
>
> --
> Jay Roos
> Information Technology Coordinator
> Great River Regional Library

-- 
Jay Roos
Information Technology Coordinator
Great River Regional Library