We use marcexport as utf-8 at Yale without difficulty, but so far have not loaded all 8 million records or most or our non Latin scripts (still in design and customization stage, and re-indexing too often to be worth the time). We did find one vendor who was sending non-roman characters encoded as "&<charname>;" tags designed to be rendered through stylesheets that had to be cleaned up.
-----------original message -------------------------------------------
Date: Thu, 12 Jun 2008 17:17:55 -0500
From: Chris Delis <cedelis@...>
Subject: [VuFind-General] "Cleaning" MARC files for use with java
importer (was Re: diacritic display -- font problem?)
Content-Type: text/plain; charset=iso-8859-1
Are there any Voyager customers out there using Voyager's marcexport
tool along with the java importer? If so, are you exporting as MARC21
MARC-8? And how are you "cleaning" your marc records, if at all? I
am having trouble getting the ISOLatin1Filter to work properly in SOLR
and am guessing the problem may have to do with a bad encoding
somewhere. Are there any good tools (which can run in a batch on a
*nix system) someone can recommend? Or is it just better to translate
(via yaz-marcdump or whatever) to MARCXML and modify the java importer
to read MARCXML?
On Wed, May 21, 2008 at 02:29:36PM -0700, Naomi Dushay wrote:
> There is a C "utf8conditioner" program available at the OAI-PMH web
> site (look under "tools"). It changes bad UTF-8 characters to a
> benign (but unmeaningful) character. The program comes with test
> files with bad UTF-8 characters.
> When I worked for the National Science Digital Library, we harvested
> OAI data that had bad UTF-8 chars. It was fairly common.
> The multi-byte UTF-8 characters tend to be particularly thorny, as I
> - Naomi