From: Barnett, J. <jef...@ya...> - 2008-06-13 13:06:48
|
We use marcexport as utf-8 at Yale without difficulty, but so far have not loaded all 8 million records or most or our non Latin scripts (still in design and customization stage, and re-indexing too often to be worth the time). We did find one vendor who was sending non-roman characters encoded as "&<charname>;" tags designed to be rendered through stylesheets that had to be cleaned up. -----------original message ------------------------------------------- Date: Thu, 12 Jun 2008 17:17:55 -0500 From: Chris Delis <ce...@ui...> Subject: [VuFind-General] "Cleaning" MARC files for use with java importer (was Re: diacritic display -- font problem?) To: vuf...@li... Message-ID: <200...@ui...> Content-Type: text/plain; charset=iso-8859-1 Hello all, Are there any Voyager customers out there using Voyager's marcexport tool along with the java importer? If so, are you exporting as MARC21 MARC-8? And how are you "cleaning" your marc records, if at all? I am having trouble getting the ISOLatin1Filter to work properly in SOLR and am guessing the problem may have to do with a bad encoding somewhere. Are there any good tools (which can run in a batch on a *nix system) someone can recommend? Or is it just better to translate (via yaz-marcdump or whatever) to MARCXML and modify the java importer to read MARCXML? Thanks! Chris On Wed, May 21, 2008 at 02:29:36PM -0700, Naomi Dushay wrote: > There is a C "utf8conditioner" program available at the OAI-PMH web > site (look under "tools"). It changes bad UTF-8 characters to a > benign (but unmeaningful) character. The program comes with test > files with bad UTF-8 characters. > > When I worked for the National Science Digital Library, we harvested > OAI data that had bad UTF-8 chars. It was fairly common. > > The multi-byte UTF-8 characters tend to be particularly thorny, as I > recall. > > - Naomi > |