Problems importing Endnote XML

2010-01-22
2013-05-28
  • Matthias Guth
    Matthias Guth
    2010-01-22

    I encountered an error while importing XML Files from Endnote to Refbase. First of all: importing the same file in the Demo-Version at http://www.refbase.org/ is working fine. But when I try to import it at various installations (one using Bibutiuls 4.1-1 and  PHP Version 5.2.10-2ubuntu6.4, the other using bibutils_4.4 and  PHP Version 5.2.9) this error occurs.

    Conversion with importBibutils($sourceText,"endx2xml") or importBibutils($sourceText,"xml2ris") in import/bibutils/import_endx2refbase.php seems to produce three characters in front of the string $recordArray used in the function validateRecords() in /includes/import.inc.php. This causes the following preg_match (check, if string starts with regex "/^TY  - /m" to fail since the string starts with these three characters and not with T.

    Therefore importing ends up in validation Error "Record 1: Unrecognized data format! Required field missing: TY". Using the "Skip records with unrecognized data format"-Option skips the first Entry to import and imports the following entries without problems.

    The Hex-Code for the three bytes is always "efbbbf" and I think this is a non multibyte -save unicode issue.

    I think there is something happen within newer Version of Bibutils and using an older Version of them should operate without Problems.

    Therefore I added a simple workaround in function validateRecords() in file includes/import.inc.php:

    if($i ==0 AND (strtolower(bin2hex(substr($recordArray[0], 0, 3)))=="efbbbf"))
            $recordArray[0]=substr($recordArray[0], 3);
    

    If there is any comment or suggestion I would be appreciate if it is posted here.

    You can also get the patched file here: http://www.gono.info/refbase/patches/includes/import.inc.phps

    Greetings,
    Matthias

     
  • Hi Matthias,

    ATM, refbase-0.9.5 only works reliably with Bibutls v3.4 which is still available at:

    http://bibutils.refbase.org/

    Bibutils 4.x introduced changes that are incompatible with refbase-0.9.5 but which should get fixed in the next refbase update.

    The problem you're seeing seems to be the BOM character (byte order mark) at the beginning of the RIS file created by Bibutils 4. Please see my post from 16-Dec-2009 in this thread:

    https://sourceforge.net/projects/refbase/forums/forum/218758/topic/3483874

    As a workaround, if you're starting with a RIS file, you could just resave that RIS file to UTF-8 without BOM before importing.

    Alternatively, if you're starting with an Endnote XML file (named, say "literature.xml"), you could convert your Endnote XML data to RIS (UTF-8, no BOM) using Bibutils 4.x locally, then import that RIS file instead:

    endx2xml -i utf8 -un literature.xml | xml2ris -o utf8 -nb > literature.ris
    

    Let us know if this works better.

    Sorry for the trouble,

    Matthias

     
  • I realize I'm bumping a very old thread, but I just checked in code that will remove BOMs from uploaded files, if present.  RefWorks & some other providers include BOM by default.