Bug: MedLine XML format + Diacritic marks

2008-03-03
2013-05-28
  • Before being negative, first off, I want to mention that I just recently discovered RefBase, and I think it's a very useful tool. We're currently evaluating it for in-house management and tracking of in-house publications.

    As most of our publications are referenced in MedLine, I took a closer look at the PubMed import. There seems to be a bug in the PubMed XML toolchain in regard to diacritical marks (Umlaute). This leads to the author field appearing empty after import.

    Please note: Importing from Medline format is not affected.

    Example/how to reproduce:
    PubMed ID 17302433 has a first author of "Aänismaa".

    In the XML file obtained from NCBI, the author name is stated as "Aänismaa".
    For import, RefBase throws this through a series of filters:
    1) med2xml (bibutils) -> in the MODS, the name is represented as "Aänismaa" ... so far, so good.
    2) xml2ris (bibutils) -> Now, the name is represented as "A<E4>nismaa" (i.e. the diacritic is now represented as 0xE4).
    3) After ris import, the author field shows up completely empty.

    This is using utf-8 throughout (database, character-encoding). Switching to latin-1 for testing did not resolve the problem. With latin-1 the author name gets mangled instead of showing up blank.

    I realize
    - This might be more of a bibutils bug than a RefBase bug
    - Importing MedLine format data circumvents the problem (MedLine data appears to be plain vanilla ASCII)

    but I thought I'd mention it in case someone else runs into the same issue. Also, in the long run, it might be a good idea to import directly from MODS rather than RIS

     
    • Hi,

      > Before being negative, first off, I want to mention that I just
      > recently discovered RefBase, and I think it's a very useful tool.

      Thanks, I appreciate the comment.

      > We're currently evaluating it for in-house management and tracking
      > of in-house publications.

      Let us know if you run into further pronblems, or have some suggestions for improvement.

      > As most of our publications are referenced in MedLine, I took a
      > closer look at the PubMed import. There seems to be a bug in the
      > PubMed XML toolchain in regard to diacritical marks (Umlaute). This
      > leads to the author field appearing empty after import.

      I can replicate this problem partly, i.e. I see that there are some issues with conversion of higher ASCII chars. However, using your test record, I always get something imported to the 'author' field, i.e. the 'author' field is not empty after import.

      An empty author field may be caused by a different issue in refbase-0.9.0, where the MEDLINE importer only inspects the MEDLINE 'FAU' field for author information. The newest version that's available in the "trunk" of the refbase SVN repository[1] will also look into the 'AU' field if there's no 'FAU' field.

      [1]: http://svn.refbase.net/

      That said, if you can point us to an example where the existence of higher ASCII chars causes the 'author' field to be empty, please let us know.

      > Example/how to reproduce:
      > PubMed ID 17302433 has a first author of "Aänismaa".
      >
      > For import, RefBase throws this through a series of filters: 1)
      > med2xml (bibutils) -> in the MODS, the name is represented as
      > "A&#228;nismaa"

      Correct. However, note that Bibutils allows to use the "-u" flag to directly write unicode (and not xml entities).

      > 2) xml2ris (bibutils) -> Now, the name is represented as
      > "A<E4>nismaa" (i.e. the diacritic is now represented as 0xE4).

      Using Bibutils v3.38, I cannot replicate this problem. Using these conversions steps on the command line:

      med2xml -i utf8 -o utf8 pubmed_result.xml > pubmed_mods.xml
      xml2ris -i utf8 -o utf8 pubmed_mods.xml > pubmed.ris

      I get a RIS file (encoded as UTF-8, no BOM) that correctly contains the author name "Aänismaa".

      Which version of Bibutils are you using? Have you tried the newest version (v3.40)?

      http://www.scripps.edu/~cdputnam/software/bibutils/

      > 3) After ris import, the author field shows up completely empty.

      This is not the case for me. When I try to import this into a Unicode-based refbase database (such as the one at http://www.refbase.org ), I do get an author string, however the umlaut character in the first author's forename seems to cause problems for the routine that reduces given names to initials. I get:

      Aänismaa, P..ivi; Seelig, A.

      This is probably a bug in refbase (and/or the PHP version used on the server). I'd need to further investigate this.

      > This is using utf-8 throughout (database, character-encoding).

      If in doubt, you might want to double check your system using the hints at:

      http://wiki.refbase.net/index.php/Installation-Troubleshooting#Problems_with_special_characters

      > Switching to latin-1 for testing did not resolve the problem. With
      > latin-1 the author name gets mangled instead of showing up blank.

      I just tried pasting the PubMEd XML for record into the import form at http://demo.refbase.net/ (which is a latin1-based refbase installation) and it actually seems to import fine.

      What exactly do you see after import into a latin1-based database?

      > I realize
      > - This might be more of a bibutils bug than a RefBase bug

      Not necessarily so. At least in my tests, the RIS file generated by Bibutils looks fine.

      > - Importing MedLine format data circumvents the problem (MedLine
      > data appears to be plain vanilla ASCII)

      Yes, unfortunately, MEDLINE data don't contain any higher ASCII chars.

      > but I thought I'd mention it in case someone else runs into the
      > same issue.

      Thanks for the clear problem report! I hope that we can resolve this issue.

      > Also, in the long run, it might be a good idea to import directly
      > from MODS rather than RIS

      Right, and this is what we had planned since a very long time.

      http://wiki.refbase.net/index.php/Importing_records#Import_road_map

      However, nothing has been done about this, partly since RIS import via Bibutils works quite well, and also since we haven't really heard (m)any complains. Besides that, developing a good parser for MODS XML isn't trivial, and the Bibutils developer (Chris Putnam) really has done an outstanding job with his *2xml tools. That said, of course I'd love to support import of MODS XML natively.

      Thanks, Matthias

       
      • Hi Matthias,

        I'm the original poster of the bug. Sorry for not replying for a long time --- I was caught up with another project.

        Now, to the original matter:
        I am currently running a recent SVN checkout (revision 1170) of RefBase, and BibUtils version 3.41

        >I can replicate this problem partly, i.e. I see that there are some issues with conversion of higher ASCII chars. >However, using your test record, I always get something imported to the 'author' field, i.e. the 'author' field is not >empty after import.

        That seems to be the case with the SVN trunk version indeed. The author's name is not completely correct, though, I get "Aìnismaa, P.Ã.¬ivi; Seelig, A.", so the diacritics are not quite rendered correctly.

        > An empty author field may be caused by a different issue in refbase-0.9.0, where the MEDLINE importer only inspects
        > the MEDLINE 'FAU' field for author information. The newest version that's available in the "trunk" of the refbase SVN
        > repository[1] will also look into the 'AU' field if there's no 'FAU' field.

        Good to hear it's fixed in SVN, maybe this was an additional distraction from the problem.

        >> Example/how to reproduce:
        >> PubMed ID 17302433 has a first author of "Aänismaa". 

        >Using Bibutils v3.38, I cannot replicate this problem. Using these conversions steps on the command line:
        >
        >med2xml -i utf8 -o utf8 pubmed_result.xml > pubmed_mods.xml
        >xml2ris -i utf8 -o utf8 pubmed_mods.xml > pubmed.ris
        >
        >I get a RIS file (encoded as UTF-8, no BOM) that correctly contains the author name "Aänismaa".

        I can confirm this with latest RefBase/Bibutils versions.

        > 3) After ris import, the author field shows up completely empty. 

        > When I try to import this into a Unicode-based refbase database (such as the one at http://www.refbase.org ),
        > I do get an author string, however the umlaut character in the first author's forename seems to cause
        > problems for the routine that reduces given names to initials. I get:
        >Aänismaa, P..ivi; Seelig, A.

        Right. When I do this in two different ways, I get the following (on a UTF-8 configured system):
        PubMed XML import: "Aìnismaa, P.Ã.¬ivi; Seelig, A."
        PubMed manually through the med2xml/xml2ris chain as described above: "Aänismaa, P.Ã.¤ivi; Seelig, A."

        So, we're definitely closer, but not there yet.

        > This is probably a bug in refbase (and/or the PHP version used on the server).
        > I'd need to further investigate this.

        Let me know if you come up with something. But the problem looks more tractable already, and I'll probably tinker with it a bit, as well.

        > I just tried pasting the PubMEd XML for record into the import form at http://demo.refbase.net/ (which is a
        > latin1-based refbase installation) and it actually seems to import fine.
        > What exactly do you see after import into a latin1-based database?

        With DB as latin-1:
        After PubMed XML import: "Aìnismaa, P.Ã.¬ivi; Seelig, A."

        Thanks for your helpful comments already,
        Michael

         
    • Hi Michael,

      thanks for the followup.

      >> the umlaut character in the first author's forename seems to cause
      >> problems for the routine that reduces given names to initials. I get:
      >>Anismaa, P..ivi; Seelig, A.

      I've done a bit more testing, and it seems that when importing a record such as PMID:17302433 as PubMed XML (where any of the author's given names contains higher ASCII chars) this causes problems for the routine that reduces given names to initials. As a result, one may get e.g. "Aänismaa, P..ivi; Seelig, A." instead of "Aänismaa, P.; Seelig, A.".

      In case of the current SVN trunk version, if the initials/given names contain any higher ASCII chars, this will currently garble the initials/given names IF a latin1-based database is used AND variable '$convertExportDataToUTF8' in 'ini.inc.php' is set to "yes". This is since the splitting is currently done AFTER the person string has been converted to UTF-8.

      However, in the bleeding-edge version of the SVN repository, I've added some magic that converts/transliterates UTF-8 data to latin1 if necessary. This seems to have also solved the issue reported here.

      You can try out the bleeding-edge version at:

      http://beta.refbase.net/import.php

      Importing PMID:17302433 as PubMed XML now seems to work fine there. Can you confirm this?

      >> What exactly do you see after import into a latin1-based database?
      >
      > With DB as latin-1:
      > After PubMed XML import: "Aìnismaa, P.Ã.¬ivi; Seelig, A."

      That's strange. Are you sure that you've  setup refbase correctly (w.r.t. character encoding)? If in doubt, please check out:

      http://wiki.refbase.net/index.php/Installation-Troubleshooting#Problems_with_special_characters

      So, basically, for author names that contain higher ASCII chars in their family name (but not within the given name), can you enter, display AND search for these characters correctly?

      Thanks, Matthias

       
      • >>> What exactly do you see after import into a latin1-based database? 
        >>
        >> With DB as latin-1: 
        >> After PubMed XML import: "Aìnismaa, P.Ã.¬ivi; Seelig, A."

        >That's strange. Are you sure that you've setup refbase correctly (w.r.t. character encoding)?

        Sorry, my bad. After explicitly DROPing the literature db and re-initializing to latin-1, this now works.
        I get "Aanismaa, P.; Seelig, A." as the authors. No Umlauts there, but at least no "weird characters", and output is identical to that from demo.refbase.net, which is also latin-1 based.

        -Michael

         
    • Hi Michael,

      gald that it's now (more or less) working for you

      > >That's strange. Are you sure that you've setup refbase correctly
      > >(w.r.t. character encoding)?
      >
      > Sorry, my bad. After explicitly DROPing the literature db and
      > re-initializing to latin-1, this now works.
      > I get "Aanismaa, P.; Seelig, A." as the authors. No Umlauts there,
      > but at least no "weird characters", and output is identical to
      > that from demo.refbase.net, which is also latin-1 based.

      How did you import this article? Did you use import it from PubMed XML data, or did you use the refbase "Import via PubMed ID" feature? Please note that the latter uses MEDLINE (and not PubMed XML) data under the hood. Unfortunately, MEDLINE data don't include any higher ASCII chars.

      However, when importing PMID:17302433[1] as PubMed XML[2]

      [1]: http://view.ncbi.nlm.nih.gov/pubmed/17302433
      [2]: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&rettype=medline&id=17302433

      into the newest (bleeding-edge SVN, #1170) refbase version (see e.g. beta.refbase.net), the higher ASCII chars should come thru fine, no matter whether you are importing it into a latin1-encoded database, or into an UTF-8 database.

      Matthias

       
      • Hi Matthias,

        I once more tackled this issue, and while I a latin-1 setup to work fine & to import even data with Umlauts correctly, I was not able to search for such entries.

        I now switched to UTF8, and things are working (almost) perfect. The only thing that is bugging me is the missing "magic" from the bleeding edge repository that fixed the error with regard to false firstname truncation.

        Can you provide me with any pointers to where in the source this magic happens, so that I can try to merge it back into the version I have currently checked out?

        -Michael

         
    • Hi Michael,

      > I now switched to UTF8, and things are working (almost) perfect.
      > The only thing that is bugging me is the missing "magic" from the
      > bleeding edge repository that fixed the error with regard to false
      > firstname truncation.

      Things seem to work fine now when importing PMID:17302433 as PubMed XML into the refbase beta database (which runs the latest refbase version from the SVN bleeding-edge branch with a *latin1* based MySQL database).

      Still, I fear I spoke too soon. I just tried to import PMID:17302433 as PubMed XML into:

      http://refbase.textdriven.com/beta/

      which runs the latest refbase version from the SVN bleeding-edge branch with a *UTF-8* based MySQL database. When importing just this single XML record from PubMed, all author information is somehow omitted. And when importing multiple records at once, only the first author comes thru correctly. :-/

      So, I fear, the issue still isn't solved. Sorry.

      > Can you provide me with any pointers to where in the source this
      > magic happens, so that I can try to merge it back into the version
      > I have currently checked out?

      Which SVN version are you using? The one from the trunk or that one from the bleeding-edge branch? Only the latter has the changes I did mentioned earlier in this thread. There are *a lot* of changes between the trunk and the bleeding-edge version, so it's probably best to check out the bleeding-edge version into a new directory and test it with a completely separate test installation.

      However, this might be all moot since it doesn't seem to work for UTF-8 now (at least this is what my above observation indicates).

      I'll need to find a calm moment and investigate this further.

      Sorry for the trouble this has caused you.

      Matthias

       
      • Hi Matthias,

        > Things seem to work fine now when importing PMID:17302433 as PubMed XML into the refbase
        > beta database (which runs the latest refbase version from the SVN bleeding-edge branch
        > with a *latin1* based MySQL database).

        Hm, for some reason, I never got latin-1 to work correctly (Searching for a string with special characters failed), and thus I'm currently favoring UTF-8.

        > When importing just this single XML record from PubMed, all author information is
        > somehow omitted. And when importing multiple records at once, only the first author
        > comes thru correctly. :-/
        >So, I fear, the issue still isn't solved. Sorry.

        Bizarre. I am seeing a different failure (admittedly with the trunk version). I can import fine now, but author's first name truncation fails (as you described earlier).

        I am using the trunk version, and I'd rather not upgrade to bleeding-edge right now, as I am doing some modifications to the code myself. Once this is all a bit more stable and condensed into a few surgical patches, I can consider switching branches.

        > There are *a lot* of changes between the trunk and the bleeding-edge version, so it's
        > probably best to check out the bleeding-edge version into a new directory and test it
        > with a completely separate test installation.

        I tried that today. Is it possible that there's a bug with the install.php in the current bleeding-edge? It seems to look for the "depends" table before it actually creates the database, or so it seems. At least, installation failed before any table being created. It's also possible that the db creation step failed, I did not have time to trace this down properly, so don't treat it as a bug yet, I'll try to reproduce it first.

        > I'll need to find a calm moment and investigate this further.
        > Sorry for the trouble this has caused you.

        That's quite alright, at least you seem to be on the best of terms with Unicode, whereas I feel a little helpless still in that area :-) I think for now, I will rely on PubMed Medline format import without diacritics, and tackle this at a later stage.

        Cheers,
        Michael

         
    • Hi Michael,

      > Hm, for some reason, I never got latin-1 to work correctly
      > (Searching for a string with special characters failed), and thus
      > I'm currently favoring UTF-8.

      This may happen if there's a mismatch between the used server (or web site) encoding and the database encoding. Unfortunately, these things are often tricky to resolve. The important things to consider are listed here:

      http://wiki.refbase.net/index.php/Installation-Troubleshooting#Problems_with_special_characters

      W.r.t. the server setup, it's especially important that the MySQL server's character set and collation settings are setup correctly and consistently. More info about this topic is given here:

      http://wiki.refbase.net/index.php/Troubleshooting#MySQL_migration_and_character_set_problems

      > Is it possible that there's a bug with the install.php in the
      > current bleeding-edge? It seems to look for the "depends" table
      > before it actually creates the database, or so it seems.

      Strange. The 'install.php' script hasn't really changed that much, in fact it's almost identical to the trunk version.

      > I did not have time to trace this down properly, so don't treat it
      > as a bug yet, I'll try to reproduce it first.

      That would be helpful, thanks. And I agree that the most logical explanation would be that the database wasn't created successfully. Maybe the MySQL user you're using has no permission to create a new MySQL database?

      I promise to look into this issue soon. While I have an idea where things might go wrong (e.g. non-Unicode savvy regex matching when parsing and re-arranging author name parts), I don't yet have a solution to fix it.

      Thanks for your patience, Matthias

       
    • Hi Michael,

      please try to open file 'includes/include.inc.php' in a text editor and re-save it with encoding "Unicode (UTF-8, no BOM)". Does this help? I.e. are you now able to import, say, PMID:17302433 as PubMed XML and have author names imported correctly (with given names getting correctly reduced to initials)?

      In case someone's interested, the regex patterns used in function 'reArrangeAuthorContents()' in file 'includes/include.inc.php' are definitively part of the problem. Generally, the 'start_session()' function (in the same file) *should* establish an appropriate locale via function 'setSystemLocale()' so that e.g. '[[:lower:]]' would also match 'ø' etc.

      However, if a UTF-8 setup is used on some (all?) servers, this doesn't seem to work as expected. I.e. '[[:lower:]]' does not match the "ä" in "Päivi". Due to this reason, I did specify higher ASCII chars of the latin1 character set literally in the regex patterns. However, this isn't really smart since it isn't a universally working solution, and it causes problems for a UTF-8 based database (unless the file is re-saved with encoding "Unicode (UTF-8, no BOM)"). But when I remove these literal latin1 characters from the regex patterns, I get "Aänismaa, P.äivi" instead of "Aänismaa, P." upon import of PMID:17302433 as PubMed XML. :-/

      So, for a UTF-8 setup, re-saving file 'includes/include.inc.php' with encoding "Unicode (UTF-8, no BOM)" may work as a temporary workaround. But this will only help for higher ASCII chars of the latin1 character set (i.e. ÄÅÁÀÂÃÇÉÈÊËÑÖØÓÒÔÕÜÚÙÛÍÌÎÏÆ or äåáàâãçéèêëñöøóòôõüúùûíìîïæÿß).

      If someone else knows a true solution to this problem, please let me know.

      Thanks, Matthias

       
    • Hi,

      I've now fixed this issue, i.e. it should now be possible to correctly import (or cite) records which contain higher ASCII chars in author's given names.

      I've committed my changes to the refbase SVN repository (bleeding-edge branch):

      http://refbase.svn.sourceforge.net/viewvc/refbase/branches/bleeding-edge/

      and I hope to update the SVN trunk soon.

      The scripts at this UTF8 based installation:

      http://refbase.textdriven.com/beta/

      have been updated, so you could try the fix there.

      Hope this helps,

      Matthias

       
  • heem
    heem
    2010-06-22

    When trying to import the same doi: 10.3183/NPPRJ-2008-23-02-p224-230 in the demo.refbase.net database demonstrates the problem.

    Only first authors lastname is imported.

     
  • @heemie

    Please start a new thread when you have an unrelated issue.

    The crossref openurl resolver shows the same thing, I don't know if it helps?

    Maybe it is a problem with the Journals database and then again a non-issue?

    Yes, exactly.  refbase gets data from CrossRef.  If the publisher has only supplied partial data to CrossRef, neither CrossRef nor refbase has a way of filling in the blanks.  Encourage your publisher to give more complete information to CrossRef and use an alternative data source for articles from that publisher until they fix things.