Menu

bibutils 5.4 Not recognizing MODS doctype when converting to ris,endnote,bibtex

matt
2014-10-24
2014-11-21
  • matt

    matt - 2014-10-24

    First off, great tool. I was preparing to write a similiar utility when I found this. I'm really happy I've found it, but alas, I am having a couple problems. I've created a github repository to demonstrate and reproduce my issue.

    Essentially, when using sample MODS files directly from the LOC website conversions to ris, endnote, and bibtex are not recognizing document types properly. Understandably this is problematic. xml2bib seems to recognize the document types the best.

    Here is an example

    transforming: article
    ris: TY  - STD
    end: %0 Generic
    bib: @Article{Brenner2000,
    ----------------------------------------
    transforming: book_chapter
    ris: TY  - STD
    end: %0 Generic
    bib: @Inbook{Amin1994,
    ----------------------------------------
    transforming: book
    ris: TY  - STD
    end: %0 Generic
    bib: @Book{11761548,
    ----------------------------------------
    transforming: conference_publication
    ris: TY  - CONF
    end: %0 Conference Proceedings
    bib: @Proceedings{4968605,
    ----------------------------------------
    transforming: serial
    ris: TY  - STD
    end: %0 Generic
    bib: @Misc{11315879,
    ----------------------------------------
    

    the best workaround I have is to pipe bibtex conversion back to mods and pipe that back to RIS or endnote. like so:

    xml2bib source/article.xml | bib2xml | xml2end
    

    But of course this fails if bibtex doesn't recognize the proper document type.

    Is anyone aware of this issue? and can anyone offer any support?

    Thanks

     
  • Chris Putnam

    Chris Putnam - 2014-10-27

    There's a lot to discuss here.

    Some of these things are bugs and there's a failure to properly recognize types that should be recognized. For example the article should be recognized by as an article as it is of <genre>article</genre>. I'm fixing this (actually the program recognizes this and then "unrecognizes it" with a comment that no longer makes sense to me...I'm certain this made sense for one particular input, but now I can't come up with a situation where that's the right thing to do).

    Some of these are awful in terms of figuring out types, but there's enough information there to do so. For example, the "book" type has a <genre>bibliography</genre>, but an <issuance>monographic</issuance> in the main element (e.g. not host/etc. element), and that issuance is typically a reasonably good enough hint that this reference is a book. xml2bib obviously does this, but xml2end and xml2ris don't. I'm fixing them. But would it kill them to give a <genre>book</genre> tag? Especially considering that the genre term "book" exists under the MARC authority?

    Some of these are just false advertising. For example, the book_chapter gives absolutely no hints that it's a book chapter. The elements of the chapter that we could try to use to figure out its type is <typeOfResource>text</typeOfResource>. That's not helpful. Lots of things are text. The host element has no <issuance>monographic</issuance> and definitely no <genre>book</genre> tags. All we can figure out definitively is that Amin1994 is some bit of text that's part of something else (a book, a journal, a magazine???). Frankly, the only reason xml2bib happens to get this right is that it defaults to a book chapter type when it can't figure anything else out and there's a host element in the MODS record. I can apply this same logic to the other converters, but this default will break with badly coded journal articles.

    So what do we do when our heuristics fail in identifying the appropriate type? We default to some generic/miscellaneous type, which I think is the right behavior.

    Bibutils tends to do a much better job when you give it "saner" MODS inputs and it tends to provide lots of type hints in the MODS it generates when converting from one format to another (which is why your xml2bib -> bib2xml -> xml2xxx pipeline does better in cases where xml2bib gets things right).

    So I'll fix up the obvious bugs and make the behavior more uniform (for the book_chapter example). But I'm much more interested in dealing with real references coming from real sources. I mostly use the (older) versions of these MODS examples in automated regression testing, but I've come to the conclusion many years ago that some of these "examples" are particularly bad. I've worked around "oddities" from particular reference sources in the past, but at some point, these conversion utilities have the garbage in -> garbage out limitation.

     
  • matt

    matt - 2014-10-30

    Thanks for responding to this Chris. I've been traveling this week, so apologies for my delayed (and brief)response.

    Essentially, I tend to agree with you on these points. when not recognizing the documents being passed defaulting to a general type is ideal. In terms of giving bibutils sane and real input I am trying to do just that. I am crafting my organizations MOD records with influence from what I can find at the LOC.This structure is regrettably the one LOC has recommended. Alas, I am simply a programmer, not a metadata librarian. This leaves me with some grey areas when constructing my own records.

    I think making the document type recongition more consistent with current marc genre tags is likely a good course of action. I can parse the rest of your code when I absolute must make my record identify as a particular doc type. Ultimately, I wanted someone to be aware that the sample MOD records on the LOC's site might not transfer themselves properly to some formats using bibutils.

     
  • Chris Putnam

    Chris Putnam - 2014-10-30

    If you're generating your own MODS output, you're in an ideal situation.

    If you just give bibutils a hint like <genre>book</genre>, <genre>journal</genre> or <genre>article</genre> lines at the appropriate places (e.g. in the main or 'host' MODS records), bibutils should do the right thing.

    Honestly, that's really all that's missing from the LOC examples. I expect that the examples were "handcrafted" to illustrate the structure of putting in all of the information into MODS and no one tried to figure out if you could decipher what they were programmatically (as opposed to the ability of humans to infer the reference type from the available info--I can guess that the book chapter example is probably a book chapter based on the data in the fields).

    To reiterate: I have no problem with the structure of the LOC examples. My problem is just the examples don't have hints for defining what type of reference these records represent. (And I'm happy to fix cases where bibutils isn't using reasonable hints due to my oversight.)

    In terms of bibliographies, I am also simply a programmer and not a librarian. And my strong belief after working with MODS (and all other bibliography formats for that matter) for all of this time is that there are definitely grey areas.

    One suggestion that I have for looking at different data encoded in MODS is to take a bibtex, ris, or other file and use bibutils to convert it to MODS (or examine the "ALWAYS" and "DEFAULT" tags for different formats in the lib/bibtextypes.c lib/ristypes.c etc). It'll show you all of the hints that bibutils adds to clearly identify reference types via the MODS <genre>, <issuance>, and <resource> tags.

    And I'll get version 5.5 up pretty soon (I need to double check the changes in the RIS output). Sorry for the delay, but my "hobby time" has been eaten up by my "real work" over the last few days.

     

    Last edit: Chris Putnam 2014-10-30
  • Chris Putnam

    Chris Putnam - 2014-11-15

    The changes are now available in the latest version 5.5. Thanks again for the discussion.

     
  • matt

    matt - 2014-11-21

    Hi Chris,

    Sorry again for my delayed response, but thank you so much for getting this fixed and updated. It much appreciated. Now, I'm off to figure out the easiest way to get our application to not only export citations, but display HTML versions. I am currently looking at citeproc-py for this. It seems to accept BibTeX input.

    Thanks again for this great project of yours!

     

Log in to post a comment.