Menu

autolink files using metadata (XMP)

2013-04-01
2013-04-04
  • Adrian Daerr

    Adrian Daerr - 2013-04-01

    In order for JabRef to be smarter about linking PDFs from database entries, I'd like it to exploit the XMP metadata when present (currently autolinking is based on (regexp) filename matching only). In particular the DOI, being by construction unique, seems to be a nice candidate for linking bibtex entries to files. I have a few questions on how to go about it:
    (a) How should this matching behave ? When comparing the bibtex entry and the entry reconstructed from XMP data, should we require that all fields match, that a selection of fields match (e.g. matches(doi) AND matches(title)), that any of a number of fields match (e.g. matches(doi) OR matches(title)), or do we need a mix of that (e.g. 'matches(doi) OR (matches(title) AND matches(authors))') ? The last option is the most flexible, but obviously also the hardest to implement...
    (b) To developers acquainted with The Source: How should I best go about it ? It seems the name matching part in autolinking is done in net.sf.jabref.Util.findAssociatedFiles(), shall I extend that method or do you have a better suggestion ?

    Adrian

     
    • Oliver Kopp

      Oliver Kopp - 2013-04-01

      Hi,

      In order for JabRef to be smarter about linking PDFs from database entries,
      I'd like it to exploit the XMP metadata when present (currently autolinking
      is based on (regexp) filename matching only).

      We also have the "find unlinked files" feature, which does more.

      Furthermore, there is the possiblity to drag and drop a PDF into JabRef.

      Both call the same import behavior.

      (a) How should this matching behave ? When comparing the bibtex entry and
      the entry reconstructed from XMP data, should we require that all fields
      match, that a selection of fields match (e.g. matches(doi) AND
      matches(title)), that any of a number of fields match (e.g. matches(doi) OR
      matches(title)), or do we need a mix of that (e.g. 'matches(doi) OR
      (matches(title) AND matches(authors))') ? The last option is the most
      flexible, but obviously also the hardest to implement...

      I currently have no idea about that.
      Maybe the "Search" / "Find duplicates" functionality provides some
      support on that?

      (b) To developers acquainted with The Source: How should I best go about it
      ? It seems the name matching part in autolinking is done in
      net.sf.jabref.Util.findAssociatedFiles(), shall I extend that method or do
      you have a better suggestion ?

      That method feels wrong, since it is a low level method.
      I think the embedding in the importers is right. Have a look at the
      embedding of the "PDFContentImporter". I hope, that helps. Feel free
      to ask in a PM or on IRC. Or we could go through JabRef using
      TeamViewer :)

      Please join our developers mailinglist
      https://lists.sourceforge.net/lists/listinfo/jabref-devel to have a
      discussion with other developers there.

      Cheers,

      Oliver

       
  • Adrian Daerr

    Adrian Daerr - 2013-04-03

    Hello Oliver,
    Thanks for your comments.

    We also have the "find unlinked files" feature, which does more.

    I saw that, but

    • it does not offer to create bibtex entries from XMP data
    • it does not offer to attribute the files to existing entries (based e.g. on XMP data)

    Let me give a bit more details on the cases where I would like to have more support in linking entries and PDFs (and maybe I'll stand corrected if I have just missed existing tools):

    • I like to be able to name my PDFs as I want (not necessarily obeying a naming convention I have to decide and then forever stick to), rename them as I feel, and/or move them to another subdirectory (e.g. when at some point I have a bunch of papers that I want nicely grouped under a new topic). With any of those actions I loose the entry's links and I would like JabRef to find the files under their new (path)name faster than by
      • searching for and deleting broken links,
      • finding unlinked files, create corresponding entries, and
      • finally clean up duplicates with the 'merge entries' tool
    • When I get a bunch of XMP tagged articles from a colleague, I want both to add them to my database, and to move the files into appropriate subdirectories.
      1. If I first add them to the database and then move each file into
        its appropriate subdir, I break the links in the newly created
        entries and find myself in the situation described before
        (existing entries, existing files but broken links).
      2. If I first store the files where I want them, there is no easy
        dragging and dropping any more because by then they are all in
        different places. The 'find unlinked files' tool would of course
        then do its job (corresponding entries do not exist yet), if it
        were able to create entries based on the XMP data.

    Furthermore, there is the possiblity to drag and drop a PDF into JabRef. Both call the same import behavior.

    Ok, I'll look at the source of the importers so understand how they work.

    That method feels wrong, since it is a low level method.
    I think the embedding in the importers is right. Have a look at the
    embedding of the "PDFContentImporter". I hope, that helps. Feel free
    to ask in a PM or on IRC. Or we could go through JabRef using
    TeamViewer :)

    I saw the tools have evolved a bit, but I do not have a good picture of how thinks are organised. Thanks for your offer, I have subscribed to jabref-devel and might bother you with questions soon ;-)

    cheers,
    Adrian

     
    • Oliver Kopp

      Oliver Kopp - 2013-04-04

      Hi Adrian,

      it does not offer to create bibtex entries from XMP data

      I think, "Create entry based on XMP data" should exactly do that.
      Doesn't it work for you?

      it does not offer to attribute the files to existing entries (based e.g. on
      XMP data)

      I think, that could be accomplished by creating a new functionality
      based on "Update empty fields with data fetched from Mr.dLib" and
      "Create entry based on XMP data".

      I like to be able to name my PDFs as I want

      This feels like a rewrite of the whole file field functionality.

      I think,
      * creating a SHA256 hash
      * storing the hash in the file field
      * maintaining a mapping from hash to real path in JabRef
      should solve this problem.

      The mapping from hash to real path can be updated on demand: Each time
      a file is requested, the stored path is checked. If the file is not
      found, all files not contained in the mapping are hashed and added to
      the mapping. This hashing is aborted if the file with the searched
      hash is found.

      When I get a bunch of XMP tagged articles from a colleague, I want both to
      add them to my database, and to move the files into appropriate
      subdirectories.

      How is "appropriate subdirectory" determined? Based on a keyword? Or
      can't it be done automatically?

      The 'find unlinked files' tool would of course
      then do its job (corresponding entries do not exist yet), if it
      were able to create entries based on the XMP data.

      Please try the "Create entry based on XMP data" feature and tell us
      whether it works.

      Thanks for your offer, I have subscribed to
      jabref-devel and might bother you with questions soon ;-)

      You are very welcome to do so!

      Cheers,

      Oliver

       
      • Adrian Daerr

        Adrian Daerr - 2013-04-04

        Hi Oliver,

        it does not offer to create bibtex entries from XMP data

        I think, "Create entry based on XMP data" should exactly do that. Doesn't it work for you?

        I do not see that option in the 'find unlinked files' tool. This is how the window looks like on 2.10dev for me:

        http://www.msc.univ-paris-diderot.fr/~daerr/tmp/jabref2.10dev_find-unlinked-files.png

        I do have the option to "Create entry based on XMP data" when I drag and drop files, so the building bricks are there.

        it does not offer to attribute the files to existing entries (based e.g. on XMP data)

        I think, that could be accomplished by creating a new functionality based on "Update empty fields with data fetched from Mr.dLib" and "Create entry based on XMP data".

        Yes, sounds good.

        I like to be able to name my PDFs as I want

        This feels like a rewrite of the whole file field functionality.

        I think,
        - creating a SHA256 hash
        - storing the hash in the file field
        - maintaining a mapping from hash to real path in JabRef
        should solve this problem.

        I also thought about such a solution first, which would have the advantage of not needing to modify the files themselves (by adding XMP data), but why not just use the DOI which most document nowadays have and which is also unique (even more so by definition) ? As a side effect it encourages the use of XMP, which I feel is a good thing.

        When I get a bunch of XMP tagged articles from a colleague, I want both to add them to my database, and to move the files into appropriate subdirectories.

        How is "appropriate subdirectory" determined?

        It is just a rough (and evolving) subdivision into categories which I use to organise my files. I agree I could use the keyword functionality in JabRef, the subdirectory organisation simply predates my use of JabRef and I grew accustomed to grouping similar papers in directories.

        cheers,
        Adrian