JabRef / Discussion / Open Discussion: autolink files using metadata (XMP)

Adrian Daerr - 2013-04-01

In order for JabRef to be smarter about linking PDFs from database entries, I'd like it to exploit the XMP metadata when present (currently autolinking is based on (regexp) filename matching only). In particular the DOI, being by construction unique, seems to be a nice candidate for linking bibtex entries to files. I have a few questions on how to go about it:
(a) How should this matching behave ? When comparing the bibtex entry and the entry reconstructed from XMP data, should we require that all fields match, that a selection of fields match (e.g. matches(doi) AND matches(title)), that any of a number of fields match (e.g. matches(doi) OR matches(title)), or do we need a mix of that (e.g. 'matches(doi) OR (matches(title) AND matches(authors))') ? The last option is the most flexible, but obviously also the hardest to implement...
(b) To developers acquainted with The Source: How should I best go about it ? It seems the name matching part in autolinking is done in net.sf.jabref.Util.findAssociatedFiles(), shall I extend that method or do you have a better suggestion ?

Adrian

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Oliver Kopp - 2013-04-01
  
  Hi,
  
  In order for JabRef to be smarter about linking PDFs from database entries,
  I'd like it to exploit the XMP metadata when present (currently autolinking
  is based on (regexp) filename matching only).
  
  We also have the "find unlinked files" feature, which does more.
  
  Furthermore, there is the possiblity to drag and drop a PDF into JabRef.
  
  Both call the same import behavior.
  
  (a) How should this matching behave ? When comparing the bibtex entry and
  the entry reconstructed from XMP data, should we require that all fields
  match, that a selection of fields match (e.g. matches(doi) AND
  matches(title)), that any of a number of fields match (e.g. matches(doi) OR
  matches(title)), or do we need a mix of that (e.g. 'matches(doi) OR
  (matches(title) AND matches(authors))') ? The last option is the most
  flexible, but obviously also the hardest to implement...
  
  I currently have no idea about that.
  Maybe the "Search" / "Find duplicates" functionality provides some
  support on that?
  
  (b) To developers acquainted with The Source: How should I best go about it
  ? It seems the name matching part in autolinking is done in
  net.sf.jabref.Util.findAssociatedFiles(), shall I extend that method or do
  you have a better suggestion ?
  
  That method feels wrong, since it is a low level method.
  I think the embedding in the importers is right. Have a look at the
  embedding of the "PDFContentImporter". I hope, that helps. Feel free
  to ask in a PM or on IRC. Or we could go through JabRef using
  TeamViewer :)
  
  Please join our developers mailinglist
  https://lists.sourceforge.net/lists/listinfo/jabref-devel to have a
  discussion with other developers there.
  
  Cheers,
  
  Oliver
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Adrian Daerr - 2013-04-03

Hello Oliver,
Thanks for your comments.

We also have the "find unlinked files" feature, which does more.

I saw that, but

it does not offer to create bibtex entries from XMP data

it does not offer to attribute the files to existing entries (based e.g. on XMP data)

Let me give a bit more details on the cases where I would like to have more support in linking entries and PDFs (and maybe I'll stand corrected if I have just missed existing tools):

I like to be able to name my PDFs as I want (not necessarily obeying a naming convention I have to decide and then forever stick to), rename them as I feel, and/or move them to another subdirectory (e.g. when at some point I have a bunch of papers that I want nicely grouped under a new topic). With any of those actions I loose the entry's links and I would like JabRef to find the files under their new (path)name faster than by

searching for and deleting broken links,

finding unlinked files, create corresponding entries, and

finally clean up duplicates with the 'merge entries' tool

When I get a bunch of XMP tagged articles from a colleague, I want both to add them to my database, and to move the files into appropriate subdirectories.

If I first add them to the database and then move each file into
its appropriate subdir, I break the links in the newly created
entries and find myself in the situation described before
(existing entries, existing files but broken links).

If I first store the files where I want them, there is no easy
dragging and dropping any more because by then they are all in
different places. The 'find unlinked files' tool would of course
then do its job (corresponding entries do not exist yet), if it
were able to create entries based on the XMP data.

Furthermore, there is the possiblity to drag and drop a PDF into JabRef. Both call the same import behavior.

Ok, I'll look at the source of the importers so understand how they work.

That method feels wrong, since it is a low level method.
I think the embedding in the importers is right. Have a look at the
embedding of the "PDFContentImporter". I hope, that helps. Feel free
to ask in a PM or on IRC. Or we could go through JabRef using
TeamViewer :)

I saw the tools have evolved a bit, but I do not have a good picture of how thinks are organised. Thanks for your offer, I have subscribed to jabref-devel and might bother you with questions soon ;-)

cheers,
Adrian
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Oliver Kopp - 2013-04-04
  
  Hi Adrian,
  
  it does not offer to create bibtex entries from XMP data
  
  I think, "Create entry based on XMP data" should exactly do that.
  Doesn't it work for you?
  
  it does not offer to attribute the files to existing entries (based e.g. on
  XMP data)
  
  I think, that could be accomplished by creating a new functionality
  based on "Update empty fields with data fetched from Mr.dLib" and
  "Create entry based on XMP data".
  
  I like to be able to name my PDFs as I want
  
  This feels like a rewrite of the whole file field functionality.
  
  I think,
  * creating a SHA256 hash
  * storing the hash in the file field
  * maintaining a mapping from hash to real path in JabRef
  should solve this problem.
  
  The mapping from hash to real path can be updated on demand: Each time
  a file is requested, the stored path is checked. If the file is not
  found, all files not contained in the mapping are hashed and added to
  the mapping. This hashing is aborted if the file with the searched
  hash is found.
  
  When I get a bunch of XMP tagged articles from a colleague, I want both to
  add them to my database, and to move the files into appropriate
  subdirectories.
  
  How is "appropriate subdirectory" determined? Based on a keyword? Or
  can't it be done automatically?
  
  The 'find unlinked files' tool would of course
  then do its job (corresponding entries do not exist yet), if it
  were able to create entries based on the XMP data.
  
  Please try the "Create entry based on XMP data" feature and tell us
  whether it works.
  
  Thanks for your offer, I have subscribed to
  jabref-devel and might bother you with questions soon ;-)
  
  You are very welcome to do so!
  
  Cheers,
  
  Oliver
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Adrian Daerr - 2013-04-04
    
    Hi Oliver,
    
    it does not offer to create bibtex entries from XMP data
    
    I think, "Create entry based on XMP data" should exactly do that. Doesn't it work for you?
    
    I do not see that option in the 'find unlinked files' tool. This is how the window looks like on 2.10dev for me:
    
    http://www.msc.univ-paris-diderot.fr/~daerr/tmp/jabref2.10dev_find-unlinked-files.png
    
    I do have the option to "Create entry based on XMP data" when I drag and drop files, so the building bricks are there.
    
    it does not offer to attribute the files to existing entries (based e.g. on XMP data)
    
    I think, that could be accomplished by creating a new functionality based on "Update empty fields with data fetched from Mr.dLib" and "Create entry based on XMP data".
    
    Yes, sounds good.
    
    I like to be able to name my PDFs as I want
    
    This feels like a rewrite of the whole file field functionality.
    
    I think,
    - creating a SHA256 hash
    - storing the hash in the file field
    - maintaining a mapping from hash to real path in JabRef
    should solve this problem.
    
    I also thought about such a solution first, which would have the advantage of not needing to modify the files themselves (by adding XMP data), but why not just use the DOI which most document nowadays have and which is also unique (even more so by definition) ? As a side effect it encourages the use of XMP, which I feel is a good thing.
    
    When I get a bunch of XMP tagged articles from a colleague, I want both to add them to my database, and to move the files into appropriate subdirectories.
    
    How is "appropriate subdirectory" determined?
    
    It is just a rough (and evolving) subdivision into categories which I use to organise my files. I agree I could use the keyword functionality in JabRef, the subdirectory organisation simply predates my use of JabRef and I grew accustomed to grouping similar papers in directories.
    
    cheers,
    Adrian
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

autolink files using metadata (XMP)

JabRef is a graphical application for managing bibliographical data

Forums

Help

autolink files using metadata (XMP)

autolink files using metadata (XMP)

JabRef is a graphical application for managing bibliographical data

Forums

Help

autolink files using metadata (XMP) document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

autolink files using metadata (XMP)