In order for JabRef to be smarter about linking PDFs from database entries, I'd like it to exploit the XMP metadata when present (currently autolinking is based on (regexp) filename matching only). In particular the DOI, being by construction unique, seems to be a nice candidate for linking bibtex entries to files. I have a few questions on how to go about it:
(a) How should this matching behave ? When comparing the bibtex entry and the entry reconstructed from XMP data, should we require that all fields match, that a selection of fields match (e.g. matches(doi) AND matches(title)), that any of a number of fields match (e.g. matches(doi) OR matches(title)), or do we need a mix of that (e.g. 'matches(doi) OR (matches(title) AND matches(authors))') ? The last option is the most flexible, but obviously also the hardest to implement...
(b) To developers acquainted with The Source: How should I best go about it ? It seems the name matching part in autolinking is done in net.sf.jabref.Util.findAssociatedFiles(), shall I extend that method or do you have a better suggestion ?
Adrian
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
In order for JabRef to be smarter about linking PDFs from database entries,
I'd like it to exploit the XMP metadata when present (currently autolinking
is based on (regexp) filename matching only).
We also have the "find unlinked files" feature, which does more.
Furthermore, there is the possiblity to drag and drop a PDF into JabRef.
Both call the same import behavior.
(a) How should this matching behave ? When comparing the bibtex entry and
the entry reconstructed from XMP data, should we require that all fields
match, that a selection of fields match (e.g. matches(doi) AND
matches(title)), that any of a number of fields match (e.g. matches(doi) OR
matches(title)), or do we need a mix of that (e.g. 'matches(doi) OR
(matches(title) AND matches(authors))') ? The last option is the most
flexible, but obviously also the hardest to implement...
I currently have no idea about that.
Maybe the "Search" / "Find duplicates" functionality provides some
support on that?
(b) To developers acquainted with The Source: How should I best go about it
? It seems the name matching part in autolinking is done in
net.sf.jabref.Util.findAssociatedFiles(), shall I extend that method or do
you have a better suggestion ?
That method feels wrong, since it is a low level method.
I think the embedding in the importers is right. Have a look at the
embedding of the "PDFContentImporter". I hope, that helps. Feel free
to ask in a PM or on IRC. Or we could go through JabRef using
TeamViewer :)
We also have the "find unlinked files" feature, which does more.
I saw that, but
it does not offer to create bibtex entries from XMP data
it does not offer to attribute the files to existing entries (based e.g. on XMP data)
Let me give a bit more details on the cases where I would like to have more support in linking entries and PDFs (and maybe I'll stand corrected if I have just missed existing tools):
I like to be able to name my PDFs as I want (not necessarily obeying a naming convention I have to decide and then forever stick to), rename them as I feel, and/or move them to another subdirectory (e.g. when at some point I have a bunch of papers that I want nicely grouped under a new topic). With any of those actions I loose the entry's links and I would like JabRef to find the files under their new (path)name faster than by
searching for and deleting broken links,
finding unlinked files, create corresponding entries, and
finally clean up duplicates with the 'merge entries' tool
When I get a bunch of XMP tagged articles from a colleague, I want both to add them to my database, and to move the files into appropriate subdirectories.
If I first add them to the database and then move each file into
its appropriate subdir, I break the links in the newly created
entries and find myself in the situation described before
(existing entries, existing files but broken links).
If I first store the files where I want them, there is no easy
dragging and dropping any more because by then they are all in
different places. The 'find unlinked files' tool would of course
then do its job (corresponding entries do not exist yet), if it
were able to create entries based on the XMP data.
Furthermore, there is the possiblity to drag and drop a PDF into JabRef. Both call the same import behavior.
Ok, I'll look at the source of the importers so understand how they work.
That method feels wrong, since it is a low level method.
I think the embedding in the importers is right. Have a look at the
embedding of the "PDFContentImporter". I hope, that helps. Feel free
to ask in a PM or on IRC. Or we could go through JabRef using
TeamViewer :)
I saw the tools have evolved a bit, but I do not have a good picture of how thinks are organised. Thanks for your offer, I have subscribed to jabref-devel and might bother you with questions soon ;-)
cheers,
Adrian
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
it does not offer to create bibtex entries from XMP data
I think, "Create entry based on XMP data" should exactly do that.
Doesn't it work for you?
it does not offer to attribute the files to existing entries (based e.g. on
XMP data)
I think, that could be accomplished by creating a new functionality
based on "Update empty fields with data fetched from Mr.dLib" and
"Create entry based on XMP data".
I like to be able to name my PDFs as I want
This feels like a rewrite of the whole file field functionality.
I think,
* creating a SHA256 hash
* storing the hash in the file field
* maintaining a mapping from hash to real path in JabRef
should solve this problem.
The mapping from hash to real path can be updated on demand: Each time
a file is requested, the stored path is checked. If the file is not
found, all files not contained in the mapping are hashed and added to
the mapping. This hashing is aborted if the file with the searched
hash is found.
When I get a bunch of XMP tagged articles from a colleague, I want both to
add them to my database, and to move the files into appropriate
subdirectories.
How is "appropriate subdirectory" determined? Based on a keyword? Or
can't it be done automatically?
The 'find unlinked files' tool would of course
then do its job (corresponding entries do not exist yet), if it
were able to create entries based on the XMP data.
Please try the "Create entry based on XMP data" feature and tell us
whether it works.
Thanks for your offer, I have subscribed to
jabref-devel and might bother you with questions soon ;-)
You are very welcome to do so!
Cheers,
Oliver
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I do have the option to "Create entry based on XMP data" when I drag and drop files, so the building bricks are there.
it does not offer to attribute the files to existing entries (based e.g. on XMP data)
I think, that could be accomplished by creating a new functionality based on "Update empty fields with data fetched from Mr.dLib" and "Create entry based on XMP data".
Yes, sounds good.
I like to be able to name my PDFs as I want
This feels like a rewrite of the whole file field functionality.
I think,
- creating a SHA256 hash
- storing the hash in the file field
- maintaining a mapping from hash to real path in JabRef
should solve this problem.
I also thought about such a solution first, which would have the advantage of not needing to modify the files themselves (by adding XMP data), but why not just use the DOI which most document nowadays have and which is also unique (even more so by definition) ? As a side effect it encourages the use of XMP, which I feel is a good thing.
When I get a bunch of XMP tagged articles from a colleague, I want both to add them to my database, and to move the files into appropriate subdirectories.
How is "appropriate subdirectory" determined?
It is just a rough (and evolving) subdivision into categories which I use to organise my files. I agree I could use the keyword functionality in JabRef, the subdirectory organisation simply predates my use of JabRef and I grew accustomed to grouping similar papers in directories.
cheers,
Adrian
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
In order for JabRef to be smarter about linking PDFs from database entries, I'd like it to exploit the XMP metadata when present (currently autolinking is based on (regexp) filename matching only). In particular the DOI, being by construction unique, seems to be a nice candidate for linking bibtex entries to files. I have a few questions on how to go about it:
(a) How should this matching behave ? When comparing the bibtex entry and the entry reconstructed from XMP data, should we require that all fields match, that a selection of fields match (e.g. matches(doi) AND matches(title)), that any of a number of fields match (e.g. matches(doi) OR matches(title)), or do we need a mix of that (e.g. 'matches(doi) OR (matches(title) AND matches(authors))') ? The last option is the most flexible, but obviously also the hardest to implement...
(b) To developers acquainted with The Source: How should I best go about it ? It seems the name matching part in autolinking is done in net.sf.jabref.Util.findAssociatedFiles(), shall I extend that method or do you have a better suggestion ?
Adrian
Hi,
We also have the "find unlinked files" feature, which does more.
Furthermore, there is the possiblity to drag and drop a PDF into JabRef.
Both call the same import behavior.
I currently have no idea about that.
Maybe the "Search" / "Find duplicates" functionality provides some
support on that?
That method feels wrong, since it is a low level method.
I think the embedding in the importers is right. Have a look at the
embedding of the "PDFContentImporter". I hope, that helps. Feel free
to ask in a PM or on IRC. Or we could go through JabRef using
TeamViewer :)
Please join our developers mailinglist
https://lists.sourceforge.net/lists/listinfo/jabref-devel to have a
discussion with other developers there.
Cheers,
Oliver
Hello Oliver,
Thanks for your comments.
I saw that, but
Let me give a bit more details on the cases where I would like to have more support in linking entries and PDFs (and maybe I'll stand corrected if I have just missed existing tools):
its appropriate subdir, I break the links in the newly created
entries and find myself in the situation described before
(existing entries, existing files but broken links).
dragging and dropping any more because by then they are all in
different places. The 'find unlinked files' tool would of course
then do its job (corresponding entries do not exist yet), if it
were able to create entries based on the XMP data.
Ok, I'll look at the source of the importers so understand how they work.
I saw the tools have evolved a bit, but I do not have a good picture of how thinks are organised. Thanks for your offer, I have subscribed to jabref-devel and might bother you with questions soon ;-)
cheers,
Adrian
Hi Adrian,
I think, "Create entry based on XMP data" should exactly do that.
Doesn't it work for you?
I think, that could be accomplished by creating a new functionality
based on "Update empty fields with data fetched from Mr.dLib" and
"Create entry based on XMP data".
This feels like a rewrite of the whole file field functionality.
I think,
* creating a SHA256 hash
* storing the hash in the file field
* maintaining a mapping from hash to real path in JabRef
should solve this problem.
The mapping from hash to real path can be updated on demand: Each time
a file is requested, the stored path is checked. If the file is not
found, all files not contained in the mapping are hashed and added to
the mapping. This hashing is aborted if the file with the searched
hash is found.
How is "appropriate subdirectory" determined? Based on a keyword? Or
can't it be done automatically?
Please try the "Create entry based on XMP data" feature and tell us
whether it works.
You are very welcome to do so!
Cheers,
Oliver
Hi Oliver,
I do not see that option in the 'find unlinked files' tool. This is how the window looks like on 2.10dev for me:
http://www.msc.univ-paris-diderot.fr/~daerr/tmp/jabref2.10dev_find-unlinked-files.png
I do have the option to "Create entry based on XMP data" when I drag and drop files, so the building bricks are there.
Yes, sounds good.
I also thought about such a solution first, which would have the advantage of not needing to modify the files themselves (by adding XMP data), but why not just use the DOI which most document nowadays have and which is also unique (even more so by definition) ? As a side effect it encourages the use of XMP, which I feel is a good thing.
It is just a rough (and evolving) subdivision into categories which I use to organise my files. I agree I could use the keyword functionality in JabRef, the subdirectory organisation simply predates my use of JabRef and I grew accustomed to grouping similar papers in directories.
cheers,
Adrian