Work at SourceForge, help us to make it a better place! We have an immediate need for a Support Technician in our San Francisco or Denver office.

Close

CVS current?

Lacan
2005-11-16
2013-05-28
  • Lacan
    Lacan
    2005-11-16

    Hi,

    You said 1 year ago:

    "Import/export will change quite a bit (see below). However, the main functionality will stay as it is today. Improvement of import/export capabilities is our main priority, right now. Additionally, we're planning..."

    What's the status of this today?? I have noticed that this is indeed one of RefBase's greatest weakness. Importing ref's from major sources like PubMed and ISI WoS, for example. I am indeed interested in helping out here if possible...

    -joachim-

     
  • Import is still the most wanted feature addition to refbase, from the user as well as the developer perspective. However, not much has happend due to a variety of reasons.

    On the programming side, we'd like to provide a standard import interface (such as a standardized PHP array or object structure). This would enable developers of the different bibliophile projects to  develop importers that would be interchangable. I've started to work on such a standard interface but haven't found time to make the actual proposal on the bibliophile developers list. Although these standardization efforts delay things, I consider it very important.

    Additionally, we'd like to provide a MODS XML importer right from the start, i.e. don't hack our own BibTeX/Endnote/RIS/ISI/PubMed importers. Instead, we'd like to integrate bibutils for import, similar to the current export integration. By using bibutils in combination with a MODS XML importer we'd immediately gain support for all of the above mentioned formats. So this is where we want to go. Drawback is that this requires more development effort.

    If we could at least agree on the standard PHP interface mentioned above then others could start hacking there own stuff.

    Matthias

     
  • Lacan
    Lacan
    2005-11-17

    Hi Matthias,

    That sounds really great and is indeed a feature that will be needed, but I have a slightly different solution to the problem. And instead of keeping it secret I'll just spill my guts here.

    Consider the following scenario most common to researchers: 1) Read about something interesting someone has been doing and take down their name and search ISI, DOI, or directly on journal. 2) Find the interesting article and immediately download the PDF for that. 3) [Heres the problem] Copy down all reference info and locaztions to your favourite Literature program or website (Like RefWorks, Connotea, RefBase etc.)

    This takes time, and effort.

    My Solution(s): 

    1) Drag and drop the PDF to RefBase. Then RefBase will automagically extract the DOI reference number from the PDF and go to "http://dx.doi.org/" and extract the required fields and input them to RefBase with the PDF. Done!

    2) In case there is no DOI: Go to article abstract (like in ISI or PubMed) "cut" the article abstract text and "paste" it into your RefBase text parser box, which will then extract the fields. The drag and drop your downloaded PDF to another entry field. Done!

    All the tools for doing this is out there, its just a matter to incorporate them to RefBase.
    For (1) there are several: "pdftotext, pdftohtml,pdfsearch" and JavaScripts that does Drag and Drop. For (2) there is BibConverter from:

    http://www.unik.no/~fauske/bibconverter/

    Hey! I'm ready to do this, but I'm going to need some help.

    Best regards,

    -joachim-

     
  • Hi Joachim,

    > 1) Drag and drop the PDF to RefBase. Then RefBase will automagically
    > extract the DOI reference number from the PDF

    That's an interesting idea. Extracting all bibliographic data from a PDF (such as author, title, etc) would not work (unless they are provided as meta info). But extracting the DOI number may work since it's usually preceeded by the string "DOI".

    > and go to "http://dx.doi.org/" and extract the required fields and
    > input them to RefBase with the PDF. Done!

    Problems are that this would involve some heavy screen scraping which has almost 100% chance of failing to work at some point in the future.

    Plus, each publisher site would require another screen scraper. I tried to do this once for publications of www.springerlink.com and it was a tough job already for this single publisher site. Even worse, the SpringerLink web page layout was different between early volumes and more recent volumes, thus actually requiring two scraping mechanisms. For these reasons (and due to legal concerns), I did never finish/publish the script.

    I know that Richard Cameron (who's the developer of CiteUlike) provides many screen scrapers (as bookmarklets) for a lot of different sites. So it's doable, but quite a task by it's own.

    If we'd get the bibliographic data in a structured form (preferably as XML) than this would be much more tempting to do.

    CrossRef.org offers their bibligraphic data as XML but AFAIK you must register with CrossRef. Registering with CrossRef seems to be free of charge for libraries and affiliates. So that might be an option...

    refbase currently autogenerates OpenURLs such as this:

    http://www.crossref.org/openurl?aulast=Amsler&title=Journal%20of%20Phycology&volume=27&issue=&spage=26&date=1991

    which, when clicked, will direct the user directly to the record's details page provided by the journal publisher. This is very useful by itself (IMHO). However, as outlined above, I'm hesitant to get into the screen scraping business.

    If we'd append '&redirect=false' to the above URL example:

    http://www.crossref.org/openurl?aulast=Amsler&title=Journal%20of%20Phycology&volume=27&issue=&spage=26&date=1991&redirect=false

    CrossRef will not redirect to the publisher's site but will instead return an XML record containing the DOI and other identifiers (such as ISSN) as well as the exact journal & article titles. This is already pretty useful and could be used to pre-fill the record entry form.

    And, as far as I understand things, CrossRef would return the full record of (basic) bibliographic metadata if you're a registered CrossRef member. See the bottom of following page for an example of the XML data that gets returned to a registered member:

    http://crossref.org/03libraries/25query_spec.html

    > 2) In case there is no DOI: Go to article abstract (like in ISI or
    > PubMed) "cut" the article abstract text and "paste" it into your
    > RefBase text parser box, which will then extract the fields.

    Yes, but the same method of screen scraping would be involved. If others want to do this, that's fine. If a standard import mechanism (with a standard PHP structure) would exist, then people could develop their own "import plugins". I would prefer such a solution.

    > All the tools for doing this is out there, its just a matter to incorporate
    > them to RefBase.
    > For (1) there are several: "pdftotext, pdftohtml,pdfsearch" and JavaScripts
    > that does Drag and Drop.

    Yes. However, it was one design principle for refbase to avoid JavaScript whenever possible. That doesn't mean that we'll avoid stuff like JavaScript forever. In the (very) distant future we might provide an alternate (i.e. NOT a replacement) interface that does fancy things (using AJAX or something similar). But, in my opinion, the current interface should focus on large inter-operability. refbase does even work in a text browser (such as lynx) which I think is a good thing. Plus, while I agree that drag and drop would be cool, I think that an upload button is not too difficult to use.

    Btw, the new refbase version in the CVS offers a search/retrieve web service which uses standard formats for querying (SRU+CQL) and when returning data (SRW+MODS XML). The easiest way to implement a completely different interface for refbase would be to design custom XSLT stylesheets that output appropriate HTML, CSS, JavaScript, etc.

    Generally, I agree with you that refbase should somehow assist a user when entering new records by automatically fetching and pre-filling important bibliographic data. I had many plans in that direction but did postpone them since we haven't even completed basic features (such as import).

    Appreciate your ideas.

    Best regards, Matthias