Re: [Refdb-users] New user and questions :-)

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Janusz S. Bie=F1 writes:
 > There are probably some tools already for extracting metadata from
 > PDFs. I think such facility is in particular built into Greenstone
 >=20
 >                 http://www.greenstone.org/cgi-bin/library
 >=20
 > It is GPLed, so the relevant code can be reused.
 >=20

Thanks for the pointer. They do seem to have some tools that seem
useful for this purpose, but they also mention the limited utility
with particular kinds of PDF files (e.g. PDFs of older articles that
contain scanned page images instead of text).

At least the newer PDFs all seem to contain a doi in the document
properties. These are accessible e.g. through a Perl API, see

http://search.cpan.org/~areibens/PDF-API2-0.51/lib/PDF/API2.pm

The $pdf->info function seems to return the metadata, with the doi
info usually in the title field. This may be a good starting point at
least for newer PDFs.

regards
Markus

--=20
Markus Hoenicka
mar...@ca...
(Spam-protected email: replace the quadrupeds with "mhoenicka")
http://www.mhoenicka.de