Re: [Refdb-users] New user and questions :-)
Status: Beta
Brought to you by:
mhoenicka
From: Markus H. <mar...@mh...> - 2006-03-03 23:45:06
|
Janusz S. Bie=F1 writes: > There are probably some tools already for extracting metadata from > PDFs. I think such facility is in particular built into Greenstone >=20 > http://www.greenstone.org/cgi-bin/library >=20 > It is GPLed, so the relevant code can be reused. >=20 Thanks for the pointer. They do seem to have some tools that seem useful for this purpose, but they also mention the limited utility with particular kinds of PDF files (e.g. PDFs of older articles that contain scanned page images instead of text). At least the newer PDFs all seem to contain a doi in the document properties. These are accessible e.g. through a Perl API, see http://search.cpan.org/~areibens/PDF-API2-0.51/lib/PDF/API2.pm The $pdf->info function seems to return the metadata, with the doi info usually in the title field. This may be a good starting point at least for newer PDFs. regards Markus --=20 Markus Hoenicka mar...@ca... (Spam-protected email: replace the quadrupeds with "mhoenicka") http://www.mhoenicka.de |