From: stuart yeates <stuart.yeates@vu...> - 2009-02-24 20:00:57
Does anyone have a script that checks all of the previously uploaded
PDFs and find ones that are malformed and reports their URLs/record IDs?
I can see how to write a script that uses the unix command line 'file'
and 'pdftops' tools to check that every file that looks like a PDF is a
good and valid PDF. Going from a file on the disk to a database record
I'm not too sure of.
http://www.nzetc.org/ New Zealand Electronic Text Centre
http://researcharchive.vuw.ac.nz/ Institutional Repository
From: Kim Shepherd <kims@wa...> - 2009-02-24 23:02:47
Example assetstore file:
The filename itself is in bitstream.internal_id in the dspace database, and the directory names are just the first 6 numbers of the internal ID.
Here's a SQL query that resolves internal_ids to item_id (aka record ID) and handle (which should tie into URL):
select item.item_id,handle,bitstream.internal_id from item,item2bundle,bundle2bitstream,handle,bitstream where item.item_id = item2bundle.item_id and item2bundle.bundle_id = bundle2bitstream.bundle_id and bundle2bitstream.bitstream_id = bitstream.bitstream_id and handle.resource_id = item.item_id;
I've never looked at writing a script based on this (we are just doing the standard checksum checking at the moment) but it shouldn't be too difficult.
(if you want to cut down on analysing non-PDFs with 'file', you could use bitstream.bitstream_format_id to build a list of PDFs before running the filesystem-level tools, too..)
IRR Technical Specialist
ITS Systems & Development
The University of Waikato
DDI +64 7 838 4025
> -----Original Message-----
> From: stuart yeates [mailto:stuart.yeates@...]
> Sent: Wednesday, 25 February 2009 9:03 a.m.
> To: dspace-tech@...
> Subject: [Dspace-tech] script to validate all PDFs ?
> Does anyone have a script that checks all of the previously uploaded
> PDFs and find ones that are malformed and reports their URLs/record IDs?
> I can see how to write a script that uses the unix command line 'file'
> and 'pdftops' tools to check that every file that looks like a PDF is a
> good and valid PDF. Going from a file on the disk to a database record
> I'm not too sure of.
> Stuart Yeates
> http://www.nzetc.org/ New Zealand Electronic Text Centre
> http://researcharchive.vuw.ac.nz/ Institutional Repository
> Open Source Business Conference (OSBC), March 24-25, 2009, San
> Francisco, CA
> -OSBC tackles the biggest issue in open source: Open Sourcing the
> -Strategies to boost innovation and cut costs with open source
> -Receive a $600 discount off the registration fee with the source code:
> DSpace-tech mailing list