Menu

#41 PDF Serch Upgrade

open
nobody
None
5
2006-12-01
2006-12-01
Rich
No

This is an upgrade to the search tool built into exponent (v 0.96.5). It enables the ability to search pdf documents.

At the moment the pdf extraction tool requires a licence. If an open source tool can be found it can simply replace the current one.

(This has also been posted on the exponent forum.)

We've got some code working now that uses pdf2text - on file upload it converts the PDF to a text file. The text file is saved in the same directory as the PDF and given the same name with the addition of ‘.txt’. When the system runs the search it opens the txt and reads out the data into a string. This string is then added to the body field of the search table. The addition also works when re-indexing with the spider button on the admin menu.

The text conversion process can be replaced with anything really as all it does is convert the uploaded file to a txt file. I'm sure there will be open source versions to run on linux platforms but I'm on windows. We've not found a suitable open source tool for doing PDF to text that supports the latest version of the pdf file format.

How to implement PDF search (using pdf2text)

Download the trail version from http://www.pdf2text.com/ The trail works for 15 days, this should be enough time to find out if the tool does everything that is needed. For our purpose we need to buy the server edition unless we can find an open source solution.

Follow the install instructions for pdf2text. Then you try the tool to make sure it works using the following code which can be found here http://www.pdf2text.com/ConvertPDFToText-server-edition.htm#exphp

Edit the save.php in Resource Module
Open <root>\modules\resourcemodule\actions\save.php
And add the covert pdf code after… (near line 50)

$file = file::update('file',$directory,null,time().'_'.$_FILES['file']['name']);
if (is_object($file)) {
$resource->file_id = $db->insertObject($file,'file');

# insert new code here >>>>>>

# Convert PDF code

$fileU = ($directory.'/'.time().'_'.$_FILES['file']['name']);
$saveLocation = ($directory.'/'.time().'_');

#Create the object
$p2t = new COM("P2TServer.P2T");

#VerifyLicense always success in trial version.
$p2t->VerifyLicense("4747457", "345srwr242342423");

#set control flag PDF_OUTPUTRANGE|PDF_OUTPUTPDFINFO,
#%%NUM' is the place holder for page number

$p2t->EngageProcessor(64+32,"5,10,11-15", "#######################%%NUM#########");
$p2t->Convert ("$fileU", ($fileU.'.txt')). "<br>";
$p2t = null;

#/ Convert PDF code

Create Text to String Function
Now create a new file in <root>\modules\resourcemodule\ called txt2string.php
Fill this with following code

<?php
function LoadText($filename){

//open file
$fileToOpen = fopen($filename,"r");

//set the content of the file into a variable
$content = fread($fileToOpen, filesize($filename));

//close the file
fclose($fileToOpen);

return utf8_decode($content);
}
?>

Modify the resource modules class.php
Open <root>\modules\resourcemodule\class.php

before the class resource module add the following line. (near line 20)

include 'txt2string.php';

Finally replace the spiderContent function code with the following (this function is found at the end of the page.

if ($item) {
$db->delete('search',"ref_module='resourcesmodule' AND ref_type='resourceitem' AND original_id=" . $item->id);
$search->original_id = $item->id;

$contentText = '';

$dbSelection = $db->selectObject('file','id='.$item->file_id);

if (substr($dbSelection->filename, -4, 4) == '.pdf'){
$filename = $dbSelection->directory.'/'.$dbSelection->filename.'.txt';
$contentText = LoadText($filename);
}

$search->body = ' ' . exponent_search_removeHTML($item->description) . ' '.$contentText.' ';
$search->title = ' ' . $item->name . ' ';
$search->location_data = $item->location_data;
$search->view_link = 'index.php?module=resourcesmodule&action=view&id='.$item->id;
$db->insertObject($search,'search');
} else {
$db->delete('search',"ref_module='resourcesmodule' AND ref_type='resourceitem'");
foreach ($db->selectObjects('resourceitem') as $item) {
$search->original_id = $item->id;

$contentText = '';

$dbSelection = $db->selectObject('file','id='.$item->file_id);

if (substr($dbSelection->filename, -4, 4) == '.pdf'){
$filename = $dbSelection->directory.'/'.$dbSelection->filename.'.txt';
$contentText = LoadText($filename);
}

$search->body = ' ' . exponent_search_removeHTML($item->description) . ' '.$contentText.' ';
$search->title = ' ' . $item->name . ' ';
$search->location_data = $item->location_data;
$search->view_link = 'index.php?module=resourcesmodule&action=view&id='.$item->id;
$db->insertObject($search,'search');
}
}

return true;
}

As you can see this function does require a bit of core code modification which could be a problem when it comes to core updates. However it does provide the ability to search indside pdf’s and keep the access level security.

Rich

(working on the same project as Partick@m4design)

Discussion


Log in to post a comment.

MongoDB Logo MongoDB