I am using the following example to convert the pdf (ocr) to text
org.apache.log4j.PropertyConfigurator.configure("C://Projects//Library//Tess4J//log4j.properties.txt");//sets//properties//file//for//log4jFileimage=newFile("C://Users//arpit.tandon//Documents//My Received Files//SomePapers//Chamberlain-1979-Thalidomide and lack of terat.pdf")TesseracttessInst=newTesseract();tessInst.setDatapath("C://Projects//Library//Tess4J");try{Stringresult=tessInst.doOCR(image);FileWriterfw=newFileWriter("C://Users//arpit.tandon//Documents//MyReceivedFiles//SomePapers//output.txt");fw.write(result);//System.out.println(result);fw.close();}catch(TesseractException|IOExceptione){System.err.println(e.getMessage());}
The pdf has two columns. Tess4J works great and outpute the text but it doesnt consider the column. It prints out the adjacent lines from two columns as one line. Though when I try to convert the same pdf to tiff using "convert" and then run terrasact directly on command line, it takes care of the column and output accordingly and also, its fast. Any idea what I might be missing here?
I am using the following example to convert the pdf (ocr) to text
The pdf has two columns. Tess4J works great and outpute the text but it doesnt consider the column. It prints out the adjacent lines from two columns as one line. Though when I try to convert the same pdf to tiff using "convert" and then run terrasact directly on command line, it takes care of the column and output accordingly and also, its fast. Any idea what I might be missing here?
Last edit: arpit tandon 2018-07-26
Got it. I had to use tessInst.setPageSegMode(3);