Tess4J / Discussion / Open Discussion: Tess4J doesnt detect the columns like tesseract

Tess4J doesnt detect the columns like tesseract

Created: 2018-07-26

Updated: 2018-07-26

I am using the following example to convert the pdf (ocr) to text

 org.apache.log4j.PropertyConfigurator.configure("C://Projects//Library//Tess4J//log4j.properties.txt"); // sets
                                                                                                            // properties
                                                                                                            // file
                                                                                                            // for
                                                                                                            // log4j
    File image = new File("C://Users//arpit.tandon//Documents//My Received Files//SomePapers//Chamberlain-1979-Thalidomide and lack of terat.pdf")

    Tesseract tessInst = new Tesseract();
    tessInst.setDatapath("C://Projects//Library//Tess4J");
    try
    {
        String result = tessInst.doOCR(image);
        FileWriter fw = new FileWriter("C://Users//arpit.tandon//Documents//My Received Files//SomePapers//output.txt");
        fw.write(result);
        // System.out.println(result);

        fw.close();

    }
    catch (TesseractException | IOException e)
    {
        System.err.println(e.getMessage());
    }

The pdf has two columns. Tess4J works great and outpute the text but it doesnt consider the column. It prints out the adjacent lines from two columns as one line. Though when I try to convert the same pdf to tiff using "convert" and then run terrasact directly on command line, it takes care of the column and output accordingly and also, its fast. Any idea what I might be missing here?

Last edit: arpit tandon 2018-07-26

Chamberlain-1979-Thalidomide and lack of terat.pdf

arpit tandon - 2018-07-26

Got it. I had to use tessInst.setPageSegMode(3);

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Tess4J doesnt detect the columns like tesseract

Forums

Help

Tess4J doesnt detect the columns like tesseract

Tess4J doesnt detect the columns like tesseract

Forums

Help

Tess4J doesnt detect the columns like tesseract document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Tess4J doesnt detect the columns like tesseract