Menu

Tess4J doesnt detect the columns like tesseract

2018-07-26
2018-07-26
  • arpit tandon

    arpit tandon - 2018-07-26

    I am using the following example to convert the pdf (ocr) to text

     org.apache.log4j.PropertyConfigurator.configure("C://Projects//Library//Tess4J//log4j.properties.txt"); // sets
                                                                                                                // properties
                                                                                                                // file
                                                                                                                // for
                                                                                                                // log4j
        File image = new File("C://Users//arpit.tandon//Documents//My Received Files//SomePapers//Chamberlain-1979-Thalidomide and lack of terat.pdf")
    
        Tesseract tessInst = new Tesseract();
        tessInst.setDatapath("C://Projects//Library//Tess4J");
        try
        {
            String result = tessInst.doOCR(image);
            FileWriter fw = new FileWriter("C://Users//arpit.tandon//Documents//My Received Files//SomePapers//output.txt");
            fw.write(result);
            // System.out.println(result);
    
            fw.close();
    
        }
        catch (TesseractException | IOException e)
        {
            System.err.println(e.getMessage());
        }
    

    The pdf has two columns. Tess4J works great and outpute the text but it doesnt consider the column. It prints out the adjacent lines from two columns as one line. Though when I try to convert the same pdf to tiff using "convert" and then run terrasact directly on command line, it takes care of the column and output accordingly and also, its fast. Any idea what I might be missing here?

     

    Last edit: arpit tandon 2018-07-26
  • arpit tandon

    arpit tandon - 2018-07-26

    Got it. I had to use tessInst.setPageSegMode(3);

     

Log in to post a comment.