Menu

Tess4J keeps crashing

Help
Jan Benes
2016-08-26
2016-09-01
  • Jan Benes

    Jan Benes - 2016-08-26

    Hi,

    I'm trying to see if Tess4J is the way to go for my project. Unfortunately, I'm unable to get it running. I link to Tess4J using maven/sbt:

    "net.sourceforge.tess4j" % "tess4j" % "3.2.1"
    

    and try to use it like this (scala, sbt; hopefully, you can read the code even if you don't know Scala):

    import net.sourceforge.tess4j.{Tesseract1, Tesseract, TesseractException}
    
    def main(args: Array[String]) {
            // this is a java.io.File
            val file = FileSelector.selectFile().getOrElse { return }
    
            val instance = Tesseract.getInstance() // JNA Interface Mapping
            //val instance = new Tesseract1() // JNA Direct Mapping
    
            try {
                val result = instance.doOCR(file)
                println(result)
            } catch {
                case e: Exception => println(e.getMessage())
            }
        }
    

    In general, I've had little luck trying to load PNG/PDF files (at least PNGs, I can usually load just fine; but I wager that might be a separate issue), but a GIF seems to load/work fine. Now, I either use Tesseract.getInstance, in which case I get

    Exception in thread "main" java.lang.Error: Invalid memory access
        at com.sun.jna.Native.invokePointer(Native Method)
        at com.sun.jna.Function.invokePointer(Function.java:470)
        at com.sun.jna.Function.invoke(Function.java:404)
        at com.sun.jna.Function.invoke(Function.java:315)
        at com.sun.jna.Library$Handler.invoke(Library.java:212)
        at com.sun.proxy.$Proxy0.TessBaseAPIGetUTF8Text(Unknown Source)
        at net.sourceforge.tess4j.Tesseract.getOCRText(Tesseract.java:437)
        at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:292)
        at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:213)
        at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:197)
    

    or I use val instance = new Tesseract1(), in which case, the app always crashes on ios.seek(0); in ImageIOHelper.java on line 297.

    ERROR net.sourceforge.tess4j.Tesseract1 - null
    java.lang.IndexOutOfBoundsException: null
        at javax.imageio.stream.FileCacheImageOutputStream.seek(FileCacheImageOutputStream.java:170)
        at net.sourceforge.tess4j.util.ImageIOHelper.getImageByteBuffer(ImageIOHelper.java:297)
        at net.sourceforge.tess4j.Tesseract1.setImage(Tesseract1.java:363)
        at net.sourceforge.tess4j.Tesseract1.doOCR(Tesseract1.java:257)
        at net.sourceforge.tess4j.Tesseract1.doOCR(Tesseract1.java:180)
        at net.sourceforge.tess4j.Tesseract1.doOCR(Tesseract1.java:164)
    

    Not that the temp file it's trying to read exists and I can open it just fine in IrfanView. Also ios is a valid reference to a FileCacheImageOutputStream.

    I suspected there might be some conflict somewhere with the TwelveMonkeys library I also use, but removing that dependency didn't help. I also tried an older version, choosing 3.0.0 at random, but the issues persisted. I'm running Win10.

    Any help is appreciated!

     
  • Quan Nguyen

    Quan Nguyen - 2016-08-27

    In Tess4J, invalid memory access exceptions usualy result from the program unable to locate tessdata folder. In such cases, you'd need to set datapath to the parent folder of tessdata.

    Imageio-related exceptions are thrown when imageio library cannot read the input image due to bad format, bad image, etc. You may want to trap the exception, perform image cleanup, and try to read it again.

     
  • Jan Benes

    Jan Benes - 2016-08-29

    Quan,

    thank you for your help. I've made progress, but there are still some weird issues.

    With your advice, I was able to use the old interface

    val instance = Tesseract.getInstance() // JNA Interface Mapping
    instance.setDatapath("somepath")
    

    to OCR a file. However, upon re-running the application, I got the ios.seek exception above (on the same input file). It is also the same exception I get with the val instance = new Tesseract1() // JNA Direct Mapping variant.

    I feel like there's some issue with the temp files (I use Win10) that creates a hidden state that makes the program crash (perhaps the file is left locked? though I don't know how an old file being locked would influence subsequent runs of the program). I managed to run the OCR process successfully once more by deleting the temp image files Tess4J creates, but that also only helped once. After that, it's back to the ios.seek exception, even if I again delete the temp files.

    Ideas?

    Thank you, Jan

     
  • Quan Nguyen

    Quan Nguyen - 2016-08-29

    Did you have problem with those sample eurotext images? Can you attach an image that has given you trouble?

     
  • Jan Benes

    Jan Benes - 2016-08-30

    Yes, I used the eurotext test images.

    I double-checked by downloading the Tess4J-3.2.1-src.zip and using the files in Tess4J\test\resources\test-data. eurotext.png, eurotext.bmp, and eurotext.tif all give the same error as above. Note that the seek file I mention above is a seek on the output temporary TIF file. The error isn't in processing the file itself, as far as I can tell, it's in handling the temp file.

     

    Last edit: Jan Benes 2016-08-30
  • Anonymous

    Anonymous - 2016-08-31

    Hi Quan,

    I've been running into the same issue, and my search has led me here.

    I've tried several versions of tess4j and keep getting:

    ERROR net.sourceforge.tess4j.Tesseract1 - null
    java.lang.IndexOutOfBoundsException: null
        at javax.imageio.stream.FileCacheImageOutputStream.seek(FileCacheImageOutputStream.java:170)
        at net.sourceforge.tess4j.util.ImageIOHelper.getImageByteBuffer(ImageIOHelper.java:297)
        at net.sourceforge.tess4j.Tesseract1.setImage(Tesseract1.java:363)
        at net.sourceforge.tess4j.Tesseract1.doOCR(Tesseract1.java:257)
        at net.sourceforge.tess4j.Tesseract1.doOCR(Tesseract1.java:180)
        at net.sourceforge.tess4j.Tesseract1.doOCR(Tesseract1.java:164)
    

    I found the following bug report for Java:

    https://bugs.openjdk.java.net/browse/JDK-6967419

    And the following StackOverflow post which suggests using MemoryCacheImageOutputStream because the bug won't be fixed until Java 9 is released.

    http://stackoverflow.com/questions/12252143/indexoutofbounds-using-javas-imageio-write-to-create-byte-array-in-png-format

    Thanks,

    Rick

     

    Last edit: Anonymous 2016-08-31
  • Quan Nguyen

    Quan Nguyen - 2016-09-01

    We have never had any problem reading eurotext image files using imageio library (that's why they are included in the distribution as reference images), so it was really puzzling that Jan reported the problem.

    The imageio version that tess4j is currently using is a fork of the original one. If the current issue is indeed due to that bug, you may want alert the maintainer of that project to incorporate the fix, so you don't need to wait for JDK9.

    Thanks,
    Quan

     

Log in to post a comment.