Menu

Tesseract 4.0

2016-12-16
2016-12-29
  • Christopher Edward

    Starting this topic thanking you Nguyenq for your hard work, you are appreciated.

    In 2016-11-11 Ray has released the first outcome of what will be called Tesseract 4.00 and I can only say that it's a work of art, Ray has trully excelled in this Masterpiece. The new neural network system based on LSTM have the potential to achieve great recognition rates, and also lets not forget, Ray have documented everything & I mean everything.
    Also, Goodbye Cube

    For thoses who would like to test on Windows, the Mannheim University have made some executables of Tesseract to use for testing only at:
    http://digi.bib.uni-mannheim.de/tesseract/

    But bare in mind that this Tesseract version must be built from source, it uses Leptonica 1.73.
    Finally a question to the hard working Nguyenq, are you planing to integrate this Tesseract 4.0 version to Jtessboxeditor so that we could start testing right away?

    Thank you

     

    Last edit: Christopher Edward 2016-12-16
  • Quan Nguyen

    Quan Nguyen - 2016-12-28

    We'll certainly consider upgrading the training tools. Bear in mind that the new training process is a lot more complex than the previous version -- Tesseract developers have warned that "The training cannot be quite as automated as the training for 3.04 for several reasons."

    So let's try to get some familarity and experience with the new process and then cautiously proceed. The tools themselves are still in alpha stage of development, so it could be a while for a compatible jTessBoxEditor to appear.

    Thank you.

     

    Last edit: Quan Nguyen 2016-12-28
  • Christopher Edward

    Thanks for replying
    Ray said that 4.0 version will contain all engines along with the new LSTM, hence adding the comand "-psm" to choose which engine.

    Also the user " Shreeshrii " has created for me a list of all the commands required for training the LSTM engine from scratch, have a look at:
    https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00---Finetune

    assuming that:
    arabic1.txt : contains the text that will be used to generate tif/box files.
    ara : the Arabic language.
    Times New Roman : the font used.

     
  • Quan Nguyen

    Quan Nguyen - 2017-02-20

    A beta version that includes latest 4.00 training executable from Mannheim University has been released. The new training process specific to 4.00 has not been implemented, though. That will have to come later.

    I just happened to see a video for Arabic training at https://wn.com/training_tesseract_ocr_for_arabic_language_tutorial . It seems adjusting the boxes and including an ara.config file in the combine step makes a difference in proper recognition of Arabic scripts.

     
  • M.N.S.Rao

    M.N.S.Rao - 2017-04-03

    Sir,
    I downloaded VietOCR v5.0 alpha and kan.traineddata file through it to my Desktop using windowsXP.
    While trying to OCR a jpg file error "The specified module could not be found" appears.
    I checked that the kan.traineddata(47927 kb) file is lodged in the tessdata folder. Please suggest the correction required.
    Thanks,
    MNS Rao

     
  • Quan Nguyen

    Quan Nguyen - 2017-04-13

    Tesseract 4.00alpha is currently not supported for Windows XP. That may change, though.

    https://github.com/tesseract-ocr/tesseract/issues/810

     
  • Christopher Edward

    Quan Nguyen, when do you expect the implimentation of LSTM training in jtessbox?

     
  • Quan Nguyen

    Quan Nguyen - 2017-04-27

    I'm not sure when I will be able to get around to it. I myself haven't tried to go through the new LSTM training, but it seems quite complicated. And since it is script based, it may not lend itself to be implementable in codes.

     

    Last edit: Quan Nguyen 2017-04-28
    • Christopher Edward

      The Mannheim University Library have released an updated version of Tesseract 4.0, have a look:
      http://digi.bib.uni-mannheim.de/tesseract/

      I think it includes the CLSTM training tools and variuos bug fixes.
      Quan can you test the training function?
      Will you update jtessboxeditor to include the newer verion of tesseract?
      Thank you for your hard work

       
      • Quan Nguyen

        Quan Nguyen - 2017-06-04

        Yes, I happened to be working on incorporating those updates at the time. They have just been uploaded.

         
    • Christopher Edward

      What about installing Git
      https://git-scm.com/

       
  • Christopher Edward

    Since The Mannheim University Library have integrated the lstm training tools.
    Quan will you automate the LSTM training process, and add lstm training to jtboxedittor?

     

Log in to post a comment.