Tesseract 4.0

Brought to you by: nguyenq

Tesseract 4.0

Forum: Open Discussion

Creator: Christopher Edward

Created: 2016-12-16

Updated: 2016-12-29

Christopher Edward - 2016-12-16

Starting this topic thanking you Nguyenq for your hard work, you are appreciated.

In 2016-11-11 Ray has released the first outcome of what will be called Tesseract 4.00 and I can only say that it's a work of art, Ray has trully excelled in this Masterpiece. The new neural network system based on LSTM have the potential to achieve great recognition rates, and also lets not forget, Ray have documented everything & I mean everything.
Also, Goodbye Cube

For thoses who would like to test on Windows, the Mannheim University have made some executables of Tesseract to use for testing only at:
http://digi.bib.uni-mannheim.de/tesseract/

But bare in mind that this Tesseract version must be built from source, it uses Leptonica 1.73.
Finally a question to the hard working Nguyenq, are you planing to integrate this Tesseract 4.0 version to Jtessboxeditor so that we could start testing right away?

Thank you

Last edit: Christopher Edward 2016-12-16

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Quan Nguyen - 2016-12-28

We'll certainly consider upgrading the training tools. Bear in mind that the new training process is a lot more complex than the previous version -- Tesseract developers have warned that "The training cannot be quite as automated as the training for 3.04 for several reasons."

So let's try to get some familarity and experience with the new process and then cautiously proceed. The tools themselves are still in alpha stage of development, so it could be a while for a compatible jTessBoxEditor to appear.

Thank you.

Last edit: Quan Nguyen 2016-12-28

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Christopher Edward - 2016-12-29

Thanks for replying
Ray said that 4.0 version will contain all engines along with the new LSTM, hence adding the comand "-psm" to choose which engine.

Also the user " Shreeshrii " has created for me a list of all the commands required for training the LSTM engine from scratch, have a look at:
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00---Finetune

assuming that:
arabic1.txt : contains the text that will be used to generate tif/box files.
ara : the Arabic language.
Times New Roman : the font used.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Quan Nguyen - 2017-02-20

A beta version that includes latest 4.00 training executable from Mannheim University has been released. The new training process specific to 4.00 has not been implemented, though. That will have to come later.

I just happened to see a video for Arabic training at https://wn.com/training_tesseract_ocr_for_arabic_language_tutorial . It seems adjusting the boxes and including an ara.config file in the combine step makes a difference in proper recognition of Arabic scripts.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

M.N.S.Rao - 2017-04-03

Sir,
I downloaded VietOCR v5.0 alpha and kan.traineddata file through it to my Desktop using windowsXP.
While trying to OCR a jpg file error "The specified module could not be found" appears.
I checked that the kan.traineddata(47927 kb) file is lodged in the tessdata folder. Please suggest the correction required.
Thanks,
MNS Rao

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Quan Nguyen - 2017-04-13

Tesseract 4.00alpha is currently not supported for Windows XP. That may change, though.

https://github.com/tesseract-ocr/tesseract/issues/810

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Christopher Edward - 2017-04-27

Quan Nguyen, when do you expect the implimentation of LSTM training in jtessbox?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Quan Nguyen - 2017-04-27

I'm not sure when I will be able to get around to it. I myself haven't tried to go through the new LSTM training, but it seems quite complicated. And since it is script based, it may not lend itself to be implementable in codes.

Last edit: Quan Nguyen 2017-04-28

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Christopher Edward - 2017-06-04
  
  The Mannheim University Library have released an updated version of Tesseract 4.0, have a look:
  http://digi.bib.uni-mannheim.de/tesseract/
  
  I think it includes the CLSTM training tools and variuos bug fixes.
  Quan can you test the training function?
  Will you update jtessboxeditor to include the newer verion of tesseract?
  Thank you for your hard work
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Quan Nguyen - 2017-06-04
    
    Yes, I happened to be working on incorporating those updates at the time. They have just been uploaded.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Christopher Edward - 2017-06-13
  
  What about installing Git
  https://git-scm.com/
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Quan Nguyen - 2017-06-13

https://github.com/nguyenq/jTessBoxEditor
https://github.com/nguyenq/jTessBoxEditorFX

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Christopher Edward - 2017-06-25

Since The Mannheim University Library have integrated the lstm training tools.
Quan will you automate the LSTM training process, and add lstm training to jtboxedittor?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.