Starting this topic thanking you Nguyenq for your hard work, you are appreciated.
In 2016-11-11 Ray has released the first outcome of what will be called Tesseract 4.00 and I can only say that it's a work of art, Ray has trully excelled in this Masterpiece. The new neural network system based on LSTM have the potential to achieve great recognition rates, and also lets not forget, Ray have documented everything & I mean everything.
Also, Goodbye Cube
For thoses who would like to test on Windows, the Mannheim University have made some executables of Tesseract to use for testing only at: http://digi.bib.uni-mannheim.de/tesseract/
But bare in mind that this Tesseract version must be built from source, it uses Leptonica 1.73. Finally a question to the hard working Nguyenq, are you planing to integrate this Tesseract 4.0 version to Jtessboxeditor so that we could start testing right away?
Thank you
Last edit: Christopher Edward 2016-12-16
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
We'll certainly consider upgrading the training tools. Bear in mind that the new training process is a lot more complex than the previous version -- Tesseract developers have warned that "The training cannot be quite as automated as the training for 3.04 for several reasons."
So let's try to get some familarity and experience with the new process and then cautiously proceed. The tools themselves are still in alpha stage of development, so it could be a while for a compatible jTessBoxEditor to appear.
Thank you.
Last edit: Quan Nguyen 2016-12-28
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
assuming that:
arabic1.txt : contains the text that will be used to generate tif/box files.
ara : the Arabic language.
Times New Roman : the font used.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
A beta version that includes latest 4.00 training executable from Mannheim University has been released. The new training process specific to 4.00 has not been implemented, though. That will have to come later.
Sir,
I downloaded VietOCR v5.0 alpha and kan.traineddata file through it to my Desktop using windowsXP.
While trying to OCR a jpg file error "The specified module could not be found" appears.
I checked that the kan.traineddata(47927 kb) file is lodged in the tessdata folder. Please suggest the correction required.
Thanks,
MNS Rao
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm not sure when I will be able to get around to it. I myself haven't tried to go through the new LSTM training, but it seems quite complicated. And since it is script based, it may not lend itself to be implementable in codes.
Last edit: Quan Nguyen 2017-04-28
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I think it includes the CLSTM training tools and variuos bug fixes.
Quan can you test the training function?
Will you update jtessboxeditor to include the newer verion of tesseract?
Thank you for your hard work
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Since The Mannheim University Library have integrated the lstm training tools.
Quan will you automate the LSTM training process, and add lstm training to jtboxedittor?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Starting this topic thanking you Nguyenq for your hard work, you are appreciated.
In 2016-11-11 Ray has released the first outcome of what will be called Tesseract 4.00 and I can only say that it's a work of art, Ray has trully excelled in this Masterpiece. The new neural network system based on LSTM have the potential to achieve great recognition rates, and also lets not forget, Ray have documented everything & I mean everything.
Also, Goodbye Cube
For thoses who would like to test on Windows, the Mannheim University have made some executables of Tesseract to use for testing only at:
http://digi.bib.uni-mannheim.de/tesseract/
But bare in mind that this Tesseract version must be built from source, it uses Leptonica 1.73.
Finally a question to the hard working Nguyenq, are you planing to integrate this Tesseract 4.0 version to Jtessboxeditor so that we could start testing right away?
Thank you
Last edit: Christopher Edward 2016-12-16
We'll certainly consider upgrading the training tools. Bear in mind that the new training process is a lot more complex than the previous version -- Tesseract developers have warned that "The training cannot be quite as automated as the training for 3.04 for several reasons."
So let's try to get some familarity and experience with the new process and then cautiously proceed. The tools themselves are still in alpha stage of development, so it could be a while for a compatible jTessBoxEditor to appear.
Thank you.
Last edit: Quan Nguyen 2016-12-28
Thanks for replying
Ray said that 4.0 version will contain all engines along with the new LSTM, hence adding the comand "-psm" to choose which engine.
Also the user " Shreeshrii " has created for me a list of all the commands required for training the LSTM engine from scratch, have a look at:
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00---Finetune
assuming that:
arabic1.txt : contains the text that will be used to generate tif/box files.
ara : the Arabic language.
Times New Roman : the font used.
A beta version that includes latest 4.00 training executable from Mannheim University has been released. The new training process specific to 4.00 has not been implemented, though. That will have to come later.
I just happened to see a video for Arabic training at https://wn.com/training_tesseract_ocr_for_arabic_language_tutorial . It seems adjusting the boxes and including an
ara.config
file in the combine step makes a difference in proper recognition of Arabic scripts.Sir,
I downloaded VietOCR v5.0 alpha and kan.traineddata file through it to my Desktop using windowsXP.
While trying to OCR a jpg file error "The specified module could not be found" appears.
I checked that the kan.traineddata(47927 kb) file is lodged in the tessdata folder. Please suggest the correction required.
Thanks,
MNS Rao
Tesseract 4.00alpha is currently not supported for Windows XP. That may change, though.
https://github.com/tesseract-ocr/tesseract/issues/810
Quan Nguyen, when do you expect the implimentation of LSTM training in jtessbox?
I'm not sure when I will be able to get around to it. I myself haven't tried to go through the new LSTM training, but it seems quite complicated. And since it is script based, it may not lend itself to be implementable in codes.
Last edit: Quan Nguyen 2017-04-28
The Mannheim University Library have released an updated version of Tesseract 4.0, have a look:
http://digi.bib.uni-mannheim.de/tesseract/
I think it includes the CLSTM training tools and variuos bug fixes.
Quan can you test the training function?
Will you update jtessboxeditor to include the newer verion of tesseract?
Thank you for your hard work
Yes, I happened to be working on incorporating those updates at the time. They have just been uploaded.
What about installing Git
https://git-scm.com/
https://github.com/nguyenq/jTessBoxEditor
https://github.com/nguyenq/jTessBoxEditorFX
Since The Mannheim University Library have integrated the lstm training tools.
Quan will you automate the LSTM training process, and add lstm training to jtboxedittor?