NAPS2 - Not Another PDF Scanner / Discussion / General Discussion: Tesseract-4.0.x

Hello!

You recently added /components/tesseract-4.0.0b4 on your NAPS2 download page.
Tried to use it without success. NAPS2 autodounload tesseract-3.0.4 when try to use OCR. When unpacking tesseract-4.0.0b4 downloaded files to components folder and tried to cheat NAPS2 with copy them to folder tesseract-3.0.4, NAPS2 correctly detect language file (traineddata) , but scanned PDF is without OCR and errorlog.txt shows:

2018-09-05 14:00:51.8076 Error running OCR System.IO.FileNotFoundException: Could not find file 'C:\Program Files (x86)\util\scan\Data\temp\4m2fsebo.3cb.hocr'.
File name: 'C:\Program Files (x86)\util\scan\Data\temp\4m2fsebo.3cb.hocr'
   at System.IO.__Error.WinIOError(Int32 errorCode, String maybeFullPath)
   at System.IO.FileStream.Init(String path, FileMode mode, FileAccess access, Int32 rights, Boolean useRights, FileShare share, Int32 bufferSize, FileOptions options, SECURITY_ATTRIBUTES secAttrs, String msgPath, Boolean bFromProxy, Boolean useLongPath, Boolean checkHost)
   at System.IO.FileStream..ctor(String path, FileMode mode, FileAccess access, FileShare share, Int32 bufferSize)
   at System.Xml.XmlDownloadManager.GetStream(Uri uri, ICredentials credentials, IWebProxy proxy, RequestCachePolicy cachePolicy)
   at System.Xml.XmlUrlResolver.GetEntity(Uri absoluteUri, String role, Type ofObjectToReturn)
   at System.Xml.XmlTextReaderImpl.FinishInitUriString()
   at System.Xml.XmlTextReaderImpl..ctor(String uriStr, XmlReaderSettings settings, XmlParserContext context, XmlResolver uriResolver)
   at System.Xml.XmlReaderSettings.CreateReader(String inputUri, XmlParserContext inputContext)
   at System.Xml.XmlReader.Create(String inputUri, XmlReaderSettings settings, XmlParserContext inputContext)
   at System.Xml.Linq.XDocument.Load(String uri, LoadOptions options)
   at NAPS2.Ocr.TesseractOcrEngine.ProcessImage(String imagePath, String langCode, Func`1 cancelCallback)

Need assistance to use tesseract-4.

Regards,

Zdenko

Ben Olden-Cooligan - 2018-09-05

Hi,

Tesseract 4 requires a new version of NAPS2, which is coming soon (1-2 weeks hopefully).

Ben

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

OSS fan - 2018-09-06

Thank you.

Thought so.
Hardly waiting. The best will be even better! Is that possible?

Best regards,

Zdenko

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Timo - 2018-09-07

Please enlighten me:

What is Tesseract 4?

Does it come build in with next NAPS2 and most importantly,

How does it make end-use better/faster/more accurate?

Sorry if this was trivial.

Appreciated :)

//Timo
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ben Olden-Cooligan - 2018-09-07

Tesseract is the underlying program NAPS2 uses for OCR. Version 4 uses a new engine based on a type of neural network.

In the new NAPS2 version it will be integrated the same way as OCR is now, where you just press the OCR button and download the language you want.

I've been very impressed in my testing, with as many as 80% fewer recognition errors (though overall it's probably more modest, 30% less) and no noticeable regressions

It will take more CPU (but for small numbers of pages may be faster due to improved multi-core use). To compensate, I'm adding the ability to run OCR preemptively (before you click the Save PDF button), so from a user perspective it may be much faster (or even instant).

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Timo - 2018-09-10

Appreciated, thx!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ben Olden-Cooligan - 2018-09-18

This should now work with the latest version (6.0b1).

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Tesseract-4.0.x

Scan documents to PDF and other file types, as simply as possible.

Forums

Help

Tesseract-4.0.x

Tesseract-4.0.x

Scan documents to PDF and other file types, as simply as possible.

Forums

Help

Tesseract-4.0.x document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Tesseract-4.0.x