From: Torsten N. <tn...@in...> - 2002-05-30 10:52:56
|
Geoff Hutchison wrote: >=20 > On Tue, 28 May 2002, Elaine Fortin wrote: >=20 > > We want to store faxes and be able to search them with htdig. > ... > > In order to be able to search the content, would we have to run them > > through an OCR program, or is there something else that can translate= them? >=20 > You'd have to have some sort of OCR in there. >=20 > A fax TIFF file is pure graphic--there's very little text content. (TIF= F > files in general can have some useful text info, but I think you're > looking for the text in the fax, not text that the fax program may or m= ay > not store in the TIFF.) OCR software programs need to be "trained" for successfully recognizing any textual contents in a graphic. Textual graphics need to be properly aligned in order for the OCR software to successfully recognize the text content as such. That means: The OCR programs need to know about the font used in the graphics files *plus* there should be little (less than 3=B0) alignment offset or else any (even the best commercially available OCR software program) will produce near completely unreadable output. In the case of facsimiles to be indexed by ht://Dig via translating graphics to text content with any give OCR software, one has to take into account that (a) facsimiles are in most cases *not* correctly enough aligned to be analyzed by an OCR program (hand-faxed sheets will normally have offsets of 3=B0+) (b) facsimiles cannot be controlled with regards to character based training of the OCR software (they will especially never be sent using specially designed OCR fonts) (c) facsimiles (especially hand-transmitted ones) will in many cases contain valuable information added in hand-writing (d) facsimiles will contain "useless" information that cannot be skipped by text-indexing software like ht://Dig (since there is no way of inserting the respective control statements for the analyzing software) All this makes facsimiles (and most scanned texts) nearly unfit for automatic processing with OCR and indexing programs. cheers, Torsten --=20 InWise - Wirtschaftlich-Wissenschaftlicher Internet Service GmbH Waldhofstra=DFe 14 Tel: +49-4101-403605 D-25474 Ellerbek Fax: +49-4101-403606 E-Mail: in...@in... Internet: http://www.inwise.de |