I need to convert 1000 scanned pages of a medical book (in .jpg) format into HTML and XML -- What's the right software combo to get started? OCR software + html editor + xml editor?
Is the XML Copy Editor the right software for this?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Once you have flat text, you need to decide on a format to use. A number of document types would work: you could look into DocBook or TEI, for example. The advantage of using an open format is that you can output the content to various formats (HTML, XHTML, XSL-FO) and you can move to another piece of software if your requirements change. Given that you're starting from flat text, you could also think about OpenOffice.
When you have made these choices, you can mark up the content.
All XML editors - including this one - will help you surround chunks of text with the relevant text and check the document is ok. If you choose DocBook, for example, XML Copy Editor comes with the relevant stylesheets for HTML output preinstalled.
If you run into trouble or something isn't working as expected, please don't hesitate to ask!
Regards,
Gerald
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello
Newbie here. Would appreciate some pointers.
I need to convert 1000 scanned pages of a medical book (in .jpg) format into HTML and XML -- What's the right software combo to get started? OCR software + html editor + xml editor?
Is the XML Copy Editor the right software for this?
The most important tool will be the OCR software. http://en.wikipedia.org/wiki/Optical_character_recognition_software has some useful links to projects in this area, some of them open source (I think tesseract has been attracting a fair bit of attention on Slashdot).
Once you have flat text, you need to decide on a format to use. A number of document types would work: you could look into DocBook or TEI, for example. The advantage of using an open format is that you can output the content to various formats (HTML, XHTML, XSL-FO) and you can move to another piece of software if your requirements change. Given that you're starting from flat text, you could also think about OpenOffice.
When you have made these choices, you can mark up the content.
All XML editors - including this one - will help you surround chunks of text with the relevant text and check the document is ok. If you choose DocBook, for example, XML Copy Editor comes with the relevant stylesheets for HTML output preinstalled.
If you run into trouble or something isn't working as expected, please don't hesitate to ask!
Regards,
Gerald