HTML Tidy contains hidden functionality to store the original text extracted from the source document. While this code is disabled by default, undocumented (i.e. not supported by the group) and there is presently no API to expose this data (even if enabled), it is present in the source code and therefore potentially usable by the more adventurous HTML Tidy API users.
While the group may not wish to actively support this undocumented functionality, the code really should be fixed or it should be pulled entirely from the CVS repository as it's not reliable in it's current state. I would rather see the first than the latter as there's probably someone else out there that would benefit from them.
The attached patch file contains two things primarily:
1) A new Tidy API function that exposes the original text. This function is wrapped with the TIDY_STORE_ORIGINAL_TEXT #define so it is only available when this functionality is enabled in the build.
2) Fixes several issues with the current functionality (hopefully all of them) where the original text got out of sync with the lexer state in GetTokenFromStream() due to changing node start points, pushing characters back onto the stream, and splitting read content into two nodes (in CondReturnTextNode macro).
The patch contains differences from the current CVS version of the following files:
include\tidy.h (Rev 1.22)
src\tidylib.c (Rev 1.75)
src\lexer.c (Rev 1.194)
src\streamio.c (Rev 1.43)
It does not contain any changes from HTML Tidy bugs 2811690, 2819896, and 2819903. (Though I would highly recommend them as they correct reported node positions issues.)
Log in to post a comment.