Thursday, July 09, 2009.

 

Hi Chris,

 

Yes, way back then, 2006, I did enable TIDY_STORE_ORIGINAL_TEXT, and did a number of 'fixes' to get it to work properly. These were not TOO numerous, but essential none the less...

 

My purpose at the time was to aid in fixing two 'accessibility testing' bugs, by referring back to the original text, BUT these fixes were never applied! And I found alternative ways to address these two bugs...

 

And also I saw potential for using this original text store in some debugging situations... just to know what text Tidy started with... but...

 

(a) I remember at the time Björn advising that he did not 'like' his own code on this ;=(), but after the fixes it all worked fine, and

(b) Even if applied, there was never any intention of adding it to the Tidy API, so you would only be able to access this 'original' store through programming... 

 

As you have discovered, some text 'manipulation' happens down at the parse, lexer, stream IO level, especially for spaces and line ending, for presumably speed and convenience at the time, rather than all at the cleanup and/or pretty print level, so some  original file data is lost, and sometimes perhaps then the file position information is not exactly accurate...

 

Maybe this latter file position information could be addressed as a bug. Find a simple example test case, and maybe this could be addressed. This is because when some characters are seen in the stream, they sometimes dropped, converted, or are put back into a 'store', and maybe there are cases where the reported column position then gets out of sync, and maybe line number also... With some good simple test cases this could be checked...

 

But it seems strange that you start with HTML Tidy, whose purpose is to TIDY HTML, for a 'HTML to Text' converter project! In the past I have always used a rather simple perl script to 'remove' everything between '<' and '>', and if I wanted to be fancy, removed <head...> to </head>, <script...> to </script>, etc... and if this is done on a line by line basis, character by character parsing, then that script could report absolutely accurate line and column positions...

 

Anyway, I do not know the full details of your project...

 

Hope this helps.

 

Regards,

 

Geoff.

 

EOF - Tidy-112-Chris01.doc

 



What can you do with the new Windows Live? Find out