SourceForge has been redesigned. Learn more.
Close

#856 Cyrillic characters in utf-8

closed-invalid
nobody
5
2007-10-18
2007-09-08
someuser
No

I submitted the report: http://sourceforge.net/tracker/index.php?func=detail&aid=1788387&group_id=27659&atid=390963

How here is an axample.

To reproduce the issue:

I open it any utf-8 capable text editor.
Select all text.
On MacOSX in Services menu select Tidy to XHTML (giving that the TidyService is installed).

And have a garbage output like:

пњљпњљпњљ пњљпњљпњљ - пњљпњљпњљпњљпњљ
пњљпњљпњљпњљпњљпњљ: пњљпњљ пњљпњљ
пњљпњљпњљпњљпњљпњљпњљ пњ

Discussion

  • someuser

    someuser - 2007-09-08

    Example of cyrillic utf-8 text.

     
  • Geoff

    Geoff - 2007-09-08

    Logged In: YES
    user_id=1408861
    Originator: NO

    Saturday, September 08, 2007.

    Hi Edge (someuser),

    I too am just a 'someuser' ;=))

    Thank you for signing in, and adding an example.html file. But this runs fine in the latest CVS source version of Tidy, compiled in Windows ... If I add -utf8, meaning utf8 for input and output - no garbage output, no warnings or errors ... perfect!

    Likewise if I use a config file, with -
    input-encoding: utf8
    output-encoding: utf8
    I get no warnings or errors, and a well formed output file, the same as the above ... even adding '--output-xml yes' makes no change ...

    BUT if I run Tidy _WITHOUT_ this config file, or do not add the -utf8 command, then I get 122 warnings, and a 'garbage' output file. Not quite what you showed, but with similar repetitions, like :-

    <p>&ETH;&scaron;&ETH;&deg;&ETH;&para;&ETH;&acute;&ETH;&deg;&Ntilde;
    &ETH;&iquest;&Ntilde;&sbquo;&ETH;&cedil;&Ntilde;&dagger;&ETH;&deg;
    ...

    An extract of the error output in that case is :-
    line 5 column 9 - Warning: replacing invalid character code 159
    line 5 column 11 - Warning: replacing invalid character code 128
    line 5 column 19 - Warning: replacing invalid character code 128
    line 9 column 5 - Warning: replacing invalid character code 154
    line 9 column 15 - Warning: discarding invalid character code 143
    line 9 column 19 - Warning: replacing invalid character code 130
    etc, etc, etc ...

    Maybe it SEEMS like your MAC Tidy is NOT reading, or using the TidyService.conf file you specified. You seem to suggest similar warnings - Warning: replacing invalid UTF-8 bytes - does it really say 'UTF-8 bytes'? UTF-8 is multibyte, hence the warning 'invalid character code' I get ...

    If what you get is the same warning as shown above, then Tidy is _NOT_ set, not configured for reading UTF8, and is stumbling over some second bytes like 0x9F (159), 0x80 (128), 0x9A (154), etc ... that is it is reading the file bytes in 'standard' mode, and does its best with the rubbish it finds ...

    As normal, it ignores the meta tag setting charset=utf-8. That is, Tidy has to be specifically configured to read UTF8 via the input-encoding
    parameter ...

    Regrettably, as you can read on the Tidy development list, Terry League died about 2 years ago, and he was the Tidy MAC OS supporter, and I have not read of anyone replacing him. Perhaps YOU?

    There have been a LOT of improvement in the Tidy code over the last few years, but I do not specifically remember anything related to utf8, but there could be!

    So, how to check that your OS X 'TidyService' is reading the 'conf' file? More importantly, how to get an updated version of Tidy code available for the MAC? What is the version you are using?

    And I note your 'example.html' does NOT begin with a UTF8 BYTE ORDER MARK (BOM), but maybe this is optional?

    It does seem your OS X TidyService is not operating properly. I am sorry this is just more questions than answers ;=((

    Regards,

    Geoff.

    EOF - cyrillic-02.doc

     
  • someuser

    someuser - 2007-09-09

    Logged In: YES
    user_id=1885736
    Originator: YES

    Don't really know about porting Tidy. It can be installed via Macports.

    TidyService is rather fresh app.

    I also have error at the beginning of the log: TidyService[241] Atempt to run diagnostics failed : 1

     
  • Geoff

    Geoff - 2007-09-09

    Logged In: YES
    user_id=1408861
    Originator: NO

    Sunday, September 09, 2007.

    Hi Edge,

    I have REMOTE access to a friends MAC (using VNC), and have started discussing this with him ... He has done the XCode Tools and XWindows (X11) install in one machine, but must now install the same in the MAC Server I have access to ...

    And then we must turn to try to build Tidy. How did you handle this? That is, how did you 'install' a TidyService?

    One source I found was the 'Tidy Darwin Source', from :-

    http://www.opensource.apple.com/darwinsource/Current/

    Here we could download tidy-5.tar.gz which has a Makefile to do the make and install in the MAC ... this source is a few years old - about 2005 - but should be ok for a start ...

    IS THIS WHAT YOU USED?

    If NOT, please explain how you 'installed' Tidy as a Tidy Service - in some details if possible because my friend is NOT a programmer ;=(( and I really only know windows and some linux/unix ...

    With this Tidy Service running I may be able to tell more about your log entry : TidyService[241] Attempt to run diagnostics failed : 1, but at the moment have no idea ...

    Looking forward to your early reply.

    Regards,

    Geoff.

    PS: If you do not mind, you could give me a direct email address, and we could correspond directly on this rather than through this bug tracker board ...

    I know I can use someuser at users dot sourceforge dot net, but this is also quite indirect ...

    A direct email address for me is 'geoff air', one word, no space, and of course, no inverted commas, at hotmail dot com, if you want to send to me directly.

    As you like ...

    EOF - cyrillic-03.doc

     
  • Arnaud Desitter

    Arnaud Desitter - 2007-09-10
    • milestone: --> Current - Mac OS specific
     
  • Arnaud Desitter

    Arnaud Desitter - 2007-09-10

    Logged In: YES
    user_id=566665
    Originator: NO

    The example provided works fine with command line version of tidy.
    It is most likely a problem with tidyservice (http://www.pixelfreak.net/tidy_service/).

    Geoff, would like to contact the author of tidyservice using the email in
    http://www.pixelfreak.net/tidy_service/readme.html ?

     
  • Arnaud Desitter

    Arnaud Desitter - 2007-09-10
    • status: open --> open-invalid
     
  • Geoff

    Geoff - 2007-09-10

    Logged In: YES
    user_id=1408861
    Originator: NO

    Sunday, September 09, 2007.

    Hi Arnaud,

    Thanks for your pointers. Sometimes I really MUST remember to check a simple Yahoo! search - 'TidyService' yields pixelfreak as the top two hits ;=() but I had already spent some 6+ hours on my VNC connection, getting to 'know' the MAC in general - it really does not 'feel' much different to windows ...

    Below is what I sent to Scott ...

    re: http://www.pixelfreak.net/tidy_service/readme.html

    Hi Scott,

    We have a TidyService user having problems with cyrillic (UTF-8) characters. You can read about this on :-

    https://sourceforge.net/tracker/?func=detail&atid=390963&aid=1790572&group_id=27659

    As you can also read there, I am setting up a 'test' in a REMOTE MAC of a friend, and may be able to report more when this is done, but the remote VNC interface is VERY SLOW, and VERY TRYING ;=)). I can see I was way off base with the 'Tidy Darwin Source', but things should progress once I install your service for a trial.

    It really seems the TidyService is NOT reading or obeying the :-
    input-encoding: utf8
    in his TidyService.conf file, or maybe he 'missed' on the clear case sensitive note in your read me!

    This bug report contains an example.html for you to try.

    Regards,

    Geoff.

    EOF - cyrillic-04.doc

     
  • Geoff

    Geoff - 2007-09-10

    Logged In: YES
    user_id=1408861
    Originator: NO

    Oops, that top date should be Monday, September 10, 2007. Only missed by a day ;=))

     
  • Geoff

    Geoff - 2007-09-11

    Logged In: YES
    user_id=1408861
    Originator: NO

    Tuesday, September 11, 2007.

    Hi Scott, Edge, et al ...

    Ok, I was able to download, and install the 'Tidy Service' into the (remote) MAC, by manually copying it to /Library/Services, and re-booting. And I created a TidyService.conf file in my $HOME folder, with at least :-
    input-encoding: utf8
    output-encoding: utf8

    I had some initial problems with this, since the default save in the TextEdit text editor was in RTF. Since I could not seem to find a way to turn this off, I eventually created this file in my home folder using the 'pico' editor in a Terminal (shell) Window ...

    For the loading of the downloaded example.html into TextEdit, I _WAS_ able to turn off the loading as HTML. If not, you only see the cyrillic text, like in a browser. Actually the option I checked said something like 'not as RTF'???

    But when loaded with this OFF, then I was able to 'see' the FULL TEXT, beginning with <!DOCTYPE ... etc ... to </html>, but then TextEdit did NOT show the cyrillic text correctly ... which I think is ok in this 'raw' mode ...

    Selecting ALL, and then Services -> Tidy to XHTML, it ran WITHOUT any errors or warnings ... but then the resultant file, saved as example3.html, still did not display correctly ... just rubbish!

    I then REMOVED the output-encoding: utf8 entry, and tried it again. This time I got a file, I called example2.html, that correctly displays cyrillic in a browser, but of course all the characters are encoded as entities. That is in the form &#nnnn; ...

    I then tried the Service -> Tidy Markup, saving the file as example4.html, AND EVERYTHING WORKED PERFECTLY ;=)) Phew!

    There were no warnings or errors. Essentially there was no difference between example.html, the original, and example4.html, the Tidy Service output! So in this part, the Tidy Service is working great. Thanks.

    When looking at just the 6 cyrillic chars in the <title> ... </title> tag, of course, example2.html has :-
    &#1055;&#1088;&#1080;&#1084;&#1077;&#1088; - 6 chars, which in hex is -
    0x041F 0x0440 0x0438 0x043C 0x0435 0x0440

    These same hex values can be seen if the output encoding is set to UTF-16 - output-encoding: utf16 - namely -
    04 1F 04 40 04 38 04 3C 04 35 04 40

    So what is the 'problem' when output is utf8 and 'Tidy to XHTML' is used? Why does example3.html NOT display correctly? To look at the character coding values - again just the title - and comparing the byte values :-

    example.html - D0 9F D1 80 D0 B8 D0 BC D0 B5 D1 80 (6 chars)
    example3.html - E2 80 93 C3 BC E2 80 94 C3 84 E2 80 93 E2 88 8F etc...

    EXTREEMLY DIFFERENT!!! WHAT IS THIS??? Why does example3.html have this strange set of values? Maybe someone who understands character encoding more will make more sense of this than I can ;=))

    When I used my windows command line Tidy, built from CVS, and giving it a config file with input-encoding: utf8 and output-encoding: utf8, I get EXACTLY the same output codes as the example.html input file, and the same as example4.html when using 'Tidy Markup'. That is - D0 9F D1 80 D0 B8 D0 BC D0 B5 D1 80 ...

    In fact, with my windows tidy, even if I ADD a -asxml command, or --output-xml yes, or output-xhtml yes, which I assume are the same as 'Tidy to XHTML', I still get the exactly the same file, which displays fine ... Is there some other option I should try?

    So there only seems a problem with the Tidy Service when 'Tidy to XHTML' is chosen. What settings are passed to the tidy library when this is chosen? Maybe this is an invalid option choice since the file is already in XML format, with an XHTML Strict DOCTYPE header? Or I have done something else wrong?

    The FULL SET of hex values between <title>...</title> is :-
    E2 80 93 C3 BC E2 80 94 C3 84 E2 80 93 E2 88 8F E2 80 93 C2 BA E2 80 93 C2 B5 E2 80 94 C3 84

    Knowing, by 'seeing', that the 2nd and last of the 6 cyrillic characters in the title are the SAME allows me to extrapolate something like the following sets :-
    1 - E2 80 93 C3 BC
    2 - E2 80 94 C3 84
    3 - E2 80 93 E2 88 8F
    4 - E2 80 93 C2 BA
    5 - E2 80 93 C2 B5
    6 - E2 80 94 C3 84

    But this is just a BIG GUESS! ANYWAY, WHAT IS THIS OUTPUT? Why does each seem to begin with E2 80? Maybe this is some special MAC encoding? But it does NOT display cyrillic in a browser, although I tried a few of the MAC 'encoding' options, but none seemed to work - that is none display cyrillic ...

    Anyway, it seems Tidy Service is fine if 'Tidy Markup' is used. Only when 'Tidy to XHTML' is chosen is there a problem. And no amount of fiddling with my windows command line version could I create this character stream created by the MAC Tidy to XHTML Service ...

    HELP NEEDED TO UNDERSTAND THIS! But some of the 'problem' seems solved ;=))

    Regards,

    Geoff.

    PS: I also downloaded the latest CVS Tidy code, and compiled it in the MAC with no problems. Only when I finished did I note there was a tidy binary in /usr/bin, that comes with the OS, or maybe from the XCode Tools/SDK install!

    However this is a 1 December 2004 version!! At least in this MAC I have now updated that to 15 August 2007 ;=)) But it also mean Edge, and other MAC users, should have a 2004 command line tidy to at least try for comparison ...

    Does anyone know where this binary can be sent to get Apple to 'update' their 'release' of the tidy binary?

    EOF - cyrillic-06.doc

     
  • Arnaud Desitter

    Arnaud Desitter - 2007-09-17
    • status: open-invalid --> pending-invalid
     
  • SourceForge Robot

    • status: pending-invalid --> closed-invalid
     
  • SourceForge Robot

    Logged In: YES
    user_id=1312539
    Originator: NO

    This Tracker item was closed automatically by the system. It was
    previously set to a Pending status, and the original submitter
    did not respond within 30 days (the time period specified by
    the administrator of this Tracker).

     

Log in to post a comment.