#103 HighlightLineFilter assumes the wrong encoding on WIN32

open
nobody
None
5
2009-11-11
2009-11-11
Tobias Mieth
No

The _filter() method of HighlightLineFilter assumes that the read stdout data of execute comment is encoded in UTF-8. However, due to the fact that Codestriker::execute_command returns exactly this data encoded as latin1 ('iso-8859-1'), decode_utf8() will fail and obfuscate any character that is different from UTF-8. This bug can be reproduced with a source code file that for example contains german umlauts (http://en.wikipedia.org/wiki/Germanic_umlaut) in its comments.

e.g. "int i = 0; // Test mit Umlauten ÄÖÜüü"

Tested with Version 1.9.10.

Excerpt from HighlightLineFilter::_filter

...
Codestriker::execute_command($read_stdout_fh, undef, $self->{highlight}, @args);
$read_data = decode_utf8($read_data);
...

From my point of view there are 2 solutions for this issue:

1.) Fix Codestriker::execute_command so that all data (stdout, stderr, etc.) is not modified.

In the above stated example highlight should return UTF-8 encoded data (indicated by: push @args, '-u'; push @args, 'UTF-8'; in HighlightLineFilter) but instead latin1 encoded data is returned.

2.) Handle this issue in HighlightLineFilter and use decode('latin1', $read_data) on the WIN32 platform.

Currently I use this solution as work around (HighlightLineFilter.pm):

...

# Wrap the command in an eval in case highlight fails running over the file - for
# example if it is an unknown file type.
eval {
Codestriker::execute_command($read_stdout_fh, undef, $self->{highlight}, @args);
if( is_utf8($read_data)) {
$read_data = decode_utf8($read_data);
}
else {
$read_data = decode('iso-8859-1', $read_data);
}
};

...

Cheers Tobias

Discussion

  • Tobias Mieth
    Tobias Mieth
    2009-11-13

    I have dug a bit deeper into this topic and found out that Codestriker::execute_command is alright. So this brought me back to HighlightLineFilter.pm where I found the following. The data (although it is already in utf8) is going to be written as latin1 encoded to the temporary file. Which in turn does cause the obfuscated characters. Although haven't tested it under linux i guess this problem occures just on the windows platform.

    In order to circumvent this issue i have then marked the filehandle as utf8. This ensures that the data is written as utf8 and the correctly highlighted etc.

    HighlightLineFilter.pm

    # Convert tabs to the appropriate number of   entities.
    sub _filter {
    my ($self, $text, $extension) = @_;

    # Create a temporary file which will contain the delta text to highlight.
    my ($input_text_fh, $input_filename) = tempfile(SUFFIX => $extension);
    binmode( $input_text_fh, ":utf8" );
    print $input_text_fh $text;
    close $input_text_fh;
    ...

    I am not that fluent with perl there might be another solution, like defining the text output globally as utf8. Furthermore I noticed that DownloadTopic.pm had the similar issure and therefore i applied the same solution. I have attached both 'modifed' files and would be thankful if someone could check these for example on another platform.

    Tobias

     
  • Tobias Mieth
    Tobias Mieth
    2009-11-13

    DownloadTopic.pm

     
    Attachments
  • Tobias Mieth
    Tobias Mieth
    2009-11-13

    HighlightLineFilter.pm