#103 HighlightLineFilter assumes the wrong encoding on WIN32


The _filter() method of HighlightLineFilter assumes that the read stdout data of execute comment is encoded in UTF-8. However, due to the fact that Codestriker::execute_command returns exactly this data encoded as latin1 ('iso-8859-1'), decode_utf8() will fail and obfuscate any character that is different from UTF-8. This bug can be reproduced with a source code file that for example contains german umlauts (http://en.wikipedia.org/wiki/Germanic_umlaut) in its comments.

e.g. "int i = 0; // Test mit Umlauten ÄÖÜüü"

Tested with Version 1.9.10.

Excerpt from HighlightLineFilter::_filter

Codestriker::execute_command($read_stdout_fh, undef, $self->{highlight}, @args);
$read_data = decode_utf8($read_data);

From my point of view there are 2 solutions for this issue:

1.) Fix Codestriker::execute_command so that all data (stdout, stderr, etc.) is not modified.

In the above stated example highlight should return UTF-8 encoded data (indicated by: push @args, '-u'; push @args, 'UTF-8'; in HighlightLineFilter) but instead latin1 encoded data is returned.

2.) Handle this issue in HighlightLineFilter and use decode('latin1', $read_data) on the WIN32 platform.

Currently I use this solution as work around (HighlightLineFilter.pm):


# Wrap the command in an eval in case highlight fails running over the file - for
# example if it is an unknown file type.
eval {
Codestriker::execute_command($read_stdout_fh, undef, $self->{highlight}, @args);
if( is_utf8($read_data)) {
$read_data = decode_utf8($read_data);
else {
$read_data = decode('iso-8859-1', $read_data);


Cheers Tobias


  • Tobias Mieth

    Tobias Mieth - 2009-11-13

    I have dug a bit deeper into this topic and found out that Codestriker::execute_command is alright. So this brought me back to HighlightLineFilter.pm where I found the following. The data (although it is already in utf8) is going to be written as latin1 encoded to the temporary file. Which in turn does cause the obfuscated characters. Although haven't tested it under linux i guess this problem occures just on the windows platform.

    In order to circumvent this issue i have then marked the filehandle as utf8. This ensures that the data is written as utf8 and the correctly highlighted etc.


    # Convert tabs to the appropriate number of   entities.
    sub _filter {
    my ($self, $text, $extension) = @_;

    # Create a temporary file which will contain the delta text to highlight.
    my ($input_text_fh, $input_filename) = tempfile(SUFFIX => $extension);
    binmode( $input_text_fh, ":utf8" );
    print $input_text_fh $text;
    close $input_text_fh;

    I am not that fluent with perl there might be another solution, like defining the text output globally as utf8. Furthermore I noticed that DownloadTopic.pm had the similar issure and therefore i applied the same solution. I have attached both 'modifed' files and would be thankful if someone could check these for example on another platform.


  • Tobias Mieth

    Tobias Mieth - 2009-11-13



Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

JavaScript is required for this form.

No, thanks