Logwatch / Bugs / #56 UTF-8 data may not be properly encoded

Bjorn - 2017-02-06

Continuing the discussion from the debian bug report...

As mentioned by Willi, it is not clear why this is a logwatch issue, as anyone can send email with any declared encoding.

The Wikipedia entry shows UTF-8 support for email clients. The only ones listed as either not supporting UTF-8 or with unknown support are no longer in development (I think all of them have ceased for at least nine years).

Also, it is not clear to me how someone would introduce an exploit. The examples listed could result in an incorrectly displayed string, but not in executing malicious code.

Last edit: Bjorn 2017-02-06

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Bjorn - 2017-02-06

I mangled the link in the previous message. The correct one is:
https://en.wikipedia.org/wiki/Comparison_of_email_clients#Features

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kyle Marek - 2017-08-14

In investigating this issue, I have assembled some simple commands for investigating UTF-8 and getting corrupt UTF-8 and ANSI control sequences logged:

# ANSI escape sequence # (sequence from stderr output of `reset`) echo "$(printf '\x1b\x63\x1b\x5d\x31\x30\x34\x07\x1b\x5b\x21\x70\x1b\x5b\x3f\x33\x3b\x34\x6c\x1b\x5b\x34\x6c\x1b\x3e\x0d')" # Valid UTF-8 (inverted exclamantion point) echo "$(printf '\xc2\xa1')" # Invalid UTF-8 (same character, missing a continuatin bit on second byte) echo "$(printf '\xc2\x21')" # A UTF-8 string echo "$(printf '\xc3\xbc\x62\x65\x72\x73\xc3\xa4\x74')" # See what ISO-8859-1-only people see # (by converting the ISO-8859-1 view of this string to the UTF-8 display equivalent) echo "$(printf '\xc3\xbc\x62\x65\x72\x73\xc3\xa4\x74')" | iconv -f ISO-8859-1 -t UTF-8

These are easy to get in some log by running things like sudo echo "$(...)". This can impact logwatch because output of journalctl --no-pager --output=cat will output both the unmodified invalid UTF-8 sequences and ANSI control sequences.

This appears to be easy to sanitize, as perl's "utf-8" (as opposed to "utf8") offers some interesting options for only returning valid UTF-8 character sequences.
See: http://search.cpan.org/~dankogai/Encode-2.92/Encode.pm#UTF-8_vs._utf8_vs._UTF8
See: https://web.archive.org/web/20170814012503/http://search.cpan.org/~dankogai/Encode-2.92/Encode.pm#UTF-8_vs._utf8_vs._UTF8
See: http://search.cpan.org/~dankogai/Encode-2.92/Encode.pm#Handling_Malformed_Data
See: https://web.archive.org/web/20170814012503/http://search.cpan.org/~dankogai/Encode-2.92/Encode.pm#Handling_Malformed_Data

kmarek@kyle.internal.gigabyteproductions.net ~ $ printf '\xc2\x21' | perl -e 'use open qw(:std :encoding(utf-8)); binmode(STDOUT, ":utf8"); while(<>){print;}' | xxd utf8 "\xC2" does not map to Unicode at -e line 1. 00000000: 5c78 4332 21 \xC2!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jason Pyeron - 2017-08-14

On a side note:

per RFC 3164 and RFC 5424 syslog should be in UTF-8.

per the systemd journalctl man page, output is 'by default "utf-8", if the invoking terminal is determined to be UTF-8 compatible'.

ilegal characters should be noted, escaped, truncated or dropped.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jason Pyeron - 2017-08-14

It seems that perl does not inherit an encoding:

no encoding;
Unsets the script encoding. The layers of STDIN , STDOUT are reset to ":raw " (the default unprocessed raw stream of bytes).

A confirurgation option for default input encoding might be needed, but each service script for sure will need to handle their own input encoding and output utf8.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Bjorn - 2017-08-20

I propose to use iconv to ignore any non-UTF-8 characters. The following patch inserts the iconv command in the pre-processing state, when the filtered logs are created. The iconv command appears to be very efficient; I don't see much of a difference in overall execution time with or without the patch.

The patch is for the scripts/logwatch.pl file. Let us know if you have any issues with it. If there are no issues, I'll incorporate it into the repository at a future date.

logwatch.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jason Pyeron - 2017-08-20

That works, but not well enough.

It will drop the erroneous chars, leaving no evidence of issues (perl can hex escape)

It adds a dependency to iconv (this can be native to perl)

Last edit: Jason Pyeron 2017-08-20
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Bjorn - 2017-08-20

Yes, iconv would drop the characters (based on the earlier "illegal characters should be noted, escaped, truncated or dropped"). But I do agree that escaping them would be better. I don't know how to get that out of the Encode perl modules, however. Can you or anyone else help out?

I would very much like it to be done in the logwatch pre-processing stage, so it would be done only once for each log file. I certainly don't want us to change every service script to have to handle it independently. So it would still look like a separate perl invocation through which the log files are piped.

As an aside, I still don't see how to inject these non-UTF-8 characters into the logs. The example given above (using sudo) does not work; the characters are escaped before writing to the log. I think syslog represents strings internally as UTF-8 as well, but I am not sure that is true of all implementations out there. So while the security threat aspect still appears to be conjecture, I am willing to address it provided the cost in time and code is reasonable.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kyle Marek - 2017-08-20

I don't know how to get that out of the Encode perl modules, however. Can you or anyone else help out?

Sure. Will provide patch.

So it would still look like a separate perl invocation through which the log files are piped.

As I intend the patch, the options when you open the file will make this transparent.

As an aside, I still don't see how to inject these non-UTF-8 characters into the logs. The example given above (using sudo) does not work; the characters are escaped before writing to the log.

It might be a pager or a different logging system. I was able to recreate with Debian Stretch (9) by watching log with journalctl -b -f -l --no-pager --output=cat (similar to the note in https://sourceforge.net/p/logwatch/git/ci/master/tree/scripts/shared/journalctl). It looks like --output=cat is necessary to get anything besides "[75B blob data]". Looking around, I now see there's no automatic usage of --output=cat, so this method of injecting UTF-8 won't automatically affect logwatch, and will only affect service configurations that have explicitly asked for --output=cat, as shown in the file.

I think syslog represents strings internally as UTF-8 as well, but I am not sure that is true of all implementations out there.

This is a good example of a reason to not assume/enforce UTF-8 input. What I mean by this is not to ignore the invalid UTF-8 character input; I think making logwatch locale-aware is worth consideration.

For example:

Specifications that say a given log is supposed to be in UTF-8, such as RFC 3164, RFC 5424, and the journalctl manual, are a good reason for logwatch to default to UTF-8 input when processing these files, regardless of system locale

Logically equivalent to iconv -c -f UTF-8 -t UTF-8

Invalid UTF-8 characters are "corrected"

Examples of noncompliant systems are a good reason to allow for a given log's default to be overridden

$format set with a service configuration file

Logically equivalent to iconv -c -f $format -t UTF-8

Invalid UTF-8 characters are not necessarily invalid $format characters, and can be converted to valid UTF-8 characters

Example of converting valid ASCII (but invalid UTF-8) to a valid UTF-8 representation of the same character: printf '\xf8' | iconv -f ISO-8859-15 -t UTF-8

All else should probably default to system locale

Same notes as 2.: Invalid UTF-8 is not necessarily invalid in the proper locale, and can be converted to UTF-8's representation of the same characters
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Bjorn - 2018-01-01

status: open --> closed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Bjorn - 2018-01-01

Added CharEncoding variable to declare input encoding. See repository's logwatch.conf file.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

UTF-8 data may not be properly encoded

Group

Searches

Help

#56 UTF-8 data may not be properly encoded

Discussion