Menu

#54 replacement of illegal UTF-8 sequences

Unstable (w- patch)
closed-fixed
None
5
2014-12-05
2008-07-02
No

This is the second of two related patches. The first, patches item #2008377 (https://sourceforge.net/tracker/?func=detail&atid=468023&aid=2008377&group_id=52781), deals with legal UTF-8 sequences that are illegal XML. This patch deals with sequences of bytes, coming from the application rather than the network, which are illegal UTF-8.

"Why is your application promising to give gSOAP UTF-8 and then violating its promise?" would be a good question. The application in question has had UTF-8 support retrofitted and a SOAP veneer placed on top. We are going through the legacy code, adding type-safe UTF-8 strings to prevent new illegal UTF-8 sequences from creeping in but it's a big job. Even when that's done, we're still left us with another big job in cleaning up all the status and configuration information that predated the changes.

In the meantime, we have a problem caused by gSOAP's perfectly reasonable behavior when we give it illegal UTF-8 sequences. gSOAP simply transmits them, verbatim. This would be fine for our purposes if only the recipient of our SOAP missives were based on gSOAP. It is, however, based on JAX, which takes a strict approach to its XML parsing. JAX discards the whole request or response for the trifling offense of a single illegal UTF-8 sequence. This can render a whole feature unmanageable because one of its configuration items contains an illegal UTF-8 sequence. It's a bit like refusing to read a whole book because page 76 contains a single word in French - not entirely unreasonable but really quite inconvenient.

gSOAP is uniquely well positioned to clean up our strings for us. If gSOAP replaces illegal sequences with the Unicode replacement character, as intended by this patch, then the feature that was previously unmanageable becomes fully manageable again apart from the one configuration item that contains the illegal sequence. This graceful degradation is something we can easily live with.

gSOAP already provides us with the option to call mbtowc on outbound strings. The most expedient implementation, then, seemed to be to check the value returned from mbtowc and to replace the offending byte rather than allowing it to be transmitted verbatim.

As with the other related patch, I haven't attempted to add a configuration option to enable this behavior. I wanted to keep the first patch minimal until I find out whether such a configuration option would be palatable upstream. Obviously, we'd rather see our changes merged upstream, so I'd be happy to receive suggestions.

Discussion

  • Martin Dorey

    Martin Dorey - 2008-07-02
     
  • Robert van Engelen

    Logged In: YES
    user_id=354274
    Originator: NO

    I understand the reason, but can you explain why you prefer to use xFFFD + char approach?

     
  • Robert van Engelen

    • assigned_to: nobody --> engelen
    • status: open --> pending
     
  • Martin Dorey

    Martin Dorey - 2008-07-07

    Logged In: YES
    user_id=1180368
    Originator: YES

    "xFFFD + char"? Uh-oh - perhaps my patch doesn't do what I think it does. My hope was that the illegal sequence would be entirely replaced by one or more instances of the Unicode replacement character. The effect, as seen by the end user, would be that illegal sequences show up in the browser as little black diamonds with question marks in (or whatever the browser deems appropriate) rather than preventing the whole page from being displayed. One line in the table then says "hello, I contain an illegal sequence - you probably want to clean me up" rather than the whole table being hidden due to some error that's only explaining by groveling in the logs. But I think you're telling me that I'm outputting the illegal sequence as well as the Unicode replacement character?

     
  • Martin Dorey

    Martin Dorey - 2008-07-07
    • status: pending --> open
     
  • Robert van Engelen

    • status: open --> closed-fixed
    • Group: --> Unstable (example)
     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.