Menu

#5 Add support for UTF8

open
nobody
UTF8 (2)
7
2015-01-23
2015-01-19
rasjv
No

Hello.

I have uriparser v0.8.1. It's cool!
But when I tried to parse URL from Google, for example:
https://www.google.com/search?q=%D1%80%D0%B0%D0%B7%D0%B1%D0%BE%D1%80+URL+%D0%BD%D0%B0+%D0%BF%D0%B0%D1%80%D0%B0%D0%BC%D0%B5%D1%82%D1%80%D1%8B+C%2B%2B&ie=utf-8&oe=utf-8#q=parse+URL+C%2B%2B

I see that uriparser treats some text in the query as UTF-16 but not UTF-8. That's why I need manually convert UTF-16 to UTF-8.
It would be very useful if you add support for UTF-8.

Regards.

Discussion

  • Sebastian Pipping

    Hello rasjv,

    are you using uriParseUriA/char or uriParseUriW/wchar_t? All that uriparser knows about encoding is single-byte or double-byte characters. If I'm not mistaken, all characters I see in the URI above have the same single-byte encoding in both ASCII and UTF-8. In that sense, UTF-8 is supported already. Please help me understand what you are asking for.

    Best, Sebastian

     
  • rasjv

    rasjv - 2015-01-20

    Hello, Sebastian.

    I use uriParseUriW. The whole code is:

    UriParserStateW state={0};
    UriUriW uri={0};

    state.uri = &uri;
    if (uriParseUriW(&state, L"https://www.google.com/search?q=%D1%80%D0%B0%D0%B7%D0%B1%D0%BE%D1%80+URL+%D0%BD%D0%B0+%D0%BF%D0%B0%D1%80%D0%B0%D0%BC%D0%B5%D1%82%D1%80%D1%8B+C%2B%2B&ie=utf-8&oe=utf-8#q=parse+URL+C%2B%2B\x0") != URI_SUCCESS)
    {
    / Failure /
    uriFreeUriMembersW(&uri);
    }
    //success
    //do something with uri

    UriQueryListW * queryList=0;
    int itemCount;
    if (uriDissectQueryMallocW(&queryList, &itemCount, uri.query.first,
    uri.query.afterLast) != URI_SUCCESS)
    {
    / Failure /

    }
    //success
    //do something with queryList
    const wchar_t *query1;
    query1=queryList->value;

    uriFreeQueryListW(queryList);
    uriFreeUriMembersW(&uri);

    UTF-8 characters are not double-byte characters, they can be from 1 to 6 bytes.
    UTF-16(wchar_t) yes, are double-byte characters in most cases(although can be very seldom special cases there UTF-16 character is more than double-byte, but this is an offtopic).

    So, I'll show you screenshot and you understand more clearly.
    I have a query in UTF-8 but escaped in URL:
    %D1%80%D0%B0%D0%B7%D0%B1%D0%BE%D1%80+URL+%D0%BD%D0%B0+%D0%BF%D0%B0%D1%80%D0%B0%D0%BC%D0%B5%D1%82%D1%80%D1%8B+C%2B%2B
    This is a UTF-8 string not UTF-16! It means: разбор URL на параметры C++

    If you treat it as UTF-16 you will get what you can see on the screenshot highlighted in red:
    query1

    So, I need manually convert query1 bytes to UTF-8 string with this code(for Windows):

    wchar_t query1_utf16[256];
    char query1_corrected[256];
    int len,rez,i;
    len=wcslen(query1);
    for (i=0;i<len;i++)
    query1_corrected[i]=((char )((char )query1+2i));
    rez=MultiByteToWideChar(CP_UTF8,0,query1_corrected,len,query1_utf16,256);
    query1_utf16[rez]=0;

    And we see exactly what we must see:
    query1_corrected

    If I will use your char-functions(A-ending) then still the same except I don't need any corrections and can convert these multibyte characters of "query1" to UTF-16 directly. The "query1" string must be treated as UTF-8 string if you say: "UTF-8 is supported already". And it must be converted to UTF-16 in Windows OS because this the default Unicode format for this OS.

    It's also will be very useful if you give a size in bytes of such UTF-8 string (it can be calculate during parsing) and let do not make any additional calculations. So, I think a new property must be added to the "UriQueryList struct" to the existing: key,value,next named size:

    int size;

     

    Last edit: rasjv 2015-01-20
  • Sebastian Pipping

    My understanding is that

    • you have a string in a wchar_t array

    • with a percent encoded UTF-8 string.

    Your options are:

    a) Convert the string into a char array picking every second byte. If the URI is valid that's a lossless operation. You the parse the URI using uriParseUriA, run uriDissectQueryMallocExA to dissect the query and run uriUnescapeInPlaceExA on the query parts. That should give valid UTF-8 if it was initially.

    b) Keep the string in the wchar_t array, use uriParseUriW, then use uriDissectQueryMallocExW, copy the query parts into a char array picking every second byte (again lossless), run uriUnescapeInPlaceExA on those, again valid UTF-8 if it was initially.

    About adding length to UriQueryList: member "value" is a zero terminated string so it is carrying its length around implicitly.

    Best, Sebastian

     
  • rasjv

    rasjv - 2015-01-22

    Your options are...

    So, now you haven't support for UTF8, at least for Windows. And I must manually convert from UTF8-bytes to UTF-16 as I did.

    About adding length to UriQueryList: member "value" is a zero terminated string so it is >carrying its length around implicitly.

    "member "value" is a zero terminated string" will not work for UTF-8 String.
    UTF-8 String can contain zeroes! That's why the field "size" is needed.

     

    Last edit: rasjv 2015-01-22
  • Sebastian Pipping

    Hello again,

    UTF-8 String can contain zeroes! That's why the field "size" is needed.

    The only null byte in a UTF-8 string possible is an actual null character.
    Please check the table at https://en.wikipedia.org/wiki/UTF-8#Description .
    So UTF-8 can contain null bytes to the very same degree as ASCII.

    More importantly, the string in field "value" is not a /full/ UTF-8 string but uses single byte characters shared with ASCII, only. UTF-8 is what you have after the conversion in another buffer.

    So, now you haven't support for UTF8, at least for Windows.

    It's the same for Linux.

    And I must manually convert from UTF8-bytes to UTF-16 as I did.

    One way or another, an additional call to a converter function is needed. uriparser could ship with a UTF-8 to UTF-16 function, but I do not consider that to be uriparser's job. There are other libraries to do that, that you can easily use together with uriparser with more or less the same level of convenience.

    I'm happy to have a quick Skype/mumble/Phone/Jitsi about it some time, if you feel that could help. In that case, contact me offlist about a time and the medium of choice, please.

    Best, Sebastian

     
  • rasjv

    rasjv - 2015-01-23

    The only null byte in a UTF-8 string possible is an actual null character.
    Please check the table at https://en.wikipedia.org/wiki/UTF-8#Description .
    So UTF-8 can contain null bytes to the very same degree as ASCII.

    Ok. But to exclude any potential error during converting or operating with, It will be very useful if I have a size of that bufer that I will convert to UTF-16 or operate with that bufer treated as UTF-8 String. It's simple to add, isn't it? You parse query and adding size is not very difficult but will be very useful.

     
  • Sebastian Pipping

    If you are speaking of the length of the verbatim content in field UriQueryListStructA.value, that's stored implicitly (as mentioned before).

    If you are speaking of the length of content in field UriQueryListStructA.value after decoding to UTF-8, that would need internal UTF-8 decoding from uriparser, knowledge of the encoding in there etc.

    Also, please note that adding fields to structures breaks ABI compatibility with prior releases so that's something library authors need to think twice about.

    If you aim at storing length to known space requirements up front, what might help is a (safe) heuristic:

    • A character in UTF-8 may take 1 to 4 bytes
    • A character in UTF-16 may take 2 to 4 bytes (see https://en.wikipedia.org/wiki/UTF-16#Examples for four-byte examples)
    • If the input was four-byte UTF-8 characters only, UTF-16 output would take x1/2 to x1 space in bytes, or "strlen(...) / 2 + 1" wchar_t elements at worst.
    • If the input was all single-byte UTF-8 characters, UTF-16 output would take x2 to 4x (worsened on purpose) space in bytes or "strlen(...) * 2 + 1" wchar_t elements at worst.
    • So "strlen(...) * 2 + 1" makes a safe worst case wchar_t character space calculation for later conversion to UTF-16.

    Is that what you are looking for?

     

Log in to post a comment.