search() in tld.c should use case insensitive domain comparison
Extract the TLD of any, world wide, URI.
Brought to you by:
alexis_wilke,
dooglio2
strncasecmp() won't work because:
1) That function only works for ASCII characters and not UTF-8, if we are to do it correctly, we probably want to properly support UTF-8.
2) That function will check 'n' characters and if they match, return true whether the end of string 'a' was reached or not. If you look closely, both strings have to be equal up to 'n' characters and then
has to be true. If not, then the cmp() function returns false.
I will implement the UTF-8 conversion. That being said, if you were to call the tld_check_uri() with a string in lowercase already, it would work with the current version.
This brings us to the use of strncasecmp() in that other function... it is probably wrong there since it is not UTF-8 aware.
Last edit: Alexis Wilke 2015-08-21
Ah... actually the tolower() and strncasecmp() are used for the protocol which is limited to ASCII letters (a-z, A-Z) digits and underscore. So that's all proper in the check function.
Okay, I see that the problem is a tad bit more complicated. The URI are expected to be passed with %XX character still in place (i.e. the URL should not be decoded first.)
If you look at the data (in tld_data.c) and search for %, you will find out that all URIs are defined with % and not with UTF-8 characters. Also I write the %XX in lowercase which is not conventional. So I guess I have two problems to resolve.
Okay, I did not change the cmp() or tld() functions. The reason is that those functions are heavily optimized. Adding a "str case cmp" in the middle would mean transforming the user data to lowercase hundred of times instead of just once. Not only that, the data is saved using %XX syntax (opposed to direct UTF-8 bytes) so that makes things very complicated to change the cmp() function.
So instead I have a function named tld_domain_to_lowercase(), which will allocate a new buffer and generate the lowercase version of the domain name. It may fail if the input is not encoded (i.e. use %XX) because the output would end up being too large. If most of your input are ASCII domains (English) then it won't matter.
I uploaded version 1.5.0. Let me know how that plays out for you. I updated example.c to show how the function gets used. Make sure to free() the string it returns or you'll have a leak, but remember that you cannot free the string if you used it with tld() since the pointers in the tld_info structure will point in that string.
Sorry, I missed adding the new source in 1.5.0 so I published 1.5.1 today. You probably want to use that newer version to test the new lowercase feature.
You could pre-calculate the size of the buffer so that it doesn't overflow when applying url-encoding.