search() in tld.c should use case insensitive domain comparison

Extract the TLD of any, world wide, URI.

Brought to you by: alexis_wilke, dooglio2

#1 search() in tld.c should use case insensitive domain comparison

Milestone: 1.0

Status: open

Owner: nobody

Labels: None

Updated: 2015-08-24

Created: 2015-08-21

Creator: Dzmitry

Private: No

As per RFC 3986 host name is case insensitive.
Instead of using custom (and case sensitive) cmp() you could use strncasecmp() you already use for schema in tld_check_uri().

Discussion

Alexis Wilke - 2015-08-21

strncasecmp() won't work because:

1) That function only works for ASCII characters and not UTF-8, if we are to do it correctly, we probably want to properly support UTF-8.
2) That function will check 'n' characters and if they match, return true whether the end of string 'a' was reached or not. If you look closely, both strings have to be equal up to 'n' characters and then

*a == '\0'

has to be true. If not, then the cmp() function returns false.

I will implement the UTF-8 conversion. That being said, if you were to call the tld_check_uri() with a string in lowercase already, it would work with the current version.

This brings us to the use of strncasecmp() in that other function... it is probably wrong there since it is not UTF-8 aware.

Last edit: Alexis Wilke 2015-08-21
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Alexis Wilke - 2015-08-21

Ah... actually the tolower() and strncasecmp() are used for the protocol which is limited to ASCII letters (a-z, A-Z) digits and underscore. So that's all proper in the check function.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Alexis Wilke - 2015-08-21

Okay, I see that the problem is a tad bit more complicated. The URI are expected to be passed with %XX character still in place (i.e. the URL should not be decoded first.)

If you look at the data (in tld_data.c) and search for %, you will find out that all URIs are defined with % and not with UTF-8 characters. Also I write the %XX in lowercase which is not conventional. So I guess I have two problems to resolve.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Alexis Wilke - 2015-08-22

Okay, I did not change the cmp() or tld() functions. The reason is that those functions are heavily optimized. Adding a "str case cmp" in the middle would mean transforming the user data to lowercase hundred of times instead of just once. Not only that, the data is saved using %XX syntax (opposed to direct UTF-8 bytes) so that makes things very complicated to change the cmp() function.

So instead I have a function named tld_domain_to_lowercase(), which will allocate a new buffer and generate the lowercase version of the domain name. It may fail if the input is not encoded (i.e. use %XX) because the output would end up being too large. If most of your input are ASCII domains (English) then it won't matter.

I uploaded version 1.5.0. Let me know how that plays out for you. I updated example.c to show how the function gets used. Make sure to free() the string it returns or you'll have a leak, but remember that you cannot free the string if you used it with tld() since the pointers in the tld_info structure will point in that string.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Alexis Wilke - 2015-08-22

Sorry, I missed adding the new source in 1.5.0 so I published 1.5.1 today. You probably want to use that newer version to test the new lowercase feature.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Dzmitry - 2015-08-24

It may fail if the input is not encoded (i.e. use %XX) because the output would end up being too large.

You could pre-calculate the size of the buffer so that it doesn't overflow when applying url-encoding.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.