Sorry to take so long to get around to this.
My take on this is that you're solving a specific problem (allowing the slug
field to have more characters in it) with a general solution (changing the
definition of valid::url). As I see it, AndyS's point is that this may have
unexpected consequences throughout the app.
Andy, it's unclear to me how your solution works since the validation is
performed in Item_Model::valid_slug and you're changing valid::url -- can
you explain what I'm missing there?
Either way, I think we should expand Item_Model::valid_slug to incorporate a
broader set of unicode characters which we think users will be likely to
include in their urls. That would restrict the impact to what's important
to the user and should be a relatively simple change that accomplishes what
80% of our users need...
On Sun, Apr 3, 2011 at 3:10 PM, Andy Lindeman <alindeman@...> wrote:
> Hi all--
> As a part of my "get to know Gallery 3.x by taking low hanging fruit
> tickets," I ran across this one:
> The gist of it is that the validation routines for URLs that
> Gallery/Kohana/PHP use do not allow non-ASCII characters. Therefore,
> Gallery does not allow non-ASCII characters in URL fields, even though
> most (all?) modern browsers handle them appropriately.
> This behavior of non-ASCII characters is a documented (weakness?) of
> PHP's `filter_var`:
> Maybe being slightly naive, I borrowed a regular expression from a
> Ruby/Rails plugin and overrode Kohana's validation routine:
> Andy brings up some good points with my proposed fix, though, which
> are worth discussing:
> Changes in validation code need a very thorough review especially for
> consequences in terms of security, but also for user experience. And
> that's especially true if you make a validation check more lenient /
> relaxed compared to what it has been before.
> In this case, we're opening the gate from a subset of ASCII characters
> (not all ASCII characters are allowed in the host part), to (almost)
> all of Unicode characters. That's clearly wrong.
> If there's not a well-vetted library to handle this instead of
> filter_var, then I'd probably stick with filter_var for now and push
> hard on PHP core maintainers to provide a solution for IDNA.
> A few use cases to consider:
> Where do $url strings come from and where do they go? Where do they
> appear in the UI, emails, feeds, do we send HTTP / email or other
> network requests out using these strings as addresses?
> Is it better to be more lenient at the input and then to realize that
> our networking functions can't deal with IDNA and fail? UX is worse in
> these cases.
> Is it acceptable to accept URLs with confusable characters in the
> domain name and to be a service which shows phishing links in comments
> / elsewhere in the UI? (consider paypal.com vs. look-a-likes)
> What other security concepts rely on the fact that URLs only have a
> subset of ASCII characters in the domain name (and other requirements
> for other parts for the URLs)?
> I don't have an answer to all of these questions. But if you want to
> make a change to this core component in input filtering, I'd ask you
> to investigate these questions, starting a discussion so we can come
> to a conclusion together.
> For now, my impression is that if we can't find a well vetted library
> which takes care of IDNA, and if we can't show that other parts of the
> Gallery application don't have implicit preconditions that URLs aren't
> IDNA, then we have to punt on more lenient / more internationalized
> URL validation for the time being.
> Andy Lindeman
> Create and publish websites with WebMatrix
> Use the most popular FREE web apps or write code yourself;
> WebMatrix provides all the features you need to develop and
> publish your website. http://p.sf.net/sfu/ms-webmatrix-sf
> __[ g a l l e r y - d e v e l ]_________________________
> [ list info/archive --> http://gallery.sf.net/lists.php ]
> [ gallery info/FAQ/download --> http://gallery.sf.net ]