Re: [Dbpedia-discussion] URIs with "<" in them confusing Virtuoso and Jena

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Chris, Ted,

On 25 Aug 2008, at 23:44, Chris Schumacher wrote:
> An old URI RFC (section 2.4.3) states that angle brackets are  
> illegal in URIs, but the current spec and RFC 3986 seem to allow  
> them(!).

No, it does not allow them. Angle brackets are illegal in URIs. The  
URI grammar in RFC 3986 explicitly lists all characters that are  
allowed in a URI. If a character is not listed, it is obviously not  
allowed. Angle brackets are not listed and thus not allowed. Check the  
handy summary of the URI grammar on pages 48/49. Makes it easy to  
trace which characters are allowed where.

(As with any character that is not allowed, it can be %-encoded, and  
the resulting triplets (%3C, %3E) can be included in a URI.)

On 26 Aug 2008, at 02:26, Ted Thibodeau Jr wrote:
> It looks to me like, although these URIs are all valid

Wrong. They are not valid.

> it contains the left angle bracket
> "<" -- which *is* permitted in a general sense

Wrong. It is *not* permitted, anywhere, in a URI.

> * Richard Cyganiak [2008/08/20 09:29 AM +0100] wrote:
>> Ampersands are allowed in URIs, so the Yago URIs are perfectly
>> fine according to all the specs. (We *might* still want to
>> %-encode the ampersand in those URIs, but just for consistency
>> with our other URIs, not because the specs require it. That's
>> a separate question.)
>
> Absolute statements can be dangerous.  On lists like these,
> statements such as the above can become quoted authority,
> even when incorrect ... as now.

Ted, I did not make an absolute statement. I said “Ampersands are  
allowed in URIs”, which is true. I did not say that ampersands are  
allowed *everywhere* in URIs.

I'm well aware of the details of the relevant specifications, but in  
the context of the original message there was no need to go into that  
much detail. For the issue at hand (Virtuoso didn't entity-escape the  
ampersand, as required by the XML spec), all that matters is the fact  
that ampersands can occur in a URI at all.

Please be careful when accusing other people of being incorrect. It  
looks rather clueless when you do this while not getting your own  
facts straight.

That being said, I appreciate your effort to educate yourself on the  
relevant specs. Just don't fall into the trap of assuming that you are  
infallible just because you read a few pages of RFCs.

Thanks,
Richard

On 26 Aug 2008, at 02:26, Ted Thibodeau Jr wrote:

> Hi, Chris --
>
> * Chris Schumacher [2008/08/25 03:44 PM -0700] wrote:
>> This is similar to the recent ampersand issue.
>>
>> An old URI RFC [1] (section 2.4.3) states that angle brackets are
>> illegal in URIs, but the current spec [2] and RFC 3986 [3] seem to
>> allow them(!).  The dbpedia3.1 externallinks_en.nt file has several
>> URIs with "<" which is leading to confusion for both Virtuoso and
>> Jena.
>>
>> For example, at <http://dbpedia.org/snorql/> the following query
>> will confuse virtuoso:
>>
>>   SELECT * WHERE {
>>   <http://www.sample.com<dogs> ?p ?o
>>   }
>>
>> remain in light,
>> cws
>>
>> [1] <http://www.faqs.org/rfcs/rfc2396.html>
>> [2] <http://www.w3.org/Addressing/URL/uri-spec.html>
>> [3] <http://www.ietf.org/rfc/rfc3986.txt>
>
>
> Well...
>
> First thing...
>
> I've just dug into the file in question, and there are 8 URIs
> causing this sort of trouble, all in the ?o position, each in
> a single triple.
>
>
>   <http://www.youtube.com/watch?v=vgKWDwRw_DE<!--> .
>
>
> <http://www.nytimes.com/2008/03/14/business/media/14adco.html?_r=1&oref=slogin<br 
> >
> .
>
>
> <http://links.jstor.org/sici?sici=0891-3609%28192712%2F192801%2923%3A117%3C19%3AASOTEV%3E2.0.CO%3B2-4&size=LARGE&origin=JSTOR-enlargePage<!-- 
> >
> .
>
>
> <http://links.jstor.org/sici?sici=0891-3609%28192912%2925%3A130%3C24%3ATSCOCS%3E2.0.CO%3B2-T<!-- 
> >
> .
>
>   <http://www.youtube.com/watch?v=uVcje0t3-nE<YouTube> .
>
>
> <http://www.royalsportal.de/forum/index.php?showtopic=22787&hl=Hesse<!-- 
> > .
>
>
> <http://www.ncbi.nlm.nih.gov/sites/entrez?db=pubmed&cmd=text&uid=6345791&dopt=Abstract<!-- 
> >
> .
>
>
> <http://www.up-rs.si/up-rs/uprs.nsf/dokumentiweb/46F76AD2AEA5E1A6C125746A004692FA?OpenDocument<!-- 
> >
> .
>
> It looks to me like, although these URIs are all valid, they won't
> have the effect one might expect, when dereferenced.  The snippets
> from the enclosed left-angle-bracket to the end of the URI are (to
> me) clearly erroneous -- 6 of them are the comment-starting "<!--";
> one is a "<br", and the last is the literal text, "<YouTube".
>
> These are errors, and should be tidied up in the sources.
>
>
> That said -- while this has some surface resemblance to the ampersand
> issue, in that resolution is found by reading RFCs, including the RFC
> 3986 you cite --
>
>   [1] <http://www.rfc.net/rfc3986.html>.
>
> -- the sample URI in your query is broken, and I would expect it to
> get an error back (so there does seem to be some error handling to
> be added to some tools).
>
> I was writing a followup to Richard's assertion that "Ampersands
> are allowed in URIs", but as usual, a proper followup is rather
> full of details and detours, so I wasn't done yet.
>
> But clearly, it's needed now, so I'll include it here in its
> current state, modified somewhat.
>
>
> In the current case, your sample URI --
>
>   <http://www.sample.com<dogs>
>
> -- is invalid, not simply because it contains the left angle bracket
> "<" -- which *is* permitted in a general sense -- but because of
> *where* that character is found.
>
> This is the key piece of the URI syntax which your sample breaks --
>
>   [2] <http://www.rfc.net/rfc3986.html#s3.2.>
>
>   The authority component is preceded by a double slash ("//")
>   and is terminated by the next slash ("/"), question mark ("?"),
>   or number sign ("#") character, or by the end of the URI.
>
>      authority   = [ userinfo "@" ] host [ ":" port ]
>
> As there is no "@" and no ":", host is the sub-segment that matters.
> The following further analysis comes from Appendix A, the Collected
> ABNF for URI --
>
>   [3] <http://www.rfc.net/rfc3986.html#sA.>
>
>   host          = IP-literal / IPv4address / reg-name
>
> I think we can agree that the host value is neither an IP-literal
> nor an IPv4address; so it must be reg-name.
>
>   reg-name      = *( unreserved / pct-encoded / sub-delims )
>
>   unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
>
>   pct-encoded   = "%" HEXDIG HEXDIG
>
>   sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
>                 / "*" / "+" / "," / ";" / "="
>
> As the left-angle-bracket is not specifically included in either
> the unreserved or sub-delims character sets, it *must* be percent-
> encoded to be included in this segment.
>
>
>
> Interestingly, Appendix C of RFC 3986 includes most of what was
> in RFC 2396 Sec 2.4.3 -- but there is nothing in the current RFC
> which makes angle-brackets or double-quotes illegal in URIs, and
> thus I have to question whether this section remains accurate --
>
>   [4] <http://www.rfc.net/rfc3986.html#sD.>
>
>   [...] In such cases, it is important to be able to delimit
>   the URI from the rest of the text, and in particular from
>   punctuation marks that might be mistaken for part of the URI.
>
>   In practice, URIs are delimited in a variety of ways, but
>   usually within double-quotes "http://example.com/", angle
>   brackets <http://example.com/>, or just by using whitespace:
>
>      http://example.com/
>
>   These wrappers do not form part of the URI.
>
> I'm left wondering whether the omission of angle-brackets from
> the reserved list was intentional or accidental.
>
>
> That said, to what I was already writing --
>
>
> * Richard Cyganiak [2008/08/20 09:29 AM +0100] wrote:
>> Ampersands are allowed in URIs, so the Yago URIs are perfectly
>> fine according to all the specs. (We *might* still want to
>> %-encode the ampersand in those URIs, but just for consistency
>> with our other URIs, not because the specs require it. That's
>> a separate question.)
>
> Absolute statements can be dangerous.  On lists like these,
> statements such as the above can become quoted authority,
> even when incorrect ... as now.
>
> Ampersands are allowed in *some* components of *some* URIs, and
> those *do* include the Yago URIs, so far as I can tell.
>
>  [5] <http://www.rfc.net/rfc3986.html#s2.2.>
>
>   2.2. Reserved Characters
>
>   URIs include components and subcomponents that are delimited
>   by characters in the "reserved" set.  These characters are
>   called "reserved" because they may (or may not) be defined as
>   delimiters by the generic syntax, by each scheme-specific
>   syntax, or by the implementation-specific syntax of a URI's
>   dereferencing algorithm.  If data for a URI component would
>   conflict with a reserved character's purpose as a delimiter,
>   then the conflicting data must be percent-encoded before the
>   URI is formed.
>
>      reserved    = gen-delims / sub-delims
>
>      gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"
>
>      sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
>                  / "*" / "+" / "," / ";" / "="
>
>   The purpose of reserved characters is to provide a set of
>   delimiting characters that are distinguishable from other
>   data within a URI. URIs that differ in the replacement of
>   a reserved character with its corresponding percent-encoded
>   octet are not equivalent.  Percent-encoding a reserved
>   character, or decoding a percent-encoded octet that
>   corresponds to a reserved character, will change how the
>   URI is interpreted by most applications.  Thus, characters
>   in the reserved set are protected from normalization and
>   are therefore safe to be used by scheme-specific and
>   producer-specific algorithms for delimiting data
>   subcomponents within a URI.
>
>   A subset of the reserved characters (gen-delims) is used
>   as delimiters of the generic URI components described in
>   Section 3.  A component's ABNF syntax rule will not use
>   the reserved or gen-delims rule names directly; instead,
>   each syntax rule lists the characters allowed within that
>   component (i.e., not delimiting it), and any of those
>   characters that are also in the reserved set are "reserved"
>   for use as subcomponent delimiters within the component.
>   Only the most common subcomponents are defined by this
>   specification; other subcomponents may be defined by a URI
>   scheme's specification, or by the implementation-specific
>   syntax of a URI's dereferencing algorithm, provided that
>   such subcomponents are delimited by characters in the
>   reserved set allowed within that component.
>
>   URI producing applications should percent-encode data octets
>   that correspond to characters in the reserved set unless
>   these characters are specifically allowed by the URI scheme
>   to represent data in that component.  If a reserved character
>   is found in a URI component and no delimiting role is known
>   for that character, then it must be interpreted as
>   representing the data octet corresponding to that character's
>   encoding in US-ASCII.
>
>
> Note that the ampersand is included in the "sub-delims" portion
> of the "reserved" set.  Note that the reserved status of "&" in
> HTTP URIs has *changed* as the HTTP URI scheme RFC has evolved --
> but these changes have not always been properly documented!
>
>
> Appendix D of RFC 3986 [6] <http://www.rfc.net/rfc3986.html#sD.>
> is supposed to show "Changes from RFC 2396" -- but it left out
> the following, which is key here.
>
> The current RFC shows --
>
>   [7] <http://www.rfc.net/rfc3986.html#s3.4.>
>
>   3.4. Query
>
>   The query component contains non-hierarchical data that,
>   along with data in the path component (Section 3.3), serves
>   to identify a resource within the scope of the URI's scheme
>   and naming authority (if any).  The query component is
>   indicated by the first question mark ("?") character and
>   terminated by a number sign ("#") character or by the end
>   of the URI.
>
>      query       = *( pchar / "/" / "?" )
>
>   The characters slash ("/") and question mark ("?") may
>   represent data within the query component.  Beware that
>   some older, erroneous implementations may not handle such
>   data correctly when it is used as the base URI for relative
>   references (Section 5.1), apparently because they fail to
>   distinguish query data from path data when looking for
>   hierarchical separators.  However, as query components are
>   often used to carry identifying information in the form of
>   "key=value" pairs and one frequently used value is a
>   reference to another URI, it is sometimes better for
>   usability to avoid percent-encoding those characters.
>
> -- while the RFC it obsoleted shows --
>
>   [8] <http://www.rfc.net/rfc2396.html#s3.4.>
>
>   3.4. Query Component
>
>   The query component is a string of information to be
>   interpreted by the resource.
>
>      query         = *uric
>
>   Within a query component, the characters ";", "/", "?",
>   ":", "@", "&", "=", "+", ",", and "$" are reserved.
>
> Appendix D *does* say --
>
>   Section 2, on characters, has been rewritten to explain
>   what characters are reserved, when they are reserved,
>   and why they are reserved, even when they are not used
>   as delimiters by the generic syntax.  The mark characters
>   that are typically unsafe to decode, including the
>   exclamation mark ("!"), asterisk ("*"), single-quote ("'"),
>   and open and close parentheses ("(" and ")"), have been
>   moved to the reserved set in order to clarify the distinction
>   between reserved and unreserved and, hopefully, to answer
>   the most common question of scheme designers.  Likewise, the
>   section on percent-encoded characters has been rewritten,
>   and URI normalizers are now given license to decode any
>   percent-encoded octets corresponding to unreserved characters.
>   In general, the terms"escaped" and "unescaped" have been
>   replaced with "percent-encoded" and "decoded", respectively,
>   to reduce confusion with other forms of escape mechanisms.
>
> -- but there's no discussion of the substantial changes to
> Section 3.4, which I flagged above...
>
>
> So, while it should be clear that the ampersand has historically
> been reserved in *part* of the HTTP URI, it seems that this is no
> longer true -- but older implementations and authors who learned
> from the older RFC may well still treat it so -- and may well
> percent-encode it in components of the URI other than the Query
> Component, for a variety of reasons (not least being simple
> confusion as to when percent-encoding is required and when not --
> this aspect of the spec remains rather unclear to the average
> reader, thanks to the utter lack of examples covering such tricky
> scenarios as an ampersand in the "path-absolute").
>
>
> Maybe we need some URI validation tools, to start with...
>
>
> Be seeing you,
>
> Ted
>
>
>
>
>
>
>
>
> -- 
> A: Yes.                      http://www.guckes.net/faq/ 
> attribution.html
> | Q: Are you sure?
> | | A: Because it reverses the logical flow of conversation.
> | | | Q: Why is top posting frowned upon?
>
> Ted Thibodeau, Jr.           //               voice +1-781-273-0900  
> x32
> Evangelism & Support         //         
> mailto:tth...@op...
> OpenLink Software, Inc.      //              http:// 
> www.openlinksw.com/
>                                 http://www.openlinksw.com/weblogs/uda/
> OpenLink Blogs              http://www.openlinksw.com/weblogs/ 
> virtuoso/
>                               http://www.openlinksw.com/blog/~kidehen/
>    Universal Data Access and Virtual Database Technology Providers
>
> -------------------------------------------------------------------------
> This SF.Net email is sponsored by the Moblin Your Move Developer's  
> challenge
> Build the coolest Linux based applications with Moblin SDK & win  
> great prizes
> Grand prize is a trip for two to an Open Source event anywhere in  
> the world
> http://moblin-contest.org/redirect.php?banner_id=100&url=/
> _______________________________________________
> Dbpedia-discussion mailing list
> Dbp...@li...
> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion