#410 URL parsing not quite right

Linking
closed
Joel Uckelman
5
2012-10-11
2004-12-23
Dan F
No

I had the following URL in a page

[Twin Cities search entry |
http://twincities.citysearch.com/profile/35716516?cslink=roundup_name_cust&ulink=rounduproundupentity2-5_1_profile_2_1]

The page itself says:

BAD URL -- remove all of <, >, "

There are no <, >, or ". I remove the query stuff to leave

[Twin Cities search entry |
http://twincities.citysearch.com/profile/35716516]

and it doesn't warn anymore.

Either the parser or the error message should change.

I would track it down, but I don't have time this minute.

Dan

Discussion

  • Dan F
    Dan F
    2004-12-23

    Logged In: YES
    user_id=417594

    This is based on 1.3.9, by the way.

     
  • Joel Uckelman
    Joel Uckelman
    2005-01-14

    Logged In: YES
    user_id=245140

    I know what's going on here. The pairs of double underscores
    in your url are being interpreted as bold tags in the wiki
    markup language, and the url is parsed for markup prior to
    the check for whether it's a safe url (i.e., whether it
    contains any of <, >, "). Since the markup parser has
    inserted a '' tag, the url now contains <, and so
    it's rejected.

    So, that's a diagnosis of the problem. I haven't had a
    chance to dig any deeper to see where to correct it. I'll
    see about that in the next few days.

    (This bug is still in the version in CVS right now, so has
    been carried along from at least 1.3.9.)

     
  • Joel Uckelman
    Joel Uckelman
    2005-01-14

    Logged In: YES
    user_id=245140

    Here's what I came up with as a solution:

    Look at ConvertOldMarkup() in lib/stdlib.php. Immediately
    before the line

          $subs["links"] = array($orig, $repl);
    

    try adding the following:

         $orig[] = '/\[(.*?)__(.*?)__(.*?)\]/';
         $repl[] = '[\1%5F%5F\2%5F%5F\3]';
    
         $orig[] = '/\[(.*?)\'\'(.*?)\'\'(.*?)\]/';
         $repl[] = '[\1%27%27\2%27%27\3]';
    

    and let me know if this appears to work for you. I'm not
    very familiar with this part of the codebase; it works for
    my test case, but it might break things of which I'm not aware.

    One side-effect this has is that it changes the appearance
    of unnamed external links containing pairs of "__" or pairs
    of "''" (single quotes). Right now they don't work at all,
    so this is probably an improvement, nonetheless.

    Also, I'm not sure that this does the proper thing with
    internal links containing paired underscores or single
    quotes. E.g., [foobar] ends up as 'foo%5F%5Fbar%5F%5F'.
    Actually, I'm not sure whether it's even legal to have a
    page with that name.

     
  • Dan F
    Dan F
    2005-01-14

    Logged In: YES
    user_id=417594

    Joel,

    Thanks for your suggestion. I will not have a chance to try
    it out for a few weeks, probably, but I would like to try it.

    A couple of thoughts:

    It would be quite valuable to clarify exactly what pagenames
    are legal. Actually, we are trying to allow any characters
    in our pagenames. We import Amazon book titles, and they
    have spaces, :s, etc. etc. Thus, I hope Phpwiki ends up
    supporting any characters.

    The way this is normally done is with quoting schemes (e.g.,
    "foo~~_bar~~_" or "q(foobar")). This can be ugly, but
    makes things possible.

    In general I've found Phpwiki's parsing to be fragile and
    adhoc. Apologies to whoever wrote it-- I'm sure it was a lot
    of work.
    Your suggestion I'm afraid looks a lot like a patch on top
    of a shaky parser. Please correct me if I'm wrong.

    Dan

     
  • Joel Uckelman
    Joel Uckelman
    2005-01-14

    patch for double underscores in link URLs (old markup)

     
    Attachments
  • Joel Uckelman
    Joel Uckelman
    2005-01-14

    Logged In: YES
    user_id=245140

    1. I'm attaching a patch which is a more robust solution to
      the problem than the one I suggested below. The affected
      part of lib/stdlib.php is identical between 1.3.9 and
      current CVS, so it should just work. If you can't get it to
      apply, email me.

    2. The underscore problem you're having shows up only with
      the old markup syntax. The link you give as an example works
      as is with the new makup syntax. In general, I'd suggest
      using the new markup syntax.

    3. You might want to ask on the talk list for clarification
      of what pagenames are legal. I'm not sure myself---I was
      able to get most punctuation to work in pagenames, but not
      all (e.g., colons don't work, but spaces and underscores
      do). Maybe open a new bug on that point, as well.