Re: [Python-markdown-discuss] Escape character and backtick bug.

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Blake Winton wrote:
> Hey, what if the escape character turned the following character into 
> its hex-escape, as a pre-transformation?  Something along these lines:
> In [4]: def hexesc(m):
>    ...:     return "&#x%x;" % ord(m.group(1))
>    ...:
> 
> In [5]: re.sub( r"\\(.)", hexesc, "abc \\  def" )
> Out[5]: 'abc &#x20; def'
> 
> and then take that string, and run it through the patterns?

Well, I started in on this, and I think I've got something that's at
least proof-of-concept material...

In [2]: print markdown.markdown( r"\``\`abc\``\`d&ef&#x20;" ).strip()
<p>`<code>`abc`</code>`d&amp;ef&#x20;
</p>

I'm sure there are bugs, (because it's software ;) and the duplication 
of the unescape method is a little ugly, but hopefully someone more in 
tune with the codebase can take the patch and make it pretty and more 
correct.  (One of the bugs I found before I sent this was that bare &s 
weren't getting translated into &amp;  Fortunately, it was easy enough 
to fix.)

For those who are interested, here's the explanation of the patch, hunk 
by hunk:

----------------------
@@ -220,6 +220,10 @@
      attrRegExp = re.compile(r'\{@([^\}]*)=([^\}]*)}') # {@id=123}

      def __init__ (self, text) :
+        def hexunesc(m):
+            return "%c" % chr(int(m.group(0)[3:-1],16))
+        unescapeChars = r"&#x[0-9A-Fa-f]+;"
+        text = re.sub( unescapeChars, hexunesc, text )
          self.value = text

      def attributeCallback(self, match) :
----------------------

This is TextNode's init, where it is being passed escaped characters, 
and so it unescapes them.

----------------------
@@ -488,8 +492,8 @@

  Also note that all the regular expressions used by inline must
  capture the whole block.  For this reason, they all start with
-'^(.*)' and end with '(.*)!'.  In case with built-in expression
-Pattern takes care of adding the "^(.*)" and "(.*)!".
+'^(.*)' and end with '(.*)$'.  In case with built-in expression
+Pattern takes care of adding the "^(.*)" and "(.*)$".

  Finally, the order in which regular expressions are applied is very
  important - e.g. if we first replace http://.../ links with <a> tags
----------------------

Just some typo fixes I thought I'ld throw in there.

----------------------
@@ -518,9 +522,8 @@
          + (NOBRACKET+ r'\])*'+NOBRACKET)*6
          + NOBRACKET + r')\]' )

-BACKTICK_RE = r'\`([^\`]*)\`'                    # `e= m*c^2`
-DOUBLE_BACKTICK_RE =  r'\`\`(.*)\`\`'            # ``e=f("`")``
-ESCAPE_RE = r'\\(.)'                             # \<
+BACKTICK_RE = r'`([^`]*)`'                       # `e= m*c^2`
+DOUBLE_BACKTICK_RE =  r'``(.*?)``'               # ``e=f("`")``
  EMPHASIS_RE = r'\*([^\*]*)\*'                    # *emphasis*
  STRONG_RE = r'\*\*(.*)\*\*'                      # **strong**
  STRONG_EM_RE = r'\*\*\*([^_]*)\*\*\*'            # ***strong***
----------------------

Since we are handling escapes at a different level, we don't need the 
regex for them anymore.  Also, ` isn't a special character in regexes, 
so we don't need to \-escape it.  Finally, I've made the double-backtick 
regular expression non-greedy (by adding the ? in the (.*?), since I 
think that ``abc`` def ``ghi`` should probably be two code blocks.

----------------------
@@ -540,16 +543,16 @@
  IMAGE_REFERENCE_RE = r'\!' + BRK + '\s*\[([^\]]*)\]' # ![alt text][2]
  NOT_STRONG_RE = r'( \* )'                        # stand-alone * or _
  AUTOLINK_RE = r'<(http://[^>]*)>'                # <http://www.123.com>
-AUTOMAIL_RE = r'<([^> \!]*@[^> ]*)>'               # <me...@ex...>
-#HTML_RE = r'(\<[^\>]*\>)'                        # <...>
+AUTOMAIL_RE = r'<([^> \!]*@[^> ]*)>'             # <me...@ex...>
+#HTML_RE = r'(\<[^\>]*\>)'                       # <...>
  HTML_RE = r'(\<[a-zA-Z/][^\>]*\>)'               # <...>
-ENTITY_RE = r'(&[\#a-zA-Z0-9]*;)'                # &amp;
+ENTITY_RE = r'(&#x26;[\#a-zA-Z0-9]*;)'           # &amp;

  class Pattern:

      def __init__ (self, pattern) :
          self.pattern = pattern
-        self.compiled_re = re.compile("^(.*)%s(.*)$" % pattern, re.DOTALL)
+        self.compiled_re = re.compile("^(.*?)%s(.*?)$" % pattern, 
re.DOTALL)

      def getCompiledRegExp (self) :
          return self.compiled_re
----------------------

Fix the spacing on the comments for AUTOLINK and #HTML.
Also, since we're escaping things by turning them into &#12F4; we also 
need to make sure that pattern can't be entered by the user, so we 
escape any &s into &#x26;, which means that entities entered by the user 
will be (in escaped-form) &#x26;amp;

----------------------
@@ -700,7 +703,6 @@
          el.setAttribute('href', mailto)
          return el

-ESCAPE_PATTERN          = SimpleTextPattern(ESCAPE_RE)
  NOT_STRONG_PATTERN      = SimpleTextPattern(NOT_STRONG_RE)

  BACKTICK_PATTERN        = BacktickPattern(BACKTICK_RE)
----------------------

We don't need the escape pattern, as mentioned above.

----------------------
@@ -767,6 +769,10 @@

             @param html: an html segment
             @returns : a placeholder string """
+        def hexunesc(m):
+            return "%c" % chr(int(m.group(0)[3:-1],16))
+        unescapeChars = r"&#x[0-9A-Fa-f]+;"
+        html = re.sub( unescapeChars, hexunesc, html )
          self.rawHtmlBlocks.append(html)
          placeholder = HTML_PLACEHOLDER % self.html_counter
          self.html_counter += 1
----------------------

So, in the HtmlStash, we need to unescape things before we stash them, 
because they won't be part of TextNodes (which is the other place that 
does the unescaping).

----------------------
@@ -946,8 +952,7 @@

          self.inlinePatterns = [ DOUBLE_BACKTICK_PATTERN,
                                  BACKTICK_PATTERN,
-	                        ESCAPE_PATTERN,
-	                        IMAGE_LINK_PATTERN,
+                                IMAGE_LINK_PATTERN,
                                  IMAGE_REFERENCE_PATTERN,
                                  REFERENCE_PATTERN,
                                  LINK_ANGLED_PATTERN,
----------------------

We don't need the escape pattern.  And fix the spacing on the IMAGE_LINK 
while we're here.

----------------------
@@ -1043,6 +1048,14 @@
          # Split into lines and run the preprocessors that will work with
          # self.lines

+        def hexesc(m):
+            if m.group(1):
+                return "&#x%x;" % ord(m.group(1))
+            else:
+                return "&#x%x;" % ord('&')
+
+        escapeChars = r"\\([\\`*_{}\[\]()#+.!-])|&"
+        text = re.sub( escapeChars, hexesc, text )
          self.lines = text.split("\n")

          # Run the pre-processors on the lines
----------------------

And finally, in _transform, before we split into lines, we should 
translate the various escape characters into their escaped form.
The regex limits the escape characters to the ones listed in 
http://daringfireball.net/projects/markdown/syntax#backslash specifically:
\   backslash
`   backtick
*   asterisk
_   underscore
{}  curly braces
[]  square brackets
()  parentheses
#   hash mark
+   plus sign
-   minus sign (hyphen)
.   dot
!   exclamation mark

and we also escape & (at the end of the regex), for reasons mentioned above.

Uh, thanks for reading this far.  I hope it all made sense.  Please let 
me know if you found it helpful, or if I was totally wasting my time.  ;)

Later,
Blake.