From: Blake W. <bw...@la...> - 2008-02-17 00:46:04
|
Blake Winton wrote: > Hey, what if the escape character turned the following character into > its hex-escape, as a pre-transformation? Something along these lines: > In [4]: def hexesc(m): > ...: return "&#x%x;" % ord(m.group(1)) > ...: > > In [5]: re.sub( r"\\(.)", hexesc, "abc \\ def" ) > Out[5]: 'abc   def' > > and then take that string, and run it through the patterns? Well, I started in on this, and I think I've got something that's at least proof-of-concept material... In [2]: print markdown.markdown( r"\``\`abc\``\`d&ef " ).strip() <p>`<code>`abc`</code>`d&ef  </p> I'm sure there are bugs, (because it's software ;) and the duplication of the unescape method is a little ugly, but hopefully someone more in tune with the codebase can take the patch and make it pretty and more correct. (One of the bugs I found before I sent this was that bare &s weren't getting translated into & Fortunately, it was easy enough to fix.) For those who are interested, here's the explanation of the patch, hunk by hunk: ---------------------- @@ -220,6 +220,10 @@ attrRegExp = re.compile(r'\{@([^\}]*)=([^\}]*)}') # {@id=123} def __init__ (self, text) : + def hexunesc(m): + return "%c" % chr(int(m.group(0)[3:-1],16)) + unescapeChars = r"&#x[0-9A-Fa-f]+;" + text = re.sub( unescapeChars, hexunesc, text ) self.value = text def attributeCallback(self, match) : ---------------------- This is TextNode's init, where it is being passed escaped characters, and so it unescapes them. ---------------------- @@ -488,8 +492,8 @@ Also note that all the regular expressions used by inline must capture the whole block. For this reason, they all start with -'^(.*)' and end with '(.*)!'. In case with built-in expression -Pattern takes care of adding the "^(.*)" and "(.*)!". +'^(.*)' and end with '(.*)$'. In case with built-in expression +Pattern takes care of adding the "^(.*)" and "(.*)$". Finally, the order in which regular expressions are applied is very important - e.g. if we first replace http://.../ links with <a> tags ---------------------- Just some typo fixes I thought I'ld throw in there. ---------------------- @@ -518,9 +522,8 @@ + (NOBRACKET+ r'\])*'+NOBRACKET)*6 + NOBRACKET + r')\]' ) -BACKTICK_RE = r'\`([^\`]*)\`' # `e= m*c^2` -DOUBLE_BACKTICK_RE = r'\`\`(.*)\`\`' # ``e=f("`")`` -ESCAPE_RE = r'\\(.)' # \< +BACKTICK_RE = r'`([^`]*)`' # `e= m*c^2` +DOUBLE_BACKTICK_RE = r'``(.*?)``' # ``e=f("`")`` EMPHASIS_RE = r'\*([^\*]*)\*' # *emphasis* STRONG_RE = r'\*\*(.*)\*\*' # **strong** STRONG_EM_RE = r'\*\*\*([^_]*)\*\*\*' # ***strong*** ---------------------- Since we are handling escapes at a different level, we don't need the regex for them anymore. Also, ` isn't a special character in regexes, so we don't need to \-escape it. Finally, I've made the double-backtick regular expression non-greedy (by adding the ? in the (.*?), since I think that ``abc`` def ``ghi`` should probably be two code blocks. ---------------------- @@ -540,16 +543,16 @@ IMAGE_REFERENCE_RE = r'\!' + BRK + '\s*\[([^\]]*)\]' # ![alt text][2] NOT_STRONG_RE = r'( \* )' # stand-alone * or _ AUTOLINK_RE = r'<(http://[^>]*)>' # <http://www.123.com> -AUTOMAIL_RE = r'<([^> \!]*@[^> ]*)>' # <me...@ex...> -#HTML_RE = r'(\<[^\>]*\>)' # <...> +AUTOMAIL_RE = r'<([^> \!]*@[^> ]*)>' # <me...@ex...> +#HTML_RE = r'(\<[^\>]*\>)' # <...> HTML_RE = r'(\<[a-zA-Z/][^\>]*\>)' # <...> -ENTITY_RE = r'(&[\#a-zA-Z0-9]*;)' # & +ENTITY_RE = r'(&[\#a-zA-Z0-9]*;)' # & class Pattern: def __init__ (self, pattern) : self.pattern = pattern - self.compiled_re = re.compile("^(.*)%s(.*)$" % pattern, re.DOTALL) + self.compiled_re = re.compile("^(.*?)%s(.*?)$" % pattern, re.DOTALL) def getCompiledRegExp (self) : return self.compiled_re ---------------------- Fix the spacing on the comments for AUTOLINK and #HTML. Also, since we're escaping things by turning them into F4; we also need to make sure that pattern can't be entered by the user, so we escape any &s into &, which means that entities entered by the user will be (in escaped-form) &amp; ---------------------- @@ -700,7 +703,6 @@ el.setAttribute('href', mailto) return el -ESCAPE_PATTERN = SimpleTextPattern(ESCAPE_RE) NOT_STRONG_PATTERN = SimpleTextPattern(NOT_STRONG_RE) BACKTICK_PATTERN = BacktickPattern(BACKTICK_RE) ---------------------- We don't need the escape pattern, as mentioned above. ---------------------- @@ -767,6 +769,10 @@ @param html: an html segment @returns : a placeholder string """ + def hexunesc(m): + return "%c" % chr(int(m.group(0)[3:-1],16)) + unescapeChars = r"&#x[0-9A-Fa-f]+;" + html = re.sub( unescapeChars, hexunesc, html ) self.rawHtmlBlocks.append(html) placeholder = HTML_PLACEHOLDER % self.html_counter self.html_counter += 1 ---------------------- So, in the HtmlStash, we need to unescape things before we stash them, because they won't be part of TextNodes (which is the other place that does the unescaping). ---------------------- @@ -946,8 +952,7 @@ self.inlinePatterns = [ DOUBLE_BACKTICK_PATTERN, BACKTICK_PATTERN, - ESCAPE_PATTERN, - IMAGE_LINK_PATTERN, + IMAGE_LINK_PATTERN, IMAGE_REFERENCE_PATTERN, REFERENCE_PATTERN, LINK_ANGLED_PATTERN, ---------------------- We don't need the escape pattern. And fix the spacing on the IMAGE_LINK while we're here. ---------------------- @@ -1043,6 +1048,14 @@ # Split into lines and run the preprocessors that will work with # self.lines + def hexesc(m): + if m.group(1): + return "&#x%x;" % ord(m.group(1)) + else: + return "&#x%x;" % ord('&') + + escapeChars = r"\\([\\`*_{}\[\]()#+.!-])|&" + text = re.sub( escapeChars, hexesc, text ) self.lines = text.split("\n") # Run the pre-processors on the lines ---------------------- And finally, in _transform, before we split into lines, we should translate the various escape characters into their escaped form. The regex limits the escape characters to the ones listed in http://daringfireball.net/projects/markdown/syntax#backslash specifically: \ backslash ` backtick * asterisk _ underscore {} curly braces [] square brackets () parentheses # hash mark + plus sign - minus sign (hyphen) . dot ! exclamation mark and we also escape & (at the end of the regex), for reasons mentioned above. Uh, thanks for reading this far. I hope it all made sense. Please let me know if you found it helpful, or if I was totally wasting my time. ;) Later, Blake. |