|
From: Blake W. <bw...@la...> - 2008-02-17 00:46:04
|
Blake Winton wrote:
> Hey, what if the escape character turned the following character into
> its hex-escape, as a pre-transformation? Something along these lines:
> In [4]: def hexesc(m):
> ...: return "&#x%x;" % ord(m.group(1))
> ...:
>
> In [5]: re.sub( r"\\(.)", hexesc, "abc \\ def" )
> Out[5]: 'abc   def'
>
> and then take that string, and run it through the patterns?
Well, I started in on this, and I think I've got something that's at
least proof-of-concept material...
In [2]: print markdown.markdown( r"\``\`abc\``\`d&ef " ).strip()
<p>`<code>`abc`</code>`d&ef 
</p>
I'm sure there are bugs, (because it's software ;) and the duplication
of the unescape method is a little ugly, but hopefully someone more in
tune with the codebase can take the patch and make it pretty and more
correct. (One of the bugs I found before I sent this was that bare &s
weren't getting translated into & Fortunately, it was easy enough
to fix.)
For those who are interested, here's the explanation of the patch, hunk
by hunk:
----------------------
@@ -220,6 +220,10 @@
attrRegExp = re.compile(r'\{@([^\}]*)=([^\}]*)}') # {@id=123}
def __init__ (self, text) :
+ def hexunesc(m):
+ return "%c" % chr(int(m.group(0)[3:-1],16))
+ unescapeChars = r"&#x[0-9A-Fa-f]+;"
+ text = re.sub( unescapeChars, hexunesc, text )
self.value = text
def attributeCallback(self, match) :
----------------------
This is TextNode's init, where it is being passed escaped characters,
and so it unescapes them.
----------------------
@@ -488,8 +492,8 @@
Also note that all the regular expressions used by inline must
capture the whole block. For this reason, they all start with
-'^(.*)' and end with '(.*)!'. In case with built-in expression
-Pattern takes care of adding the "^(.*)" and "(.*)!".
+'^(.*)' and end with '(.*)$'. In case with built-in expression
+Pattern takes care of adding the "^(.*)" and "(.*)$".
Finally, the order in which regular expressions are applied is very
important - e.g. if we first replace http://.../ links with <a> tags
----------------------
Just some typo fixes I thought I'ld throw in there.
----------------------
@@ -518,9 +522,8 @@
+ (NOBRACKET+ r'\])*'+NOBRACKET)*6
+ NOBRACKET + r')\]' )
-BACKTICK_RE = r'\`([^\`]*)\`' # `e= m*c^2`
-DOUBLE_BACKTICK_RE = r'\`\`(.*)\`\`' # ``e=f("`")``
-ESCAPE_RE = r'\\(.)' # \<
+BACKTICK_RE = r'`([^`]*)`' # `e= m*c^2`
+DOUBLE_BACKTICK_RE = r'``(.*?)``' # ``e=f("`")``
EMPHASIS_RE = r'\*([^\*]*)\*' # *emphasis*
STRONG_RE = r'\*\*(.*)\*\*' # **strong**
STRONG_EM_RE = r'\*\*\*([^_]*)\*\*\*' # ***strong***
----------------------
Since we are handling escapes at a different level, we don't need the
regex for them anymore. Also, ` isn't a special character in regexes,
so we don't need to \-escape it. Finally, I've made the double-backtick
regular expression non-greedy (by adding the ? in the (.*?), since I
think that ``abc`` def ``ghi`` should probably be two code blocks.
----------------------
@@ -540,16 +543,16 @@
IMAGE_REFERENCE_RE = r'\!' + BRK + '\s*\[([^\]]*)\]' # ![alt text][2]
NOT_STRONG_RE = r'( \* )' # stand-alone * or _
AUTOLINK_RE = r'<(http://[^>]*)>' # <http://www.123.com>
-AUTOMAIL_RE = r'<([^> \!]*@[^> ]*)>' # <me...@ex...>
-#HTML_RE = r'(\<[^\>]*\>)' # <...>
+AUTOMAIL_RE = r'<([^> \!]*@[^> ]*)>' # <me...@ex...>
+#HTML_RE = r'(\<[^\>]*\>)' # <...>
HTML_RE = r'(\<[a-zA-Z/][^\>]*\>)' # <...>
-ENTITY_RE = r'(&[\#a-zA-Z0-9]*;)' # &
+ENTITY_RE = r'(&[\#a-zA-Z0-9]*;)' # &
class Pattern:
def __init__ (self, pattern) :
self.pattern = pattern
- self.compiled_re = re.compile("^(.*)%s(.*)$" % pattern, re.DOTALL)
+ self.compiled_re = re.compile("^(.*?)%s(.*?)$" % pattern,
re.DOTALL)
def getCompiledRegExp (self) :
return self.compiled_re
----------------------
Fix the spacing on the comments for AUTOLINK and #HTML.
Also, since we're escaping things by turning them into F4; we also
need to make sure that pattern can't be entered by the user, so we
escape any &s into &, which means that entities entered by the user
will be (in escaped-form) &amp;
----------------------
@@ -700,7 +703,6 @@
el.setAttribute('href', mailto)
return el
-ESCAPE_PATTERN = SimpleTextPattern(ESCAPE_RE)
NOT_STRONG_PATTERN = SimpleTextPattern(NOT_STRONG_RE)
BACKTICK_PATTERN = BacktickPattern(BACKTICK_RE)
----------------------
We don't need the escape pattern, as mentioned above.
----------------------
@@ -767,6 +769,10 @@
@param html: an html segment
@returns : a placeholder string """
+ def hexunesc(m):
+ return "%c" % chr(int(m.group(0)[3:-1],16))
+ unescapeChars = r"&#x[0-9A-Fa-f]+;"
+ html = re.sub( unescapeChars, hexunesc, html )
self.rawHtmlBlocks.append(html)
placeholder = HTML_PLACEHOLDER % self.html_counter
self.html_counter += 1
----------------------
So, in the HtmlStash, we need to unescape things before we stash them,
because they won't be part of TextNodes (which is the other place that
does the unescaping).
----------------------
@@ -946,8 +952,7 @@
self.inlinePatterns = [ DOUBLE_BACKTICK_PATTERN,
BACKTICK_PATTERN,
- ESCAPE_PATTERN,
- IMAGE_LINK_PATTERN,
+ IMAGE_LINK_PATTERN,
IMAGE_REFERENCE_PATTERN,
REFERENCE_PATTERN,
LINK_ANGLED_PATTERN,
----------------------
We don't need the escape pattern. And fix the spacing on the IMAGE_LINK
while we're here.
----------------------
@@ -1043,6 +1048,14 @@
# Split into lines and run the preprocessors that will work with
# self.lines
+ def hexesc(m):
+ if m.group(1):
+ return "&#x%x;" % ord(m.group(1))
+ else:
+ return "&#x%x;" % ord('&')
+
+ escapeChars = r"\\([\\`*_{}\[\]()#+.!-])|&"
+ text = re.sub( escapeChars, hexesc, text )
self.lines = text.split("\n")
# Run the pre-processors on the lines
----------------------
And finally, in _transform, before we split into lines, we should
translate the various escape characters into their escaped form.
The regex limits the escape characters to the ones listed in
http://daringfireball.net/projects/markdown/syntax#backslash specifically:
\ backslash
` backtick
* asterisk
_ underscore
{} curly braces
[] square brackets
() parentheses
# hash mark
+ plus sign
- minus sign (hyphen)
. dot
! exclamation mark
and we also escape & (at the end of the regex), for reasons mentioned above.
Uh, thanks for reading this far. I hope it all made sense. Please let
me know if you found it helpful, or if I was totally wasting my time. ;)
Later,
Blake.
|