Re: [Python-markdown-discuss] Markdown escape function?

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Waylan,

thanks for your continued input.

Some more context from my side might help here:

I take the 'contact' (an arbitrary untrusted string) from a backend:
https://onionoo.torproject.org/details?fields=contact
and produce Markdown:
https://raw.githubusercontent.com/nusenu/OrNetStats/master/maincwfamilies.md
which
Jekyll uses to produce HTML based on these files.
Final output (the current output shows stripped output and does not
exactly match the input, but the goal would be a displayed string that
matches):
https://nusenu.github.io/OrNetStats/maincwfamilies

> Something to keep in mind is that there is no such thing as invalid
> Markdown.

I hope I didn't say something that would contradict that.

> And then there is the many years that users have been using Markdown
> (over a decade). There is a certain expectation regarding behavior
> that exists today. Most everyone knows and expects that `**foo**`
> will result in bold text. And it is not surprising if some service
> disallows that, but then the expectation is that you will just get
> `foo` back. Getting back `**foo**` would be surprising.
[..]
>
> It is with these long-standing expectations of users in mind that us
> long-time users of Markdown say that the only way to sanitize
> Markdown is by sanitizing the HTML output.

In my case the data source does not have any Markdown expectations (the
source does not even know there is Markdown in the middle or what
Markdown is). The source expects literal output

obfuscated**dot**emailaddress**dot**tld
should not become:
myobfuscateddotemailaddressdottld

In these examples I used the "*" but this is not limited to this
character, this is about any metachars + the pipe sign (since I use the
table extension).

> As a practical matter, to sanitize the Markdown text before passing
> it to the parser would require writing another parser, just one that
> removes/escapes the disallowed markup. If that is what you really
> want

Yes :)

> then just use Python-Markdown extension API to remove the
> “strongPattern” from the parser. 

Is it possible to use your API to do Markdown escaping? (the initial
question)

input -> output examples:

**foo** -> \*\*foo\*\*
dot*foo -> dot*foo (no backslash)
1. 	-> 1\.
example.com -> example.com (no backslash)
...(and all other meta chars)

> If that is the behavior
> you really want, that is the easiest way to get it. But I expect your
> users will very much dislike it.

As stated above - no worries here - since the data source does not know
about Markdown - and therefore has no expectations.

bellow you find my conversation with python-help - because it is also
relevant and the reason I'm asking again here since I need a
Markdown-aware escape function not a simple search/replace:
------------

Matt (python-help) wrote:
> However, I also think that having an escape-the-markup
> function in a markup library makes perfect sense.

>> (simply replacing all metachars with \metachar does not work)
>
> But if you can be more specific about how that doesn't work, we may
> be able to help suggest something.

I wrote a very simple function that replaces all Markdown metachars to
test that approach

def escape_md(input):
    input=input.replace('\\','\\\\')
    input=input.replace('`', '\\`')
    input=input.replace('*', '\\*')
    ...
    return input

The problem with this approach is that it does not care about the context.

So this works fine against
"**foo**" -> HTML: **foo**
but it does not work with here:
"d*ot" -> HTML: d\*ot

similarly with other chars .([`-

So in the end the escape function has to be Markdown aware and at that
point I guess it is no longer a simply search and replace thing and
should be part of a Markdown library (that is already aware of the
syntax anyway).