Re: [Python-markdown-discuss] Markdown escape function?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Something to keep in mind is that there is no such thing as invalid Markdown. A Markdown parser must be able to take any text input and not raise an error. At worst, the output would be meaningless, but no error should ever be raised. I once received a bug report from someone who was using the Complete Works of Shakespeare (from the Gutenberg project which uses its own custom plain text format) as input and complained that it caused the parser to crash with a maximum recursion depth error. My knee-jerk reaction was to protest that that wasn’t Markdown so of course it didn’t work. But he was right, the parser should never raise such as error (the internals have since been completely refactored).

If you go back and look at the reasoning and motivations of the creator of Markdown, he was looking to create a format which could accept any plain text email (even from years earlier) and pass it into Markdown and get a reasonably readable output out the other side. Sure some of the details would be different that the author intended, but it should still be human readable. Like I said, there is no such thing as invalid Markdown, just poorly formatted Markdown.

And then there is the many years that users have been using Markdown (over a decade). There is a certain expectation regarding behavior that exists today. Most everyone knows and expects that `**foo**` will result in bold text. And it is not surprising if some service disallows that, but then the expectation is that you will just get `foo` back. Getting back `**foo**` would be surprising. Personally, I would assume that something other than a Markdown parser is being used. Perhaps one of the other competing lightweight markup languages.

It is with these long-standing expectations of users in mind that us long-time users of Markdown say that the only way to sanitize Markdown is by sanitizing the HTML output.

As a practical matter, to sanitize the Markdown text before passing it to the parser would require writing another parser, just one that removes/escapes the disallowed markup. If that is what you really want, then just use Python-Markdown extension API to remove the “strongPattern” from the parser. Now the parser won’t parse and convert `**foo**` but will leave it intact. If that is the behavior you really want, that is the easiest way to get it. But I expect your users will very much dislike it.

Waylan

On May 24, 2017, 1:40 PM -0400, nusenu <nus...@ri...>, wrote:
> > If you need to filter input from an untrusted source, then you
> > should not filter the Markdown input, but the HTML output instead.
> > For a detailed explanation of why, see this article:
> > https://michelf.ca/blog/2010/markdown-and-xss/
>
> Thank you for the pointer, but I disagree with the main conclusion that
> there is "no other choice":
>
> > So the conclusion is that, if you want real security, you need to
> > filter Markdown’s output, not the input. **There’s no other choice.**
>
> but I would _like_ to be proven wrong so I can improve [1] (maybe with
> an example XSS payload that bypasses [1]).
>
> Why do I disagree?
> The blog post shows an example with a (poorly written) "XSS filter".
>
> The problem with "filter the HTML output not the Markdown input" is:
> I'm not in the position to choose. I have to provide Markdown output not
> HTML.
> Also: There must be a reason for Markdown to provide escape
> possibilities. [0]
>
> I claim that it is possible to write a filter that makes untrusted
> input, to be used in Markdown output, XSS-safe. The question is
> - Is there a known implementation?
> - If not: How invasive does such a filter has to be.
>
> In my current approach [1] I simply consider whitelisted characters only
> (the rest gets discarded) but I'm unhappy with that - because it is
> probably **not safe** and the displayed string is no longer the one
> provided by the untrusted source - so I'm looking for something better.
>
> "better" is: output string **looks** (after Markdown got converted to
> HTML) exactly like input string _and_ is XSS-safe
>
> > In Python I recommend Bleach with this whitelist as a good starting
> > place: https://github.com/yourcelf/bleach-whitelist
> > https://github.com/mozilla/bleach
>
> Yes, I saw your recommendation when reading your documentation [2].
> Bleach is for HTML, I need something for Markdown.
>
> thanks for your help,
> nusenu
>
>
>
> [0] https://daringfireball.net/projects/markdown/syntax#backslash
> [2] https://pythonhosted.org/Markdown/reference.html#safe_mode
> [1]
> def strip_md(input):
> input=cgi.escape(input)
> input=re.sub(r'[\{\}\[\]()_]', ' ', input)
> # "." and "-" are Markdown metachars!
> whitelist=r'[0-9a-zA-Z"$%&/\',\.:;=?@\^\- ]'
> input="".join(re.findall(whitelist,input))
> input=input.strip()
> input=re.sub('\s+',' ', input)
> return input
>
>
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Python-markdown-discuss mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/python-markdown-discuss