From: nusenu <nus...@ri...> - 2017-05-22 09:22:37
Attachments:
signature.asc
|
Hi, as far as I have seen Python-Markdown does not offer an escape function for input that should not be interpreted as Markdown. Example: **foo** should be displayed as **foo** not <strong>foo... Do you know of a library that escapes input for use in Markdown? **foo** -> \*\*foo\*\* and all the other characters: https://daringfireball.net/projects/markdown/syntax#backslash + the pipe character "|" (tables). thanks, nusenu |
From: <way...@ic...> - 2017-05-22 12:19:13
|
The backslash escape as documented on daringfireball is fully supported by default. Is there a specific input that is not working for you? Thanks, Waylan Limberg On May 22, 2017, 5:22 AM -0400, nusenu <nus...@ri...>, wrote: > Hi, > > as far as I have seen Python-Markdown does not offer an escape function > for input that should not be interpreted as Markdown. > Example: **foo** should be displayed as **foo** not <strong>foo... > > Do you know of a library that escapes input for use in Markdown? > **foo** -> \*\*foo\*\* > and all the other characters: > https://daringfireball.net/projects/markdown/syntax#backslash > + the pipe character "|" (tables). > > thanks, > nusenu > > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > _______________________________________________ > Python-markdown-discuss mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/python-markdown-discuss |
From: nusenu <nus...@ri...> - 2017-05-22 12:44:16
Attachments:
signature.asc
|
way...@ic...: > The backslash escape as documented on daringfireball is fully supported by default. Is there a specific input that is not working for you? maybe my question was not clear. I'll try again: print escape_md('**foo**') \*\*foo\*\* I'm looking for _that_ "escape_md()" function. Is there one in Python-Markdown? Do you know one elsewhere? thanks, nusenu |
From: <way...@ic...> - 2017-05-22 16:20:51
|
No, I am not aware of any such function. I've never seen on (that I recall) and never had a need or request for one either. There is certainly no such function in Python-Markdown. Thanks, Waylan Limberg On May 22, 2017, 8:44 AM -0400, nusenu <nus...@ri...>, wrote: > > > way...@ic...: > > The backslash escape as documented on daringfireball is fully supported by default. Is there a specific input that is not working for you? > > maybe my question was not clear. > > I'll try again: > > print escape_md('**foo**') > \*\*foo\*\* > > I'm looking for _that_ "escape_md()" function. > > Is there one in Python-Markdown? Do you know one elsewhere? > > thanks, > nusenu > |
From: nusenu <nus...@ri...> - 2017-05-24 11:59:12
Attachments:
signature.asc
|
way...@ic...: > No, I am not aware of any such function. I've never seen on (that I > recall) and never had a need or request for one either. Is no one using untrusted input in their markdown files? |
From: <way...@ic...> - 2017-05-24 12:25:11
|
If you need to filter input from an untrusted source, then you should not filter the Markdown input, but the HTML output instead. For a detailed explanation of why, see this article: https://michelf.ca/blog/2010/markdown-and-xss/ In Python I recommend Bleach with this whitelist as a good starting place: https://github.com/yourcelf/bleach-whitelist https://github.com/mozilla/bleach Waylan On May 24, 2017, 7:59 AM -0400, nusenu <nus...@ri...>, wrote: > > > way...@ic...: > > No, I am not aware of any such function. I've never seen on (that I > > recall) and never had a need or request for one either. > > Is no one using untrusted input in their markdown files? > |
From: nusenu <nus...@ri...> - 2017-05-24 17:40:17
Attachments:
signature.asc
|
> If you need to filter input from an untrusted source, then you > should not filter the Markdown input, but the HTML output instead. > For a detailed explanation of why, see this article: > https://michelf.ca/blog/2010/markdown-and-xss/ Thank you for the pointer, but I disagree with the main conclusion that there is "no other choice": > So the conclusion is that, if you want real security, you need to > filter Markdown’s output, not the input. **There’s no other choice.** but I would _like_ to be proven wrong so I can improve [1] (maybe with an example XSS payload that bypasses [1]). Why do I disagree? The blog post shows an example with a (poorly written) "XSS filter". The problem with "filter the HTML output not the Markdown input" is: I'm not in the position to choose. I have to provide Markdown output not HTML. Also: There must be a reason for Markdown to provide escape possibilities. [0] I claim that it is possible to write a filter that makes untrusted input, to be used in Markdown output, XSS-safe. The question is - Is there a known implementation? - If not: How invasive does such a filter has to be. In my current approach [1] I simply consider whitelisted characters only (the rest gets discarded) but I'm unhappy with that - because it is probably **not safe** and the displayed string is no longer the one provided by the untrusted source - so I'm looking for something better. "better" is: output string **looks** (after Markdown got converted to HTML) exactly like input string _and_ is XSS-safe > In Python I recommend Bleach with this whitelist as a good starting > place: https://github.com/yourcelf/bleach-whitelist > https://github.com/mozilla/bleach Yes, I saw your recommendation when reading your documentation [2]. Bleach is for HTML, I need something for Markdown. thanks for your help, nusenu [0] https://daringfireball.net/projects/markdown/syntax#backslash [2] https://pythonhosted.org/Markdown/reference.html#safe_mode [1] def strip_md(input): input=cgi.escape(input) input=re.sub(r'[\{\}\[\]()_]', ' ', input) # "." and "-" are Markdown metachars! whitelist=r'[0-9a-zA-Z"$%&/\',\.:;=?@\^\- ]' input="".join(re.findall(whitelist,input)) input=input.strip() input=re.sub('\s+',' ', input) return input |
From: nusenu <nus...@ri...> - 2017-05-24 18:17:57
Attachments:
signature.asc
|
nusenu: > - Is there a known implementation? Since you replied to this one already I reached out to the guys at python-help. thanks, nusenu |
From: <way...@ic...> - 2017-05-25 12:31:04
|
Something to keep in mind is that there is no such thing as invalid Markdown. A Markdown parser must be able to take any text input and not raise an error. At worst, the output would be meaningless, but no error should ever be raised. I once received a bug report from someone who was using the Complete Works of Shakespeare (from the Gutenberg project which uses its own custom plain text format) as input and complained that it caused the parser to crash with a maximum recursion depth error. My knee-jerk reaction was to protest that that wasn’t Markdown so of course it didn’t work. But he was right, the parser should never raise such as error (the internals have since been completely refactored). If you go back and look at the reasoning and motivations of the creator of Markdown, he was looking to create a format which could accept any plain text email (even from years earlier) and pass it into Markdown and get a reasonably readable output out the other side. Sure some of the details would be different that the author intended, but it should still be human readable. Like I said, there is no such thing as invalid Markdown, just poorly formatted Markdown. And then there is the many years that users have been using Markdown (over a decade). There is a certain expectation regarding behavior that exists today. Most everyone knows and expects that `**foo**` will result in bold text. And it is not surprising if some service disallows that, but then the expectation is that you will just get `foo` back. Getting back `**foo**` would be surprising. Personally, I would assume that something other than a Markdown parser is being used. Perhaps one of the other competing lightweight markup languages. It is with these long-standing expectations of users in mind that us long-time users of Markdown say that the only way to sanitize Markdown is by sanitizing the HTML output. As a practical matter, to sanitize the Markdown text before passing it to the parser would require writing another parser, just one that removes/escapes the disallowed markup. If that is what you really want, then just use Python-Markdown extension API to remove the “strongPattern” from the parser. Now the parser won’t parse and convert `**foo**` but will leave it intact. If that is the behavior you really want, that is the easiest way to get it. But I expect your users will very much dislike it. Waylan On May 24, 2017, 1:40 PM -0400, nusenu <nus...@ri...>, wrote: > > If you need to filter input from an untrusted source, then you > > should not filter the Markdown input, but the HTML output instead. > > For a detailed explanation of why, see this article: > > https://michelf.ca/blog/2010/markdown-and-xss/ > > Thank you for the pointer, but I disagree with the main conclusion that > there is "no other choice": > > > So the conclusion is that, if you want real security, you need to > > filter Markdown’s output, not the input. **There’s no other choice.** > > but I would _like_ to be proven wrong so I can improve [1] (maybe with > an example XSS payload that bypasses [1]). > > Why do I disagree? > The blog post shows an example with a (poorly written) "XSS filter". > > The problem with "filter the HTML output not the Markdown input" is: > I'm not in the position to choose. I have to provide Markdown output not > HTML. > Also: There must be a reason for Markdown to provide escape > possibilities. [0] > > I claim that it is possible to write a filter that makes untrusted > input, to be used in Markdown output, XSS-safe. The question is > - Is there a known implementation? > - If not: How invasive does such a filter has to be. > > In my current approach [1] I simply consider whitelisted characters only > (the rest gets discarded) but I'm unhappy with that - because it is > probably **not safe** and the displayed string is no longer the one > provided by the untrusted source - so I'm looking for something better. > > "better" is: output string **looks** (after Markdown got converted to > HTML) exactly like input string _and_ is XSS-safe > > > In Python I recommend Bleach with this whitelist as a good starting > > place: https://github.com/yourcelf/bleach-whitelist > > https://github.com/mozilla/bleach > > Yes, I saw your recommendation when reading your documentation [2]. > Bleach is for HTML, I need something for Markdown. > > thanks for your help, > nusenu > > > > [0] https://daringfireball.net/projects/markdown/syntax#backslash > [2] https://pythonhosted.org/Markdown/reference.html#safe_mode > [1] > def strip_md(input): > input=cgi.escape(input) > input=re.sub(r'[\{\}\[\]()_]', ' ', input) > # "." and "-" are Markdown metachars! > whitelist=r'[0-9a-zA-Z"$%&/\',\.:;=?@\^\- ]' > input="".join(re.findall(whitelist,input)) > input=input.strip() > input=re.sub('\s+',' ', input) > return input > > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > _______________________________________________ > Python-markdown-discuss mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/python-markdown-discuss |
From: nusenu <nus...@ri...> - 2017-05-25 16:49:37
Attachments:
signature.asc
|
Hi Waylan, thanks for your continued input. Some more context from my side might help here: I take the 'contact' (an arbitrary untrusted string) from a backend: https://onionoo.torproject.org/details?fields=contact and produce Markdown: https://raw.githubusercontent.com/nusenu/OrNetStats/master/maincwfamilies.md which Jekyll uses to produce HTML based on these files. Final output (the current output shows stripped output and does not exactly match the input, but the goal would be a displayed string that matches): https://nusenu.github.io/OrNetStats/maincwfamilies > Something to keep in mind is that there is no such thing as invalid > Markdown. I hope I didn't say something that would contradict that. > And then there is the many years that users have been using Markdown > (over a decade). There is a certain expectation regarding behavior > that exists today. Most everyone knows and expects that `**foo**` > will result in bold text. And it is not surprising if some service > disallows that, but then the expectation is that you will just get > `foo` back. Getting back `**foo**` would be surprising. [..] > > It is with these long-standing expectations of users in mind that us > long-time users of Markdown say that the only way to sanitize > Markdown is by sanitizing the HTML output. In my case the data source does not have any Markdown expectations (the source does not even know there is Markdown in the middle or what Markdown is). The source expects literal output obfuscated**dot**emailaddress**dot**tld should not become: myobfuscateddotemailaddressdottld In these examples I used the "*" but this is not limited to this character, this is about any metachars + the pipe sign (since I use the table extension). > As a practical matter, to sanitize the Markdown text before passing > it to the parser would require writing another parser, just one that > removes/escapes the disallowed markup. If that is what you really > want Yes :) > then just use Python-Markdown extension API to remove the > “strongPattern” from the parser. Is it possible to use your API to do Markdown escaping? (the initial question) input -> output examples: **foo** -> \*\*foo\*\* dot*foo -> dot*foo (no backslash) 1. -> 1\. example.com -> example.com (no backslash) ...(and all other meta chars) > If that is the behavior > you really want, that is the easiest way to get it. But I expect your > users will very much dislike it. As stated above - no worries here - since the data source does not know about Markdown - and therefore has no expectations. bellow you find my conversation with python-help - because it is also relevant and the reason I'm asking again here since I need a Markdown-aware escape function not a simple search/replace: ------------ Matt (python-help) wrote: > However, I also think that having an escape-the-markup > function in a markup library makes perfect sense. >> (simply replacing all metachars with \metachar does not work) > > But if you can be more specific about how that doesn't work, we may > be able to help suggest something. I wrote a very simple function that replaces all Markdown metachars to test that approach def escape_md(input): input=input.replace('\\','\\\\') input=input.replace('`', '\\`') input=input.replace('*', '\\*') ... return input The problem with this approach is that it does not care about the context. So this works fine against "**foo**" -> HTML: **foo** but it does not work with here: "d*ot" -> HTML: d\*ot similarly with other chars .([`- So in the end the escape function has to be Markdown aware and at that point I guess it is no longer a simply search and replace thing and should be part of a Markdown library (that is already aware of the syntax anyway). |
From: <way...@ic...> - 2017-05-25 23:17:29
|
On May 25, 2017, 12:49 PM -0400, wrote: > > In my case the data source does not have any Markdown expectations (the > source does not even know there is Markdown in the middle or what > Markdown is). It sounds to me like you would benefit from a parser which generates an abstract syntax tree (AST). The AST is then feed to a renderer which then renders the output. You get the benefits of an already existing parser and can easily provide your own renderer which outputs whatever format you want. This is how Pandoc works to convert between so many different formats. If you are looking for a Python lib, you might want to look at mistune (haven’t used it myself but I like what I see). It is sort-of possible to do that with Python-Markdown. You can provide your own serializer. But we use an ElementTree instance rather than an AST. So the serializer would need to build your output from an HTML like object representation, which is less than ideal for anything but HTML. And then some of our postprocessers would need to be rewritten as the default ones expect HTML. Waylan |