Thread: [Python-markdown-discuss] Escaping HTML instead of removing it ?

Brought to you by: qaramazov, waylanhl

python-markdown-discuss

[Python-markdown-discuss] Escaping HTML instead of removing it ?

From: Herbert P. <her...@gm...> - 2007-06-12 21:32:05

Hi,

Is it somehow possible to not remove HTML but instead escape it with
html entities ? this seems to be a much more user friendly way for
wikis to deal with HTML.

i tried to simply put in a pre processor, but had no luck yet..
basically because i'm not sure if i fully understand how the current
implementation removes HTML . (like why HTML is escaped in code blocks
and not fully removed) ..

is there an easy way to do this ?


thanks & cu,
  herbert


P.S.: @Yuri Takhteyev: i guess you don't really care any more since
you've already put up a wiki .. but anyway .. http://sct.sphene.net/
is my wiki based on python-markdown (and django)

Re: [Python-markdown-discuss] Escaping HTML instead of removing it ?

From: Yuri T. <qar...@gm...> - 2007-06-13 03:02:05

You should be able to do this with a preprocessor by simply
pre-escaping all HTML, no?  Alternatively, if you want a quick and
dirty hack, look for the line that says:

    if self.safeMode and html != "<hr />" and html != "<br />":
                    html = HTML_REMOVED_TEXT

I do agree though that perhaps escaping html would be a better
default.  (Please do file a bug on sourceforge so that I don't forget
to make this change later.)  In the long term, perhaps, the new and
more flexible way of managing pre-post-etc-processors would solve this
problem as well.

> implementation removes HTML . (like why HTML is escaped in code blocks
> and not fully removed) ..

An oversight on my part...

> P.S.: @Yuri Takhteyev: i guess you don't really care any more since
> you've already put up a wiki .. but anyway .. http://sct.sphene.net/
> is my wiki based on python-markdown (and django)

I will stick with what I installed, but I do _care_ - it's good to
have a Wiki based this module.   Please add your project to the wiki
under "Related Projects".

  - yuri

-- 
http://www.freewisdom.org/

Re: [Python-markdown-discuss] Escaping HTML instead of removing it ?

From: Waylan L. <wa...@gm...> - 2007-11-05 05:23:55

I've just committed a patch to svn (r53) that provides a nice middle
ground to the escaping vs. removing html issue. The old behavior is
still the default, but escaping is provided as an option. Currently,
the global variable `HTML_REMOVED_TEXT` holds the text that is used
for replacement. I set it up so that if that string is empty (or
otherwise evaluates to `False` in python) then the html is escaped
instead. In other words, you turn escaping on in the same way that you
change the replacement text. Here's an example:

    >>> import markdown
    >>> markdown.HTML_REMOVED_TEXT = ''
    >>> md = markdown.Markdown(safe_mode=True)
    >>> md.convert('<a href="foo">foo</a> bar.')
    '<p>&lt;a href=&quot;foo&quot;&gt;foo&lt;/a&gt; bar.\n</p>'

I left the default as the old behavior, but that could easily be
switched. I also considered adding a new global (perhaps
`ESCAPE_HTML`) which would simply hold a True/False value, but
couldn't see adding an additional variable. If anyone feels otherwise,
let me know.

I see one potential problem with my solution which I hadn't considered
until just now (after committing my patch). One could already have
code that sets `HTML_REMOVED_TEXT` to an empty string so that all html
is stripped and replaced with nothing. Some may prefer such a
behavior. This makes that imposable to do. Is anyone doing this?
Adding `ESCAPE_HTML` would address this issue, if it is one.

Another solution would be to change the expected values of the
`safe_mode` parameter for Markdown() to one of 'strip', 'escape', or
None rather than True/False. But that could get complicated/confusing.

Oh, and obviously, the value of `HTML_REMOVED_TEXT` can be changed in
the source file if one will always want that behavior. That can become
a headache on upgrading to a new version though. Its usually better to
future-proof your code IMO.

I should also mention that I also moved the code that does the
escaping/removing from the convert method to a text-post-processor. It
makes more sense there regardless of this change IMO and simplifies
the process of making your own extension to change the behavior.
Extensions would be another way to address the issues I mention above.
Perhaps we could just leave it at that.

The escaping is very basic. Any improvements are welcome. Anyone know
of a method already available in the python standard lib?

Any objections, comments, suggestions are welcome.

On 6/12/07, Yuri Takhteyev <qar...@gm...> wrote:
> You should be able to do this with a preprocessor by simply
> pre-escaping all HTML, no?  Alternatively, if you want a quick and
> dirty hack, look for the line that says:
>
>     if self.safeMode and html != "<hr />" and html != "<br />":
>                     html = HTML_REMOVED_TEXT
>
> I do agree though that perhaps escaping html would be a better
> default.  (Please do file a bug on sourceforge so that I don't forget
> to make this change later.)  In the long term, perhaps, the new and
> more flexible way of managing pre-post-etc-processors would solve this
> problem as well.
>
> > implementation removes HTML . (like why HTML is escaped in code blocks
> > and not fully removed) ..
>
> An oversight on my part...
>
> > P.S.: @Yuri Takhteyev: i guess you don't really care any more since
> > you've already put up a wiki .. but anyway .. http://sct.sphene.net/
> > is my wiki based on python-markdown (and django)
>
> I will stick with what I installed, but I do _care_ - it's good to
> have a Wiki based this module.   Please add your project to the wiki
> under "Related Projects".
>
>   - yuri
>
> --
> http://www.freewisdom.org/
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2 express and take
> control of your XML. No limits. Just data. Click to get it now.
> http://sourceforge.net/powerbar/db2/
> _______________________________________________
> Python-markdown-discuss mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/python-markdown-discuss
>

-- 
----
Waylan Limberg
wa...@gm...

Re: [Python-markdown-discuss] Escaping HTML instead of removing it ?

From: Trent M. <tr...@gm...> - 2007-11-06 05:37:11

> The escaping is very basic. Any improvements are welcome. Anyone know
> of a method already available in the python standard lib?

>>> import cgi
>>> cgi.escape("<a href='blah'>foo & bar</a>")
"&lt;a href='blah'&gt;foo &amp; bar&lt;/a&gt;"


Trent

-- 
Trent Mick
tr...@gm...

Re: [Python-markdown-discuss] Escaping HTML instead of removing it ?

From: Yuri T. <qar...@gm...> - 2007-11-05 06:08:37

> until just now (after committing my patch). One could already have
> code that sets `HTML_REMOVED_TEXT` to an empty string so that all html
> is stripped and replaced with nothing. Some may prefer such a
> behavior. This makes that imposable to do. Is anyone doing this?

This does seem like a reasonable thing to allow.  Why not use None
instead of empty string as the code for escaping, testing for
type(HTML_REMOVED_TEXT) == "string"?

> Another solution would be to change the expected values of the
> `safe_mode` parameter for Markdown() to one of 'strip', 'escape', or
> None rather than True/False. But that could get complicated/confusing.

This is actually quote reasonable, except that we could make it more
more backwards compatible by saying that safe_mode = None would turn
it off, safe_mode = "escape" would escape the HTML, and "remove" or
any other non-false value would replace HTML with the value of
HTML_REMOVED_TEXT.  I think for the documentation we should tell
people to put "replace", but the actual code should treat any true
value other than "escape" as meaning "removed".

> I should also mention that I also moved the code that does the
> escaping/removing from the convert method to a text-post-processor. It
> makes more sense there regardless of this change IMO and simplifies
> the process of making your own extension to change the behavior.
> Extensions would be another way to address the issues I mention above.
> Perhaps we could just leave it at that.

I am glad you did it, but it would be nice to have a simpler solution,
that does not depend on groking extensions.

Thanks for all the work!  When do you think we should make a release of 1.7?

  - yuri

-- 
Yuri Takhteyev
Ph.D. Candidate, UC Berkeley School of Information
http://takhteyev.org/, http://www.freewisdom.org/

Re: [Python-markdown-discuss] Escaping HTML instead of removing it ?

From: Waylan L. <wa...@gm...> - 2007-11-05 14:15:28

On 11/5/07, Yuri Takhteyev <qar...@gm...> wrote:
> > until just now (after committing my patch). One could already have
> > code that sets `HTML_REMOVED_TEXT` to an empty string so that all html
> > is stripped and replaced with nothing. Some may prefer such a
> > behavior. This makes that imposable to do. Is anyone doing this?
>
> This does seem like a reasonable thing to allow.  Why not use None
> instead of empty string as the code for escaping, testing for
> type(HTML_REMOVED_TEXT) == "string"?

After sending this message last night, I realized this isn't that big
of a problem. I'm currently testing by doing `if HTML_REMOVED_TEXT:`
so `False`, 0, an empty string, and `None` will all result in
escaping. What I missed last night is that a string containing one
space will equate to True and trigger replacing rather than escaping.
Seeing whitespace is a non-issue in html anyway, this seems like a
reasonable solution.

>
> > Another solution would be to change the expected values of the
> > `safe_mode` parameter for Markdown() to one of 'strip', 'escape', or
> > None rather than True/False. But that could get complicated/confusing.
>
> This is actually quote reasonable, except that we could make it more
> more backwards compatible by saying that safe_mode = None would turn
> it off, safe_mode = "escape" would escape the HTML, and "remove" or
> any other non-false value would replace HTML with the value of
> HTML_REMOVED_TEXT.  I think for the documentation we should tell
> people to put "replace", but the actual code should treat any true
> value other than "escape" as meaning "removed".

The more I think about it, the more I'm inclined to want a way to turn
escaping on as a parameter, so I think I'll leave things the way they
are, except that if safe_mode == "escape" we force escaping regardless
of the value of HTML_REMOVED_TEXT.

That seems to allow the most possabilites without extensions.

>
> > I should also mention that I also moved the code that does the
> > escaping/removing from the convert method to a text-post-processor. It
> > makes more sense there regardless of this change IMO and simplifies
> > the process of making your own extension to change the behavior.
> > Extensions would be another way to address the issues I mention above.
> > Perhaps we could just leave it at that.
>
> I am glad you did it, but it would be nice to have a simpler solution,
> that does not depend on groking extensions.
>
> Thanks for all the work!  When do you think we should make a release of 1.7?

I should update the escaping tonight from this discussion, and don't
have anything else for the immediate future, so whenever your ready.
I'll let you make those unicode changes that were discussed. You seem
to understand that better than me anyway. Or was that just a
documentation issue?

>
>   - yuri
>
> --
> Yuri Takhteyev
> Ph.D. Candidate, UC Berkeley School of Information
> http://takhteyev.org/, http://www.freewisdom.org/
>

-- 
----
Waylan Limberg
wa...@gm...

Re: [Python-markdown-discuss] Escaping HTML instead of removing it ?

From: Yuri T. <qar...@gm...> - 2007-11-05 16:09:53

> I should update the escaping tonight from this discussion, and don't
> have anything else for the immediate future, so whenever your ready.
> I'll let you make those unicode changes that were discussed. You seem
> to understand that better than me anyway. Or was that just a
> documentation issue?

Ok, I'll make them and update the documentation.

  - yuri

-- 
Yuri Takhteyev
Ph.D. Candidate, UC Berkeley School of Information
http://takhteyev.org/, http://www.freewisdom.org/

Re: [Python-markdown-discuss] Escaping HTML instead of removing it ?

From: Waylan L. <wa...@gm...> - 2007-11-05 21:02:14

I've finished my updates. I've even updated the change_log for you.
Feel free to release anytime.

I should note that I decided to remove escape with HTML_REMOVED_TEXT
as an empty string being that one would have to set safe_mode anyway.
That seemed redundant once I started writing documentation.

Btw, I did some work on the documentation [1]. If you like the format,
I'll do the same for the other pages.

For a full rundown of the new safe_mode functionality see that page.
The italicized note can be removed upon release (or I can remove the
section now and add it back upon release if preferred)

[1]: http://www.freewisdom.org/projects/python-markdown/Using_as_a_Module

On 11/5/07, Yuri Takhteyev <qar...@gm...> wrote:
> > I should update the escaping tonight from this discussion, and don't
> > have anything else for the immediate future, so whenever your ready.
> > I'll let you make those unicode changes that were discussed. You seem
> > to understand that better than me anyway. Or was that just a
> > documentation issue?
>
> Ok, I'll make them and update the documentation.
>
>   - yuri
>
> --
> Yuri Takhteyev
> Ph.D. Candidate, UC Berkeley School of Information
> http://takhteyev.org/, http://www.freewisdom.org/
>

-- 
----
Waylan Limberg
wa...@gm...

Re: [Python-markdown-discuss] Escaping HTML instead of removing it ?

From: Waylan L. <wa...@gm...> - 2007-11-05 21:38:04

Oh, I almost forgot to add escaping to the command line interface.
It's there now, but I'm not sure I like it. I rarely, if ever, (except
maybe when testing) us the command line interface, so if anyone else
has any input, let me know.

On 11/5/07, Waylan Limberg <wa...@gm...> wrote:
> I've finished my updates. I've even updated the change_log for you.
> Feel free to release anytime.
>
> I should note that I decided to remove escape with HTML_REMOVED_TEXT
> as an empty string being that one would have to set safe_mode anyway.
> That seemed redundant once I started writing documentation.
>
> Btw, I did some work on the documentation [1]. If you like the format,
> I'll do the same for the other pages.
>
> For a full rundown of the new safe_mode functionality see that page.
> The italicized note can be removed upon release (or I can remove the
> section now and add it back upon release if preferred)
>
> [1]: http://www.freewisdom.org/projects/python-markdown/Using_as_a_Module
>
> On 11/5/07, Yuri Takhteyev <qar...@gm...> wrote:
> > > I should update the escaping tonight from this discussion, and don't
> > > have anything else for the immediate future, so whenever your ready.
> > > I'll let you make those unicode changes that were discussed. You seem
> > > to understand that better than me anyway. Or was that just a
> > > documentation issue?
> >
> > Ok, I'll make them and update the documentation.
> >
> >   - yuri
> >
> > --
> > Yuri Takhteyev
> > Ph.D. Candidate, UC Berkeley School of Information
> > http://takhteyev.org/, http://www.freewisdom.org/
> >
>
>
> --
> ----
> Waylan Limberg
> wa...@gm...
>


-- 
----
Waylan Limberg
wa...@gm...