Re: [Python-markdown-discuss] AtomicString

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

I haven't added to this discussion yet as I wasn't sure what position
to take. Here's my thoughts, observations and almost working solution:

Everyone seems to be going back and forth on which random string
generator is better. Personally I'm wondering what all the fuss is
about. What we want is a unique string that identifies said string as
a placeholder for a specific item in a stash. We have 2 stashes
(rawHtml and inline) so we also need to identify which stash. The
thing is, the "start" and "end" chars give us the uniqueness that
identifies the string as a placeholder. If we only had one stash, all
we would need is the id number.

So the question then, is how do we identify which stash this
placeholder is for? Currently, each stash's placeholder either
contains the string "inline" or "html" (there are currently a couple
other subtle differences but there easily removable). Now, as the
current wikilink bug demonstrates, using actual real words that could
legitimately appear in the document and perhaps even have patterns
matching against it causes problems. So, we need 2 strings that will
never (or at least very unlikely) be matched by any other pattern.

The popular solutions in this dicusion thus far seem to have a string
of random chars generated at import time. Depending on the generation
method used, there will be x chances of a collision with a real, valid
string. Obviously, the higher x is, the better - or so it seems.
Suppose I am serving a document via a cgi script which will cause an
import and a new, different random string on each page view. I only
have x page views before there will be a collision. Whoops! Now try
debugging that!

Therefore, I propose that we select 2 strings of random chars (using
whatever method you desire) and **hardcode** those 2 strings into
markdown.py. That way, on each import (each page view in the above
scenario) the placeholder strings will be the same and debugging will
be consistent.  What we really want is a string that will never be
matched by another inline pattern's regex. We just need a string of
all same-case chars between a-z of length n. As long as it does not
contain any known words or abbreviations it works for me.
Additionally, if the string is consistent, that makes it easier for an
extension author to write  the regex for inline patterns that will not
match the string in the placeholder.

I have commited an *almost* working branch [1] that has everything
except the random strings (it still uses "inline" & "html"). I say
"almost working" because the output includes a lot of extra,
unnecessary whitespace. The problem is not creating the placeholder,
but replacing the placeholder with the real content later - at least
the way Artem's code works. Based upon docstrings, I determined that I
needed to refactor ``InlineStash.extractId``, which I did. However, it
seems that Artem's code was jumping through an awful lot of hoops and
I haven't fully groked what ``Markdown._processPlaceholders`` is doing
when it calls ``InlineStash.extractId``. Wouldn't it be better if we
simply used the indexes ``m.start`` and ``m.end`` from a regex match
rather than the string manipulation hoops it's doing now?

Once we get that worked out, I'll replace the strings "inline" &
"html" with something more random. Here's the output of a few simple
tests:

    >>> markdown.markdown('foo *bar* baz')
    u'<p>foo <em>bar</em> baz</p>'

    >>> markdown.markdown('foo *bar __blah__* baz')
    u'<p>foo <em>bar <strong>blah</strong>\n  </em> baz</p>'

What's up with the newline and space between the closing tags
``</strong>`` & ``</em>``? Is that from the (IMO unnecessary)
``IndentTree`` function or something in
``Markdown._processPlaceholders``? I'm not sure.

Any thoughts?

[1]: http://gitorious.org/projects/python-markdown/repos/mainline/commits/ab57ff93b5b2750c082c87072ced774881190744

On Tue, Aug 26, 2008 at 9:22 AM, David Wolever <wo...@cs...> wrote:
> On 25-Aug-08, at 7:55 PM, Artem Yunusov wrote:
>> Yes, I agree, it's not necessary here. But thanks David, I didn't know
>> about it before, and used to use md5 for such things.
> Ah, well, I'm glad I could be helpful anyway :)
>
> And, re: more readable:
>> "abcdefghijklmnopqrstuvwxyz"[random.randint(1,26)]
> I agree, that's pretty nice :)
>
> -------------------------------------------------------------------------
> This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
> Build the coolest Linux based applications with Moblin SDK & win great prizes
> Grand prize is a trip for two to an Open Source event anywhere in the world
> http://moblin-contest.org/redirect.php?banner_id=100&url=/
> _______________________________________________
> Python-markdown-discuss mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/python-markdown-discuss
>

-- 
----
Waylan Limberg
wa...@gm...