From: Neale P. <ne...@wo...> - 2008-08-21 22:11:57
|
I was the one who suggested the AtomicString change that's recently been committed. It's probably time to move discussion to the mail list. I'm writing a wiki and what prompted the change was that this: <http://example.com/CamelCase/foo> Would turn into this: <a href="http://example.com/CamelCase/foo">http://example.com/<a href="CamelCase.wki">CamelCase</a>/foo</a> AtomicString fixes that by creating a new String class (inherits from unicode, actually) that won't receive further processing. Then it's a simple matter of setting `e.text = AtomicString(whatever)`. I have a new problem in my WikiLinkPattern class. Here is some debugging output: handlematch u'If you are new to wiki, check out WikiHelp.' handlematch u'If you are new to wiki, check out \x02inline:WikiLinkPattern:0000\x03.' handlematch u'If you are new to wiki, check out \x02inline:\x02inline:WikiLinkPattern:0001\x03:0000\x03.' ad infinitum. My WikiLink regex is matching the inline thingy. I know the upstream dudes are at least pondering removing inline entirely, possibly using a mechanism like AtomicString. Should I hold off on fixing this locally? Neale |
From: Waylan L. <wa...@gm...> - 2008-08-22 19:25:26
|
Neale, I'm getting the same bug with the included wikilink extension. However, as far as I can tell, the extension is not the problem. It's the way the inline placeholders work that is tripping up these camelcase link generators. I've filed ticket #14 [1] for it. Particularly note the last example I included in that ticket: In [4]: markdown.markdown('[markdownlink](/markdownlink)', ['wikilink']) Out[4]: u'<p>\x02inline:<a class="wikilink" href="/LinkPattern/">LinkPattern</a>:0000\x03</p>' There is absolutely no camelcase words in the text we pass in, so no wikilinks should be generated. However, it appears that the placeholder for the "markdownlink" is inserted into the text as ``\x02LinkPattern:000\x03``. Normally, this would later be replaced by the html for the link. However, before that happens, the wikilink extension looks for a match and finds one for "LinkPattern" which it replaces with a link. Whoops. Obviously, markdown can't use camelcase placeholders. [1]: http://www.freewisdom.org/projects/python-markdown/Tickets/000014 On Thu, Aug 21, 2008 at 6:12 PM, Neale Pickett <ne...@wo...> wrote: > I was the one who suggested the AtomicString change that's recently been > committed. It's probably time to move discussion to the mail list. > > I'm writing a wiki and what prompted the change was that this: > > <http://example.com/CamelCase/foo> > > Would turn into this: > > <a href="http://example.com/CamelCase/foo">http://example.com/<a > href="CamelCase.wki">CamelCase</a>/foo</a> > > AtomicString fixes that by creating a new String class (inherits from > unicode, actually) that won't receive further processing. Then it's a > simple matter of setting `e.text = AtomicString(whatever)`. > > I have a new problem in my WikiLinkPattern class. Here is some > debugging output: > > handlematch u'If you are new to wiki, check out WikiHelp.' > > handlematch u'If you are new to wiki, check out \x02inline:WikiLinkPattern:0000\x03.' > > handlematch u'If you are new to wiki, check out \x02inline:\x02inline:WikiLinkPattern:0001\x03:0000\x03.' > > ad infinitum. My WikiLink regex is matching the inline thingy. > > I know the upstream dudes are at least pondering removing inline > entirely, possibly using a mechanism like AtomicString. Should I hold > off on fixing this locally? > > Neale > > ------------------------------------------------------------------------- > This SF.Net email is sponsored by the Moblin Your Move Developer's challenge > Build the coolest Linux based applications with Moblin SDK & win great prizes > Grand prize is a trip for two to an Open Source event anywhere in the world > http://moblin-contest.org/redirect.php?banner_id=100&url=/ > _______________________________________________ > Python-markdown-discuss mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/python-markdown-discuss > -- ---- Waylan Limberg wa...@gm... |
From: Waylan L. <wa...@gm...> - 2008-08-22 19:39:58
|
On Fri, Aug 22, 2008 at 3:25 PM, Waylan Limberg <wa...@gm...> wrote: [snip] > There is absolutely no camelcase words in the text we pass in, so no > wikilinks should be generated. However, it appears that the > placeholder for the "markdownlink" is inserted into the text as > ``\x02LinkPattern:000\x03``. [snip] Sorry that should have been ``\x02inline:LinkPattern:0000\x03``. Regardless, the point remains. -- ---- Waylan Limberg wa...@gm... |
From: Yuri T. <qar...@gm...> - 2008-08-22 19:49:21
|
> There is absolutely no camelcase words in the text we pass in, so no > wikilinks should be generated. However, it appears that the > placeholder for the "markdownlink" is inserted into the text as > ``\x02LinkPattern:000\x03``. Normally, this would later be replaced by Yeah, that's an whoops. I think rather than just avoiding camelcase in placeholders, we should avoid anything meangingfull in them at all, apart from STX and ETX codes. We used to have some random combination of characters. Adding STX and ETX around it made it safer against us trying to replace the occurrence of the placeholder in the original text. However, switching from a random combination to meaningful things like "LinkPattern" creates the possibility of users messing with our placeholders via extensions. So, I we should do both: use a meaningless combination of letters (without any punctuation), and then wrap it with characters that users aren't allowed to put in the input (STX and ETX). E.g.: STX = u'\u0002' # Use STX ("Start of text") for start-of-placeholder ETX = u'\u0003' # Use ETX ("End of text") for end-of-placeholder HTML_PLACEHOLDER_PREFIX = STX+"wyxhzde38k" HTML_PLACEHOLDER = HTML_PLACEHOLDER_PREFIX + "%d"+ETX INLINE_PLACEHOLDER_PREFIX = STX+"0ix2bavflj" INLINE_PLACEHOLDER_SUFFIX = ETX AMP_SUBSTITUTE = STX+"k75lziz62a"+ETX Actually, come to think of it, perhaps even that %d is not a good idea. (I am not checking this in, since Waylan seems to be actively working on the file.) - yuri -- http://sputnik.freewisdom.org/ |
From: Waylan L. <wa...@gm...> - 2008-08-22 21:03:09
|
On Fri, Aug 22, 2008 at 3:49 PM, Yuri Takhteyev <qar...@gm...> wrote: [snip] > with our placeholders via extensions. So, I we should do both: use a > meaningless combination of letters (without any punctuation), and then > wrap it with characters that users aren't allowed to put in the input > (STX and ETX). E.g.: > > STX = u'\u0002' # Use STX ("Start of text") for start-of-placeholder > ETX = u'\u0003' # Use ETX ("End of text") for end-of-placeholder > HTML_PLACEHOLDER_PREFIX = STX+"wyxhzde38k" > HTML_PLACEHOLDER = HTML_PLACEHOLDER_PREFIX + "%d"+ETX > INLINE_PLACEHOLDER_PREFIX = STX+"0ix2bavflj" > INLINE_PLACEHOLDER_SUFFIX = ETX > AMP_SUBSTITUTE = STX+"k75lziz62a"+ETX > > Actually, come to think of it, perhaps even that %d is not a good idea. > > (I am not checking this in, since Waylan seems to be actively working > on the file.) > Go ahead Yuri. I'm working on something else right now. -- ---- Waylan Limberg wa...@gm... |
From: Artem Y. <ne...@gm...> - 2008-08-22 20:12:51
|
Yuri Takhteyev wrote: >> There is absolutely no camelcase words in the text we pass in, so no >> wikilinks should be generated. However, it appears that the >> placeholder for the "markdownlink" is inserted into the text as >> ``\x02LinkPattern:000\x03``. Normally, this would later be replaced by >> > > Yeah, that's an whoops. I think rather than just avoiding camelcase > in placeholders, we should avoid anything meangingfull in them at all, > apart from STX and ETX codes. We used to have some random combination > of characters. Adding STX and ETX around it made it safer against us > trying to replace the occurrence of the placeholder in the original > text. However, switching from a random combination to meaningful > things like "LinkPattern" creates the possibility of users messing > with our placeholders via extensions. So, I we should do both: use a > meaningless combination of letters (without any punctuation), and then > wrap it with characters that users aren't allowed to put in the input > (STX and ETX). E.g.: > > STX = u'\u0002' # Use STX ("Start of text") for start-of-placeholder > ETX = u'\u0003' # Use ETX ("End of text") for end-of-placeholder > HTML_PLACEHOLDER_PREFIX = STX+"wyxhzde38k" > HTML_PLACEHOLDER = HTML_PLACEHOLDER_PREFIX + "%d"+ETX > INLINE_PLACEHOLDER_PREFIX = STX+"0ix2bavflj" > INLINE_PLACEHOLDER_SUFFIX = ETX > AMP_SUBSTITUTE = STX+"k75lziz62a"+ETX > > Actually, come to think of it, perhaps even that %d is not a good idea. > Maybe we should use some random hashes, like `md5.new(str(random.random())).hexdigest()` ? I don't think that users will be handle with placeholders, in case if everything works fine. In preporcessors they'll be given just plain input, in postprocessors they'll receive already processed with inline patterns ElementTree. |
From: Waylan L. <wa...@gm...> - 2008-08-27 22:16:43
|
I haven't added to this discussion yet as I wasn't sure what position to take. Here's my thoughts, observations and almost working solution: Everyone seems to be going back and forth on which random string generator is better. Personally I'm wondering what all the fuss is about. What we want is a unique string that identifies said string as a placeholder for a specific item in a stash. We have 2 stashes (rawHtml and inline) so we also need to identify which stash. The thing is, the "start" and "end" chars give us the uniqueness that identifies the string as a placeholder. If we only had one stash, all we would need is the id number. So the question then, is how do we identify which stash this placeholder is for? Currently, each stash's placeholder either contains the string "inline" or "html" (there are currently a couple other subtle differences but there easily removable). Now, as the current wikilink bug demonstrates, using actual real words that could legitimately appear in the document and perhaps even have patterns matching against it causes problems. So, we need 2 strings that will never (or at least very unlikely) be matched by any other pattern. The popular solutions in this dicusion thus far seem to have a string of random chars generated at import time. Depending on the generation method used, there will be x chances of a collision with a real, valid string. Obviously, the higher x is, the better - or so it seems. Suppose I am serving a document via a cgi script which will cause an import and a new, different random string on each page view. I only have x page views before there will be a collision. Whoops! Now try debugging that! Therefore, I propose that we select 2 strings of random chars (using whatever method you desire) and **hardcode** those 2 strings into markdown.py. That way, on each import (each page view in the above scenario) the placeholder strings will be the same and debugging will be consistent. What we really want is a string that will never be matched by another inline pattern's regex. We just need a string of all same-case chars between a-z of length n. As long as it does not contain any known words or abbreviations it works for me. Additionally, if the string is consistent, that makes it easier for an extension author to write the regex for inline patterns that will not match the string in the placeholder. I have commited an *almost* working branch [1] that has everything except the random strings (it still uses "inline" & "html"). I say "almost working" because the output includes a lot of extra, unnecessary whitespace. The problem is not creating the placeholder, but replacing the placeholder with the real content later - at least the way Artem's code works. Based upon docstrings, I determined that I needed to refactor ``InlineStash.extractId``, which I did. However, it seems that Artem's code was jumping through an awful lot of hoops and I haven't fully groked what ``Markdown._processPlaceholders`` is doing when it calls ``InlineStash.extractId``. Wouldn't it be better if we simply used the indexes ``m.start`` and ``m.end`` from a regex match rather than the string manipulation hoops it's doing now? Once we get that worked out, I'll replace the strings "inline" & "html" with something more random. Here's the output of a few simple tests: >>> markdown.markdown('foo *bar* baz') u'<p>foo <em>bar</em> baz</p>' >>> markdown.markdown('foo *bar __blah__* baz') u'<p>foo <em>bar <strong>blah</strong>\n </em> baz</p>' What's up with the newline and space between the closing tags ``</strong>`` & ``</em>``? Is that from the (IMO unnecessary) ``IndentTree`` function or something in ``Markdown._processPlaceholders``? I'm not sure. Any thoughts? [1]: http://gitorious.org/projects/python-markdown/repos/mainline/commits/ab57ff93b5b2750c082c87072ced774881190744 On Tue, Aug 26, 2008 at 9:22 AM, David Wolever <wo...@cs...> wrote: > On 25-Aug-08, at 7:55 PM, Artem Yunusov wrote: >> Yes, I agree, it's not necessary here. But thanks David, I didn't know >> about it before, and used to use md5 for such things. > Ah, well, I'm glad I could be helpful anyway :) > > And, re: more readable: >> "abcdefghijklmnopqrstuvwxyz"[random.randint(1,26)] > I agree, that's pretty nice :) > > ------------------------------------------------------------------------- > This SF.Net email is sponsored by the Moblin Your Move Developer's challenge > Build the coolest Linux based applications with Moblin SDK & win great prizes > Grand prize is a trip for two to an Open Source event anywhere in the world > http://moblin-contest.org/redirect.php?banner_id=100&url=/ > _______________________________________________ > Python-markdown-discuss mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/python-markdown-discuss > -- ---- Waylan Limberg wa...@gm... |
From: Yuri T. <qar...@gm...> - 2008-08-22 21:22:57
|
> Maybe we should use some random hashes, like > `md5.new(str(random.random())).hexdigest()` ? > I don't think that users will be handle with placeholders, in case if > everything works fine. We could do that, though it might be better to avoid any actual randomness by using a fixed seed: def reset_placeholders(seed = DEFAULT_SEED) : self.seed = seed random.seed(seed) self.html_placeholder_prefix = STX+"%x" % (random.random()*1000000000) self.html_placeholder = self.html_placeholder.prefix + "%x" + ETX ... and later generate HTML placeholders with random.seed(self.seed + i) self.html_placeholder % (random.random()*1000000000) Or something more like: def get_random_token(n = 10) return "".join(["abcdefghijklmnopqrstuvwxyx"[int(random.random()*26)] for i in range(n)]) def reset_placeholders(seed = DEFAULT_SEED) : random.seed(seed) self.html_placeholder_prefix = STX + get_random_token() self.html_placeholder = self.html_placeholder_prefix + "%s" + ETX - yuri -- http://sputnik.freewisdom.org/ |
From: Artem Y. <ne...@gm...> - 2008-08-24 19:58:35
|
Yuri Takhteyev wrote: >> Maybe we should use some random hashes, like >> `md5.new(str(random.random())).hexdigest()` ? >> I don't think that users will be handle with placeholders, in case if >> everything works fine. >> > > We could do that, though it might be better to avoid any actual > randomness by using a fixed seed: > > def reset_placeholders(seed = DEFAULT_SEED) : > self.seed = seed > random.seed(seed) > self.html_placeholder_prefix = STX+"%x" % (random.random()*1000000000) > self.html_placeholder = self.html_placeholder.prefix + "%x" + ETX > ... > > and later generate HTML placeholders with > > random.seed(self.seed + i) > self.html_placeholder % (random.random()*1000000000) > > Or something more like: > > def get_random_token(n = 10) > return "".join(["abcdefghijklmnopqrstuvwxyx"[int(random.random()*26)] > for i in range(n)]) > > def reset_placeholders(seed = DEFAULT_SEED) : > random.seed(seed) > self.html_placeholder_prefix = STX + get_random_token() > self.html_placeholder = self.html_placeholder_prefix + "%s" + ETX > > - yuri > > I agree, using seed is better idea. What about something like this: RND_SEED = 789654 def getRandomToken(n=10, seed=RND_SEED): random.seed(seed) return "".join([chr(random.randint(97, 122)) for i in range(n)]) STX = u'\u0002' # Use STX ("Start of text") for start-of-placeholder ETX = u'\u0003' # Use ETX ("End of text") for end-of-placeholder HTML_PLACEHOLDER_PREFIX = STX + getRandomToken() HTML_PLACEHOLDER = HTML_PLACEHOLDER_PREFIX + "%s" + ETX INLINE_PLACEHOLDER_PREFIX = STX + getRandomToken() INLINE_PLACEHOLDER = INLINE_PLACEHOLDER_PREFIX + "%s" + ETX or maybe wrap it in class, as you did. |
From: Waylan L. <wa...@gm...> - 2008-09-03 15:31:51
|
On Wed, Sep 3, 2008 at 5:09 AM, Yuri Takhteyev <qar...@gm...> wrote: >> Therefore, I propose that we select 2 strings of random chars (using >> whatever method you desire) and **hardcode** those 2 strings into >> markdown.py. That way, on each import (each page view in the above >> scenario) the placeholder strings will be the same and debugging will >> be consistent. What we really want is a string that will never be >> matched by another inline pattern's regex. We just need a string of >> all same-case chars between a-z of length n. As long as it does not >> contain any known words or abbreviations it works for me. >> Additionally, if the string is consistent, that makes it easier for an >> extension author to write the regex for inline patterns that will not >> match the string in the placeholder. > > That's exactly what I suggested in my first email to this thread. :) > And that's exactly what I plan to use once I got the whitespace issues work out (I think I got it last night - just need a little more testing). -- ---- Waylan Limberg wa...@gm... |
From: Waylan L. <wa...@gm...> - 2008-09-03 18:56:46
|
This is done [1]. Btw, if anyone is concerned that that diff removed AtomicString - it was being defined twice for some reason so I removed one instance. [1]: http://gitorious.org/projects/python-markdown/repos/mainline/commits/c26816e831df0f8123cd24bd72f352f9f3909ce6 On Wed, Sep 3, 2008 at 11:31 AM, Waylan Limberg <wa...@gm...> wrote: > On Wed, Sep 3, 2008 at 5:09 AM, Yuri Takhteyev <qar...@gm...> wrote: >>> Therefore, I propose that we select 2 strings of random chars (using >>> whatever method you desire) and **hardcode** those 2 strings into >>> markdown.py. That way, on each import (each page view in the above >>> scenario) the placeholder strings will be the same and debugging will >>> be consistent. What we really want is a string that will never be >>> matched by another inline pattern's regex. We just need a string of >>> all same-case chars between a-z of length n. As long as it does not >>> contain any known words or abbreviations it works for me. >>> Additionally, if the string is consistent, that makes it easier for an >>> extension author to write the regex for inline patterns that will not >>> match the string in the placeholder. >> >> That's exactly what I suggested in my first email to this thread. :) >> > And that's exactly what I plan to use once I got the whitespace issues > work out (I think I got it last night - just need a little more > testing). > > > -- > ---- > Waylan Limberg > wa...@gm... > -- ---- Waylan Limberg wa...@gm... |
From: Yuri T. <qar...@gm...> - 2008-08-25 18:19:41
|
> def getRandomToken(n=10, seed=RND_SEED): > random.seed(seed) > return "".join([chr(random.randint(97, 122)) for i in range(n)]) Personally, I tend to use a static string with all the letters in cases like this just to save the reader from having to remember or guess what characters are between chr(97) to chr(122). Sometimes a "dumber" solution is just easier to understand. Eitherway, there should be no magic numbers in the code, so if we want to use chr() then the proper thing to do would be to define constants LOWERCASE_ASCII_A = 97 LOWERCASE_ASCII_Z =122 and then later: return "".join([chr(random.randint(LOWERCASE_ASCII_A, LOWERCASE_ASCII_Z)) for i in range(n)]) And then suddenly "abcdefghijklmnopqrstuvwxyz"[random.randint(1,26)] doesn't look so verbose. :) - yuri -- http://sputnik.freewisdom.org/ |
From: Artem Y. <ne...@gm...> - 2008-08-25 22:54:39
|
Yuri Takhteyev wrote: >> def getRandomToken(n=10, seed=RND_SEED): >> random.seed(seed) >> return "".join([chr(random.randint(97, 122)) for i in range(n)]) >> > > Personally, I tend to use a static string with all the letters in > cases like this just to save the reader from having to remember or > guess what characters are between chr(97) to chr(122). Sometimes a > "dumber" solution is just easier to understand. Eitherway, there > should be no magic numbers in the code, so if we want to use chr() > then the proper thing to do would be to define constants > > LOWERCASE_ASCII_A = 97 > LOWERCASE_ASCII_Z =122 > > and then later: > > return "".join([chr(random.randint(LOWERCASE_ASCII_A, > LOWERCASE_ASCII_Z)) for i in range(n)]) > > And then suddenly "abcdefghijklmnopqrstuvwxyz"[random.randint(1,26)] > doesn't look so verbose. :) > > - yuri > > Yes, in terms of readability, your variant looks better, let's decide in favour of it. |
From: David W. <wo...@cs...> - 2008-08-25 20:54:28
|
On 25-Aug-08, at 3:19 PM, Yuri Takhteyev wrote: >> def getRandomToken(n=10, seed=RND_SEED): >> random.seed(seed) >> return "".join([chr(random.randint(97, 122)) for i in range(n)]) > .... > return "".join([chr(random.randint(LOWERCASE_ASCII_A, > LOWERCASE_ASCII_Z)) for i in range(n)]) > And then suddenly "abcdefghijklmnopqrstuvwxyz"[random.randint(1,26)] > doesn't look so verbose. :) Now, if we're going to be arguing about generating universally unique identifiers... I feel like I should chime in with one module I've been making heavy use of recently: uuid! >>> from uuid import uuid4 >>> uuid4().hex '6908a7924f4d46bf915da19876c93368' >>> |
From: Yuri T. <qar...@gm...> - 2008-08-25 21:17:22
|
> Now, if we're going to be arguing about generating universally unique > identifiers... I feel like I should chime in with one module I've been > making heavy use of recently: uuid! This is only included with python2.5. We _could_ do something like what we do now with other things, i.e., wrap it and fall back on something else if uuid module cannot be imported. Or we can tell people to install an extra uuid module if they are using older versions of python. The question is: do we get any benefits from real uniqueness? (Before we get back into the question of whether anyone still uses old versions of Python: Django can run on 2.3 and as long as they can, I think so should we.) - yuri -- http://sputnik.freewisdom.org/ |
From: Artem Y. <ne...@gm...> - 2008-08-25 22:56:00
|
Yuri Takhteyev wrote: >> Now, if we're going to be arguing about generating universally unique >> identifiers... I feel like I should chime in with one module I've been >> making heavy use of recently: uuid! >> > > This is only included with python2.5. We _could_ do something like > what we do now with other things, i.e., wrap it and fall back on > something else if uuid module cannot be imported. Or we can tell > people to install an extra uuid module if they are using older > versions of python. The question is: do we get any benefits from real > uniqueness? > > (Before we get back into the question of whether anyone still uses old > versions of Python: Django can run on 2.3 and as long as they can, I > think so should we.) > > - yuri > > Yes, I agree, it's not necessary here. But thanks David, I didn't know about it before, and used to use md5 for such things. |
From: David W. <wo...@cs...> - 2008-08-26 13:27:13
|
On 25-Aug-08, at 7:55 PM, Artem Yunusov wrote: > Yes, I agree, it's not necessary here. But thanks David, I didn't know > about it before, and used to use md5 for such things. Ah, well, I'm glad I could be helpful anyway :) And, re: more readable: > "abcdefghijklmnopqrstuvwxyz"[random.randint(1,26)] I agree, that's pretty nice :) |
From: Yuri T. <qar...@gm...> - 2008-09-03 09:09:17
|
> Therefore, I propose that we select 2 strings of random chars (using > whatever method you desire) and **hardcode** those 2 strings into > markdown.py. That way, on each import (each page view in the above > scenario) the placeholder strings will be the same and debugging will > be consistent. What we really want is a string that will never be > matched by another inline pattern's regex. We just need a string of > all same-case chars between a-z of length n. As long as it does not > contain any known words or abbreviations it works for me. > Additionally, if the string is consistent, that makes it easier for an > extension author to write the regex for inline patterns that will not > match the string in the placeholder. That's exactly what I suggested in my first email to this thread. :) - yuri -- http://sputnik.freewisdom.org/ |