From: Jeff D. <da...@da...> - 2000-07-17 16:47:27
Attachments:
new_wiki_transform.php3
|
Here's a current snapshot of my thoughts on the new transform code. This currently is in the form of a drop-in replacement for wiki_transform. However, if I were to insert this into PhpWiki now, most of it would go into wiki_stdlib. Some would go into new custum-feature module files. Only a skeleton would remain in wiki_transform. Here's some random thoughts, in order of increasing entropy: Currently this only implements wiki_transform. However it should be clear that class WikiRenderer can also be used as the basis for a modular replacement to GeneratePage(). The main thing that I'm not completely happy with (and which is not yet complete) is how the order of the WikiTransforms is specified. (It is clear that some sort of 'order' or 'precedence' parameter is required --- that's easy, I just haven't done it yet.) The hard part is handling the following issues in an efficient, clean, clear way (this issues are handled by this snapshot, but I'm not sure I'm happy with the implementation): o Some transforms are "final". When they are matched, they terminate the page rendering. o Some transforms (might) need to be applied repeatedly. Consider constructs like "__''bold-italic''__". Another issue is that putting the logic to handle these details into (what is now) the inner loop (over transforms) is slow. I think I'll try reversing the order of the loops (eg. make the loop over lines the inner loop, and see if that helps). Comments welcome. Jeff |
From: Arno H. <aho...@in...> - 2000-07-18 18:56:09
|
Jeff, I had a look at your new wiki_transform. Overall impressive work. In some places it seems a little bit awkward. Actually, I had problems to understand how it works at first. I'm not sure I like the split into groups of the transfrom objects. The distinction final/normal/recursive seems necessary, but I'm sure it can be solved in a different way. See below (we can do away with recursive tokenization and the distinction final/normal can be dealt with one easy if-clause in render_line() instead of having groups and two different loops) Random thoughts: - the class interface (functions and variables) looks ok. Some functions will have to be added in order to make it useable for extracting links when using it from wiki_savepage. (e.g. some way to access the array in WikiTokenizer()) - the regexp's are too complex in some places (which makes the overall rendering slower than necessary): Take for example: /__ ( (?: [^_] (?:[^_]|_[^_])* )? ) __/x which renders __strong__ emphasis. Apparantly this regexp ensures two things: no "_" after "__" and no "__" inside the emphasis. How about: /__(.*?)__/ instead? ".*?" is non-greedy and thus "__" cannot appear inside the emphasis. Also, why forbid "_" after "__"? In your case "___text__" is rendered as "_<strong>text</strong>" in my case it's rendered as "<strong>_text</strong>". What's the difference? - Also, I don't think that all those "(?" extended regex syntax is really necessary. It may be in some places, where it's important to have a proper \0 match-value. But in all other places it adds to complexity without any benefits (and makes the regexp slower, no?) - Ok, I don't like the groups. But groupTransforms() is plain ugly. I understand that this stems from your goal to combine as many transforms into a single $group as possible. I don't understand the benefit of this approach - the only difference is that the inner loops of render_line() are executed more often than the outer for-loop. So what? - Maybe you are trying too hard with the concept of tokenization of a line. E.g. is it really necessary to tokenize emphasises like "__" and "'''"/"''"? Why not generate the HTML directly (<strong><b><i>)? All you have to do is make sure, that later transforms don't mess with the inserted HTML tags. By ordering the transforms (as you plan to do anyway) this can be achieved easily. This would also solve your problem of recursive transforms. Take the easy route first. If we ever come accross a markup that requires recursive stuff, then we can add recursive transforms. Right now I don't see the need for them. So my suggestions: - get rid of groups - implement priority sort order instead - get rid of recursive markup - right now it's only needed for emphasis. Insert the HTML tags instead. - final transfroms can be dealt with one if-clause like if($t->final) break; - make your regexps simpler. And if one regexp becomes too complex split it into two transfroms. Again, very promising start. Good work. /Arno |
From: Jeff D. <da...@da...> - 2000-07-18 20:10:43
|
In message <147...@da...>,Arno Hollosi writes: >I had a look at your new wiki_transform. >Overall impressive work. Thanks! >- the class interface (functions and variables) looks ok. > Some functions will have to be added in order to make it useable for > extracting links when using it from wiki_savepage. > (e.g. some way to access the array in WikiTokenizer()) It's already there (mostly). $page_renderer->wikilinks[] gets populated with all the WikiLinks found in the page. All that's needed is a bit more elegant API to get at the list. >- the regexp's are too complex in some places (which makes the > overall rendering slower than necessary): > Take for example: /__ ( (?: [^_] (?:[^_]|_[^_])* )? ) __/x > which renders __strong__ emphasis. Apparantly this regexp ensures > two things: no "_" after "__" and no "__" inside the emphasis. > How about: /__(.*?)__/ instead? ".*?" is non-greedy and thus > "__" cannot appear inside the emphasis. Also, why forbid "_" after > "__"? In your case "___text__" is rendered as "_<strong>text</strong>" > in my case it's rendered as "<strong>_text</strong>". What's the > difference? Okay, okay. So I'm paranoid. Yes the regexps should be cleaned up. My guess is that (at least in most cases) the speed differences are negligible --- I readily admit the regexps could be more readable. (Speaking of which: it would probably be possible to avoid the use of the Perl regexps altogether, in favor of PHP's ereg_'s. Is this worth considering? How many PHP's are out there without PCRE support?) >- Also, I don't think that all those "(?" extended regex syntax > is really necessary. It may be in some places, where it's important > to have a proper \0 match-value. But in all other places it adds > to complexity without any benefits (and makes the regexp slower, no?) Okay, okay already! :-) >- Ok, I don't like the groups. But groupTransforms() is plain ugly. > I understand that this stems from your goal to combine as many > transforms into a single $group as possible. I don't understand > the benefit of this approach - the only difference is that the > inner loops of render_line() are executed more often than the > outer for-loop. So what? The point was to do as much of the looping logic as possible (the grouping) only once rather once each line. It does make a speed difference. It is butt-ugly. I don't like it either. >- Maybe you are trying too hard with the concept of tokenization of a > line. E.g. is it really necessary to tokenize emphasises like "__" > and "'''"/"''"? Why not generate the HTML directly (<strong><b><i>)? > All you have to do is make sure, that later transforms don't mess > with the inserted HTML tags. By ordering the transforms (as you plan > to do anyway) this can be achieved easily. This would also solve > your problem of recursive transforms. Take the easy route first. > If we ever come accross a markup that requires recursive stuff, > then we can add recursive transforms. Right now I don't see the > need for them. The tokenization is not really necessary in all cases, but it is needed (I think) for the various links (or else the regexps get horrendous). If you accept that the tokenization code is needed, then it makes little difference (in complexity or time) whether <b>'s and <i>'s are tokenized or not. Tokenizing (I think) is safer --- less chance of some not-quite-completely- well-conceived custom WikiTransform mucking things up. As for recursiveness: I don't really see how direct substitution of the HTML gets around the root of the problem. How do you deal with __''Bold-italic'' and bold__ (or ''__Bold-italic__ and italic'')? (Or should we just punt on that?) >So my suggestions: > >- get rid of groups - implement priority sort order instead Yes, we need some sort of priority sorting anyhow, so that the WikiTransforms don't have to be registered() in a specific order. The groups stuff is there to deal with the recursable stuff --- you haven't yet convinced me that the recursable stuff is unnecessary. >- get rid of recursive markup - right now it's only needed for > emphasis. Insert the HTML tags instead. Again, I don't yet see how this helps? >- final transfroms can be dealt with one if-clause like > if($t->final) break; Yes that's the way I did it before I added the code to deal with the recursive stuff. (But then ''__Bold-italic__'' was broken.) >- make your regexps simpler. And if one regexp becomes too > complex split it into two transfroms. Okay, already! As a footnote though: I'm pretty sure that in most cases one transform with a complex regexp is faster than two transforms with simple regexps. Okay, so I guess my main counter-response is: Either: a) Convince me that the recursable stuff really is not needed. or b) Suggest a cleaner way to deal with the recursable stuff. Jeff |
From: Arno H. <aho...@in...> - 2000-07-18 20:46:00
|
> (Speaking of which: it would probably be possible to avoid the use > of the Perl regexps altogether, in favor of PHP's ereg_'s. Is this > worth considering? How many PHP's are out there without PCRE support?) Some Windows PHP's don't have preg_* functions. You can do without them in most places, but there are some where you absolutely need them. So if there's no way around it, you can use them throughout. > As a footnote though: I'm pretty sure that in most cases one transform > with a complex regexp is faster than two transforms with simple regexps. Point taken. > The groups stuff is there to deal with the recursable stuff --- you haven't > yet convinced me that the recursable stuff is unnecessary. Ok, trying to convince you :o) We need tokenization at least for links and stuff. That's for sure. But do we need it for emphasis markup and the like? Right now, recursive transforms are only used for '',''',__ Please correct me if I'm wrong. Suppose the following line "__Bold and ''bold italics''__" Transforms are registered in this order 1. __ 2. ''' 3. '' Instead of tokenizing $line, you directly subsitute the HTML into $line. So, step 1 $line is changed to "<strong>Bold and ''bold italics''</strong>" Step 2 does nothing and step three executes without nesting (no tokens in $line): "<strong>Bold and <i>bold italics</i></strong>" Voila :o) If there's something like "Look at __WikiLink__" it becomes: "Look at __$token$__" "Look at <strong>$token$</strong>" "Look at <strong><a href="...">WikiLink</a></strong>" Problem solved. Only use tokens where they are absolutely necessary. I don't see the need to tokenize emphasis markup or things like '%%%' and '^-{4,}' By ensuring that transforms are executed in the right order, the freshly inserted HTML tags won't interfere with later transformations. E.g. it's important to do links and the '&<>' transform before doing the rest. Did I convince you? Sure, the new architecture is then a mixture of tokens and HTML-in-place - compared to your tokens-only approach. But it's much simplier - less complexity. And I don't think it's too ugly from a structural point of view either. /Arno |
From: Jeff D. <da...@da...> - 2000-07-18 21:25:40
|
In message <147...@da...>,Arno Hollosi writes: >Some Windows PHP's don't have preg_* functions. >You can do without them in most places, but there are some where you >absolutely need them. Not that I doubt you, but, out of curiosity: where? >Instead of tokenizing $line, you directly subsitute the HTML into $line. >So, step 1 $line is changed to >"<strong>Bold and ''bold italics''</strong>" >Step 2 does nothing and step three executes without nesting (no tokens >in $line): >"<strong>Bold and <i>bold italics</i></strong>" > >Voila :o) Okay, I get it now. The one drawback I see offhand is that it's possible for (invalid ?) wiki markup to generate invalid HTML. Eg.: "''__'' ''__''" becomes "<i><b></i> <i></b></i>". Perhaps we can live with that? >Problem solved. Only use tokens where they are absolutely necessary. >I don't see the need to tokenize emphasis markup or things like >'%%%' and '^-{4,}' Yes you could tokenize the <br> and <hr> or not --- since the tokenizing mechanism is already in place (an must remain so for the links, at least) it really makes no difference readability, or complexity, and negligible difference in run time. My thinking was that by tokenizing anything containing HTML markup, the HTML is protected from being mangled by subsequent transforms. As long as each transform individually produces complete (and correct) HTML entities, the proper nesting of the final HTML output is guaranteed. This helps to minimize the sensitivity on the ordering of the transforms. I view this as somewhat important since it will make the writing of (well-behaved) transforms in (as yet unimagined) future extension modules simpler. I suppose we could eliminate the recursable logic, while keeping the tokenization by applying each of the currently recursed transformations twice. 1. Transform "''"s 2. Transform "'''"s 3. Transform "__"s 4. Transform "''"s again 5. Transform "'''"s again This, I think, handles everything that your method does (while eliminating the possibility of invalid HTML output.) Jeff |
From: Steve W. <sw...@wc...> - 2000-07-18 22:05:48
|
On Tue, 18 Jul 2000, Jeff Dairiki wrote: > >You can do without them in most places, but there are some where you > >absolutely need them. > > Not that I doubt you, but, out of curiosity: where? Oh, bugger... where was that? Arno's right though, there are places where preg_* are the only solution. > The one drawback I see offhand is that it's possible for (invalid ?) wiki > markup > to generate invalid HTML. > > Eg.: "''__'' ''__''" becomes "<i><b></i> <i></b></i>". > > Perhaps we can live with that? At some point you have to decide the user is sane and has some intelligence... we can concoct pathological situations all day and develop workarounds but I don't think that would make for a fun project. :-) > Yes you could tokenize the <br> and <hr> or not --- since the tokenizing > mechanism is already in place (an must remain so for the links, at least) > it really makes no difference readability, or complexity, and negligible > difference in run time. Probably true... > My thinking was that by tokenizing anything containing HTML markup, > the HTML is protected from being mangled by subsequent transforms. > As long as each transform individually produces complete (and correct) > HTML entities, the proper nesting of the final HTML output is guaranteed. > > This helps to minimize the sensitivity on the ordering of > the transforms. I view this as somewhat important since it will > make the writing of (well-behaved) transforms in (as yet unimagined) > future extension modules simpler. I agree; in a way this is a variation on the argument for storing all links in a separate table and storing the pages in a semi-state. What will the long term benefits be? In this case you can eliminate line-by-line processing entirely, but that would also require changes to the markup language (for plain text, you'd have to have some substitute for the tag instead of indenting with spaces like we do now; lists would be a nightmare; and we'd reinvent HTML, something I've repeatedly told users I have no intention of doing.) (Implementing XHTML might be worthwhile though. Mind you, I'm not suggesting this for 1.2 or even 1.4 (2.0?) but just speculating.) > I suppose we could eliminate the recursable logic, while keeping the > tokenization by applying each of the currently recursed transformations > twice. > > 1. Transform "''"s > 2. Transform "'''"s > 3. Transform "__"s > 4. Transform "''"s again > 5. Transform "'''"s again > > This, I think, handles everything that your method does (while eliminating > the possibility of invalid HTML output.) Not having read the code yet I'm not sure what the fuss is about... I did solve the whole issue of order-of-transformations in wiki_transform.php3 ages ago. Also, being performance minded is a good thing, but don't let it corner you into writing 10x the amount of code, or seriously complex code, just to gain small benefits. Wikis do not scale. Wikis cannot scale. They can grow a lot wider, but there is a low limit on how many people can edit a given topic before lost updates create confusion and frustration. Do not write bubble sorts; do not write loops that call external programs; but don't be afraid to use Perl regular expressions or make deep copies of objects, because we have the room to do it. sw ...............................ooo0000ooo................................. Hear FM quality freeform radio through the Internet: http://wcsb.org/ home page: www.wcsb.org/~swain |
From: Arno H. <aho...@in...> - 2000-07-18 22:21:03
|
> >Some Windows PHP's don't have preg_* functions. > >You can do without them in most places, but there are some where you > >absolutely need them. > > Not that I doubt you, but, out of curiosity: where? The one place I can think of right now is the use of preg_match_all() in wiki_transform. Also, eregs don't have non-greedy matches. Can't remember which one, but I recall that there is at least one match which needs non-greediness. > The one drawback I see offhand is that it's possible for (invalid ?) wiki > markup to generate invalid HTML. > > Eg.: "''__'' ''__''" becomes "<i><b></i> <i></b></i>". This is indeed invalid HTML. But the other way around (with tokens) the inner '' will have no effect at all (effectively: <i><i></i><i>) if __ is processed before '', or it becomes "<i>__</i> <i>__</i>" if __ is processed after ''. So the actual behaviour is not immediately apparent from the markup but depends on the implementation. Not much difference. > Perhaps we can live with [invalid HTML]? I can, because the above case will not appear very often, will it? > My thinking was that by tokenizing anything containing HTML markup, > the HTML is protected from being mangled by subsequent transforms. > As long as each transform individually produces complete (and correct) > HTML entities, the proper nesting of the final HTML output is guaranteed. A valid point. > This helps to minimize the sensitivity on the ordering of > the transforms. I view this as somewhat important since it will > make the writing of (well-behaved) transforms in (as yet unimagined) > future extension modules simpler. Ordering will always play a role. Though I have to agree that hiding HTML reduces one conflict point in the future for those "yet unimagined" extension modules. Btw, as your FIXME states: the recursive logic does not work as advertised: "__''word''__" renders ok, but "''__word__''" is not rendered - instead __ is inserted verbatim. Just looking at the code it becomes clear where the "fault" lies: you are always processing $line. Real recursion means processing the created tokens. (I guess you are aware of that already) Oddly enough replacing __ with ''' makes it work in both cases, but that is due to the regexp and not because of the recursion. > I suppose we could eliminate the recursable logic, while keeping the > tokenization by applying each of the currently recursed transformations > twice. > > 1. Transform "''"s > 2. Transform "'''"s > 3. Transform "__"s > 4. Transform "''"s again > 5. Transform "'''"s again Apart from doing ''' before '' (otherwise '''word''' becomes '<i>word</i>') it does not immediately solve the problem. You need to transfrom the tokens and not $line as you do right now. So my conclusion is: recursion adds complexity (while having its benefits). Let's start with HTML-in-place right now, and once some time has passed and the dust settled, we can do the recursion stuff - we will then have a better understanding of the issue. [Or you write a functioning and beautiful recursion right away ;o)] /Arno |
From: Steve W. <sw...@wc...> - 2000-07-27 03:35:37
|
Just catching up on the week's posts... On Tue, 18 Jul 2000, Jeff Dairiki wrote: > 1. Transform "''"s > 2. Transform "'''"s > 3. Transform "__"s > 4. Transform "''"s again > 5. Transform "'''"s again > > This, I think, handles everything that your method does (while eliminating > the possibility of invalid HTML output.) Hmm. Don't forget ''''' :-) sw ...............................ooo0000ooo................................. Hear FM quality freeform radio through the Internet: http://wcsb.org/ home page: www.wcsb.org/~swain |
From: Steve W. <sw...@wc...> - 2000-07-18 21:39:08
|
On Tue, 18 Jul 2000, Arno Hollosi wrote: > Sure, the new architecture is then a mixture of tokens and > HTML-in-place - compared to your tokens-only approach. > But it's much simplier - less complexity. And I don't think it's > too ugly from a structural point of view either. The minor drawback is that it's line-by-line processing, and if you want to have successive lines in italics in preformatted text every line must start and end with: ''here is my preformatted text in italics'' Line-by-line processing is inherited from 1.0, which is how most Wikis do things. just a minor point, sw ...............................ooo0000ooo................................. Hear FM quality freeform radio through the Internet: http://wcsb.org/ home page: www.wcsb.org/~swain |
From: Jeff D. <da...@da...> - 2000-07-18 22:11:20
|
In message <Pin...@bo...>,Steve Wai nstead writes: >The minor drawback is that it's line-by-line processing, and if you want >to have successive lines in italics in preformatted text every line must >start and end with: > > ''here is my preformatted text in italics'' > >Line-by-line processing is inherited from 1.0, which is how most Wikis do >things. Do we want to get away from line-by-line processing? It can be done. Now's the time to do it. I think it might be faster that way besides. However, I kind of like the line-by-line processing. It keeps goofs in one line from hosing the whole page. |
From: Steve W. <sw...@wc...> - 2000-07-18 22:28:25
|
On Tue, 18 Jul 2000, Jeff Dairiki wrote: > Do we want to get away from line-by-line processing? > It can be done. > Now's the time to do it. > I think it might be faster that way besides. > > However, I kind of like the line-by-line processing. It keeps goofs in one > line from hosing the whole page. True... in a browser it's easy to see where you goofed by using View Source; not quite as practical here though. Would search be impacted? Probably not.. we can still iterate though lines by exploding() the text... Would storage be impacted? Again I think not... these are separated for a reason... I don't know. I think it's not a necessary change right now, and it creates even more work because certain markup has to be changed too. (Unless we're only talking about <b>, <i> and friends, then it's a minor point; to do all of them (<hr>, <pre>, etc) is too major a change, especially for 1.2. just thinking out loud again because I don't want to work on work, sw ................................ooo0000ooo................................. Hear FM quality freeform radio through the Internet: http://wcsb.org/ home page: www.wcsb.org/~swain |
From: Arno H. <aho...@in...> - 2000-07-18 22:35:20
|
> >Line-by-line processing is inherited from 1.0, which is how most Wikis do > >things. > > Do we want to get away from line-by-line processing? I don't. Keep the line-by-line approach. As you said: errors don't spill over the rest of the page. That makes wiki more fun to experiment with. /Arno |