From: David G. <go...@py...> - 2003-07-13 01:03:25
|
Kirill Lapshin wrote: > Ok, there was quite a lot of interest to the topic, which proves > that it is rather important feature. It proves that the topic is controversial. :-) It may be that there is no universally-acceptable solution, or that we haven't found it yet. It could be that the best solution is, as Beni put it, "get a unicode editor". > Let me try to summarize and elaborate. Thanks, this is useful. > Users ... > 3. Those who *really* care. > > The third group will prefer to do it manually, unless automatic > procedure is 100% accurate, which is quite hard to achieve. On the > other hand I would expect that this group is rather small. ... because those people probably wouldn't be using Docutils to begin with. > if the conversion is activated by default). I would be very wary of activating such conversion "by default". That would be surprising behavior (see item 4, "Unsurprising", of <http://docutils.sf.net/spec/rst/introduction.html#goals>). Authors who *do* use Unicode editors (or editors that support some superset of ASCII) shouldn't be forced to accept unnecessary text conversions. > Conversion Spec ... > I would say that we don't need much flexibility on > reST side, rather we need presentation flexibility. Please explain what you mean by "flexibility on reST side" and "presentation flexibility". > In other words users should not be able to decide whether "--" is > ndash or mdash. Why not? > This should be fixed (preferably in reST spec). Otherwise we'll ned > up having too many reST flavors. I'm not convinced that the text transforms should be fixed in the spec, and I don't see how "too many reST flavors" would follow. > However, if user want to input two dashes in regular text, s/he > should be able to do so, by either escaping or using entities, or > whatever. If the author isn't aware of text conversion taking place, they wouldn't know to escape text. That's why it has to be explicitly "installed". > For instance consider following text: "It was a 3.5" floppy > disk". Here the quote in the middle is actually not a quote but > rather an inch symbol. Should it be converted to |Prime|? Which character is correct for "inch"? Is it ″, or are straight double quotes (") or right double quotes (”) correct? I can't find a definitive reference. > ... our approach should be flexible enough to allow people to write > some fancy fuzzy logic algos ot whatsoever. This is getting overly complex. I could imagine allowing regular expressions as search patterns, but I think arbitrary code is going too far. > Lets revise rules one more time. We'll say that quote following a > number is a Prime, but user can use |aquote| (automatic quote) or > |alquote|/|arquote| (autoamtic left/right quote) to explicetly tell > the role of the symbol. How exactly would the user do this? Please give examples. > How often a real quote is following a digit? Probably not often at > all. It is Ok to let user take care of such exceptions. If handling such exceptions (which will happen, more often than you think) is too painful, it may overwhelm the usefulness of the new functionality. > Current proposed set of rules: > > (?<!-)--(?!-) - ndash > (?<!-)---(?!-) - mdash > (?<!\w)"(?=\w) - left quote > (?<=\w)(?<!\d)"(?!\w) - right quote > (?<=\d)" - Prime > (?<=\d)' - prime > (?<!\.)\.{3}(?!\.) - hellip > (?<!\w)\([cC]\)(?!\w) - copy > (?<!\w)\([rR]\)(?!\w) - reg > (?<!\w)\([tT][mM]\)(?!\w) - TM These are useful, but some are debatable at least. If these are implemented in code, the code will have to be updated frequently because once this functionality is enabled, new cases are inevitable. That's why I think that a data-driven approach would be much better. > These transforoms are not applied to literal blocks, shell block, > python doctest blocks, left column of option lists. Among others. :-) Let's call these "markup and literal contexts" rather than listing every case. > Implementation > ============== > > I am going to implement a reST preprocessor for now. That's a good idea for prototyping. > Customizations > ============== > > David suggests specifying transforms via directives in the reST > file, e.g. > > .. text-replace:: "--" "|ndash|" > .. text-replace:: ' "' ' |ldquo|' > > As I was pointing out above the conversion rules have to be more > sophisticated, otherwise we'll get too many false conversions. The "text-replace" directive could support regexps, or there could be a complementary "regexp-replace" directive also. > However he has a good point, it makes sense to give user opportunity > to control conversions in reST file. I think that in-the-document is *the* place to control conversions. > How about something along these lines: > > .. text-replace:: "en" //activate standard english style conversions > :lquote: "|lsquo|" //use single quotes instead of double > :rquote: "|rsquo|" //use single quotes instead of double > :tm: "off" //turn off (TM) conversion I don't think so. Such a beast would quickly become a complex monster. Instead, I would suggest something like this:: .. include:: english-text-transforms.txt IOW, a document explicitly loads the text transforms that it intends to use. Flexibility through simplicity. I had an idea though (just added to the to-do list): * Add an "--include file" command-line option (config setting too?), equivalent to ".. include:: file" as the first line of the doc text? Especially useful for character entity sets, text transform specs, boilerplate, etc. In addition, if a text conversion system is implemented, a new command-line option should be added, to turn it *off*. > Note that some styles use different quote symbols for embedded > quotes. I.e. in russian there are two quote styles <<this>> and > ,,that''. If outer quotes are <<...>> then inner have to be > ,,..''. Are you suggesting that ``"`` be converted to ``<<``/``>>`` or ``,,``/``''`` depending on language and context? If so, I disagree strongly. If you're writing in Russian, you should be using proper Russian quotation marks. Text transforms could convert ``<<`` to « etc., but that's as far as the transform intelligence should go. If not, what are you suggesting? :-) > So it makes sense to expose at least two levels of left/right quotes > for customization. I can hardly imagine text which has more than two > levels of nesting quotes, so we should not probably care about this > one. I can imagine such text. It's dangerous to place arbitrary limits. However, it shouldn't matter here because these text transforms should be very limited in terms of state. IOW, I don't think that the transforms should know or care about the text's nesting level. > Question: should we report error if there are some unbalanced > quotes? No. That's going too far. Quotation marks are not markup. If an text transformation system is implemented, it must not cause any errors. It must ignore any garbage it sees; may not be garbage. ********** After all this, don't you think it would be easier just to use a Unicode-aware editor? -- David Goodger http://starship.python.net/~goodger For hire: http://starship.python.net/~goodger/cv Docutils: http://docutils.sourceforge.net/ (includes reStructuredText: http://docutils.sf.net/rst.html) |