Re: [Docutils-users] automatic typography

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Kirill Lapshin wrote:
 > Ok, there was quite a lot of interest to the topic, which proves
 > that it is rather important feature.

It proves that the topic is controversial. :-)  It may be that there
is no universally-acceptable solution, or that we haven't found it
yet.  It could be that the best solution is, as Beni put it, "get a
unicode editor".

 > Let me try to summarize and elaborate.

Thanks, this is useful.

 > Users
...
 > 3. Those who *really* care.
 >
 > The third group will prefer to do it manually, unless automatic
 > procedure is 100% accurate, which is quite hard to achieve. On the
 > other hand I would expect that this group is rather small.

... because those people probably wouldn't be using Docutils to begin
with.

 > if the conversion is activated by default).

I would be very wary of activating such conversion "by default".  That
would be surprising behavior (see item 4, "Unsurprising", of
<http://docutils.sf.net/spec/rst/introduction.html#goals>).  Authors
who *do* use Unicode editors (or editors that support some superset of
ASCII) shouldn't be forced to accept unnecessary text conversions.

 > Conversion Spec
...
 > I would say that we don't need much flexibility on
 > reST side, rather we need presentation flexibility.

Please explain what you mean by "flexibility on reST side" and
"presentation flexibility".

 > In other words users should not be able to decide whether "--" is
 > ndash or mdash.

Why not?

 > This should be fixed (preferably in reST spec). Otherwise we'll ned
 > up having too many reST flavors.

I'm not convinced that the text transforms should be fixed in the
spec, and I don't see how "too many reST flavors" would follow.

 > However, if user want to input two dashes in regular text, s/he
 > should be able to do so, by either escaping or using entities, or
 > whatever.

If the author isn't aware of text conversion taking place, they
wouldn't know to escape text.  That's why it has to be explicitly
"installed".

 > For instance consider following text: "It was a 3.5" floppy
 > disk". Here the quote in the middle is actually not a quote but
 > rather an inch symbol. Should it be converted to |Prime|?

Which character is correct for "inch"?  Is it &Prime;, or are straight
double quotes (&quot;) or right double quotes (&rdquo;) correct?  I
can't find a definitive reference.

 > ... our approach should be flexible enough to allow people to write
 > some fancy fuzzy logic algos ot whatsoever.

This is getting overly complex.  I could imagine allowing regular
expressions as search patterns, but I think arbitrary code is going
too far.

 > Lets revise rules one more time. We'll say that quote following a
 > number is a Prime, but user can use |aquote| (automatic quote) or
 > |alquote|/|arquote| (autoamtic left/right quote) to explicetly tell
 > the role of the symbol.

How exactly would the user do this?  Please give examples.

 > How often a real quote is following a digit?  Probably not often at
 > all. It is Ok to let user take care of such exceptions.

If handling such exceptions (which will happen, more often than you
think) is too painful, it may overwhelm the usefulness of the new
functionality.

 > Current proposed set of rules:
 >
 > (?<!-)--(?!-) - ndash
 > (?<!-)---(?!-) - mdash
 > (?<!\w)"(?=\w) - left quote
 > (?<=\w)(?<!\d)"(?!\w) - right quote
 > (?<=\d)" - Prime
 > (?<=\d)' - prime
 > (?<!\.)\.{3}(?!\.) - hellip
 > (?<!\w)\([cC]\)(?!\w) - copy
 > (?<!\w)\([rR]\)(?!\w) - reg
 > (?<!\w)\([tT][mM]\)(?!\w) - TM

These are useful, but some are debatable at least.  If these are
implemented in code, the code will have to be updated frequently
because once this functionality is enabled, new cases are inevitable.
That's why I think that a data-driven approach would be much better.

 > These transforoms are not applied to literal blocks, shell block,
 > python doctest blocks, left column of option lists.

Among others. :-)  Let's call these "markup and literal contexts"
rather than listing every case.

 > Implementation
 > ==============
 >
 > I am going to implement a reST preprocessor for now.

That's a good idea for prototyping.

 > Customizations
 > ==============
 >
 > David suggests specifying transforms via directives in the reST
 > file, e.g.
 >
 > .. text-replace:: "--"  "|ndash|"
 > .. text-replace:: ' "'  ' |ldquo|'
 >
 > As I was pointing out above the conversion rules have to be more
 > sophisticated, otherwise we'll get too many false conversions.

The "text-replace" directive could support regexps, or there could be
a complementary "regexp-replace" directive also.

 > However he has a good point, it makes sense to give user opportunity
 > to control conversions in reST file.

I think that in-the-document is *the* place to control conversions.

 > How about something along these lines:
 >
 > .. text-replace:: "en"   //activate standard english style conversions
 >     :lquote: "|lsquo|" //use single quotes instead of double
 >     :rquote: "|rsquo|" //use single quotes instead of double
 >     :tm: "off" //turn off (TM) conversion

I don't think so.  Such a beast would quickly become a complex
monster.  Instead, I would suggest something like this::

     .. include:: english-text-transforms.txt

IOW, a document explicitly loads the text transforms that it intends
to use.  Flexibility through simplicity.

I had an idea though (just added to the to-do list):

* Add an "--include file" command-line option (config setting too?),
   equivalent to ".. include:: file" as the first line of the doc text?
   Especially useful for character entity sets, text transform specs,
   boilerplate, etc.

In addition, if a text conversion system is implemented, a new
command-line option should be added, to turn it *off*.

 > Note that some styles use different quote symbols for embedded
 > quotes.  I.e. in russian there are two quote styles <<this>> and
 > ,,that''. If outer quotes are <<...>> then inner have to be
 > ,,..''.

Are you suggesting that ``"`` be converted to ``<<``/``>>`` or
``,,``/``''`` depending on language and context?  If so, I disagree
strongly.  If you're writing in Russian, you should be using proper
Russian quotation marks.  Text transforms could convert ``<<`` to
&laquo; etc., but that's as far as the transform intelligence should
go.

If not, what are you suggesting? :-)

 > So it makes sense to expose at least two levels of left/right quotes
 > for customization. I can hardly imagine text which has more than two
 > levels of nesting quotes, so we should not probably care about this
 > one.

I can imagine such text.  It's dangerous to place arbitrary limits.
However, it shouldn't matter here because these text transforms should
be very limited in terms of state.  IOW, I don't think that the
transforms should know or care about the text's nesting level.

 > Question: should we report error if there are some unbalanced
 > quotes?

No.  That's going too far.  Quotation marks are not markup.  If an
text transformation system is implemented, it must not cause any
errors.  It must ignore any garbage it sees; may not be garbage.

**********

After all this, don't you think it would be easier just to use a
Unicode-aware editor?

-- 
David Goodger    http://starship.python.net/~goodger
For hire: http://starship.python.net/~goodger/cv
Docutils: http://docutils.sourceforge.net/
(includes reStructuredText: http://docutils.sf.net/rst.html)