Re: [Docutils-users] Dash-transformation

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

[Felix Wiemann]
 > With the current implementations, some documents are specifically
 > written for the LaTeX writer (because they rely on the
 > dash-transformation) and some are written specifically for the HTML
 > writer (because they rely on multiple dashes not to be transformed).

That's bad.

 > So we have a problem which needs to be solved.

Yes.  IMO, it's a bug that the LaTeX writer implicitly performs any
dash transformation at all.  It's a dangerous convenience.

 > A somewhat radical but nonetheless simple and effective solution
 > might be to deactivate the transformation in the LaTeX writer.

+1

 > However, then it should be possible to easily enter en-/em-dashes
 > with ASCII characters.
 >
 > * I'd suggest adding built-in substitution definitions for "|--|" to
 >   en-dash and "|---|" to em-dash.

I don't know about inserting a set of predefined substitution
definitions into the parser.  But we could certainly include a set of
substitution files in Docutils.  Then the author could do:

     .. include:: <dashes.txt>

See <http://docutils.sf.net/docs/dev/todo.html#misc.include>; more
below.

 > * And it would be necessary to write em-dashes without spaces around.

Are you saying that substitution references should not require any
delimiters?  That won't work.  Substitution references are like any
other reST inline markup; the start-string and end-string recognition
rules must apply in order to avoid ambiguity
(http://docutils.sf.net/docs/ref/rst/restructuredtext.html#inline-markup).

This is the best we can do right now:

$ quicktest.py
foo\ |---|\ bar
<document source="<stdin>">
     <paragraph>
         foo
         <substitution_reference refname="---">
             ---
         bar

 > IMO the trailing space should be made omittable.

We'd still need a leading space.  With an omissible trailing space,
the best we'd be able to do would be

     foo\ |---|bar

That isn't much better than the current "foo\ |---|\ bar".  Certainly
not worth the ambiguity and effort.

But this gave me an idea.  In conjunction with a change to the
"unicode" directive, substitutions could become context-sensitive.  We
could add a "trim" option to the "unicode" directive, as follows:

     .. |--| unicode:: U+02013 .. EN DASH
        :trim:
     .. |---| unicode:: U+02014 .. EM DASH
        :trim:

Then this input:

     foo |---| bar

could become this output:

     foo&mdash;bar

And other characters can be used as markup delimiters, not just
spaces.  For example, hyphens can be used.  Alternative substitution
definitions I'm thinking of include:

     .. |M| unicode:: U+02014 .. EM DASH
        :trim: -
     .. |N| unicode:: U+02013 .. EN DASH
        :trim: -
     .. |?| unicode:: U+000AD .. SOFT HYPHEN
        :trim: -
     .. |!| unicode:: U+02011 .. NON-BREAKING HYPHEN
        :trim: -
     .. |#| unicode:: U+02012 .. FIGURE DASH
        :trim: -

So an em-dash could be written like this, similar to the proofreaders'
mark:

     foo-|M|-bar

and would produce (the equivalent of) this:

     foo&mdash;bar

Alternatively, XML entity names (|mdash|) could be used instead of the
cryptic symbols above (|M|).

Many space characters could also be defined:

     .. |emsp| unicode:: U+02003 .. EM SPACE
        :trim:
     .. |ensp| unicode:: U+02002 .. EN SPACE
        :trim:
     .. |puncsp| unicode:: U+02008 .. PUNCTUATION SPACE
        :trim:
     .. |numsp| unicode:: U+02007 .. DIGIT SPACE
        :trim:
     .. |thinsp| unicode:: U+02009 .. THIN SPACE
        :trim:
     .. |hairsp| unicode:: U+0200A .. HAIR SPACE
        :trim:
     .. |0sp| unicode:: U+0200B .. ZERO WIDTH SPACE
        :trim:
     .. |zwnj| unicode:: U+0200C .. ZERO WIDTH NON-JOINER
        :trim:
     .. |zwj| unicode:: U+0200D .. ZERO WIDTH JOINER
        :trim:
     .. |nbsp| unicode:: U+000A0 .. NO-BREAK SPACE
        :trim:

In fact, all of the character entity files in the add-on package
(http://docutils.sourceforge.net/tmp/charents.tgz, which should come
standard with Docutils) could have space-trimmed alternatives.

Discussion welcome.

-- 
David Goodger <http://python.net/~goodger>