Thread: Re: [Docutils-users] literal inclusions: fileformat handling

Brought to you by: goodger, grubert, milde, tibs, wiemann

docutils-users

Re: [Docutils-users] literal inclusions: fileformat handling

From: Alan G I. <ai...@am...> - 2009-10-21 23:19:50

> On 2009-10-19, Alan G Isaac wrote:
>> But here is another perspective.
>> A writer writes a document with a certain convention for EOL markers.

>> The platform used to create the included files should not determine
>> the EOL marker written by the writer.



On 10/21/2009 4:36 PM, Guenter Milde wrote:
> A human writer or a Docutils writer program.



A Docutils writer program.

John Thywissen says XHTML should use LF for end of line:
http://john.thywissen.org/encodings.html
That makes some sense, as it is the XML normalization:
http://www.w3.org/TR/xml/#sec-line-ends

Alan Isaac

Re: [Docutils-users] literal inclusions: fileformat handling

From: Guenter M. <mi...@us...> - 2009-10-23 14:45:08

On 2009-10-21, Alan G Isaac wrote:
>> On 2009-10-19, Alan G Isaac wrote:

>>> But here is another perspective. A writer writes a document with a
>>> certain convention for EOL markers.

>>> The platform used to create the included files should not determine
>>> the EOL marker written by the writer.

No, the writer should use the EOL marker of the platform it runs on
(if allowed by the target format, which is the case for HTML and TeX).

This facilitates viewing/editing on the generated output files with
tools native to this platform.

> John Thywissen says XHTML should use LF for end of line:
> http://john.thywissen.org/encodings.html

This is an informative but no authoritative source. It might even be
wrong:

* Thywissen writes: 

    HTML  .html ... CR LF (Windows convention)

  while my LF line-broken files where validated as correct HTML 4.1 by
  the W3C.

* AFAIK, a file can be both, valid HTML *and* XHTML - what line ending
  should this file have?

> That makes some sense, as it is the XML normalization:
> http://www.w3.org/TR/xml/#sec-line-ends

This is an authoritative source. However, it says

  ... the XML processor MUST behave as if it normalized all line breaks
  in external parsed entities (including the document entity) on input,
  before parsing, by translating both the two-character sequence #xD #xA
  and any #xD that is not followed by #xA to a single #xA character.

in other words: the XML processor MUST accept all three: LF, CR, and
CR+LF as line endings.

Howver, maybe the source of you problem is the mixing of different
line-ending styles in one file:

  LF for the non-verbatim, Docutils generated part,
  CR+LF for the verbatim inclusion.

In my understanding of the w3.org source, the browser's XML processor
should even in this case normalize the line endings to LF, but in praxi
it might expect a uniform line-ending convention per file.

Maybe you can do some experiments with mixed and non-mixed line endings.

Günter

Re: [Docutils-users] literal inclusions: fileformat handling

From: Alan G I. <ai...@am...> - 2009-10-23 15:14:07

> On 2009-10-21, Alan G Isaac wrote:
>>>> The platform used to create the included files should not determine
>>>> the EOL marker written by the writer.
  
On 10/23/2009 10:44 AM, Guenter Milde wrote:
> No, the writer should use the EOL marker of the platform it runs on
> (if allowed by the target format, which is the case for HTML and TeX).
>
> This facilitates viewing/editing on the generated output files with
> tools native to this platform.


I do not see the contradition.
(Do not assume the included files are created on
the platform that is running the writer.)

Alan

Re: [Docutils-users] literal inclusions: fileformat handling

From: Guenter M. <mi...@us...> - 2009-10-23 20:39:14

On 2009-10-23, Alan G Isaac wrote:
>> On 2009-10-21, Alan G Isaac wrote:
>>>>> The platform used to create the included files should not determine
>>>>> the EOL marker written by the writer.

> On 10/23/2009 10:44 AM, Guenter Milde wrote:
>> No, the writer should use the EOL marker of the platform it runs on
>> (if allowed by the target format, which is the case for HTML and TeX).

>> This facilitates viewing/editing on the generated output files with
>> tools native to this platform.


> I do not see the contradition.
> (Do not assume the included files are created on
> the platform that is running the writer.)

Now, after careful re-reading I see your point.

Indeed it might be good to "normalize" line endings convention) so that
we do not have a mix in the output file.

Günter

Re: [Docutils-users] literal inclusions: fileformat handling

From: Guenter M. <mi...@us...> - 2009-10-25 12:03:18

On 2009-10-25, David Goodger wrote:
> On Sat, Oct 24, 2009 at 17:44, Guenter Milde <mi...@us...> wrote:

>> Problem
>> -------

>> ... mixed line endings inside one output file if the input consists of
>> files with different line ending conventions ...
   as literal included files keep the original line-endings.

>> Solutions
>> ---------

>> a) open files in text mode ('rU') (my suggested patch),

> +0, assuming this works.

...

> Docutils handles line endings in the main file in a simple way: line
> endings are ignored completely -- they are stripped out. 
...

I'd still call this a conversion, as the individual lines are joined with
newlines again later. E.g. in states.Text.literal_block ::

        data = '\n'.join(indented)
        literal_block = nodes.literal_block(data, data)

However, against my earlier assumption, line end normalization is done
also for literal blocks, so alternatives b) and c) become:

b) explicitely normalize line endings of literal included text files:

--- misc.py	(Revision 6182)
+++ misc.py	(Arbeitskopie)

         if 'literal' in self.options:
+            # normalize line endings:
+            text = '\n'.join(rawtext.splitlines())


c) explicitely normalize line endings of literal inclusions with
   docutils.statemachine.string2lines(). 

   > (Note that docutils.statemachine.string2lines
   > also converts hard tabs to spaces
   
   and strips trailing whitespace,
   
   > which was discussed recently in the "release 0.6" thread. I'm not
   > convinced that keeping hard tabs is anything but a kludge. Keeping hard
   > tabs as in r6135 may be a mistake.)
   
   We had this discussion already. 
   
   I repeat that there is a use case for keeping hard tabs in literal
   inclusions:
   
   * It is possible to highlight tabs in the output (e.g. with the LaTeX
     writer's ``literal-env=listings`` option), so keeping them intact
     will ensure the proper 
     
     E.g. a makefile that must have hard tabs instead of spaces, so the
     correct highlighting matters!
     
   * Keeping hard tabs, it is possible to configure tab expansion in a
     style sheet (maybe not in HTML/CSS but in other output formats).
   
> I don't think any conversion is necessary. Either handle with
> universal-newlines text mode (as in (a) above) or process the same way
> as the main document. I think the latter may be the better solution.


My preference is a) 
because it uses the standard Python way to normalize line endings (fast
and clean).

I can also live with b) 
(patch is ready) 

or c), 

if we agree to let ``tab_width < 0`` signify "keep-tabs" and maybe also
"keep trailing (non line-ending) whitespace" in
docutils.statemachine.string2lines(). 


Günter

Re: [Docutils-users] literal inclusions: fileformat handling

From: David G. <go...@py...> - 2009-10-25 15:50:11

> On 2009-10-25, David Goodger wrote:
>   > (Note that docutils.statemachine.string2lines
>   > also converts hard tabs to spaces

On Sun, Oct 25, 2009 at 08:02, Guenter Milde <mi...@us...> wrote:
>   and strips trailing whitespace,

You're not going to argue that there's a use case for trailing
whitespace, are you?
(0.5 ;-)

>   > which was discussed recently in the "release 0.6" thread. I'm not
>   > convinced that keeping hard tabs is anything but a kludge. Keeping hard
>   > tabs as in r6135 may be a mistake.)
>
>   We had this discussion already.

I know. I mentioned so ("which was discussed recently" above).

>   I repeat that there is a use case for keeping hard tabs in literal
>   inclusions:

The use case seems to me to be flimsy at best -- it is grasping for
some/any possible use of hard tabs. Keeping the hard tabs feels wrong
to me somehow. For example, what happens to the hard tabs when the
Writer (output format) cannot handle them properly? But since it
requires an explicit action (:tab-width: < 0), (ab)users of this
(mis)feature would only have themselves to blame for any consequences.

>> I don't think any conversion is necessary. Either handle with
>> universal-newlines text mode (as in (a) above) or process the same way
>> as the main document. I think the latter may be the better solution.
>
> My preference is a)
> because it uses the standard Python way to normalize line endings (fast
> and clean).
>
> I can also live with b)
> (patch is ready)
>
> or c),

+0 on (a).

> if we agree to let ``tab_width < 0`` signify "keep-tabs"

-0. I don't care enough to argue further.

> and maybe also
> "keep trailing (non line-ending) whitespace" in
> docutils.statemachine.string2lines().

... so no use case yet, but you are going to argue in that direction.

-1. No way. That's hypergeneralization, and is just wrong.

-- 
David Goodger <http://python.net/~goodger>

Re: [Docutils-users] literal inclusions: fileformat handling

From: Guenter M. <mi...@us...> - 2009-10-23 21:19:14

On 2009-10-23, Guenter Milde wrote:
> On 2009-10-23, Alan G Isaac wrote:
>>> On 2009-10-21, Alan G Isaac wrote:

>>>>>> The platform used to create the included files should not determine
>>>>>> the EOL marker written by the writer.

> Indeed it might be good to "normalize" line endings convention) so that
> we do not have a mix in the output file.

Does the following patch solve your problem?

alltests.py runs fine here with Python 2.5 on Unix. Can you test on Windows?

Günter

Exec: svn 'diff' 'io.py' 2>&1
Dir: /home/milde/Code/Python/docutils-svn/docutils/docutils/

Index: io.py
===================================================================
--- io.py	(Revision 6182)
+++ io.py	(Arbeitskopie)
@@ -207,7 +207,7 @@
 
     def __init__(self, source=None, source_path=None,
                  encoding=None, error_handler='strict',
-                 autoclose=1, handle_io_errors=1):
+                 autoclose=1, handle_io_errors=1, mode='rU'):
         """
         :Parameters:
             - `source`: either a file-like object (which is read directly), or
@@ -218,6 +218,9 @@
             - `autoclose`: close automatically after read (boolean); always
               false if `sys.stdin` is the source.
             - `handle_io_errors`: summarize I/O errors here, and exit?
+            - `mode`: how the file is to be opened (see standard function
+              `open`). The default 'rU' provides universal newline support
+              for text files.
         """
         Input.__init__(self, source, source_path, encoding, error_handler)
         self.autoclose = autoclose
@@ -225,7 +228,7 @@
         if source is None:
             if source_path:
                 try:
-                    self.source = open(source_path, 'rb')
+                    self.source = open(source_path, mode)
                 except IOError, error:
                     if not handle_io_errors:
                         raise
Exec: svn 'diff' 'io.py' 2>&1
Dir: /home/milde/Code/Python/docutils-svn/docutils/docutils/

Index: io.py
===================================================================
--- io.py	(Revision 6182)
+++ io.py	(Arbeitskopie)
@@ -207,7 +207,7 @@
 
     def __init__(self, source=None, source_path=None,
                  encoding=None, error_handler='strict',
-                 autoclose=1, handle_io_errors=1):
+                 autoclose=1, handle_io_errors=1, mode='rU'):
         """
         :Parameters:
             - `source`: either a file-like object (which is read directly), or
@@ -218,6 +218,9 @@
             - `autoclose`: close automatically after read (boolean); always
               false if `sys.stdin` is the source.
             - `handle_io_errors`: summarize I/O errors here, and exit?
+            - `mode`: how the file is to be opened (see standard function
+              `open`). The default 'rU' provides universal newline support
+              for text files.
         """
         Input.__init__(self, source, source_path, encoding, error_handler)
         self.autoclose = autoclose
@@ -225,7 +228,7 @@
         if source is None:
             if source_path:
                 try:
-                    self.source = open(source_path, 'rb')
+                    self.source = open(source_path, mode)
                 except IOError, error:
                     if not handle_io_errors:
                         raise

Re: [Docutils-users] literal inclusions: fileformat handling

From: Alan G I. <ai...@am...> - 2009-10-24 14:12:47

Sorry, this was a bit unclear to me:
did you already apply this patch somewhere,
or did you want me to apply it to my
existing docutils?

Alan

Re: [Docutils-users] literal inclusions: fileformat handling

From: David G. <go...@py...> - 2009-10-24 15:25:57

On Sat, Oct 24, 2009 at 10:12, Alan G Isaac <ai...@am...> wrote:
> Sorry, this was a bit unclear to me:
> did you already apply this patch somewhere,
> or did you want me to apply it to my
> existing docutils?

Günter's patch was not applied yet. He wants you to apply it on your
code and test it.

-- 
David Goodger <http://python.net/~goodger>

Re: [Docutils-users] literal inclusions: fileformat handling

From: Guenter M. <mi...@us...> - 2009-10-26 08:08:26

On 2009-10-25, Alan G Isaac wrote:
> On 10/25/2009 5:04 PM, Guenter Milde wrote:
>> Alan, could you test with your sample case?

> Yes, that fixes the literal-inclusion eol issue.
> I tested for dos, mac, and unix fileformats.
> (Current SVN.)

Fine.

> PS Literal inclusions always end with a superfluous
> empty line. 

This is the trailing newline in your source file.

> If I use :begin-after:, they also begin
> with a superfluous empty line.

Also, if the begin-after text ends in the middle of a line?

Günter

Re: [Docutils-users] literal inclusions: fileformat handling

From: David G. <go...@py...> - 2009-10-26 14:07:30

On Mon, Oct 26, 2009 at 04:07, Guenter Milde <mi...@us...> wrote:
> On 2009-10-25, Alan G Isaac wrote:
>> If I use :begin-after:, they also begin
>> with a superfluous empty line.
>
> Also, if the begin-after text ends in the middle of a line?

Take the text after the begin-after marker, .strip() it, and if that's
an empty string, don't include it.

-- 
David Goodger <http://python.net/~goodger>

Re: [Docutils-users] literal inclusions: fileformat handling

From: Guenter M. <mi...@us...> - 2009-10-24 21:45:29

On 2009-10-24, David Goodger wrote:
> On Sat, Oct 24, 2009 at 10:12, Alan G Isaac <ai...@am...> wrote:
>> Sorry, this was a bit unclear to me:
>> did you already apply this patch somewhere,
>> or did you want me to apply it to my
>> existing docutils?

> Günter's patch was not applied yet. He wants you to apply it on your
> code and test it.

Exactly.

Also, I would like Davids opinion on the assumption 

  By default, input files are supposed to be text files (open with
  mode='rU').

Problem
-------

The current setting is open as binary which can lead to mixed line
endings inside one output file if the input consists of files
with different line ending conventions, e.g.

Example: 
  main.txt includes three files with

    .. include unix-child.txt

    .. include dos-child.txt

    .. include mac-child.txt

mulit-line literal parts of the child documents will have the original
line-endings.

Solutions
---------

a) open files in text mode ('rU') (my suggested patch),

b) explicitely convert line endings of included text files.
   (not only literal inclusions, but also RST files, as these could
   contain literal blocks),

c) explicitely convert line endings of literal blocks and literal
   inclusions.

Günter

Günter

Re: [Docutils-users] literal inclusions: fileformat handling

From: David G. <go...@py...> - 2009-10-25 03:41:51

On Sat, Oct 24, 2009 at 17:44, Guenter Milde <mi...@us...> wrote:
> Also, I would like Davids opinion on the assumption
>
>  By default, input files are supposed to be text files (open with
>  mode='rU').
>
> Problem
> -------
>
> The current setting is open as binary which can lead to mixed line
> endings inside one output file if the input consists of files
> with different line ending conventions, e.g.
>
> Example:
>  main.txt includes three files with
>
>    .. include unix-child.txt
>
>    .. include dos-child.txt
>
>    .. include mac-child.txt
>
> mulit-line literal parts of the child documents will have the original
> line-endings.
>
> Solutions
> ---------
>
> a) open files in text mode ('rU') (my suggested patch),

+0, assuming this works.

> b) explicitely convert line endings of included text files.
>   (not only literal inclusions, but also RST files, as these could
>   contain literal blocks),
>
> c) explicitely convert line endings of literal blocks and literal
>   inclusions.

I don't think any conversion is necessary. Either handle with
universal-newlines text mode (as in (a) above) or process the same way
as the main document. I think the latter may be the better solution.

Docutils handles line endings in the main file in a simple way: line
endings are ignored completely -- they are stripped out. See the
docutils.statemachine.string2lines function. It uses the .splitlines
string method to split a block of text into a list of individual
lines. Experimentation shows that all of \n, \r, and \r\n are handled
by this method.

The "include" directive currently splits the input text into lines in
a different way, using file.readlines(). This leaves line endings
intact, resulting in the current problem. Better to strip out the line
endings consistently. (Note that docutils.statemachine.string2lines
also converts hard tabs to spaces, which was discussed recently in the
"release 0.6" thread. I'm not convinced that keeping hard tabs is
anything but a kludge. Keeping hard tabs as in r6135 may be a
mistake.)

-- 
David Goodger <http://python.net/~goodger>

Re: [Docutils-users] literal inclusions: fileformat handling

From: Guenter M. <mi...@us...> - 2009-10-26 15:16:33

On 2009-10-26, David Goodger wrote:
> On Mon, Oct 26, 2009 at 04:07, Guenter Milde <mi...@us...> wrote:
>> On 2009-10-25, Alan G Isaac wrote:
>>> If I use :begin-after:, they also begin
>>> with a superfluous empty line.

>> Also, if the begin-after text ends in the middle of a line?

> Take the text after the begin-after marker, .strip() it, and if that's
> an empty string, don't include it.

How about stripping all blank lines from begin and end of the included file
or file section?

This would correspond to the way a literal block is handled::

  this

is equal to::

  this

although there are empty lines.

As nodes.literal_block expects (and we have) the text as a string (rather
than a list of lines), using a regexp seems the way.

Günter

Re: [Docutils-users] literal inclusions: fileformat handling

From: David G. <go...@py...> - 2009-10-26 15:26:27

On Mon, Oct 26, 2009 at 11:15, Guenter Milde <mi...@us...> wrote:
> How about stripping all blank lines from begin and end of the included file
> or file section?

Sure.

-- 
David Goodger <http://python.net/~goodger>

Re: [Docutils-users] literal inclusions: fileformat handling

From: Guenter M. <mi...@us...> - 2009-10-27 07:58:04

On 2009-10-26, David Goodger wrote:
> On Mon, Oct 26, 2009 at 11:15, Guenter Milde <mi...@us...> wrote:
>> How about stripping all blank lines from begin and end of the included file
>> or file section?

> Sure.

It's now on the todo.txt list.

Günter

Re: [Docutils-users] literal inclusions: fileformat handling

From: Guenter M. <mi...@us...> - 2009-10-25 21:05:01

On 2009-10-25, David Goodger wrote:
>> On 2009-10-25, David Goodger wrote:
>>   > (Note that docutils.statemachine.string2lines
>>   > also converts hard tabs to spaces

> On Sun, Oct 25, 2009 at 08:02, Guenter Milde <mi...@us...> wrote:
>>   and strips trailing whitespace,

> You're not going to argue that there's a use case for trailing
> whitespace, are you?
> (0.5 ;-)

Of course I am. For me, literal means literal. I want a faithfull
representation of the included file, so I will not introduce any changes
without need.

Trailing whitespace can be made visible if the background of the running
text is set differently from that of the literal block in a style sheet...

...

>>   I repeat that there is a use case for keeping hard tabs in literal
>>   inclusions:

> The use case seems to me to be flimsy at best -- it is grasping for
> some/any possible use of hard tabs. Keeping the hard tabs feels wrong
> to me somehow. For example, what happens to the hard tabs when the
> Writer (output format) cannot handle them properly? 

My point of view is the other way round: what happens when the Writer
(output format) renders hard tabs and spaces differently?
The latex2e writer is a real example, while I don't know of any
writer that cannot handle them properly.

> But since it
> requires an explicit action (:tab-width: < 0), (ab)users of this
> (mis)feature would only have themselves to blame for any consequences.

This was our consensus and I hope we can stick to it.

...

>> My preference is a)
>> because it uses the standard Python way to normalize line endings (fast
>> and clean).

>> I can also live with b)
>> (patch is ready)

>> or c),

> +0 on (a).

>> if we agree to let ``tab_width < 0`` signify "keep-tabs"

> -0. I don't care enough to argue further.

>> and maybe also
>> "keep trailing (non line-ending) whitespace" in
>> docutils.statemachine.string2lines().

> ... so no use case yet, but you are going to argue in that direction.

> -1. No way. That's hypergeneralization, and is just wrong.

So a) won by a margin of 2*\epsilon. 

The patch is applied. 

I also tried to add a test case (literal inclusion of file with CRLF) but
I fear that SVN did normalize the line endings thus rendering useless.
(Maybe we could define the test file include_literal.txt as binary.)

Alan, could you test with your sample case?

Günter

Alan, can you

Re: [Docutils-users] literal inclusions: fileformat handling

From: Alan G I. <ai...@am...> - 2009-10-25 21:33:29

On 10/25/2009 5:04 PM, Guenter Milde wrote:
> Alan, could you test with your sample case?

Yes, that fixes the literal-inclusion eol issue.
I tested for dos, mac, and unix fileformats.
(Current SVN.)

Thanks!
Alan

PS Literal inclusions always end with a superfluous
empty line.  If I use :begin-after:, they also begin
with a superfluous empty line.

PPS On another topic, here's a use case for retaining all white space
literally  in literal includes:
http://en.wikipedia.org/wiki/Whitespace_%28programming_language%29