From: Alan G I. <ai...@am...> - 2009-10-21 23:19:50
|
> On 2009-10-19, Alan G Isaac wrote: >> But here is another perspective. >> A writer writes a document with a certain convention for EOL markers. >> The platform used to create the included files should not determine >> the EOL marker written by the writer. On 10/21/2009 4:36 PM, Guenter Milde wrote: > A human writer or a Docutils writer program. A Docutils writer program. John Thywissen says XHTML should use LF for end of line: http://john.thywissen.org/encodings.html That makes some sense, as it is the XML normalization: http://www.w3.org/TR/xml/#sec-line-ends Alan Isaac |
From: Guenter M. <mi...@us...> - 2009-10-23 14:45:08
|
On 2009-10-21, Alan G Isaac wrote: >> On 2009-10-19, Alan G Isaac wrote: >>> But here is another perspective. A writer writes a document with a >>> certain convention for EOL markers. >>> The platform used to create the included files should not determine >>> the EOL marker written by the writer. No, the writer should use the EOL marker of the platform it runs on (if allowed by the target format, which is the case for HTML and TeX). This facilitates viewing/editing on the generated output files with tools native to this platform. > John Thywissen says XHTML should use LF for end of line: > http://john.thywissen.org/encodings.html This is an informative but no authoritative source. It might even be wrong: * Thywissen writes: HTML .html ... CR LF (Windows convention) while my LF line-broken files where validated as correct HTML 4.1 by the W3C. * AFAIK, a file can be both, valid HTML *and* XHTML - what line ending should this file have? > That makes some sense, as it is the XML normalization: > http://www.w3.org/TR/xml/#sec-line-ends This is an authoritative source. However, it says ... the XML processor MUST behave as if it normalized all line breaks in external parsed entities (including the document entity) on input, before parsing, by translating both the two-character sequence #xD #xA and any #xD that is not followed by #xA to a single #xA character. in other words: the XML processor MUST accept all three: LF, CR, and CR+LF as line endings. Howver, maybe the source of you problem is the mixing of different line-ending styles in one file: LF for the non-verbatim, Docutils generated part, CR+LF for the verbatim inclusion. In my understanding of the w3.org source, the browser's XML processor should even in this case normalize the line endings to LF, but in praxi it might expect a uniform line-ending convention per file. Maybe you can do some experiments with mixed and non-mixed line endings. Günter |
From: Alan G I. <ai...@am...> - 2009-10-23 15:14:07
|
> On 2009-10-21, Alan G Isaac wrote: >>>> The platform used to create the included files should not determine >>>> the EOL marker written by the writer. On 10/23/2009 10:44 AM, Guenter Milde wrote: > No, the writer should use the EOL marker of the platform it runs on > (if allowed by the target format, which is the case for HTML and TeX). > > This facilitates viewing/editing on the generated output files with > tools native to this platform. I do not see the contradition. (Do not assume the included files are created on the platform that is running the writer.) Alan |
From: Guenter M. <mi...@us...> - 2009-10-23 20:39:14
|
On 2009-10-23, Alan G Isaac wrote: >> On 2009-10-21, Alan G Isaac wrote: >>>>> The platform used to create the included files should not determine >>>>> the EOL marker written by the writer. > On 10/23/2009 10:44 AM, Guenter Milde wrote: >> No, the writer should use the EOL marker of the platform it runs on >> (if allowed by the target format, which is the case for HTML and TeX). >> This facilitates viewing/editing on the generated output files with >> tools native to this platform. > I do not see the contradition. > (Do not assume the included files are created on > the platform that is running the writer.) Now, after careful re-reading I see your point. Indeed it might be good to "normalize" line endings convention) so that we do not have a mix in the output file. Günter |
From: Guenter M. <mi...@us...> - 2009-10-25 12:03:18
|
On 2009-10-25, David Goodger wrote: > On Sat, Oct 24, 2009 at 17:44, Guenter Milde <mi...@us...> wrote: >> Problem >> ------- >> ... mixed line endings inside one output file if the input consists of >> files with different line ending conventions ... as literal included files keep the original line-endings. >> Solutions >> --------- >> a) open files in text mode ('rU') (my suggested patch), > +0, assuming this works. ... > Docutils handles line endings in the main file in a simple way: line > endings are ignored completely -- they are stripped out. ... I'd still call this a conversion, as the individual lines are joined with newlines again later. E.g. in states.Text.literal_block :: data = '\n'.join(indented) literal_block = nodes.literal_block(data, data) However, against my earlier assumption, line end normalization is done also for literal blocks, so alternatives b) and c) become: b) explicitely normalize line endings of literal included text files: --- misc.py (Revision 6182) +++ misc.py (Arbeitskopie) if 'literal' in self.options: + # normalize line endings: + text = '\n'.join(rawtext.splitlines()) c) explicitely normalize line endings of literal inclusions with docutils.statemachine.string2lines(). > (Note that docutils.statemachine.string2lines > also converts hard tabs to spaces and strips trailing whitespace, > which was discussed recently in the "release 0.6" thread. I'm not > convinced that keeping hard tabs is anything but a kludge. Keeping hard > tabs as in r6135 may be a mistake.) We had this discussion already. I repeat that there is a use case for keeping hard tabs in literal inclusions: * It is possible to highlight tabs in the output (e.g. with the LaTeX writer's ``literal-env=listings`` option), so keeping them intact will ensure the proper E.g. a makefile that must have hard tabs instead of spaces, so the correct highlighting matters! * Keeping hard tabs, it is possible to configure tab expansion in a style sheet (maybe not in HTML/CSS but in other output formats). > I don't think any conversion is necessary. Either handle with > universal-newlines text mode (as in (a) above) or process the same way > as the main document. I think the latter may be the better solution. My preference is a) because it uses the standard Python way to normalize line endings (fast and clean). I can also live with b) (patch is ready) or c), if we agree to let ``tab_width < 0`` signify "keep-tabs" and maybe also "keep trailing (non line-ending) whitespace" in docutils.statemachine.string2lines(). Günter |
From: David G. <go...@py...> - 2009-10-25 15:50:11
|
> On 2009-10-25, David Goodger wrote: > > (Note that docutils.statemachine.string2lines > > also converts hard tabs to spaces On Sun, Oct 25, 2009 at 08:02, Guenter Milde <mi...@us...> wrote: > and strips trailing whitespace, You're not going to argue that there's a use case for trailing whitespace, are you? (0.5 ;-) > > which was discussed recently in the "release 0.6" thread. I'm not > > convinced that keeping hard tabs is anything but a kludge. Keeping hard > > tabs as in r6135 may be a mistake.) > > We had this discussion already. I know. I mentioned so ("which was discussed recently" above). > I repeat that there is a use case for keeping hard tabs in literal > inclusions: The use case seems to me to be flimsy at best -- it is grasping for some/any possible use of hard tabs. Keeping the hard tabs feels wrong to me somehow. For example, what happens to the hard tabs when the Writer (output format) cannot handle them properly? But since it requires an explicit action (:tab-width: < 0), (ab)users of this (mis)feature would only have themselves to blame for any consequences. >> I don't think any conversion is necessary. Either handle with >> universal-newlines text mode (as in (a) above) or process the same way >> as the main document. I think the latter may be the better solution. > > My preference is a) > because it uses the standard Python way to normalize line endings (fast > and clean). > > I can also live with b) > (patch is ready) > > or c), +0 on (a). > if we agree to let ``tab_width < 0`` signify "keep-tabs" -0. I don't care enough to argue further. > and maybe also > "keep trailing (non line-ending) whitespace" in > docutils.statemachine.string2lines(). ... so no use case yet, but you are going to argue in that direction. -1. No way. That's hypergeneralization, and is just wrong. -- David Goodger <http://python.net/~goodger> |
From: Guenter M. <mi...@us...> - 2009-10-23 21:19:14
|
On 2009-10-23, Guenter Milde wrote: > On 2009-10-23, Alan G Isaac wrote: >>> On 2009-10-21, Alan G Isaac wrote: >>>>>> The platform used to create the included files should not determine >>>>>> the EOL marker written by the writer. > Indeed it might be good to "normalize" line endings convention) so that > we do not have a mix in the output file. Does the following patch solve your problem? alltests.py runs fine here with Python 2.5 on Unix. Can you test on Windows? Günter Exec: svn 'diff' 'io.py' 2>&1 Dir: /home/milde/Code/Python/docutils-svn/docutils/docutils/ Index: io.py =================================================================== --- io.py (Revision 6182) +++ io.py (Arbeitskopie) @@ -207,7 +207,7 @@ def __init__(self, source=None, source_path=None, encoding=None, error_handler='strict', - autoclose=1, handle_io_errors=1): + autoclose=1, handle_io_errors=1, mode='rU'): """ :Parameters: - `source`: either a file-like object (which is read directly), or @@ -218,6 +218,9 @@ - `autoclose`: close automatically after read (boolean); always false if `sys.stdin` is the source. - `handle_io_errors`: summarize I/O errors here, and exit? + - `mode`: how the file is to be opened (see standard function + `open`). The default 'rU' provides universal newline support + for text files. """ Input.__init__(self, source, source_path, encoding, error_handler) self.autoclose = autoclose @@ -225,7 +228,7 @@ if source is None: if source_path: try: - self.source = open(source_path, 'rb') + self.source = open(source_path, mode) except IOError, error: if not handle_io_errors: raise Exec: svn 'diff' 'io.py' 2>&1 Dir: /home/milde/Code/Python/docutils-svn/docutils/docutils/ Index: io.py =================================================================== --- io.py (Revision 6182) +++ io.py (Arbeitskopie) @@ -207,7 +207,7 @@ def __init__(self, source=None, source_path=None, encoding=None, error_handler='strict', - autoclose=1, handle_io_errors=1): + autoclose=1, handle_io_errors=1, mode='rU'): """ :Parameters: - `source`: either a file-like object (which is read directly), or @@ -218,6 +218,9 @@ - `autoclose`: close automatically after read (boolean); always false if `sys.stdin` is the source. - `handle_io_errors`: summarize I/O errors here, and exit? + - `mode`: how the file is to be opened (see standard function + `open`). The default 'rU' provides universal newline support + for text files. """ Input.__init__(self, source, source_path, encoding, error_handler) self.autoclose = autoclose @@ -225,7 +228,7 @@ if source is None: if source_path: try: - self.source = open(source_path, 'rb') + self.source = open(source_path, mode) except IOError, error: if not handle_io_errors: raise |
From: Alan G I. <ai...@am...> - 2009-10-24 14:12:47
|
Sorry, this was a bit unclear to me: did you already apply this patch somewhere, or did you want me to apply it to my existing docutils? Alan |
From: David G. <go...@py...> - 2009-10-24 15:25:57
|
On Sat, Oct 24, 2009 at 10:12, Alan G Isaac <ai...@am...> wrote: > Sorry, this was a bit unclear to me: > did you already apply this patch somewhere, > or did you want me to apply it to my > existing docutils? Günter's patch was not applied yet. He wants you to apply it on your code and test it. -- David Goodger <http://python.net/~goodger> |
From: Guenter M. <mi...@us...> - 2009-10-26 08:08:26
|
On 2009-10-25, Alan G Isaac wrote: > On 10/25/2009 5:04 PM, Guenter Milde wrote: >> Alan, could you test with your sample case? > Yes, that fixes the literal-inclusion eol issue. > I tested for dos, mac, and unix fileformats. > (Current SVN.) Fine. > PS Literal inclusions always end with a superfluous > empty line. This is the trailing newline in your source file. > If I use :begin-after:, they also begin > with a superfluous empty line. Also, if the begin-after text ends in the middle of a line? Günter |
From: David G. <go...@py...> - 2009-10-26 14:07:30
|
On Mon, Oct 26, 2009 at 04:07, Guenter Milde <mi...@us...> wrote: > On 2009-10-25, Alan G Isaac wrote: >> If I use :begin-after:, they also begin >> with a superfluous empty line. > > Also, if the begin-after text ends in the middle of a line? Take the text after the begin-after marker, .strip() it, and if that's an empty string, don't include it. -- David Goodger <http://python.net/~goodger> |
From: Guenter M. <mi...@us...> - 2009-10-24 21:45:29
|
On 2009-10-24, David Goodger wrote: > On Sat, Oct 24, 2009 at 10:12, Alan G Isaac <ai...@am...> wrote: >> Sorry, this was a bit unclear to me: >> did you already apply this patch somewhere, >> or did you want me to apply it to my >> existing docutils? > Günter's patch was not applied yet. He wants you to apply it on your > code and test it. Exactly. Also, I would like Davids opinion on the assumption By default, input files are supposed to be text files (open with mode='rU'). Problem ------- The current setting is open as binary which can lead to mixed line endings inside one output file if the input consists of files with different line ending conventions, e.g. Example: main.txt includes three files with .. include unix-child.txt .. include dos-child.txt .. include mac-child.txt mulit-line literal parts of the child documents will have the original line-endings. Solutions --------- a) open files in text mode ('rU') (my suggested patch), b) explicitely convert line endings of included text files. (not only literal inclusions, but also RST files, as these could contain literal blocks), c) explicitely convert line endings of literal blocks and literal inclusions. Günter Günter |
From: David G. <go...@py...> - 2009-10-25 03:41:51
|
On Sat, Oct 24, 2009 at 17:44, Guenter Milde <mi...@us...> wrote: > Also, I would like Davids opinion on the assumption > > By default, input files are supposed to be text files (open with > mode='rU'). > > Problem > ------- > > The current setting is open as binary which can lead to mixed line > endings inside one output file if the input consists of files > with different line ending conventions, e.g. > > Example: > main.txt includes three files with > > .. include unix-child.txt > > .. include dos-child.txt > > .. include mac-child.txt > > mulit-line literal parts of the child documents will have the original > line-endings. > > Solutions > --------- > > a) open files in text mode ('rU') (my suggested patch), +0, assuming this works. > b) explicitely convert line endings of included text files. > (not only literal inclusions, but also RST files, as these could > contain literal blocks), > > c) explicitely convert line endings of literal blocks and literal > inclusions. I don't think any conversion is necessary. Either handle with universal-newlines text mode (as in (a) above) or process the same way as the main document. I think the latter may be the better solution. Docutils handles line endings in the main file in a simple way: line endings are ignored completely -- they are stripped out. See the docutils.statemachine.string2lines function. It uses the .splitlines string method to split a block of text into a list of individual lines. Experimentation shows that all of \n, \r, and \r\n are handled by this method. The "include" directive currently splits the input text into lines in a different way, using file.readlines(). This leaves line endings intact, resulting in the current problem. Better to strip out the line endings consistently. (Note that docutils.statemachine.string2lines also converts hard tabs to spaces, which was discussed recently in the "release 0.6" thread. I'm not convinced that keeping hard tabs is anything but a kludge. Keeping hard tabs as in r6135 may be a mistake.) -- David Goodger <http://python.net/~goodger> |
From: Guenter M. <mi...@us...> - 2009-10-26 15:16:33
|
On 2009-10-26, David Goodger wrote: > On Mon, Oct 26, 2009 at 04:07, Guenter Milde <mi...@us...> wrote: >> On 2009-10-25, Alan G Isaac wrote: >>> If I use :begin-after:, they also begin >>> with a superfluous empty line. >> Also, if the begin-after text ends in the middle of a line? > Take the text after the begin-after marker, .strip() it, and if that's > an empty string, don't include it. How about stripping all blank lines from begin and end of the included file or file section? This would correspond to the way a literal block is handled:: this is equal to:: this although there are empty lines. As nodes.literal_block expects (and we have) the text as a string (rather than a list of lines), using a regexp seems the way. Günter |
From: David G. <go...@py...> - 2009-10-26 15:26:27
|
On Mon, Oct 26, 2009 at 11:15, Guenter Milde <mi...@us...> wrote: > How about stripping all blank lines from begin and end of the included file > or file section? Sure. -- David Goodger <http://python.net/~goodger> |
From: Guenter M. <mi...@us...> - 2009-10-27 07:58:04
|
On 2009-10-26, David Goodger wrote: > On Mon, Oct 26, 2009 at 11:15, Guenter Milde <mi...@us...> wrote: >> How about stripping all blank lines from begin and end of the included file >> or file section? > Sure. It's now on the todo.txt list. Günter |
From: Guenter M. <mi...@us...> - 2009-10-25 21:05:01
|
On 2009-10-25, David Goodger wrote: >> On 2009-10-25, David Goodger wrote: >> > (Note that docutils.statemachine.string2lines >> > also converts hard tabs to spaces > On Sun, Oct 25, 2009 at 08:02, Guenter Milde <mi...@us...> wrote: >> and strips trailing whitespace, > You're not going to argue that there's a use case for trailing > whitespace, are you? > (0.5 ;-) Of course I am. For me, literal means literal. I want a faithfull representation of the included file, so I will not introduce any changes without need. Trailing whitespace can be made visible if the background of the running text is set differently from that of the literal block in a style sheet... ... >> I repeat that there is a use case for keeping hard tabs in literal >> inclusions: > The use case seems to me to be flimsy at best -- it is grasping for > some/any possible use of hard tabs. Keeping the hard tabs feels wrong > to me somehow. For example, what happens to the hard tabs when the > Writer (output format) cannot handle them properly? My point of view is the other way round: what happens when the Writer (output format) renders hard tabs and spaces differently? The latex2e writer is a real example, while I don't know of any writer that cannot handle them properly. > But since it > requires an explicit action (:tab-width: < 0), (ab)users of this > (mis)feature would only have themselves to blame for any consequences. This was our consensus and I hope we can stick to it. ... >> My preference is a) >> because it uses the standard Python way to normalize line endings (fast >> and clean). >> I can also live with b) >> (patch is ready) >> or c), > +0 on (a). >> if we agree to let ``tab_width < 0`` signify "keep-tabs" > -0. I don't care enough to argue further. >> and maybe also >> "keep trailing (non line-ending) whitespace" in >> docutils.statemachine.string2lines(). > ... so no use case yet, but you are going to argue in that direction. > -1. No way. That's hypergeneralization, and is just wrong. So a) won by a margin of 2*\epsilon. The patch is applied. I also tried to add a test case (literal inclusion of file with CRLF) but I fear that SVN did normalize the line endings thus rendering useless. (Maybe we could define the test file include_literal.txt as binary.) Alan, could you test with your sample case? Günter Alan, can you |
From: Alan G I. <ai...@am...> - 2009-10-25 21:33:29
|
On 10/25/2009 5:04 PM, Guenter Milde wrote: > Alan, could you test with your sample case? Yes, that fixes the literal-inclusion eol issue. I tested for dos, mac, and unix fileformats. (Current SVN.) Thanks! Alan PS Literal inclusions always end with a superfluous empty line. If I use :begin-after:, they also begin with a superfluous empty line. PPS On another topic, here's a use case for retaining all white space literally in literal includes: http://en.wikipedia.org/wiki/Whitespace_%28programming_language%29 |