From: <ms...@ma...> - 2002-11-11 17:29:17
|
Hi I need to convert plain text (with some formatting) to HTML format, and, fortunately, this module provide this functionality (among other). :) Having downloaded the 0.2 version, I tried to write a simple program to convert from text to HTML. It turned to out to be not that easy. Fortunately, in mailing list archives I found a suggestion to download the snapshot and proceed with publish_string helper. This works great, thank you. I have two questions, however. It looks like I cannot get only the body of the text (what is located between <body> ... </body>) without some addtional programming, nor it's possible to get rid of use stylesheets at all. What would be the best way to accomplish this? The second comes from my rather extensive use of Outlook (yes, a Microsoft product) "highlighting". In cases, when the path or the file name contain spaces, it's very convenient to just enclose the whole consruction in angle brackets (like, <schema://some path/with/spaces/and a file>), and you do not really have to worry about converting those in %20. What would you say about such a feature? Best Regards, -- Misha |
From: David G. <go...@py...> - 2002-11-12 01:52:24
|
Mikhail Sobolev wrote: > Having downloaded the 0.2 version, I tried to write a simple program > to convert from text to HTML. It turned to out to be not that easy. > Fortunately, in mailing list archives I found a suggestion to > download the snapshot and proceed with publish_string helper. This > works great, thank you. Thank you for checking the archives! > I have two questions, however. > > It looks like I cannot get only the body of the text (what is located > between <body> ... </body>) without some addtional programming, Correct. You'll need a specialized Writer component. Take a look at the files in http://docutils.sf.net/sandbox/oliverr/ht/ . This seems to be a common requirement for people, so a custom HTML-body-only Writer could be useful. I don't know what to do about the DocTitle transform in this case though (in docutils/transforms/frontmatter.py). > nor it's possible to get rid of use stylesheets at all. I'm not sure what you mean by this or what you want. Please elaborate. The html4css1.py Writer is designed to use a stylesheet, as recommended by the latest HTML specs. If you want HTML that doesn't require a stylesheet at all, a new Writer would be needed. > The second comes from my rather extensive use of Outlook (yes, a > Microsoft product) "highlighting". In cases, when the path or the > file name contain spaces, it's very convenient to just enclose the > whole consruction in angle brackets (like, <schema://some > path/with/spaces/and a file>), and you do not really have to worry > about converting those in %20. What would you say about such a > feature? According to RFC 2396 "Uniform Resource Identifiers (URI): Generic Syntax", spaces are not valid URI/URL characters. It does say this: In some cases, extra whitespace (spaces, linebreaks, tabs, etc.) may need to be added to break long URI across lines. The whitespace should be ignored when extracting the URI. ... Using <> angle brackets around each URI is especially recommended as a delimiting style for URI that contain whitespace. The syntax you propose would conflict with this, especially if the MS-style URL were to break across lines: <http://www.example.com/a/very/long/ path/broken/across/lines> Is the whitespace after "long/" significant or not? The RFC says it's not. The reStructuredText parser also joins long multi-line URLs in targets. I wouldn't mind adding the ability to join broken URLs in free text as well, if surrounded by brackets. So the answer to your question is, I think I'd say no thanks. Whitespace in URLs is a pain; I think it's better just to avoid it. -- David Goodger <go...@py...> Open-source projects: - Python Docutils: http://docutils.sourceforge.net/ (includes reStructuredText: http://docutils.sf.net/rst.html) - The Go Tools Project: http://gotools.sourceforge.net/ |
From: <ms...@ma...> - 2002-11-12 10:18:13
|
On Mon, Nov 11, 2002 at 08:53:01PM -0500, David Goodger wrote: > > It looks like I cannot get only the body of the text (what is located > > between <body> ... </body>) without some addtional programming, >=20 > Correct. You'll need a specialized Writer component. Take a look at > the files in http://docutils.sf.net/sandbox/oliverr/ht/ . This seems > to be a common requirement for people, so a custom HTML-body-only > Writer could be useful. I don't know what to do about the DocTitle > transform in this case though (in docutils/transforms/frontmatter.py). I believe, the best approach is to just ignore it. :) Those who really need it, could access it through the document instance. > > nor it's possible to get rid of use stylesheets at all. >=20 > I'm not sure what you mean by this or what you want. Please > elaborate. The current code does produce HTML elements with classes referencing to a stylesheet. I'd say that the rendering without a stylesheet seems to be OK for me, so I'd like to specify None as the stylesheet name, and in this case I'd expect to get html text without class references in html elements. > The html4css1.py Writer is designed to use a stylesheet, as > recommended by the latest HTML specs. If you want HTML that doesn't > require a stylesheet at all, a new Writer would be needed. Such a behaviour does not seem to be very complicated, so maybe it could be possible to add this functionality in the current code? > [ URLs with spaces ] >=20 > According to RFC 2396 "Uniform Resource Identifiers (URI): Generic > Syntax", spaces are not valid URI/URL characters. It does say this: >=20 > In some cases, extra whitespace (spaces, linebreaks, tabs, etc.) > may need to be added to break long URI across lines. The whitespace > should be ignored when extracting the URI. > ... > Using <> angle brackets around each URI is especially recommended > as a delimiting style for URI that contain whitespace. >=20 > The syntax you propose would conflict with this, especially if the > MS-style URL were to break across lines: >=20 > <http://www.example.com/a/very/long/ > path/broken/across/lines> >=20 > Is the whitespace after "long/" significant or not? The RFC says it's > not. The reStructuredText parser also joins long multi-line URLs in > targets. I wouldn't mind adding the ability to join broken URLs in > free text as well, if surrounded by brackets. >=20 > So the answer to your question is, I think I'd say no thanks. > Whitespace in URLs is a pain; I think it's better just to avoid it. Hmm. The current code does not seem to follow the quoted RFC 2396 then. I did specify <http://www.example.com/an url with spaces> (which seems to be correct according to this RFC) and as result got <<a href=3D"http://www.example.com/an">http://www.example.com/an</a> url with spaces> which seems to be incorrect, right? -- Misha |
From: David G. <go...@py...> - 2002-11-13 01:28:49
|
[Mikhail] >>> nor it's possible to get rid of use stylesheets at all. [David] >> I'm not sure what you mean by this or what you want. Please >> elaborate. [Mikhail] > The current code does produce HTML elements with classes referencing > to a stylesheet. Actually, it's the other way around. The HTML file does reference a stylesheet in its <link rel="stylesheet" ... /> element, but it's the styles (in the stylesheet) which reference the "class" attributes on elements in the HTML files. So if there's no stylesheet referenced, the "class" attributes have no effect. > I'd say that the rendering without a stylesheet seems to be OK for > me, so I'd like to specify None as the stylesheet name, I've altered the HTML Writer so that if both settings.stylesheet (--stylesheet) and settings.stylesheet_path (--stylesheet-path) are None or "", there will be no <link rel="stylesheet" ... /> added to the output. Note that if you use the standard config file in tools/docutils.conf, it does set settings.stylesheet_path, so you'll have to override explicitly. > and in this case I'd expect to get html text without class > references in html elements. ... > Such a behaviour does not seem to be very complicated, so maybe it > could be possible to add this functionality in the current code? If that's what you want, you'll have to supply the code. There's no harm having ``class="whatever"`` attributes on HTML elements when there's no stylesheet. It would be easy to add as a setting/option, but I'll leave it to you because I don't think it's useful. I'll be happy to accept a patch. > Hmm. The current code does not seem to follow the quoted RFC 2396 > then. I did specify > > <http://www.example.com/an url with spaces> > > (which seems to be correct according to this RFC) and as result got > > <<a href="http://www.example.com/an" > >http://www.example.com/an</a> url with spaces> > > which seems to be incorrect, right? Note that according to the RFC, your example should be interpreted as: http://www.example.com/anurlwithspaces Which is *not* what you asked for. I wrote: >> The reStructuredText parser also joins long multi-line URLs in >> targets. This applies to the "target" construct only:: .. _target: http://www.example.com/a/very/long/ path/broken/across/lines My comment does not apply to standalone URLs in text, with or without angle brackets (which have no special meaning now). As for "The current code does not seem to follow the quoted RFC 2396", that's true. However, please realize that the quoted text comes from Appendix E, "Recommendations for Delimiting URI in Context". A recommendation, not a specification. The first sentence reads: URI are often transmitted through formats that do not provide a clear context for their interpretation. reStructuredText *does* provide a clear context for the interpretation of URIs, via the "target" construct. The appendix goes on to say: For robustness, software that accepts user-typed URI should attempt to recognize and strip both delimiters and embedded whitespace. And I wrote: >> I wouldn't mind adding the ability to join broken URLs in free text >> as well, if surrounded by brackets. This is the first time this issue has come up. If this feature is important to you, I would be pleased to accept a patch that implements it. But the patch should implement the behavior described in the RFC, *not* the ad-hoc behavior witnessed in MS Outlook. The ambiguous and non-standard MS Outlook behavior will *not* be supported. -- David Goodger <go...@py...> Open-source projects: - Python Docutils: http://docutils.sourceforge.net/ (includes reStructuredText: http://docutils.sf.net/rst.html) - The Go Tools Project: http://gotools.sourceforge.net/ |
From: <ms...@ma...> - 2002-11-13 03:07:20
|
On Tue, Nov 12, 2002 at 08:29:35PM -0500, David Goodger wrote: [ stylesheet / class usage discussion skipped ] I believe, the question I raised comes from my desire to use the generated HTML code as a part of a bigger document. So as I may not know what stylesheet is used in the "parent" document, this "unnecessary" references to a non-existant stylesheet may lead to problems. However I got your point and will try to supply the code which implements the this behaviour. > > Hmm. The current code does not seem to follow the quoted RFC 2396 > > then. I did specify > >=20 > > <http://www.example.com/an url with spaces> > >=20 > > (which seems to be correct according to this RFC) and as result got > >=20 > > <<a href=3D"http://www.example.com/an" > > >http://www.example.com/an</a> url with spaces> > >=20 > > which seems to be incorrect, right? >=20 > Note that according to the RFC, your example should be interpreted as: >=20 > http://www.example.com/anurlwithspaces >=20 > Which is *not* what you asked for. Which I _originally_ asked for. I understood your explanation and the reference to the RFC, and the way the RFC suggests these whitespace characters are interpreted. > >> The reStructuredText parser also joins long multi-line URLs in > >> targets. >=20 > This applies to the "target" construct only:: >=20 > .. _target: http://www.example.com/a/very/long/ > path/broken/across/lines >=20 > My comment does not apply to standalone URLs in text, with or without > angle brackets (which have no special meaning now). I see. I missed the word "targets", which actually has a special meaning. > This is the first time this issue has come up. If this feature is > important to you, I would be pleased to accept a patch that implements > it. But the patch should implement the behavior described in the RFC, > *not* the ad-hoc behavior witnessed in MS Outlook. The ambiguous and > non-standard MS Outlook behavior will *not* be supported. That I understand and I have no intention to insist on any non-standard behaviour whatsoever. -- Misha |