[Docutils-develop] Re: reST includes first cut (long...)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

I'd already begun a follow-up to my last message, which I'll
interweave into this reply.

[Dethe]
> Thanks for the great feedback.  I hope you don't mind me taking this
> back online, I think there's some valuable stuff here.

I don't mind at all.  Actually, I prefer the discussions online; it
opens up the possibility for more input.

[David]
>> I just hope I don't scare you away.
> 
> Not a bit of it.

Glad to hear it.  In the message I'd already begun, I wrote:

    If this small task has grown too big or or time-consuming or
    onerous for you to handle, please let me know and I'll take over,
    with gratitude for work done and ideas inspired.

So you've got an escape clause if you change your mind. ;-)

>> I'm looking at the "utils.cheapDirective" function you wrote.  I
>> think it's a very good idea to provide a generic directive parser
>> function, but to be truly useful, this function needs to be even
>> *more* generic.  Please bear with me here.
>
> More generic would be good, I was working out from the images
> directive and trying to factor what was specfic to that directive
> from what was common to all directives as I went.

And a very fruitful side-trip it has become!  It's a step I hadn't
taken.  Every new directive began with a copy & paste from the most
similar existing one, but I hadn't looked at the patterns yet.

>> From the markup spec, there are 3 physical parts to a directive:
>> 
>> 1. Directive type (identifier before the "::")
>> 2. Directive data (remainder of the first line)
>> 3. Directive block (indented text following the first line)
> 
> Except that there is a fourth part: The attribute list.

That's a *logical* part, not physical (at least, not by the original
definition).  No matter; I think we'll dispense with the old
"physical" terms as obsolete.

>> I think the "include" directive should be reStructuredText-only,
>> and the "raw" directive should have an optional second argument, a
>> path.  They definitely should not duplicate each other's
>> functionality (TOOWTDI).  In the terms defined above,
> 
> I'm not disagreeing, but I got a bit lost in the debate over :raw:
> vs. :include: and wanted a) some context to help me form an
> opinion, and b) to test the generalized directive parsing by using
> it for more than one directive.

Sure.  That's cool.

>> - "include" should have one argument, a path. (1)
>> 
>> - "raw" should have one or two arguments -- a format ("html" etc.)
>>   and an optional path -- and content, but only if there was no
>>   path argument.  So this would be either a (1) or a (1,3)
>>   directive.  So that forces us to split the parse in two, perhaps
>>   into "parse_directive" (which parses the arguments & attributes),
>>   plus "get_directive_content".
> 
> Not necessarily.  Is there any reason we can't parse a directive and
> return a 3-tuple (arguments, options, content).  It would then be up
> to the individual directive to test that these do have values, if
> required.

Originally, I was thinking that if a directive didn't *need* a content
block, it shouldn't consume it any that happened to be there.  That
would allow a directive to be followed by a block quote, but it makes
directives conceptually more difficult than they need to be.  While
updating the markup and directive specs as per my last message, I
thought some more about the logical parts of a directive, and I now
believe that the original thinking was wrong.  Instead, all indented
text following a directive *should* be consumed by the directive.  The
"parse_directive" function should signal an error if there *is*
content, but the directive doesn't ask for it.  If a block quote
should follow a directive, an empty comment inserted between them will
do the trick.  Docs updated accordingly.

And yes, you're right about returning a uniform set of data.  But we
actually need a 4-tuple; read on.

>> Note that there are no attributes/options now.  If an attribute is
>> required, it shouldn't be an attribute.  It's analogous to
>> command-line options: "required option" is an oxymoron.  I'm thinking
>> of changing the terminology in the spec from "attribute" to "option"
>> to help reinforce this.

(As I threatened here, I've changed terminology from "directive
attributes" to "directive options", in the docs and in the parser
code.  See the latest CVS or snapshot.)

> OK, but this means sub-parsing the arguments instead of pulling data
> out of the attributes.  My preference would be to have a standard
> way to specify not only the types of attributes, but default values
> for them.  Then there are no optional attributes, just default or
> explicitly set.  This is a more XML-ish way to go.

The concept of "optional" is useful.  I'm using the XML idea of
#IMPLIED attributes, rather than defaults in the DTD (or in the
attribute parsing code, in our case).  I've always found it more
flexible for the downstream parts of the processing chain to make the
decisions; keep your options open.  If you put in default values
early, you lose the information that the attribute just *wasn't
specified*, and that can be valuable information lost.

Modelling directives on shell commands works well.  Let's go with it.

>> Looking at the directives in the test1.rst file in order, first
>> there's::
>> 
>>     .. include:: file:test2.rst
>> 
>> I'm not comfortable with URL syntax here.  I really think it's
>> YAGNI, and may open up a big can of worms.  So that one should
>> become::
>> 
>>     .. include:: test2.rst
> 
> Well, I think we *do* need URIs if we want includes to be useful to
> ZReST (which is not file-system based, but lives in an
> object-oriented database accessible by URI).  Obviously, we don't
> need a file: URI for such a trivial (and file-based) example, that's
> just a way to test that URIs work in general.

OK, good use case.  But I'm still uncomfortable with the "openAny"
function you sent::

    def openAny(path):
        try:
          # is it a file?
          return open(path)
        except :
          try:
            # is it a url?
            return urlopen(path)
          except (URLError, ValueError):
            # treat as a string
            return StringIO(path)

Especially the final StringIO part; the function should simply fail.
Should we check if "path" is a URL first, to avoid the "open(path)"
failure?  Or is this a case of "look before you leap" vs. "it's easier
to ask forgiveness than permission"?  Can a URI look like a filesystem
path?  Seems a bit ambiguous to me.

Out of curiosity, what would a Zope/ZReST URL look like?

>> The next one is an "include"/"raw" hybrid::
>> 
>>     .. include:: test3.rst
>>        :raw: 
>>        :format: html
> 
> Well, there were a few possibilities tossed around in email.  I
> couldn't remember why the format was even needed (it's *raw*, what
> else do we need?), but then my data wouldn't show up using html.py
> and I had to dig through the writer code to figure out that raw
> nodes are tested for format type.

Yes, it wouldn't do much good to insert raw HTML in the middle of PDF.
The "raw" directive is meant to be a solution of last resort anyhow;
it's not portable.

>> It should be::
>> 
>>     .. raw:: html test3.rst
> 
> OK, if that's the correct way to do it in reST, I'll do that.  I
> still prefer the default attributes method discussed above rather
> than multiple arguments because it makes the names and types of the
> argument/option/attribute explicit.

Discussed further below.

>> I noticed that you've got the attributes *after* the content in the
>> next ones::
>> 
>>     .. include::
>>        This is a <super>Test</super> of the <sub>Emergency</sub>
>>        <strike>Broadcasting</strike> &quot;System&quot;
>>        :raw:
>>        :format: html
> 
> Yes, that was my mis-parsing of the directive (because I was working
> from images.py, not the spec).

Aha!

> The way I was grabbing (or failing to grab?) the content threw an
> exception if I put the attributes first, but worked OK if I put them
> after.  I didn't like this either.

The spec is there for a reason. :-) But it's not immutable.  It can be
changed when there's good reason.  The code too; there's a *reason*
we're not at release 1.0 yet!  This is a learning experience.

>> Attributes always come *before* the content (who's to say that
>> ":format: html" isn't valid raw data in some format?).  In any case,
>> that directive should become::
>> 
>>     .. raw:: html
> 
> OK.  But the same arguments apply.  Who's to say :format: html isn't
> valid raw data at the beginning of the data as well as easily as at
> the end?  I agree that it's aethetically and functionally better to
> have the attributes first, just not because the string ":format:
> html" might appear in the data.

That's why the blank line is necessary between options and content.

> Also, the problem with using multiple comes up. Arguments to
> :raw: are 
> 
> path format
> or
> format
> 
> But it seems more intuitive to me to put the required argument
> first: so the first argument to :raw: would always be format and we
> don't have to run tests or special cases.  If there's a second
> argument, then it's path.

You mis-read.  It *is* ".. raw:: format [path]" (the *second*
argument, "path", is optional).

> This still isn't as clear and explicit (to me) as using attributes,
> but better than having argument position be dependent on number of
> arguments.

It could be::

    .. raw:: format
       :source: path/URL

Nothing wrong with that, I suppose.  Since it is an *optional*
argument, it does fit into the "option" mold.  And in fact, it could
be even more explicit (and remove my misgivings at the same time) if
we made the option more specific::

    .. raw:: format
       :file: path

    .. raw:: format
       :url: URL

Although better option names may exist.

>> Next::
>> 
>>     .. raw::
>>        This is <strong>RAW</strong>.  Really, <em>really,</em> raw.
>>        :format: html
>> 
>> Should become::
>> 
>>     .. raw:: html
>> 
>>        This is <strong>RAW</strong>.  Really, <em>really,</em> raw.
>> 
>> (Note the blank line.)
> 
> Oh, there's a blank line *before* the content.  I didn't get that.

Yes.  The directives.txt file was mistaken for the description of
"raw" (although if you didn't *read* it, that shouldn't have mattered
;-).  It's fixed now, and the specs are much more explicit.  See
http://docutils.sf.net/spec/rst/reStructuredText.html#directives, and
http://docutils.sf.net/spec/rst/directives.html.

> I thought the directive *ends* with a blank line.

There was some ambiguity about that, gone now.  The rule is, a
directive block ends with the end of indentation.  That's it.  If a
directive doesn't need a content block, it should be empty, otherwise
it's an error.

> Yikes.  Now I see why you want get_directive_content().  Doesn't
> that introduce all sorts of possible ambiguities?  Shouldn't there
> be one format for directives in all cases so that the users (typing
> in raw text with reST getting in their way as little as possible)
> don't have to remember the special cases for different directives?

Yes and yes, and I came to the same conclusion, as discussed above.
Just "parse_directive()" will be sufficient.

>> The last "raw" directive was::
>> 
>>     .. raw::
>>        :include: test3.rst
>>        :format: html
>> 
>> And should become::
>> 
>>     .. raw:: html test3.rst
>> 
>> (This is the same as the second directive, therefore redundant.)
> 
> The redundancy is deliberate.  I was testing two different
> possibilities and wanted to be sure they both worked.  They do the
> same thing, therefore they work.

Sorry, I should have said, "(This is *now* the same as the second
directive, therefore redundant.)". :-)

> So what you'd like to see is:
> 
> :raw: [filepath] format
> 
> :include: filepath (which must be reST)
> 
> and functions for directives/__init__.py
> 
> parse_directive()
> 
> and 
> 
> get_directive_content()
> 
> right?

Not quite.  Just to be clear, let's summarize:

* ".. include:: filepath" (must be reStructuredText).

* ".. raw:: format" + either:

  - a second, optional "filepath" argument, or
  - a "source" (or equivalent) option, or
  - (perhaps best?) *two* options, one for a filesystem path source,
    the other for a URL source (and we can't use both in one
    directive)

  If the "external source" argument or option is specified (in
  whatever form), there can be no directive content.  If there is,
  it's an error.

* A single function for directives/__init__.py, "parse_directive".
  (Plus any auxiliary functions required, of course.)

* An exception for directives/__init__.py::

     class DirectiveParseError(docutils.ApplicationError): pass

  (It doesn't need an "__init__" method, but it doesn't hurt much.)

The signature for "parse_directive" could be something like this::

    def parse_directive(match, type_name, state, state_machine,
                        option_presets, arguments=None,
                        option_spec={}, content=None):
        """
        Parameters:

        - `match`, `type_name`, state`, `state_machine`, and
          `option_presets`: See `docutils.parsers.rst.directives.__init__`.
        - `arguments`: A 2-tuple of the number of ``(required,
          optional)`` whitespace-separated arguments to parse, or
          ``None`` if no arguments (same as ``(0, 0)``).  If an
          argument may contain whitespace (multiple words), specify
          only one argument (either required or optional); the client
          code must do any context-sensitive parsing.
        - `option_spec`: A dictionary, mapping known option names to
          conversion functions such as `int` or `float`.  ``None`` or
          an empty dict implies no options to parse.
        - `content`: A boolean; true if content is allowed.  Client
          code must handle the case where content is required but not
          supplied (an empty content list will be returned).

        Returns a 4-tuple: list of arguments, dict of options, list of
        strings (content block), and a boolean (blank finish).

        Or raises `DirectiveParseError` with arguments: node (system
        message), boolean (blank finish).
        """

Once "parse_directive" is ready, we'll be able to convert all existing
directives use it.  As a side-effect, we will be able to drop the
"data" parameter from the directive function signature.  The end
result will be much simpler directive code.  Great result!

-- 
David Goodger  <go...@us...>  Open-source projects:
  - Python Docutils: http://docutils.sourceforge.net/
    (includes reStructuredText: http://docutils.sf.net/rst.html)
  - The Go Tools Project: http://gotools.sourceforge.net/