|
From: David G. <go...@py...> - 2002-12-05 03:16:31
|
I have begun work on a Python source Reader component for Docutils. I expect the work to go slowly, as there is lots to absorb, much earlier work to study and learn from, and little spare time to devote. I'm trying to keep it as simple as possible, mostly for my own benefit (lest my brain explode). I've looked over the HappyDoc code and Tony "Tibs" Ibbs' PySource prototype. HappyDoc uses the stdlib "parser" module to parse Python modules into abstract syntax trees (ASTs), but that seems difficult and fragile, the ASTs being so low-level. Tibs' prototype uses the much higher-level ASTs built by the stdlib "compiler" module, which are much easier to understand. I've decided to use the "compiler" module also. My first stumbling block is in parsing assignments. I want to extract the right-hand side (RHS) of assignments straight from the source. In his prototype, Tibs rebuilds the RHS from the AST, but that seems rather roundabout and the results may not match the source perfectly (equivalent, but not character-for-character). I think using the "tokenize" module in parallel with "compiler" may allow the code to extract the raw RHS text, as well as other raw text that doesn't make it verbatim to the AST. So, is there any prior art out there? Any pointers or advice? -- David Goodger <go...@py...> Open-source projects: - Python Docutils: http://docutils.sourceforge.net/ (includes reStructuredText: http://docutils.sf.net/rst.html) - The Go Tools Project: http://gotools.sourceforge.net/ |
|
From: Brett C. <ba...@OC...> - 2002-12-05 07:04:45
|
[David Goodger] > So, is there any prior art out there? Any pointers or advice? > How does PyChecker do it? I would guess by reading the bytecode, but you never know. I would guess using regexes would be the best if you just want to read the source. The ``tokenize`` module has all the regexes and they might be available independently from the methods in the module. -Brett |
|
From: <gr...@us...> - 2002-12-05 09:17:41
|
On Wed, 4 Dec 2002, Brett Cannon wrote: > [David Goodger] > > > So, is there any prior art out there? Any pointers or advice? > > > > How does PyChecker do it? I would guess by reading the bytecode, but you > never know. > > I would guess using regexes would be the best if you just want to read the > source. The ``tokenize`` module has all the regexes and they might be > available independently from the methods in the module. from ancient times when i did crossreferencing on modula or later pretty printing newtonscript. i have to things on mind: * when using regexes nultiline statements are not so easy. * when using a lexer, association with the line might get lost (the lexer would probably merge the lines) but whome do i tell this david, you should know more than i ever will after parsing reST. what is the intention: * read doc-strings * pretty print * check the code * reformat the code -- BINGO: innovative solutions --- Engelbert Gruber -------+ SSG Fintl,Gruber,Lassnig / A6410 Telfs Untermarkt 9 / Tel. ++43-5262-64727 ----+ |
|
From: David G. <go...@py...> - 2002-12-06 02:46:52
|
Engelbert Gruber wrote:
> what is the intention:
The intention is to summarize the code and docstrings of Python
modules in order to auto-document the code. Specifically, the
intention is to extract some pieces of raw source code (RHS of
assignments and default argument values, perhaps others) that don't
survive intact past compiler.parse().
See my reply to Doug Hellmann, and
http://docutils.sf.net/pep-0258.html#python-source-reader
--
David Goodger <go...@py...> Open-source projects:
- Python Docutils: http://docutils.sourceforge.net/
(includes reStructuredText: http://docutils.sf.net/rst.html)
- The Go Tools Project: http://gotools.sourceforge.net/
|
|
From: Doug H. <do...@he...> - 2002-12-05 14:06:08
|
On Wednesday 04 December 2002 10:16 pm, David Goodger wrote: > > I've looked over the HappyDoc code and Tony "Tibs" Ibbs' PySource > prototype. HappyDoc uses the stdlib "parser" module to parse Python modules > into abstract syntax trees (ASTs), but that seems difficult and fragile, > the ASTs being so low-level. Tibs' prototype uses the much higher-level > ASTs built by the stdlib "compiler" module, which are much easier to > understand. I've decided to use the "compiler" module also. I'm pretty sure HappyDoc was written before the compiler module was generally available, but I'm not sure. I've only had to make a few minor modifications to it in the past, since the language syntax hasn't evolved that far. I'm working on a major overhaul of HappyDoc anyway, so now might be the time to rewrite the parsing stuff to use the compiler module. If you're interested in collaborating, let me know. Doug |
|
From: David G. <go...@py...> - 2002-12-06 02:44:48
|
Doug Hellmann wrote:
> I'm pretty sure HappyDoc was written before the compiler module was
> generally available
I suspected as much. Either that, or you're a glutton for punishment
;-)
> I've only had to make a few minor modifications to it in the past,
> since the language syntax hasn't evolved that far.
That's good to know. Still, the parser.suite() approach seems a lot
harder.
> I'm working on a major overhaul of HappyDoc anyway, so now might be
> the time to rewrite the parsing stuff to use the compiler module.
> If you're interested in collaborating, let me know.
I am, definitely. What I'd like to do is to take a module, read in
the text, run it through the module parser (using compiler.py and
tokenize.py) and produce a high-level AST full of nodes that are
interesting from an auto-documentation standpoint. For example, given
this module (x.py)::
# comment
"""Docstring"""
"""Additional docstring"""
__docformat__ = 'reStructuredText'
a = 1
"""Attribute docstring"""
class C(Super):
"""C's docstring"""
class_attribute = 1
"""class_attribute's docstring"""
def __init__(self, text=None):
"""__init__'s docstring"""
self.instance_attribute = (text * 7
+ ' whaddyaknow')
"""instance_attribute's docstring"""
def f(x, y=a*5, *args):
"""f's docstring"""
return [x + item for item in args]
f.function_attribute = 1
"""f.function_attribute's docstring"""
The module parser should produce a high-level AST, something like this
(in pseudo-XML_)::
<Module filename="x.py">
<Comment lineno=1>
comment
<Docstring lineno=3>
Docstring
<Docstring lineno=...> (I'll leave out the lineno's)
Additional docstring
<Attribute name="__docformat__">
<Expression>
'reStructuredText'
<Attribute name="a">
<Expression>
1
<Docstring>
Attribute docstring
<Class name="C" inheritance="Super">
<Docstring>
C's docstring
<Attribute name="class_attribute">
<Expression>
1
<Docstring>
class_attribute's docstring
<Method name="__init__" argnames=['self', ('text', 'None')]>
<Docstring>
__init__'s docstring
<Attribute name="instance_attribute" instance=True>
<Expression>
(text * 7
+ ' whaddyaknow')
<Docstring>
class_attribute's docstring
<Function name="f" argnames=['x', ('y', 'a*5'), 'args']
varargs=True>
<Docstring>
f's docstring
<Attribute name="function_attribute">
<Expression>
1
<Docstring>
f.function_attribute's docstring
compiler.parse() provides most of what's needed for this AST. I think
that "tokenize" can be used to get the rest, and all that's left is to
hunker down and figure out how. We can determine the line number from
the compiler.parse() AST, and a get_rhs(lineno) method would provide
the rest.
The Docutils Python reader component will transform this AST into a
Python-specific doctree, and then a `stylist transform`_ would further
transform it into a generic doctree. Namespaces will have to be
compiled for each of the scopes, but I'm not certain at what stage of
processing.
It's very important to keep all docstring processing out of this, so
that it's a completely generic and not tool-specific.
For an overview see:
http://docutils.sf.net/pep-0258.html#python-source-reader
For very preliminary code see:
http://docutils.sf.net/docutils/readers/python/moduleparser.py
For tests and example output see:
http://docutils.sf.net/test/test_readers/test_python/test_parser.py
I have also made some simple scripts to make "compiler", "parser", and
"tokenize" output easier to read. They use input from the
test_parser.py module above. See showast, showparse, and showtok in:
http://docutils.sf.net/test/test_readers/test_python/
.. _pseudo-XML: http://docutils.sf.net/spec/doctree.html#pseudo-xml
.. _stylist transform:
http://docutils.sf.net/spec/pep-0258.html#stylist-transforms
--
David Goodger <go...@py...> Open-source projects:
- Python Docutils: http://docutils.sourceforge.net/
(includes reStructuredText: http://docutils.sf.net/rst.html)
- The Go Tools Project: http://gotools.sourceforge.net/
|
|
From: Doug H. <do...@he...> - 2002-12-06 13:28:57
|
On Thursday 05 December 2002 9:45 pm, David Goodger wrote: > Doug Hellmann wrote: > > I'm pretty sure HappyDoc was written before the compiler module was > > generally available > > I suspected as much. Either that, or you're a glutton for punishment > ;-) Well, I didn't say that wasn't true. :-) I actually started with some sample code included in the Python source distribution, so it wasn't too hard to extend it and come up with a useful parser. > > I've only had to make a few minor modifications to it in the past, > > since the language syntax hasn't evolved that far. > > That's good to know. Still, the parser.suite() approach seems a lot > harder. If you're starting from scratch, I would definitely recommend trying the compiler module first. > > I'm working on a major overhaul of HappyDoc anyway, so now might be > > the time to rewrite the parsing stuff to use the compiler module. > > If you're interested in collaborating, let me know. > > I am, definitely. What I'd like to do is to take a module, read in > the text, run it through the module parser (using compiler.py and > tokenize.py) and produce a high-level AST full of nodes that are > interesting from an auto-documentation standpoint. For example, given > this module (x.py):: [...] > compiler.parse() provides most of what's needed for this AST. I think > that "tokenize" can be used to get the rest, and all that's left is to > hunker down and figure out how. We can determine the line number from > the compiler.parse() AST, and a get_rhs(lineno) method would provide > the rest. Does compiler include comments? I had to write a separate parser to pull comments out. > The Docutils Python reader component will transform this AST into a > Python-specific doctree, and then a `stylist transform`_ would further > transform it into a generic doctree. Namespaces will have to be > compiled for each of the scopes, but I'm not certain at what stage of > processing. Why perform all of those transformations? Why not go from the AST to a generic doctree? Or, even from the AST to the final output? > It's very important to keep all docstring processing out of this, so > that it's a completely generic and not tool-specific. Definitely. Doug |
|
From: Michael H. <mw...@py...> - 2002-12-06 13:42:15
|
Doug Hellmann <do...@he...> writes:
> If you're starting from scratch, I would definitely recommend trying the
> compiler module first.
Amen.
[...]
> Does compiler include comments?
No. tokenize.py does, though.
I don't know how hard it would be to turn the output of tokenize.py
into something like the output of compiler/transformer.py, but with
comments. SPARK may be your friend...
Cheers,
M.
--
Two things I learned for sure during a particularly intense acid
trip in my own lost youth: (1) everything is a trivial special case
of something else; and, (2) death is a bunch of blue spheres.
-- Tim Peters, 1 May 1998
|
|
From: David G. <go...@py...> - 2002-12-07 02:47:25
|
Doug Hellmann wrote:
> Does compiler include comments? I had to write a separate parser to
> pull comments out.
As Michael said, no. That's another reason for using compiler and
tokenize in parallel.
>> The Docutils Python reader component will transform this AST into a
>> Python-specific doctree, and then a `stylist transform`_ would
>> further transform it into a generic doctree. Namespaces will have
>> to be compiled for each of the scopes, but I'm not certain at what
>> stage of processing.
>
> Why perform all of those transformations? Why not go from the AST
> to a generic doctree? Or, even from the AST to the final output?
I want the docutils.readers.python.moduleparser.parse_module()
function to produce a standard documentation-oriented AST that can be
used by any tool. We can develop it together without having to
compromise on the rest of our design (i.e., HappyDoc doesn't have to
be made to work like Docutils, and vice-versa). It would be a
higher-level version of what compiler.py provides.
The Python reader component transforms this generic AST into a
Python-specific doctree (it knows about modules, classes, functions,
etc.), but this is specific to Docutils and cannot be used by HappyDoc
or others. The stylist transform does the final layout, converting
Python-specific structures ("class" sections, etc.) into a generic
doctree using primitives (tables, sections, lists, etc.). This
generic doctree does *not* know about Python structures any more. The
advantage is that this doctree can be handed off to any of the output
writers to create any output format we like.
The latter two transforms are separate because I want to be able to
have multiple independent layout styles (multiple runtime-selectable
"stylist transforms"). Each of the existing tools (HappyDoc, pydoc,
epydoc, Crystal, etc.) has its own fixed format. I personally don't
like the tables-based format produced by these tools, and I'd like to
be able to customize the format easily. That's the goal of stylist
transforms, which are independent from the Reader component itself.
One stylist transform could produce HappyDoc-like output, another
could produce output similar to module docs in the Python library
reference manual, and so on.
It's for exactly this reason:
>> It's very important to keep all docstring processing out of this,
>> so that it's a completely generic and not tool-specific.
... but it goes past docstring processing. It's also important to
keep style decisions and tool-specific data transforms out of this
module parser.
--
David Goodger <go...@py...> Open-source projects:
- Python Docutils: http://docutils.sourceforge.net/
(includes reStructuredText: http://docutils.sf.net/rst.html)
- The Go Tools Project: http://gotools.sourceforge.net/
|
|
From: Doug H. <do...@he...> - 2002-12-07 13:47:37
|
On Friday 06 December 2002 9:47 pm, David Goodger wrote:
> Doug Hellmann wrote:
> >
> > Why perform all of those transformations? Why not go from the AST
> > to a generic doctree? Or, even from the AST to the final output?
>
> I want the docutils.readers.python.moduleparser.parse_module()
> function to produce a standard documentation-oriented AST that can be
> used by any tool. We can develop it together without having to
> compromise on the rest of our design (i.e., HappyDoc doesn't have to
> be made to work like Docutils, and vice-versa). It would be a
> higher-level version of what compiler.py provides.
That part makes sense.
> The Python reader component transforms this generic AST into a
> Python-specific doctree (it knows about modules, classes, functions,
> etc.), but this is specific to Docutils and cannot be used by HappyDoc
> or others. The stylist transform does the final layout, converting
> Python-specific structures ("class" sections, etc.) into a generic
> doctree using primitives (tables, sections, lists, etc.). This
> generic doctree does *not* know about Python structures any more. The
> advantage is that this doctree can be handed off to any of the output
> writers to create any output format we like.
Ah. I handled that differently in HappyDoc. Instead of building another
data structure, I set up the API for the formatters to have methods that do
things like start/end a (sub)section, start/end a list, etc. The primary
implementation is an HTML formatter that produces tables, but there are other
formatters. The docset is then responsible for calling the right formatter
method when it wants it. Having the docset and formatter separate makes
things more complicated than I expected, so in HappyDoc 3.0 there will just
be one plugin system.
There is a new scanner which walks the input directory building a tree of
scanned files, doing optional special processing for each based on mimetype.
For text/x-python files, the file is parsed and information about classes,
etc. are extracted. The output formatter walks the resulting tree, also
doing mimetype-based processing for each file. HTML and image files will be
copied from input to output. Text files are converted using the docstring
converter, and the parse results from Python modules are used to generate new
HTML output files.
I've got the scanner done, and am working on the output formatter code now.
Doug
|