From: Zeth <the...@gm...> - 2008-06-05 23:22:06
|
Hello guys, I am new to Docutils so I don't really know what I am doing yet. However, my aim is to read the bibliographic fields out of a reStructuredText document into a Python dictionary. I achieved it by just poking it (see code below), but this seems a bit meandrous, what is the correct way to do this please? Cheers, Best Wishes, Zeth """Trying to use docutils.""" import docutils.core def get_docinfo(document): """Get the bibliographic fields out of the reStructuredText document.""" docinfo = {} for i in [x for x in docutils.core.publish_doctree(document).children \ if x.tagname == 'docinfo'][0].children: docinfo[i.tagname] = str(i.children[0]) return docinfo def main(): """Little example.""" webpage = 'http://docutils.sourceforge.net/docs/user/rst/quickstart.txt' import urllib2 document = ''.join(urllib2.urlopen(webpage).readlines()) return get_docinfo(document) # start the ball rolling if __name__ == "__main__": print main() |
From: David G. <go...@py...> - 2008-06-09 20:02:05
|
On Thu, Jun 5, 2008 at 19:22, Zeth <the...@gm...> wrote: > I am new to Docutils so I don't really know what I am doing yet. > However, my aim is to read the bibliographic fields out of a > reStructuredText document into a Python dictionary. I achieved it by > just poking it (see code below), but this seems a bit meandrous, what > is the correct way to do this please? Cheers, What you did works, but this is probably the most straigtforward: doctree = docutils.core.publish_doctree(source_text) docinfos = doctree.traverse(docutils.nodes.docinfo) This returns a list of all nodes of class docutils.nodes.docinfo. > """Trying to use docutils.""" > import docutils.core > > def get_docinfo(document): > """Get the bibliographic fields out of the reStructuredText document.""" > docinfo = {} > for i in [x for x in docutils.core.publish_doctree(document).children \ > if x.tagname == 'docinfo'][0].children: > docinfo[i.tagname] = str(i.children[0]) This won't work with custom fields, which are stored in "field" elements. For those cases, you'll have to extract the contents of the "field_name" element, which is the first child of "field" (and extract the contents of the "field_body" element which follows). There can be duplicate fields (standard or custom). You may want to accumulate field bodies in a list (e.g. using "docinfo = collections.defaultdict(list)" and "docinfo[fieldname].append(body)" [defaultdict is new in Python 2.5]). Also, field bodies are not always simple strings. First, using str() will break if there's any non-ASCII text in there. And applying str() to anything but a Text node will give you the XML representation of the element or subtree. -- David Goodger > return docinfo > > def main(): > """Little example.""" > webpage = 'http://docutils.sourceforge.net/docs/user/rst/quickstart.txt' > import urllib2 > document = ''.join(urllib2.urlopen(webpage).readlines()) > return get_docinfo(document) > > # start the ball rolling > if __name__ == "__main__": > print main() |
From: Zeth <the...@gm...> - 2008-06-09 21:46:26
|
Thankyou David for your reply, it works great. The traverse method is certainly more direct, sorry for not finding it myself. Thanks also for the advice about duplicate fields and non-string fields, I hadn't considered that, while I am not using them in my data, it is certainly good to write my code in a general way that can handle them. I had not come across defaultdict before, it is really interesting as I have used a similar approach before (e.g. dictionary with a list for each value). Somewhat off topic, but I have to admit that I still I tend to code for Python 2.3 or above, but I think I should maybe consider revising that now Redhat and Apple now ship with newer releases. Best Wishes, Zeth |