Re: [pyblio] lost in the devel--1.3 branch

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Fr=E9d=E9ric Gobry wrote:

>> At this point I can only assume I am on the wrong track and that I=20
>> must be missing something obvious. Any ideas? I'm using Python=20
>> 2.3.4... nothing fancy as far as I know.
>
>
> You're not missing sth obvious : core-api contains a few files=20
> resulting from long discussions with Peter, which led to two
> separate attempts for some functions. Therefore, there are a few=20
> classes and their associate unit tests that are not
> correctly integrated in the framework. Unfortunately, I did not=20
> maintain them, and now they produce unit test errors.
>
Ok, no problem, I won't dwell too long on the tests then. Even though=20
core-api is in its early days and is likely to be a moving target for=20
some time yet, I really like the look of it so far and have decided to=20
give it a red hot go. I expect to blunder blindly for a while and to ask=20
some pretty dumb questions, so please bear with me.

In order to become better aquainted with core-api, I started knocking=20
together a short python script to import a bibtex database and dump the=20
bib info on stdout. What I came up with was;

> #!/usr/bin/python
>
> import os, sys, string;
>
> from Pyblio.Importers import BibTeX;
> from Pyblio import Store, Schema;
>
> bibFile =3D "test1.bib";
> xmlFile =3D "test1.xml";
>
> dbFormat =3D Store.get("file"); # the core-api database format?
> dbSchema =3D Schema.Schema("./myschema.xml"); # the core-api database=20
> schema?
>
> db =3D dbFormat.dbcreate(xmlFile, dbSchema);
>
> doc =3D Store.TxoItem();
> doc.names['C'] =3D "article";
> db.txo["doctype"].add(doc); # only support article citations for now
>
> parser =3D BibTeX.Importer();
>
> parser.parse( open(bibFile), db);
>
> iter =3D db.entries.itervalues();
> while 1:
>     try:
>         cite =3D iter.next();
>         for key in cite.keys():
>             print "%s : %s" % (key, cite[key]);
>         print
>     except StopIteration:
>         break
>
> Store.get("file").dbdestroy (xmlFile, nobackup =3D True);

I started with simple.bib and schema.xml from tests/ut_bibtex/ and this=20
worked ok. I then tried it on a small bibtex database containing a=20
couple of my actual @article citations. Of course it failed miserably. =20
I added entries to myschema.xml for the volume =3D {}, pages =3D {} and=20
month =3D {} fields in the bibtex citations;

> :
> :
> <attribute id=3D"volume" type=3D"text">
>  <name>Volume</name>
> </attribute>
>
> <attribute id=3D"pages" type=3D"text">
>  <name>Pages</name>
> </attribute>
>
> <attribute id=3D"month" type=3D"text">
>  <name>Month</name>
> </attribute>
> :
> :
>
and this seemed to take care of things and my citation data was dumped=20
on stdout.

Having acomplished this, I have the the first inklings of a picture in=20
my mind of how the core-api is probably intended to be used. If you have=20
read this far, then I will really push my luck and run a few things past=20
you for confirmation or correction.

It seems that:

1. The key entities in the core-api are these database objects. All=20
bibliographic citations exist as records in a database.
2. The fields in a given database are determined by the schema with=20
which it was created (via the dbcreate( ) method).
3. The schema is typically read from an xml document.
4. Databases may be saved on disk in a number of formats (I used "file",=20
which produced xml, but garlic uses "bsddb").
5. The core-api also defines an abstract Record class which provide an=20
abstraction to simplify the manipulation of individual records,=20
irrespective of the corresponding database schema or the format of the=20
underlying database.

Am I on the right track here? I havn't really looked into things much=20
further yet, but presumably an application could contain more than one=20
database simultaneously and individual records could be moved from one=20
to another, whole databases could be merged etc. Is is up to individual=20
applications to provide the database schema(s) and the concrete Record=20
classes?=20

Getting back to my simple example script above, is there a way to have=20
the bibtex importer simply skip fields in the bibtex entries for which=20
there is no corresponding entry in the core-api database schema? For=20
example, if I wanted to compile some statistics based on a collection of=20
bibtex citations, I might only be interested in say the first author,=20
the title and the year of publication of each citation. I could define a=20
schema containing only those fields and the bibtex importer would=20
(should?) silently discard everything else, like journal names, volume=20
and page numbers etc.

Along a similar line, what happens when the schema defines an attribute=20
for which there is no bibtex analog? For example, if I wanted to sort my=20
citations into categories, I would be tempted to add a category=20
attribute to the schema something like that used by garlic.  Any=20
citations I import from a bibtex database obviously don't include a=20
category field, so I would like them to be placed into a default=20
"unsorted" category. Maybe the way to do this is with a <default> tag in=20
the database schema?

Thats plenty for now,

e.