[pyblio] architecture proposal

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi,

I haven't read in depth all the references provided in the recent
discussion, and I rather think that I'll rely on your expertise on
this topic, as I even don't have access to many of those databases.
Anyway, I would like to share how I currently imagine the architecture
of pyblio 1.2 : please tell me if you see any problems or limitation
regarding features you think are important in pyblio for you.

1. Storage

Pyblio will have its native storage format. This format will have to
support the storage of entries taken from another format, with no
loss. This format will also support all the extra information that=20
pyblio will be able to manage : keywords, lists of journal names,
cross-references,...

I propose not to rely on an existing format, as the proposed ones seem
either too complicated, or not able to fit this scheme. So let's go and
define our own, so that we master it completely. As we are not in the
areas where XML is a bad choice, let's go for it, as support is provided
in recent python distributions.

I can imagine something like:

<pybliodb>

 <topic> ... </topic>
=20
 <common>

   <person id=3D"gobry">
     <name>Gobry</name>
     <surname>V=E9ronique</surname>
     <initials>V.</initials>
   </person>

   <text id=3D"jacs">Journal of American Chemistry Society</text>

 </common>

 <entry id=3D"GBC+00" type=3D"article">

   <field name=3D"author">
   	<person ref=3D"gobry"/>
   	<person>
   	  <name>...
    </person>
   </field>
   <field name=3D"title"><text>The subject</text></field>
   <field name=3D"journal"><text ref=3D"jacs"></field>

   <original type=3D"bibtex">
     @string{jacs =3D "Journal of American Chemistry Society"}

	 @article{GBC+00,
	 	author  =3D {Gobry, V=E9ronique},
		title   =3D "The Subject",
		journal =3D jacs
	 }
   </original>

 </entry>
</pybliodb>

The rationale behind the duplication of information in the <original/>
tag is to make it possible to provide the original form of an entry as
long as it has not been modified in pyblio, while in the same time to
provide a consolidated view of the data (parsing of names, dates,...)
for the rest of the application.

I also don't think it is a good idea to create tags for specific parts
of a description (for instance, a <title> or <author> tag), as it makes
it cumbersome to customize a database with specific fields. I prefer
having some base types (a person, a date, a date range,...) and
possibly a description of what is correct and what is not (a journal
entry must have a journal name, an author, a title,...). BTW, do you
think such a description should be placed in the database itself ?=20
This is good for file exchange, but it is maybe a bit overkill for
everyday use ?

2. Data types

What are the elementary datatypes that must be understood by pyblio to
fully complete its job ?

	- person description (Is: last name, middle name, first name, lineage enou=
gh
	  for everybody ?)

	- date

	- date range

	- number

	- number range

	- rich text (for titles, it is necessary to handle exponents,
	  indexes,...)

	- simple text

The text must be in unicode in order to open pyblio to other languages
than latin1-based.

3. Internal manipulation

Once parsed, any format must fit a single representation (as compared to
now, where every format could behave a bit differently), which is close
to the native format. To conciliate the needs of people that manipulate
small databases and people that have large entries shared by many
users, it might be of interest to use a relational database as the
actual processing backend: a lightweight temporary database like gadfly
for people who don't care, and the ability to plug the system to
PostgreSQL for instance, on larger configurations.

I think that the current queries and the proposed data types are
suitable for efficient processing in a real DB, but this is yet to be
tested.

4. Front-ends

The text-based interface has been left behind during the development of
the GUI. Maybe it's time to see if a correct abstraction could be
written so that multiple front-ends can be developped with a minimum of
rewriting. The minimum should be a curses and a Gnome front-end, to
extend according to the people interested in the development.

5. Filters / Web queries

I need some feedback on how to make the development of filters and
external query mechanisms easier. There is certainly a lot of redundency
to remove, but I haven' looked at it yet.

6. Formatting

There used to be a "Format" feature, that aimed at doing the same work
as bibtex. I think it still is an important feature, but in its current
form it is not particularily well suited for the following tasks:

  - easy creation of new formats for specific journals for instance
  - connection with word processors

Here again, I need some feedback from people who have a good experience
with commercial software in the area, so that we can find out what must
be done.
 =20

Roadmap:

1. discussing the previous points to check if nothing has been left
behind

2. maybe starting by modifying the internal data types and creating a
first draft of the native file format (without the <common/> values for
instance)

3. migrating toward the database system with the use of real references

This is a request for comments, so don't hesitate !

Fr=E9d=E9ric