Thread: [Pybliographer-devel] Load bibtex and then save bibtex - problem

Brought to you by: gobry, gpoo, kota, schultes

pybliographer-devel

[Pybliographer-devel] Load bibtex and then save bibtex - problem

From: Samuel J. <mail@SamuelJohn.de> - 2007-08-14 15:30:18

Hi there!

I use pybliographer 1.3.3 as it can be downloaded from sourceforge and
just want to try to load a bibtex file and save it again. But I am not
able to get it working. Perhaps someone here can help. My code:


------------------------------------

import sys, os

in_f, temp_f, out_f =3D sys.argv[1:4]

from Pyblio.Parsers.Semantic import BibTeX
from Pyblio import Store, Registry, Adapter

# Get the default parser, though I do not know why this is needed and what
# are the other options...
Registry.parse_default()

# Get the schema associated with the default bibtex parser
bibtex_schema =3D Registry.getSchema("org.pybliographer/bibtex/0.1")

# We need to ensure the file does not exist yet.
try: os.unlink(out_f)
except OSError: pass

# Create a new db using this schema. I'd like to use an in-memory storage b=
ut
# somehow then the parsing does not work.
db =3D Store.get('file').dbcreate(temp_f, bibtex_schema)

# Import the content of the bibtex file into it
reader =3D BibTeX.Reader()
reader.parse( open(in_f), db )


# To save it as BibTeX, we get an
# adapter from PubMed to BibTeX
bibtex =3D Adapter.adapt_schema(db, 'org.pybliographer/bibtex/0.1')

# Now we have a "virtual" bibtex database, that we can actually
# save as a BibTeX file
from Pyblio.Parsers.Semantic.BibTeX import Writer

w =3D Writer()
w.write(open(out_f, 'w'), bibtex.entries, bibtex)

------------------------------

Is there an easier way to accomplish this?
And more important has anyone a clue why I get the following error:


    w.write(open(out_f, 'w'), bibtex.entries, bibtex)
AttributeError: 'NoneType' object has no attribute 'entries'



Some questions arise:
- Why do I need to call parse_default() ?
- Why does the parsing of bibtex fail when using a 'memory' database?
(that is the reason why I use the temp_f file.)
- Are new fields in the bibtex file supported? For example there is a
key called "doi"
  that stands for "digital object identifier" in some of my test
bibtex files that cannot be loaded due to an unknown "doi" element.


best regards
 Samuel

--=20
 Dipl. Inf. Samuel John
 Ph.D. student at
   Faculty of Technology, Neuroinformatics Group,
   Bielefeld University, 33501 Bielefeld, Germany
 in cooperation with
   HONDA Research Institute Europe GmbH
   Carl-Legien-Stra=DFe 30, D-63073 Offenbach/Main, Germany

Re: [Pybliographer-devel] Load bibtex and then save bibtex - problem

From: <go...@pu...> - 2007-08-14 16:28:35

> # Get the default parser, though I do not know why this is needed and wha=
t
> # are the other options...
> Registry.parse_default()

This is needed to bootstrap the registry with the default set of
schemas, adapters, formatters,... In an application, you could have
specific directories holding these definitions, which you would load
explicitely. The parse_default() method loads from predefined
locations (Pyblio/RIP/ mainly). I don't like the idea of doing this
unconditionally when Pyblio is imported (it's an application-level
behavior, not a library-level)

> # Get the schema associated with the default bibtex parser
> bibtex_schema =3D Registry.getSchema("org.pybliographer/bibtex/0.1")
>
> # We need to ensure the file does not exist yet.
> try: os.unlink(out_f)
> except OSError: pass
>
> # Create a new db using this schema. I'd like to use an in-memory storage=
 but
> # somehow then the parsing does not work.
> db =3D Store.get('file').dbcreate(temp_f, bibtex_schema)

The memory store should work fine with the fix below

> # Import the content of the bibtex file into it
> reader =3D BibTeX.Reader()
> reader.parse( open(in_f), db )
>
>
> # To save it as BibTeX, we get an
> # adapter from PubMed to BibTeX
> bibtex =3D Adapter.adapt_schema(db, 'org.pybliographer/bibtex/0.1')

You only need to adapt from one schema to another, which is not the
case here. I'll fix the code so that it's a no-op to call
adapt_schema() with the same schema.

> # Now we have a "virtual" bibtex database, that we can actually
> # save as a BibTeX file
> from Pyblio.Parsers.Semantic.BibTeX import Writer
>
> w =3D Writer()
> w.write(open(out_f, 'w'), bibtex.entries, bibtex)

This code works for me if I get rid of the adapter (and then I can use
the memory store too).

> Is there an easier way to accomplish this?

There was already a short discussion about providing a simpler api for
simple tasks. I'm completely ok with this, it's just that I don't have
the time to work on it right now. Patches welcome :) Doug (cc'ed), did
you have an opportunity to work on this? Typical tasks should be
defined first, but this comes with usage. Opening a file of a specific
format looks like a good candidate, and this layer could arguably run
parse_default() when it is imported.

>     w.write(open(out_f, 'w'), bibtex.entries, bibtex)
> AttributeError: 'NoneType' object has no attribute 'entries'

That was caused by adapt_schema returning None (I've fixed that,
returning db if no conversion is needed, and raising an exception in
case of Error)

> - Are new fields in the bibtex file supported? For example there is a
> key called "doi"
>   that stands for "digital object identifier" in some of my test
> bibtex files that cannot be loaded due to an unknown "doi" element.

This requires a multi-layered answer :) Yes, you can simply add "doi"
to the bibtex schema, and support it in the reader and writer classes.
This is the correct thing to do for "regular" bibtex fields, ie fields
that are generic enough for everybody (doi probably deserves this
status). For less generic tags, the correct idea is to have a more
specific schema, and derive your own parser that handles your specific
tags.

BibTeX has the particularity (when compared with other formats like
RIS,...) that everybody can add his own tags. This is very convenient
locally, but can be annoying when exchanging data. So far, pyblio-1.3
does not support arbitrary tags very well, because I did not come up
with a nice way to support them without loosing the advantages of
having a schema.

--=20
Fr=E9d=E9ric

Re: [Pybliographer-devel] Load bibtex and then save bibtex - problem

From: Samuel J. <mail@SamuelJohn.de> - 2007-08-15 09:09:53

Hello!

Ok the need for Registry.parse_default() is understandable and ok, I
just wish there were some help/docu that tells me what and why. Now as
you explained it, I fully agree. It's just that the name
"init_default" or "load_defaults" may perhaps be better suited for the
developer.


> > # Get the schema associated with the default bibtex parser
> The memory store should work fine with the fix below

Ok, the moromy store works for me too, now.


> You only need to adapt from one schema to another, which is not the
> case here. I'll fix the code so that it's a no-op to call
> adapt_schema() with the same schema.

Yep, alright. That was just because I did not knew better and was not
able to figure it out on my own.


> > w = Writer()
> > w.write(open(out_f, 'w'), bibtex.entries, bibtex)
>
> This code works for me if I get rid of the adapter (and then I can use
> the memory store too).

Well, sadly I have still an error, but another one (see my code at the
bottom of this mail as reference. I use the sample.bib you provided
with the distribution):
  File "bibtexnorm.py", line 25, in ?
    w.write(open(out_f, 'w'), db.entries, db)
  File "/home/sjohn/lib/python/pybliographer-1.3.3-py2.4.egg/Pyblio/Parsers/Syntax/BibTeX/__init__.py",
line 610, in write
    self.record_begin ()
  File "/home/sjohn/lib/python/pybliographer-1.3.3-py2.4.egg/Pyblio/Parsers/Syntax/BibTeX/__init__.py",
line 569, in record_begin
    self.key = str (self.record ['id'] [0])

I use python 2.4.1 here on my system (I don't have admin rights and
have to wait until 2.5 is installed, but I think 2.4.x should do it).
I don't know how to fix this. The dependecies (cElementTree, NumPy)
are working fine and I their self-tests are ok.




> > Is there an easier way to accomplish this?
>
> There was already a short discussion about providing a simpler api for
> simple tasks. I'm completely ok with this, it's just that I don't have
> the time to work on it right now. Patches welcome :) Doug (cc'ed), did
> you have an opportunity to work on this? Typical tasks should be
> defined first, but this comes with usage. Opening a file of a specific
> format looks like a good candidate, and this layer could arguably run
> parse_default() when it is imported.

That would be a nice addition but I am fine with the way it is now
since I don't think it is too complicated, once you know what you have
to write. The steps are clear: 1. choose a schema, 2. setup the
database, 3. parse some content 4. use the writer to write to a file.


>
> >     w.write(open(out_f, 'w'), bibtex.entries, bibtex)
> > AttributeError: 'NoneType' object has no attribute 'entries'
>
> That was caused by adapt_schema returning None (I've fixed that,
> returning db if no conversion is needed, and raising an exception in
> case of Error)

Yes, you are right. This one is fixed now, when I use the db instead
of the superflous adapter. But (see above) another error shows up.


>
> > - Are new fields in the bibtex file supported? For example there is a
> > key called "doi"
> >   that stands for "digital object identifier" in some of my test
> > bibtex files that cannot be loaded due to an unknown "doi" element.
>
> This requires a multi-layered answer :) Yes, you can simply add "doi"
> to the bibtex schema, and support it in the reader and writer classes.

I will definetly need this, so I need to find out how to do it.


> This is the correct thing to do for "regular" bibtex fields, ie fields
> that are generic enough for everybody (doi probably deserves this
> status). For less generic tags, the correct idea is to have a more
> specific schema, and derive your own parser that handles your specific
> tags.

Hmmm ...

>
> BibTeX has the particularity (when compared with other formats like
> RIS,...) that everybody can add his own tags.

Yes, I know, and we use this for some additions.

[...]
> can be annoying when exchanging data. So far, pyblio-1.3
> does not support arbitrary tags very well, because I did not come up
> with a nice way to support them without loosing the advantages of
> having a schema.

I agree that conversion to and from other formats will be
hard/impossible(?) with arbitrary fields. But I have to find a
solution for this, otherwise I cannot use pybliographer for our
needs... :-(



Any ideas on this and the bug from above?





Here follows the code as it is right now. I load the sample.bib as
in_f and just want to write it as bibtex into an output file.

----------------------------------
import sys, os

in_f, out_f = sys.argv[1:3]

from Pyblio.Parsers.Semantic import BibTeX
from Pyblio import Store, Registry
from Pyblio.Parsers.Semantic.BibTeX import Writer

Registry.parse_default()
bibtex_schema = Registry.getSchema("org.pybliographer/bibtex/0.1")
db = Store.get('memory').dbcreate(None, bibtex_schema)

# We need to ensure the file does not exist yet.
try: os.unlink(out_f)
except OSError: pass


# Import the content of the bibtex file (in_f) into it
reader = BibTeX.Reader()
reader.parse( open(in_f), db )


w = Writer()
w.write(open(out_f, 'w'), db.entries, db)
----------------------------------



Thanks & cheers
 Samuel

Re: [Pybliographer-devel] Load bibtex and then save bibtex - problem

From: <go...@pu...> - 2007-08-15 15:31:49

> Ok the need for Registry.parse_default() is understandable and ok, I
> just wish there were some help/docu that tells me what and why. Now as
> you explained it, I fully agree. It's just that the name
> "init_default" or "load_defaults" may perhaps be better suited for the
> developer.

I suck at naming :) (which is a problem when writing APIs...) I've
renamed it, as it is  best done now than later:

  load(directory)
  load_default_directories()

> Yep, alright. That was just because I did not knew better and was not
> able to figure it out on my own.

I haven't checked the user documentation for a while, I probably need
to revisit it with these new classes.

>
>
> > > w =3D Writer()
> > > w.write(open(out_f, 'w'), bibtex.entries, bibtex)
> >
> > This code works for me if I get rid of the adapter (and then I can use
> > the memory store too).
>
> Well, sadly I have still an error, but another one (see my code at the
> bottom of this mail as reference. I use the sample.bib you provided
> with the distribution):
>   File "bibtexnorm.py", line 25, in ?
>     w.write(open(out_f, 'w'), db.entries, db)
>   File "/home/sjohn/lib/python/pybliographer-1.3.3-py2.4.egg/Pyblio/Parse=
rs/Syntax/BibTeX/__init__.py",
> line 610, in write
>     self.record_begin ()
>   File "/home/sjohn/lib/python/pybliographer-1.3.3-py2.4.egg/Pyblio/Parse=
rs/Syntax/BibTeX/__init__.py",
> line 569, in record_begin
>     self.key =3D str (self.record ['id'] [0])
>
> I use python 2.4.1 here on my system (I don't have admin rights and
> have to wait until 2.5 is installed, but I think 2.4.x should do it).

Yes, I try to be a bit conservative wrt python versions, and even 2.3
should do it. The trace is probably cut, I don't see the actual
exception being raised. In any case, the code you attached works fine
with sample.bib here. Do you have problems when you run pyblio's
testsuite?

> > > - Are new fields in the bibtex file supported? For example there is a
> > > key called "doi"
> > >   that stands for "digital object identifier" in some of my test
> > > bibtex files that cannot be loaded due to an unknown "doi" element.
> >
> > This requires a multi-layered answer :) Yes, you can simply add "doi"
> > to the bibtex schema, and support it in the reader and writer classes.
>
> I will definetly need this, so I need to find out how to do it.
>
>
> > This is the correct thing to do for "regular" bibtex fields, ie fields
> > that are generic enough for everybody (doi probably deserves this
> > status). For less generic tags, the correct idea is to have a more
> > specific schema, and derive your own parser that handles your specific
> > tags.
>
> Hmmm ...
>
> >
> > BibTeX has the particularity (when compared with other formats like
> > RIS,...) that everybody can add his own tags.
>
> Yes, I know, and we use this for some additions.
>
> [...]
> > can be annoying when exchanging data. So far, pyblio-1.3
> > does not support arbitrary tags very well, because I did not come up
> > with a nice way to support them without loosing the advantages of
> > having a schema.
>
> I agree that conversion to and from other formats will be
> hard/impossible(?) with arbitrary fields. But I have to find a
> solution for this, otherwise I cannot use pybliographer for our
> needs... :-(

So your workflow implies managing bibtex entries with completely
arbitrary fields that you need to "properly" handle? Is it good enough
to assume these fields are all of type Text? Do you need to actually
touch these fields or is it enough not to drop them?

--=20
Fr=E9d=E9ric

Re: [pyblio-devel] [Pybliographer-devel] Load bibtex and then save bibtex - problem

From: Samuel J. <mail@SamuelJohn.de> - 2007-08-15 16:50:46

On 8/15/07, Fr=E9d=E9ric Gobry <go...@pu...> wrote:
>   load(directory)
>   load_default_directories()

Though this reflects what is done, I (as a user of pyblio) still
cannot guess why to load any directory first. load_settings(fromDir)
and load_default_settings() could be an option...
Then it would be intuitively clear to load the settings first and then
perform other actions...
just my 2 cents.

> I haven't checked the user documentation for a while, I probably need
> to revisit it with these new classes.

I guess that would be great!

> The trace is probably cut, I don't see the actual
> exception being raised.

That trace was all I get on the console, no further
exception is raised.

> Do you have problems when you run pyblio's
> testsuite?

./testsuite.sh
'ut_crossref.py': missing dependency No module named
twisted.webwarning: only te    sting the file store: bsddb is too old
((4, 3, 0, 0, 0) instead of (4, 3, 3, 0,     0))
'ut_pubmed.py': missing dependency No module named
twisted.trialwarning: only te    sting the file store: store 'bsddb'
is not available
'ut_wok.py': missing dependency No module named twisted.webunittest:
...........    ............................................................=
....................
   ........................................................................
unittest: [163 tests in 1.517s]
unittest: OK

That seems ok besides the fact that I don't have (and don't need)
twisted and bsdbd. My code just uses plain filestore.
Tomorrow I'll try that arch programm to get the lates dev branch.

> So your workflow implies managing bibtex entries with completely
> arbitrary fields that you need to "properly" handle? Is it good enough
> to assume these fields are all of type Text? Do you need to actually
> touch these fields or is it enough not to drop them?

Text as field type is fine. It would be ok if these fields
 a) do not produce an error on readind a bibtex file and
 b) show up again, when saving to bibtex.

I do _not_ need to access (and modify) them from within pyblio.

best regards
 Samuel

--=20
 Dipl. Inf. Samuel John
 Ph.D. student at
   Faculty of Technology, Neuroinformatics Group,
   Bielefeld University, 33501 Bielefeld, Germany
 in cooperation with
   HONDA Research Institute Europe GmbH
   Carl-Legien-Stra=DFe 30, D-63073 Offenbach/Main, Germany

Re: [pyblio-devel] [Pybliographer-devel] Load bibtex and then save bibtex - problem

From: <go...@pu...> - 2007-08-15 17:42:04

> Though this reflects what is done, I (as a user of pyblio) still
> cannot guess why to load any directory first. load_settings(fromDir)
> and load_default_settings() could be an option...
> Then it would be intuitively clear to load the settings first and then
> perform other actions...
> just my 2 cents.

Adopted :)

> > I haven't checked the user documentation for a while, I probably need
> > to revisit it with these new classes.
>
> I guess that would be great!

Feel free to comment / update if you notice problems, as a newcomer
you're in a good position to detect issues there

> > The trace is probably cut, I don't see the actual
> > exception being raised.
>
> That trace was all I get on the console, no further
> exception is raised.

What I mean is that there is usually the name of the actual exception
that happened (AttributeError, KeyError,...)... It's weird that you
don't have it

> That seems ok besides the fact that I don't have (and don't need)
> twisted and bsdbd. My code just uses plain filestore.

Fair enough

> Tomorrow I'll try that arch programm to get the lates dev branch.

Yes please, that will make things more comparable.

> > So your workflow implies managing bibtex entries with completely
> > arbitrary fields that you need to "properly" handle? Is it good enough
> > to assume these fields are all of type Text? Do you need to actually
> > touch these fields or is it enough not to drop them?
>
> Text as field type is fine. It would be ok if these fields
>  a) do not produce an error on readind a bibtex file and
>  b) show up again, when saving to bibtex.
>
> I do _not_ need to access (and modify) them from within pyblio.

Ok. There is a Blob type I planned to introduce, which might be a
better match for this (there will be no attempt to index & search
them).

As this is a common request, I'll create a derived BibTeX Reader and
Writer, with a different schema that is basically the standard schema
+ a blob field, and the Reader & Writers will use the content of the
blob to store the extra fields. I'll keep you posted.

--=20
Fr=E9d=E9ric

Re: [pyblio-devel] [Pybliographer-devel] Load bibtex and then save bibtex - problem

From: Doug B. <dou...@gm...> - 2007-08-16 13:04:00

Samuel's original code and confusion was almost identical to mine from a
month or so ago.

I think the main problem is a general usecase that is quite different from
the way pybliographer was originally designed. Basically, the pattern is
very simple:

1) open a bib file of some type (regardless of schema) in memory
2) do something to it
3) write it out in same or different format

Many people use, say, bibtex as a sort of XML: they make up fields to store
values. This doesn't work to well with a paradigm based on schemas. If ther=
e
was an easy to use API for the above, that would be great.

In the end, I only used Pyblio.Parsers.Syntax.BibTeX.Parser and then store
common fields in a general database. The results (still in alpha) can be
seen here:

http://myro.roboteducation.org/~dblank/reference/

-Doug

On 8/14/07, Fr=E9d=E9ric Gobry <go...@pu...> wrote:
>
> > # Get the default parser, though I do not know why this is needed and
> what
> > # are the other options...
> > Registry.parse_default()
>
> This is needed to bootstrap the registry with the default set of
> schemas, adapters, formatters,... In an application, you could have
> specific directories holding these definitions, which you would load
> explicitely. The parse_default() method loads from predefined
> locations (Pyblio/RIP/ mainly). I don't like the idea of doing this
> unconditionally when Pyblio is imported (it's an application-level
> behavior, not a library-level)
>
> > # Get the schema associated with the default bibtex parser
> > bibtex_schema =3D Registry.getSchema("org.pybliographer/bibtex/0.1")
> >
> > # We need to ensure the file does not exist yet.
> > try: os.unlink(out_f)
> > except OSError: pass
> >
> > # Create a new db using this schema. I'd like to use an in-memory
> storage but
> > # somehow then the parsing does not work.
> > db =3D Store.get('file').dbcreate(temp_f, bibtex_schema)
>
> The memory store should work fine with the fix below
>
> > # Import the content of the bibtex file into it
> > reader =3D BibTeX.Reader()
> > reader.parse( open(in_f), db )
> >
> >
> > # To save it as BibTeX, we get an
> > # adapter from PubMed to BibTeX
> > bibtex =3D Adapter.adapt_schema(db, 'org.pybliographer/bibtex/0.1')
>
> You only need to adapt from one schema to another, which is not the
> case here. I'll fix the code so that it's a no-op to call
> adapt_schema() with the same schema.
>
> > # Now we have a "virtual" bibtex database, that we can actually
> > # save as a BibTeX file
> > from Pyblio.Parsers.Semantic.BibTeX import Writer
> >
> > w =3D Writer()
> > w.write(open(out_f, 'w'), bibtex.entries, bibtex)
>
> This code works for me if I get rid of the adapter (and then I can use
> the memory store too).
>
> > Is there an easier way to accomplish this?
>
> There was already a short discussion about providing a simpler api for
> simple tasks. I'm completely ok with this, it's just that I don't have
> the time to work on it right now. Patches welcome :) Doug (cc'ed), did
> you have an opportunity to work on this? Typical tasks should be
> defined first, but this comes with usage. Opening a file of a specific
> format looks like a good candidate, and this layer could arguably run
> parse_default() when it is imported.
>
> >     w.write(open(out_f, 'w'), bibtex.entries, bibtex)
> > AttributeError: 'NoneType' object has no attribute 'entries'
>
> That was caused by adapt_schema returning None (I've fixed that,
> returning db if no conversion is needed, and raising an exception in
> case of Error)
>
> > - Are new fields in the bibtex file supported? For example there is a
> > key called "doi"
> >   that stands for "digital object identifier" in some of my test
> > bibtex files that cannot be loaded due to an unknown "doi" element.
>
> This requires a multi-layered answer :) Yes, you can simply add "doi"
> to the bibtex schema, and support it in the reader and writer classes.
> This is the correct thing to do for "regular" bibtex fields, ie fields
> that are generic enough for everybody (doi probably deserves this
> status). For less generic tags, the correct idea is to have a more
> specific schema, and derive your own parser that handles your specific
> tags.
>
> BibTeX has the particularity (when compared with other formats like
> RIS,...) that everybody can add his own tags. This is very convenient
> locally, but can be annoying when exchanging data. So far, pyblio-1.3
> does not support arbitrary tags very well, because I did not come up
> with a nice way to support them without loosing the advantages of
> having a schema.
>
> --
> Fr=E9d=E9ric
>

Re: [pyblio-devel] [Pybliographer-devel] Load bibtex and then save bibtex - problem

From: Samuel J. <mail@SamuelJohn.de> - 2007-08-16 13:37:28

Hi Doug,




> Many people use, say, bibtex as a sort of XML: they make up fields to store
> values. This doesn't work to well with a paradigm based on schemas. If there
> was an easy to use API for the above, that would be great.

My usecase is indeed very simple - in theory.

Multiple people can open a bibtex file that is under version control
(subversion).
Different gui clients introduce a different mark-up and sorting of the
entries (think of line breaks and so on). These changes do not affect
the content itself but will lead to a lot of changes in the subversion
versioning system, potentionally leading to conflicts.
My idea is: Before the svn commit I run a script that will normalize
the whole bibtex file so that only changes in the content are
reflected. If everyone uses that script before checkin, there should
be no such superflous changes due to markup stuff.

Consider this
@ARTICLE{tom2007, ... }
and this
@article {tom2007,
              ... }

both are the same entry and are valid bibtex, but subversion makes a
difference here.


>
> In the end, I only used Pyblio.Parsers.Syntax.BibTeX.Parser
> and then store common fields in a general database. The results (still in
> alpha) can be seen here:

I thought about that, too ...

>
>  http://myro.roboteducation.org/~dblank/reference/

I do find a lot of things there but where exactly is the bibtex stuff?

cheers
 Samuel