Re: [Obo-discuss] A few questions on the format specification

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Marco Antoniotti wrote:

> I am writing a parser for the OBO format for our GOALIE application and
> I have a few remarks to make about the format specification
> 'http://www.geneontology.org/GO.format.shtml#oboflat'.  It is small
> fry, but it would help of these issues were clarified.  Please don't
> take this message as bad criticism: it is well meant and written with
> good intentions.

Thanks for your input, Marco.

I agree with Chris Mungall's early reply: if you are writing your parser in 
Java or Perl, you might want to look at the parsing libraries we've made 
available. There is an easy to use, event-based parser for Java that might 
interest you...

Something to keep in mind as I go through your individual examples: The OBO 
1.2 specification clarifies a number of general parsing questions left open 
by the 1.0 spec. If there is a general parsing issue that is discussed in OBO 
1.2 that is not discussed in the OBO 1.0 spec, assume that the OBO 1.2 
specification applies to OBO 1.0 as well. Note that this is ONLY true for 
GENERAL parsing issues, not specific tag and stanza questions.

> Issue number 1:
> Comments.
> The spec is unclear whether comments must occupy an entire line or
> whether inline comments are acceptable.  Several .obo files 'de facto'
> accept inline comments in tags like
>
> 	isa_a: GO:0009055 ! electron carrier activity

Inline comments are part of the official OBO 1.2 specification. They are also 
supported by all OBO 1.0 parser implementations that I know of.

> Issue number 2:
> Date formatting.
> The spec seems to be clear about the format of dates
>
> 	dd:mm:yyyy hh:mm
>
> However, in the example just below the date is formatted in a different
> way as 'April 18th, 2003'.  Either the spec is wrong or the example is.
> Incidentally, a better format for dates (happily forgetting all the
> time-zone issues) would be
>
> 	yyyy:mm:dd[:hh:mm]
>
> with the hours and minutes optional.

The example file you saw must have been generated by a buggy file dumper. The 
format specification is correct, and OBO-Edit implements the specification 
correctly.

Note that you may implement your parser so that it ignores badly formatted 
dates. It is still a correct implementation of the spec as long as you warn 
the user that a bad date was loaded and ignored.

I will talk to Chris about making the time information optional in the OBO 1.2 
spec. I personally think it is dangerous, because many ontologies go through 
several revisions a day, and any application that uses the date stamp would 
not know which of those revisions came first. However, I'll see what Chris 
thinks...

> Issue number 3:
> Strings.
> The spec would benefit from a more precise definition of 'string'.
> This should also resolve problems with quoting issues in strings.  In
> dbxref formatting there is an example like
>
> 	GO:ma "Term written by Michael\, without reference to other sources"
>
> The quoting of the comma inside the string is pleonastic.

You're right, that example is broken. The comma should not be escaped. I'll 
fix this in the OBO 1.2 specification.

This issue is covered (albeit obliquely) in section 1.5 of the OBO 1.2 spec, 
"Escape characters":

	Escaped characters should only be used when a literal character is needed
	(that is, a character that the parser should not interpret as having a 
	special meaning when parsing). Some tag values may contain unescaped colons, 
	brackets, quotes, etc. that have meaning in decoding the tag value. Unescaped 
	spaces between the seperator colon and the start of the value tag are 
	discarded.

This means that inside a quoted string, the only characters that should be 
escaped are double quotes and backslashes, because they are the only 
characters that have special meaning within the context of a quoted string. 
(The special meaning of double-quotes is "end this string" and the special 
meaning of blackslash is "escape character coming up").

And thanks for teaching me the word "pleonastic".

> Issue number 4:
> Parser Batch Processing.
> The notion of 'files batch' is never defined.  Moreover, the meaning
> that can be inferred from the references to this notion prefigures a
> certain way of using the parser, i.e. something along the lines of
>
> 	files = [ obo_file1, obo_file2, ..., obo_fileN ]
> 	internal_obo_representation = obo_parser(files)
>
> This is not necessarily the way a sophisticated application would work.
>   E.g. an application may decide to actually work incrementally and load
> one obo file at a time and amend its internal state as needed.
> In order to solve this issue, the spec should either take the easy way
> out and state that each obo file must be self-contained or that
> incremental processing is allowed, thus amending the references to
> 'batch processing' of obo files.

I think that the specification is correct here.

We can't say that all OBO files are self-contained. That requirement would 
greatly diminish the flexibility of the format and would render OBO "library" 
files (like obo_rel.obo) useless.

I think the notion of batch processing is useful in the parser specification. 
The concept of batch processing tries to capture the idea that an application 
may need to load multiple interdependant files. The "file batch" idea is a 
way of specifying what information MUST be available for a parse to be 
successful, and what information can be left dangling after all the files are 
loaded. This is important, because some dangling references are allowable 
(dangling references to Term objects) and some dangling references are NOT 
(dangling references to Property types within a "relationship" tag). Hence 
the utility of specifying what information needs to be in place after a 
"batch" of files has been processed.

Note that these references to file batches shouldn't impact your ability to do 
incremental processing at all. The specification doesn't say anything about 
how dangling references may be resolved, so there's no reason why your 
application couldn't load files one at a time, and resolve dangling 
references as new information becomes available. However, you are still 
working with file batches that contain a single file, and at the end of each 
of these single-file batches, certain information needs to be in place (ie no 
dangling property references).

If you were trying to do something very fancy, like implementing a 
lazy-loading system where a file's contents were only loaded as needed, that 
goes beyond the scope of the specification. Such a system would require 
complex rules about how to report missing dependencies during a user session, 
and careful protocols about how to recover from such failures. It seems to me 
that this set of rules would be highly application-specific, and therefore 
well beyond the scope of a general specification for parsing obo files.

I may have missed the point here, though. Please let me know if there's 
something I'm not getting...

Thank you for your careful reading of the specification and insightful 
comments. I will get to work on amending the specification immediately.

	-John