From: John Day-R. <joh...@ay...> - 2005-10-31 21:48:59
|
Marco Antoniotti wrote: > I am writing a parser for the OBO format for our GOALIE application and > I have a few remarks to make about the format specification > 'http://www.geneontology.org/GO.format.shtml#oboflat'. It is small > fry, but it would help of these issues were clarified. Please don't > take this message as bad criticism: it is well meant and written with > good intentions. Thanks for your input, Marco. I agree with Chris Mungall's early reply: if you are writing your parser in Java or Perl, you might want to look at the parsing libraries we've made available. There is an easy to use, event-based parser for Java that might interest you... Something to keep in mind as I go through your individual examples: The OBO 1.2 specification clarifies a number of general parsing questions left open by the 1.0 spec. If there is a general parsing issue that is discussed in OBO 1.2 that is not discussed in the OBO 1.0 spec, assume that the OBO 1.2 specification applies to OBO 1.0 as well. Note that this is ONLY true for GENERAL parsing issues, not specific tag and stanza questions. > Issue number 1: > Comments. > The spec is unclear whether comments must occupy an entire line or > whether inline comments are acceptable. Several .obo files 'de facto' > accept inline comments in tags like > > isa_a: GO:0009055 ! electron carrier activity Inline comments are part of the official OBO 1.2 specification. They are also supported by all OBO 1.0 parser implementations that I know of. > Issue number 2: > Date formatting. > The spec seems to be clear about the format of dates > > dd:mm:yyyy hh:mm > > However, in the example just below the date is formatted in a different > way as 'April 18th, 2003'. Either the spec is wrong or the example is. > Incidentally, a better format for dates (happily forgetting all the > time-zone issues) would be > > yyyy:mm:dd[:hh:mm] > > with the hours and minutes optional. The example file you saw must have been generated by a buggy file dumper. The format specification is correct, and OBO-Edit implements the specification correctly. Note that you may implement your parser so that it ignores badly formatted dates. It is still a correct implementation of the spec as long as you warn the user that a bad date was loaded and ignored. I will talk to Chris about making the time information optional in the OBO 1.2 spec. I personally think it is dangerous, because many ontologies go through several revisions a day, and any application that uses the date stamp would not know which of those revisions came first. However, I'll see what Chris thinks... > Issue number 3: > Strings. > The spec would benefit from a more precise definition of 'string'. > This should also resolve problems with quoting issues in strings. In > dbxref formatting there is an example like > > GO:ma "Term written by Michael\, without reference to other sources" > > The quoting of the comma inside the string is pleonastic. You're right, that example is broken. The comma should not be escaped. I'll fix this in the OBO 1.2 specification. This issue is covered (albeit obliquely) in section 1.5 of the OBO 1.2 spec, "Escape characters": Escaped characters should only be used when a literal character is needed (that is, a character that the parser should not interpret as having a special meaning when parsing). Some tag values may contain unescaped colons, brackets, quotes, etc. that have meaning in decoding the tag value. Unescaped spaces between the seperator colon and the start of the value tag are discarded. This means that inside a quoted string, the only characters that should be escaped are double quotes and backslashes, because they are the only characters that have special meaning within the context of a quoted string. (The special meaning of double-quotes is "end this string" and the special meaning of blackslash is "escape character coming up"). And thanks for teaching me the word "pleonastic". > Issue number 4: > Parser Batch Processing. > The notion of 'files batch' is never defined. Moreover, the meaning > that can be inferred from the references to this notion prefigures a > certain way of using the parser, i.e. something along the lines of > > files = [ obo_file1, obo_file2, ..., obo_fileN ] > internal_obo_representation = obo_parser(files) > > This is not necessarily the way a sophisticated application would work. > E.g. an application may decide to actually work incrementally and load > one obo file at a time and amend its internal state as needed. > In order to solve this issue, the spec should either take the easy way > out and state that each obo file must be self-contained or that > incremental processing is allowed, thus amending the references to > 'batch processing' of obo files. I think that the specification is correct here. We can't say that all OBO files are self-contained. That requirement would greatly diminish the flexibility of the format and would render OBO "library" files (like obo_rel.obo) useless. I think the notion of batch processing is useful in the parser specification. The concept of batch processing tries to capture the idea that an application may need to load multiple interdependant files. The "file batch" idea is a way of specifying what information MUST be available for a parse to be successful, and what information can be left dangling after all the files are loaded. This is important, because some dangling references are allowable (dangling references to Term objects) and some dangling references are NOT (dangling references to Property types within a "relationship" tag). Hence the utility of specifying what information needs to be in place after a "batch" of files has been processed. Note that these references to file batches shouldn't impact your ability to do incremental processing at all. The specification doesn't say anything about how dangling references may be resolved, so there's no reason why your application couldn't load files one at a time, and resolve dangling references as new information becomes available. However, you are still working with file batches that contain a single file, and at the end of each of these single-file batches, certain information needs to be in place (ie no dangling property references). If you were trying to do something very fancy, like implementing a lazy-loading system where a file's contents were only loaded as needed, that goes beyond the scope of the specification. Such a system would require complex rules about how to report missing dependencies during a user session, and careful protocols about how to recover from such failures. It seems to me that this set of rules would be highly application-specific, and therefore well beyond the scope of a general specification for parsing obo files. I may have missed the point here, though. Please let me know if there's something I'm not getting... Thank you for your careful reading of the specification and insightful comments. I will get to work on amending the specification immediately. -John |