[agtk-devel] Validating feature values

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Gilles Sadowski <gi...@ha...> wrote:
> Now that I've the library compiled, I'm going to start asking another 
> type of questions :-).
> Thanks again for your swift and useful help in resolving the problem!
> 
> I read in Bird's and Liberman's article ("A formal framework for 
> linguistic annotation <http://arXiv.org/abs/cs/0010033>") that the 
> content of a 'Feature' is simply character data (i.e. #PCDATA, in the 
> DTD file "ag.dtd"), although preferably structured values, as proposed 
> by Dublin Core.

That's right - we said that the AG formalism didn't specify any structure
or semantics for the content of an arc label (or "feature").

> The problem I see with that scheme is that, in the case there exist 
> particular constraints on the 'Feature' contents, we would need a 
> special piece of software to enforce them and validate the contents of 
> the produced documents.

Right, but the C++ library could contain a collection of standard
validation functions to support application developers (extending the
Validation.cc functions I wrote back in June).

> If 'Feature' would be allowed to contain other 
> XML elements, we could benefit from the XML Schema, for example, to 
> define the constraints, and then "standard" validation tools could 
> validate the documents.

So far, our demonstration applications have all been for *creating*
annotation data.  In this context, any validation of feature content
is best done at the time the data is being entered, rather than later
once it is exported to XML.  (Then there are the further problems that
XML Schema validators differ in their coverage of the standard and in
their supported platforms, and that XML Schema cannot express some
common kinds of constraints over the representation we've chosen,
e.g. "if feature_x1=y1 then feature_x2=y2").

> I'd like to have your opinion about that.  Does it make sense?  Even in 
> the affirmative, it might have a too great impact on the structure of 
> the AG library to be implemented...  Is it worth it or are there strong 
> disadvantages to this approach?

Basically, the issue of how best to manage special-purpose content models
is a research question.  We've approached the problem in two ways: defining
a high-level API on top of AGLIB, which only permits well-formed structures
and features to be created by applications; OR hard-coding the constraints
into the applications themselves.  In the last six months or so, we've
realized that both of these approaches need to be implemented in a way that
is "type-safe".  I.e. every time a constraint is tested on some aspect of
graph structure or content, it is only tested on the annotations of a
specified type.  Thus, we don't touch other annotations in the AG that
might have been created by some other tool that this tool knows nothing
about.  You'll notice that the validation functions in Validation.cc all
require a type argument.

An extension to the above is to add some declarations to the metadata,
which inform the application about validation details (e.g. in the
interlinear text tool).

An ongoing research activity is to define an AG query language which can be
used to make declarations of well-formedness, then compile this into SQL
for efficient execution.  This would make it possible for any AG data file
to document the details of its structure (by including a query expression
in its header, say), and likewise for AG tools to document the assumptions
they make about AG data.  Going a step further, we might be able to
automatically determine what tools could be applied to what AG data.
Unfortunately this is all pie in the sky right now.

I think the expedient approach is to do validation in software at runtime,
and to put widely-useful validation functions in AGLIB.

Steven Bird

--
Steven Bird        Email: <sb...@cs...>  Web: http://www.cs.mu.oz.au/~sb/
A/Prof, Dept of Computer Science, University of Melbourne, Vic 3010, AUSTRALIA
Senior Research Assoc, Linguistic Data Consortium, University of Pennsylvania