From: Steven B. <sb...@cs...> - 2002-10-18 15:32:06
|
Gilles Sadowski <gi...@ha...> wrote: > Now that I've the library compiled, I'm going to start asking another > type of questions :-). > Thanks again for your swift and useful help in resolving the problem! > > I read in Bird's and Liberman's article ("A formal framework for > linguistic annotation <http://arXiv.org/abs/cs/0010033>") that the > content of a 'Feature' is simply character data (i.e. #PCDATA, in the > DTD file "ag.dtd"), although preferably structured values, as proposed > by Dublin Core. That's right - we said that the AG formalism didn't specify any structure or semantics for the content of an arc label (or "feature"). > The problem I see with that scheme is that, in the case there exist > particular constraints on the 'Feature' contents, we would need a > special piece of software to enforce them and validate the contents of > the produced documents. Right, but the C++ library could contain a collection of standard validation functions to support application developers (extending the Validation.cc functions I wrote back in June). > If 'Feature' would be allowed to contain other > XML elements, we could benefit from the XML Schema, for example, to > define the constraints, and then "standard" validation tools could > validate the documents. So far, our demonstration applications have all been for *creating* annotation data. In this context, any validation of feature content is best done at the time the data is being entered, rather than later once it is exported to XML. (Then there are the further problems that XML Schema validators differ in their coverage of the standard and in their supported platforms, and that XML Schema cannot express some common kinds of constraints over the representation we've chosen, e.g. "if feature_x1=y1 then feature_x2=y2"). > I'd like to have your opinion about that. Does it make sense? Even in > the affirmative, it might have a too great impact on the structure of > the AG library to be implemented... Is it worth it or are there strong > disadvantages to this approach? Basically, the issue of how best to manage special-purpose content models is a research question. We've approached the problem in two ways: defining a high-level API on top of AGLIB, which only permits well-formed structures and features to be created by applications; OR hard-coding the constraints into the applications themselves. In the last six months or so, we've realized that both of these approaches need to be implemented in a way that is "type-safe". I.e. every time a constraint is tested on some aspect of graph structure or content, it is only tested on the annotations of a specified type. Thus, we don't touch other annotations in the AG that might have been created by some other tool that this tool knows nothing about. You'll notice that the validation functions in Validation.cc all require a type argument. An extension to the above is to add some declarations to the metadata, which inform the application about validation details (e.g. in the interlinear text tool). An ongoing research activity is to define an AG query language which can be used to make declarations of well-formedness, then compile this into SQL for efficient execution. This would make it possible for any AG data file to document the details of its structure (by including a query expression in its header, say), and likewise for AG tools to document the assumptions they make about AG data. Going a step further, we might be able to automatically determine what tools could be applied to what AG data. Unfortunately this is all pie in the sky right now. I think the expedient approach is to do validation in software at runtime, and to put widely-useful validation functions in AGLIB. Steven Bird -- Steven Bird Email: <sb...@cs...> Web: http://www.cs.mu.oz.au/~sb/ A/Prof, Dept of Computer Science, University of Melbourne, Vic 3010, AUSTRALIA Senior Research Assoc, Linguistic Data Consortium, University of Pennsylvania |