Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo

Close

#39 CQPweb: Annotate query

TODO-4.0
open
Andrew Hardie
CQPweb (34)
1
2011-08-01
2010-03-09
Andrew Hardie
No

CQPweb: Annotate query

These are basic design notes for a proposed “Annotate query” function which will extend (And possibly ultimately subsume) the existing “Categorise query” functionality. As such, it will be one of the rare CQPweb features that is not a

Comments on the proposal on this sourceforge thread are welcome although it’s not currently a high priority.

Current situation: Categorise query allows you to define a set of categories and then “annotatate” each line of a given concordance by assigning one of those categories to it. Categories are effectively values of a single attribute, where the attribute values are limited to a set. But potentially, we might want to annotate free values. Below is an example of why.

Problem:

Say you are doing a Gries-style collostruction analysis of the BE GOING TO + VERB construction, and you want to know what collexemes are in the VERB slot.

So a search for _VVGK is the starting point.

Then you need to annotate your head verb (inconsistent position). You need its lemma. Can’t be done automatically. You want to assign a label (the lemma) to each conc line. But non-finite set of labels.

Current solution: download and analyse in exsel. Unsatisfactory – innovation in tools is a necessity, also avoids stagnation of widely-used methodologies.

We want the tools to do it for us – no download, we might want to reupload (which is currently possible, but

Solution:

(1) Add extra menu option, “Label query” << cos I will use the word “annotate” for something else later.
(2) Like categorise query, you get to name the query. But you don’t specify values. Instead, you get an empty text box to type whatever you want.
(3) You can save this, just like a categorise query. Realised as a database, like Categorise Query.
(4) Limited to \w and 0x20, for safety. Use a regex filter.

Then, you can search the “label” column to extract subsets. It can’t be a straight “split” like with categorise or else you’d get too many subsets. Instead, specify a regular expression: any instance which matches that regex goes into a new query.

Alternatively, you can get a frequency breakdown of the contents of that

This would be the data you’d need for the collostruction analysis I gave as an example: a list of lemma labels, with frequencies, in the verb-slot of that construction.

=====
Of course, this raises the question: shouldn’t we go further and allow multiple annotation fields?

EG for Gries/Divjak style behavioural profiles: every example is annotated with multiple attribute-value pairs; the results are then the input to exploratory statistics (hierarchical cluster analysis in this case). One attribute would identify the groups you are trying to cluster (e.g. senses of one word, or which of two near-synonyms it is). The other attributes would need to be what G/D call ID Tags.

Why shouldn’t it be possible to have multiple manually-adjustable attributes in CQPweb? Why should people have to download, annotate, reupload?

In this case, the procedure would probably be as follows:

1) you can define an annotation SCHEME. The scheme specifies a list of attributes, and whether they are labels or a closed-list. If it is a closed list all possible values are listed too. (saves redefining multiple lists of attributes and values at the time of creating your annotated query)
2) A separate table for this, something like manual_annot_schemes
3) You have an “Annotate query” option which allows you to link your query to one of the schemes you have defined
4) Schemes can be public across a corpus or installation – allowing, for example, teachers to set up the categories that they then give to students to apply to a concordance.
5) Query + scheme = shape of database. A record in saved_manual_annots keeps track of it, the actual data is scored in a separate table of corpus positions for the hit plus as many fields as necessary.

It might well be possible to have R in the background so that the cluster analysis, or other exploratory statistics, could be applied automatically.

Or to compare the results of applying a scheme to one query, to the results of applying it the scheme to another query.

(There would of course need to be a “sophisticated” web interface to managing all of this and manipulating the results of annotating a query).

Now note. The “categorise query” and “label query” functions become special cases (single-column annotation schemes) of “annotate query”.

They should probably be kept for compatibility however. (on the fly automatic creation of annotation schemes when you define categories relevant to a specific query).

Now, the questions: would this be a useful feature? Should it work as described, or otherwise?

Discussion

  • Andrew Hardie
    Andrew Hardie
    2010-03-11

    Other things:
    -- make columns interconvertible between labels and closed-list (ie "levels" and free-text)

     
  • Andrew Hardie
    Andrew Hardie
    2011-08-01

    Yannick suggests adding the possibility of pre-populating fields in a form governed by CQP syntax (and so this may be linked to the issue of subqueries via CQPweb...)

    >>>>
    I would picture myself writing bits of code for the case you describe, but
    you could probably also make this easier for non-programming users if
    you allow them to pre-populate a field with, say, something like
    "[p-attr] lemma of next token to the [right/left]right, within [number]10 words,
    where [p-attr] POS is [regex] V.*"
    which would reduce the annotation effort somewhat.
    <<<<
    Other suggestions/requests for this feature very welcome...

     
  • Andrew Hardie
    Andrew Hardie
    2011-08-01

    • milestone: --> TODO-4.0