#3 Exclude some text nodes according to parent tag value

v1.0_(example)
open
nobody
None
5
2013-10-04
2013-10-01
Djak
No

Hello,
I'm trying to extract some text nodes in regions excluding nodes within a specific subregion. For example in :

<text id=42 lang="English"> <s>An easy example.</s><s> Another <i>very</i> easy example.</s> <s><b>O</b>nly the <b>ea</b>siest ex<b>a</b>mples!<s></text>

I'd like to extract all the words except the words between "<i></i>". I tried the "diff" command approach :
A = /region[text];
B = /region[i];
C = diff A B;

but "diff" is a set command so it doesn't work as I expected and return the same set as A.

Anyone could tell me if what I'd like to do is possible and how I could do it ?

Thanks in advance.

Discussion

  • Stefan Evert
    Stefan Evert
    2013-10-01

    I'm not sure I understand exactly what you want to do. If you want a list of all tokens that occur in regions but are not within an element, the query would be:

    A = [text & !i];

    If you want one match per region, but "cut out" the tokens marked , that's not possible. Matches in CQP are always contiguous ranges of tokens. I'm not aware of any other query tool that supports such "ranges with holes".

    You'll probably get better help on the CWB mailing list, which you can join here:

    http://devel.sslmit.unibo.it/mailman/listinfo/cwb
    

    Best,
    Stefan

     
  • Djak
    Djak
    2013-10-03

    Thank you very much. That's exactly what I was looking for. I didn't see this syntax (!TAGNAME) in the manual, I need to recheck it.
    Cheers,
    Djak

     
  • Djak
    Djak
    2013-10-04

    Hello,
    actually, I didn't understand well the second part of your answer. I saw :

    A = [text & !i];

    retrieves a list of token positions.

    If I take the example :

    <text id=42 lang="English"><s>An easy example.</s><s> Another <i>very easy</i> example.</s> <s>Only the easiest examples!<s></text>

    Do you mean by "ranges with holes" there is no way to retrieve all the words outside the <i></i> and then be able to do sequential query such "[][]" which wouldn't return the "very easy" sequence ?

    Thanks in advance for help.