Menu

#7 Corpus info output

v1.0_(example)
open
nobody
None
5
2023-10-21
2023-10-14
ram
No

Hi! I have a test corpus with this registry:

##
## registry entry for corpus PRUEBA
##

# long descriptive name for the corpus
NAME "Una pruebáñ"
# corpus ID (must be lowercase in registry!)
ID   prueba
# path to binary data files
HOME /opt/cwb/data/prueba
# optional info file (displayed by "info;" command in CQP)
INFO /opt/cwb/data/prueba/.info

# corpus properties provide additional information about the corpus:
##:: charset  = "utf8" # character encoding of corpus data
##:: language = "??"     # insert ISO code for language (de, en, fr, ...)


##
## p-attributes (token annotations)
##

ATTRIBUTE word
ATTRIBUTE FORM
ATTRIBUTE LEMMA
ATTRIBUTE TAG
ATTRIBUTE SHORT_TAG
ATTRIBUTE MSD
ATTRIBUTE NEC
ATTRIBUTE SENSE
ATTRIBUTE SYNTAX
ATTRIBUTE DEPHEAD
ATTRIBUTE DEPREL
ATTRIBUTE COREF
ATTRIBUTE TOKENID


##
## s-attributes (structural markup)
##

# <text id=".."> ... </text>
STRUCTURE text
STRUCTURE text_id              # [annotations]

# <p> ... </p>
STRUCTURE p

# <s> ... </s>
STRUCTURE s


# Yours sincerely, the Encode tool.

And when I do info PRUEBA I get this output:

Size:    21
Charset: utf8
Properties:
        language = '??'
        charset = 'utf8'

No further information available about PRUEBA

So I wonder two things:

  1. Why the id in the registry is in lower case and for info it has to be in upper case?
  2. Why info doesn't output more information like ATTRIBUTE and STRUCTURE?

Thanks

Discussion

  • Stephanie Evert

    Stephanie Evert - 2023-10-14
    1. Because that's how the orginal developer decided to do things in 1994. It's a quirk that we live with for the sake of backwards compatibility. Note that the filename of the registry file also has to be in lowercase, while corpus IDs are to be specified in all caps everywhere else.
    2. You can get the list of attributes with show cd or using cwb-describe-corpus -s on the command line.
    3. Canonical attribute names (both positional and structural) should be all lowercase and only use ASCII characters. While your all-caps p-attributes are accepted for backwards compatibility, some tools may stumble over them.
     
  • ram

    ram - 2023-10-14

    Thanks for your response! cwb-describe-corpus -s works perfectly.

    For show cd I still get incomplete information:

    ===Context Descriptor=======================================                                                                                      
    
    left context:     25 characters                                                                                                                   
    right context:    25 characters                                                                                                                   
    corpus position:  shown                                                                                                                           
    target anchors:   not shown                                                                                                                       
    
    Positional Attributes:    <none>                                                                                                                  
    
    Structural Attributes:    <none>                                                                                                                  
    
    Aligned Corpora:          <none>                                                                                                                  
    
    ============================================================
    
     
  • Stephanie Evert

    Stephanie Evert - 2023-10-14

    You seem to have forgotten to activate the corpus:

    info PRUEBA;
    

    but

    PRUEBA;
    show cd;
    
     
    • ram

      ram - 2023-10-21

      Thanks, sorry for the mistake

       

Log in to post a comment.