OPTIMA cidoc-crm Semantic Annotation Wiki

Semantic annotation of archaeology reports with respect to CIDOC-CRM

Status: Alpha

Brought to you by: avlachid

Preprocess Phase

ex:JAPE Grammar
rdfs:label "Preprocess 01";
ex:pattern
({SpaceToken.kind==control})+
({SpaceToken.kind==space})*
({Token}):Begin
;
rdfs:comment "The grammar annotates as Begin the first Token after one or more(+) line breaks followed by zero or more spaces (*)".
.
ex:JAPE Grammar
rdfs:label "Preprocess 02";
ex:pattern
({Token}):End
({SpaceToken.kind==space})*
({SpaceToken.kind==control})+
;
rdfs:comment "The grammar annotates as End the first Token before zero or more spaces (*) followed by one or more (+) line breaks".
.
ex:JAPE Grammar
rdfs:label "Preprocess 03";
ex:pattern
{Sentence contains VG}
;
rdfs:comment "The grammar matches a Sentence (annotated by ANNIE Sentence Splitter) that contains at least one verb (VG). Sentence is re-annotated to Rich_Sentence (pseudo-line)".
.
ex:JAPE Grammar
rdfs:label "Preprocess 04";
ex:pattern
{Rich_Sentence}
({Sentence}):match
{Rich_Sentence}
;
rdfs:comment "The grammar matches a Sentence wrapped between two Rich_Sentence annotations".
.
ex:JAPE Grammar
rdfs:label "Preprocess 05";
ex:pattern
{Token.length < 4, BL contains EL}
{Token.orth == "lowercase", BL contains EL}
;
rdfs:comment "The BL (Beginning of Line) contains EL (End of Line) is used for matching only single-worded Lines. The Token annotations results from the ANNIE Tokenizer".
.
ex:JAPE Grammar
rdfs:label "Preprocess 06";
ex:pattern
({BL, Token.kind == number, Token.length <= 2}
(({Token.string == "."})?
({SpaceToken.kind == space})?
({Token.kind == number, Token.length <= 2})?)
({SpaceToken.kind == space})+
({Token.orth !="lowercase", Token.kind == word})
({Token.kind == word}|{Token.kind ==number}|
{Token.kind == punctuation}|
{SpaceToken.kind == space}|{Dots})
{EL}):match
;
rdfs:comment "The grammar matches Lines stretching up to 12 Tokens that do not contain any Tokens of the kind word".
.
ex:JAPE Grammar
rdfs:label "Preprocess 07";
ex:pattern
{BL, Token.kind == number}
({Token.kind != word})[0,10]
{EL, Token.kind != word}
;
rdfs:comment "The grammar matches headings of numerical commencement. The rule matches phrases which commence with numbers like 1, 1. , 1.1, 1.1.1. etc. followed by a non-lowercase word Token, which is then followed from any number of Tokens including sequence of Dots (previously identified) until the end of line EL token".
.
ex:JAPE Grammar
rdfs:label "Preprocess 08";
ex:pattern
{Heading}
{Heading}
{Heading}
({Heading})+
;
rdfs:comment "The grammar annotates TOC by matching four or more Headings in a row which are required in order to avoid annotation of succeeding Headings within document (empirically two to three) which are not TOC".
.
ex:JAPE Grammar
rdfs:label "Preprocess 09";
ex:pattern
{Heading contains Lookup.type =="Summary"}
“Re-Annotate Headings as Summary”
{Heading.type=="Summary"}
{Heading}
;
rdfs:comment "Heading containing any Lookup of the type Summary (this kind of Lookup originates from a gazetteer which contains the terms summary, abstract and overview)"
.

OPTIMA cidoc-crm Semantic Annotation Wiki

Semantic annotation of archaeology reports with respect to CIDOC-CRM

Preprocess Phase

Related