The
GateProcess
class is a reduce-heavy component that carries out all of USFD\'s
linguistic analysis for Workpackages 3 and 4.
The class has one internal class: the
GateReducer
.
Its reduce method uses an instance of USFD\'s OfflineProcessing tool to
carry out linguistic and related analysis.
At the beginning of each reduce call, the GATE offline analysis
component creates a new, empty GATE corpus. Each
ResultResourceWrapper in the value list is turned into a GATE document
using its content and the mime type recorded in HBase, its key as a
URI (which becomes the WebResource URI in the generated RDF), and
encoding (determined with the com.ibm.icu.text.CharsetDetector
).
The HTTP headers and status code are also passed to the GATE
component.
The offline analysis component then runs four GATE applications
(TermRaider with Named Entity Recognition; event detection; opinion
detection; and analysis for YIS) over the corpus and collects the
results, which are returned to the reduce method and then stored as
follows:
termbanks are converted to sets of RDF triples, which are uploaded
to the TripleStoreConnector;
named entities, events, and opinions are also expressed as RDF
triples, which are uploaded to the TripleStoreConnector;
an XML representation of a basic linguistic analysis of each
documents, which is stored in HBase to help YIS with topic detection.