From: <ian...@us...> - 2014-05-15 16:59:11
|
Revision: 17998 http://sourceforge.net/p/gate/code/17998 Author: ian_roberts Date: 2014-05-15 16:59:08 +0000 (Thu, 15 May 2014) Log Message: ----------- 2.4 changelog Modified Paths: -------------- gcp/trunk/doc/batch-def.tex gcp/trunk/doc/introduction.tex Modified: gcp/trunk/doc/batch-def.tex =================================================================== --- gcp/trunk/doc/batch-def.tex 2014-05-15 16:36:54 UTC (rev 17997) +++ gcp/trunk/doc/batch-def.tex 2014-05-15 16:59:08 UTC (rev 17998) @@ -172,6 +172,7 @@ assumes that the document ID is the path of an entry in the ZIP file. \subsection{The {\tt ARCInputHandler} and {\tt WARCInputHandler}} +\label{sec:batch-def:arc} These two input handlers read documents out of ARC- and WARC format web archive files as produced by the Heritrix web crawler and other similar tools. They Modified: gcp/trunk/doc/introduction.tex =================================================================== --- gcp/trunk/doc/introduction.tex 2014-05-15 16:36:54 UTC (rev 17997) +++ gcp/trunk/doc/introduction.tex 2014-05-15 16:59:08 UTC (rev 17998) @@ -134,6 +134,26 @@ This section summarises the main changes between releases of GCP +\subsection{2.4 (May 2014)} + +\bit +\item Now depends on GATE Embedded 8.0 +\item Added input handler for WARC format archives, to complement the existing + ARC handler (section~\ref{sec:batch-def:arc}). +\item ARC and WARC handlers can optionally load individual records from + remotely hosted archives using HTTP requests with a ``Range'' header. This + facility can be used with publicly-hosted data sets such as Common + Crawl\footnote{\url{http://www.commoncrawl.org}}. To support this + functionality, document identifiers in a batch definition can now take XML + attributes as well as the actual string identifier (exactly how such + attributes are used is up to the handler implementations). +\item Added output handler to save documents in a JSON format modelled on that + used by Twitter to represent ``entities'' (e.g. username mentions) in Tweets. +\item Efficiency improvements in the M\'{i}mir output handler, to send + documents to the server in batches rather than opening a new HTTP connection + for every document. +\eit + \subsection{2.3 (November 2012)} \bit This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |