[SIGLEX-MWE] [PARSEME-ST] Training data released

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Dear all,

We are happy to announce the release of the training data for the PARSEME
shared task <http://multiword.sourceforge.net/sharedtask2017> on the
automatic detection of verbal multiword expressions (VMWEs):

https://gitlab.com/parseme/sharedtask-data/

We provide full training sets for 15 languages: Bulgarian (BG), Czech (CS),
German (DE), Greek (EL), French (FR), Hebrew (HE), Hungarian (HU), Italian
(IT), Lithuanian (LT), Maltese (MT), Polish (PL), Brazilian Portuguese
(PT), Romanian (RO), Slovene (SL), Turkish (TR).

For 2 languages, we intend to provide only test data, but the trial data is
available for training: Spanish (ES), Swedish (SV). One language's release
has been postponed: Farsi (FA).

Most datasets include two files, a *train.parsemetsv* (conforming to
the parseme-tsv
format
<http://typo.uni-konstanz.de/parseme/index.php/2-general/184-parseme-shared-task-format-of-the-final-annotation>)
containing the VMWE annotations, and a companion file in a CONLL-U
<http://universaldependencies.org/format.html>-compatible format with
morphosyntactic and/or syntactic information. Both can be used in the
closed track. The companion file is provided for all languages except BG,
ES, HE and LT.  Most annotations are available under open licenses, notably
various flavors of the Creative Commons license.

In total, we provide 225,008 sentences representing 4,394,338 tokens and
containing 49,340 annotated verbal multiword expressions. The table below
summarizes the sizes of the datasets per language.

[image: Screenshot from 2017-01-06 21:33:15.png]

We hope that this highly multilingual dataset will leverage the development
of language-independent and cross-lingual MWE identification systems. We
remind you that the shared task evaluation phase will happen from January
20 to 27.

This has been a tremendous collective effort, possible only with a strong
commitment of many annotators, language leaders, organizers and technical
support. We would like to thank all contributors for the time and
enthusiasm they invested in the creation of this amazing resource.

All the best,

Silvio, Agata, Carlos and the whole shared task team

A note for shared task participants: Small-size trial data were previously
released via the same link <https://gitlab.com/parseme/sharedtask-data/>
for most languages. We cannot fully ensure that no part of these data will
be included in the test data released on 20 January. Therefore, we kindly
ask participants not to use the trial.parsemetsv files for any language
(except ES and SV) while training the final versions of their systems.