|
From: Silvio R. C. <sil...@gm...> - 2017-01-06 23:30:39
|
Dear all, We are happy to announce the release of the training data for the PARSEME shared task <http://multiword.sourceforge.net/sharedtask2017> on the automatic detection of verbal multiword expressions (VMWEs): https://gitlab.com/parseme/sharedtask-data/ We provide full training sets for 15 languages: Bulgarian (BG), Czech (CS), German (DE), Greek (EL), French (FR), Hebrew (HE), Hungarian (HU), Italian (IT), Lithuanian (LT), Maltese (MT), Polish (PL), Brazilian Portuguese (PT), Romanian (RO), Slovene (SL), Turkish (TR). For 2 languages, we intend to provide only test data, but the trial data is available for training: Spanish (ES), Swedish (SV). One language's release has been postponed: Farsi (FA). Most datasets include two files, a *train.parsemetsv* (conforming to the parseme-tsv format <http://typo.uni-konstanz.de/parseme/index.php/2-general/184-parseme-shared-task-format-of-the-final-annotation>) containing the VMWE annotations, and a companion file in a CONLL-U <http://universaldependencies.org/format.html>-compatible format with morphosyntactic and/or syntactic information. Both can be used in the closed track. The companion file is provided for all languages except BG, ES, HE and LT. Most annotations are available under open licenses, notably various flavors of the Creative Commons license. In total, we provide 225,008 sentences representing 4,394,338 tokens and containing 49,340 annotated verbal multiword expressions. The table below summarizes the sizes of the datasets per language. [image: Screenshot from 2017-01-06 21:33:15.png] We hope that this highly multilingual dataset will leverage the development of language-independent and cross-lingual MWE identification systems. We remind you that the shared task evaluation phase will happen from January 20 to 27. This has been a tremendous collective effort, possible only with a strong commitment of many annotators, language leaders, organizers and technical support. We would like to thank all contributors for the time and enthusiasm they invested in the creation of this amazing resource. All the best, Silvio, Agata, Carlos and the whole shared task team A note for shared task participants: Small-size trial data were previously released via the same link <https://gitlab.com/parseme/sharedtask-data/> for most languages. We cannot fully ensure that no part of these data will be included in the test data released on 20 January. Therefore, we kindly ask participants not to use the trial.parsemetsv files for any language (except ES and SV) while training the final versions of their systems. |