Name | Modified | Size | Downloads / Week |
---|---|---|---|
Parent folder | |||
Old-Data | 2014-08-22 | ||
README | 2023-07-25 | 596 Bytes | |
LTI-LangID-rel5.txz | 2023-07-25 | 753.8 MB | |
LTI-LangID-rel4.txz | 2020-06-12 | 669.8 MB | |
LTI-LangID-rel3.txz | 2018-02-22 | 461.3 MB | |
LTI-LangID-rel2.txz | 2014-11-21 | 395.5 MB | |
LTI-LangID-rel1.txz | 2014-08-22 | 373.4 MB | |
Totals: 7 Items | 2.7 GB | 9 |
The files in this directory contain the various releases of the LTI LangID Corpus. Release 1 contains 781 "core" languages and 1091 overall, and is the version to use if you wish to replicate the EMNLP 2014 experiments. Release 3 contains 970 "core" languages and 1279 overall. Release 4 contains 1152 "core" languages and 1547 overall (Note that the 00README inside the archive accidentally omitted counting one Wikipedia language). Release 5 contains 1266 "core" languages and 1706 overall, and includes scripts to download non-redistributable text for more than 1000 additional languages.