This directory provide the Java interface for Chinese word segmentation, POS tagging and dependency parsing using zpar under the POS tag set and dependency tag set of Peking University Multi-view Treebank (PMT) on 64-bit windows and linux (fedora).
The dll files are compiled with the cross-platform, open-source IDE "Code::blocks" (Version 13.12) with MinGW64 c++ compiler on Windows 8.1 and linux (fedora), respectively.
You may use the dlls for Java interface directly without compiling.
Using zpar in 64bit Java (V1.7) on Windows.
(1)Trained models: One trained model "model_parse_pmt1" for parsing, and two trained model "model_tag_pfr1, model_tag_pfr6" for joint segmentation and tagging are in the dir "model/".
(2)Dlls: Copy the dlls "cn_nlp_Parser.dll,cn_nlp_Tagger.dll" together with the two dlls "libgcc_s_seh-1.dll,libstdc++-6.dll" from "dll/64/" to your java project directory.
(3)Examples: Examples of using ZParser and ZTagger are given in the dir "src/cn". You can use the parser and tagger separately or jointly referring to the usages in the examples.
(4)User dict for word segmentation and POS tagging: In particular, you can give a userdict usingthe file "userdict.txt". In this file (utf-8 encoding), each line contains a word and a POS tag with a tab between them. If you do not have a proper POS tag for some words, you may use the default tag "n" for these words.
Note: the tagger and parser can not process files whose names contain Chinese characters.
About the models:
(1)The model "model_parser_arceager_mvt_origin_autopos" can be used for Chinese dependency parsing.
If you use them, please cite the following paper:
@InProceedings{qiu-EtAl:2014:Coling2,
author = {Qiu, Likun and Zhang, Yue and Jin, Peng and Wang, Houfeng},
title = {Multi-view Chinese Treebanking},
booktitle = {Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers},
month = {August},
year = {2014},
address = {Dublin, Ireland},
publisher = {Dublin City University and Association for Computational Linguistics},
pages = {257--268},
url = {http://www.aclweb.org/anthology/C14-1026}
}
(2)The models "model_tag_science" and "model_tag_pfr6" are trained on the People's Daily Corpus in January 1998 and a few sentences from scientific domain, and the People's Daily Corpus in January to June, 2000, respectively. 2000, respectively.
If you use them in your paper, please cite the following paper:
@InProceedings{qiu-EtAl:2014:Coling2,
author = {Qiu, Likun and Zhang, Yue and Jin, Peng and Wang, Houfeng},
title = {Multi-view Chinese Treebanking},
booktitle = {Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers},
month = {August},
year = {2014},
address = {Dublin, Ireland},
publisher = {Dublin City University and Association for Computational Linguistics},
pages = {257--268},
url = {http://www.aclweb.org/anthology/C14-1026}
}
@article{yu2003specification,
title={Specification for corpus processing at Peking University: Word segmentation, {POS} tagging and phonetic notation},
author={Yu, Shiwen and Duan, Huiming and Zhu, Xuefeng and Swen, Bin and Chang, Baobao},
journal={Journal of {Chinese} Language and Computing},
volume={13},
number={2},
pages={121--158},
year={2003}
}