ARAbic INtrinsic plagiarism detection Corpus (InAra Corpus 2013)
I. SYNOPSIS
II. DESCRIPTION
III. PURPOSE
IV. BUILDING METHODS
V. LANGUAGE AND ENCODING
VI. SOURCES OF TEXTS
VII. COPYRIGHT AND AVAILABILITY
VIII. HOW TO CITE THE CORPUS ?
IX. WARNING
X. CONTACT US
I. SYNOPSIS
InAra corpus comprises 1024 documents; 80% of them contain passages
borrowed from other documents to simulate documents that contain
plagiarized fragments.
II. DESCRIPTION
This corpus consists mainly of 2 datasets: textual files and XML files.
The textual files represent the suspicious documents i.e. the documents
that contain artificial plagiarism; and the XML files are the plagiarism
annotation i.e. they provide for each plagiarized passage its starting
offset in the suspicious document and its length (offset and length were
both expressed in bytes and not in characters). A suspicious document
file and its plagiarism annotation file share the same name. Additional
XML files are provided in the folder test-annotation, they contain the
documents metadata but without the plagiarism information. These files
could be used to insert within the evaluated method results to
facilitate its comparison with the files in the plagiarism-annotation
folder.
III. PURPOSE
The purpose of InAra corpus is to evaluate automatic plagiarism
detection methods, notably methods of the intrinsic approach. This
approach consists in uncovering the plagiarized passages on the basis of
the writing style inconsistency in a given suspicious document. As
opposed to the external approach, the intrinsic approach does not
necessitate any comparison of the suspicious document against the
potential sources of plagiarism. It should be noted that InAra corpus is
not appropriate for the evaluation of the external plagiaism detection
because the plagiarism cases were taken from the sources and inserted
directly in the suspicious document without undergoing any obfuscation.
Indeed the detection of verbatim plagiarism is no longer a challenging
problem and hence it will be very easy to detect the plagiarism in InAra
corpus using the external approach.
IV. BUILDING METHODS
The documents that compose InAra corpus do not contain actual plagiarism
cases, they are rather artificial suspicious documents in which
plagiarism was created automatically by a software that takes fragments
of text from one or more sources documents and inserts them in another
one according to a set of parameters namely the percentage of plagiarism
and the plagiarized passages lengths. This building method is the same
used to construct PAN 2009-2011 corpora of plagiarism detection (see
http://pan.webis.de for more information on PAN competition and its
corpora).
V. LANGUAGE AND ENCODING
All the textual documents of this corpus are written in Arabic language
and encoded in UTF-8 without BOM. The XML documents contain titles and
author names in Arabic. Thus they are also encoded in UTF-8 without BOM.
VI. SOURCES OF TEXTS
Texts used to build this corpus, either suspicious documents or the
inserted passages, are taken mainly from the open library Arabic
Wikisource (http://ar.wikisource.org), one of Wikimedia foundation
projects. A few numbers of documents were taken from other websites
namely:
Create your own country blog: http://diycountry.blogspot.com
Corpus of Classical Arabic (KSUCCA): http://ksucorpus.ksu.edu.sa
Islamic book web site: http://www.islamicbook.ws
VII. COPYRIGHT AND AVAILABILITY
We were very careful to build the corpus with copyright-free texts only,
to be able to make it publicly available without any sort of problems
with texts owners.
VIII. HOW TO CITE THE CORPUS ?
If you publish a paper about your experimentation using InAra corpus,
please cite the following paper:
Bensalem I., Rosso P. , Chikhi S.: A New Corpus for the Evaluation of
Arabic Intrinsic Plagiarism Detection. CLEF 2013, Valencia, Spain,
September 23-26, Springer. (to appear)
Additional information on the corpus building are in the paper:
Bensalem, I., Rosso, P., Chikhi, S.: Building Arabic Corpora from
Wikisource. 10th ACS/IEEE International Conference on Computer Systems
and Applications (AICCSA’13), Fes/Ifran, Morocco, May 27-30.
IX. WARNING
It should be noted that the Arabic texts may contain quotations from the
Qur'an and the Hadith; and due to the fact that text insertion is
automatic and in random positions, it is possible that the plagiarized
text is inserted unintentionally between Quranic verses or sentences of
a Hadith cited in a document. Moreover, the inserted passages may alter
the meaning of the original text. For these reasons, this corpus must
not be used outside the purpose for which it was built. Examples of the
inappropriate use include using the corpus documents as a source of
knowledge or distributing them without mentioning that they contain
borrowed texts. If you are not interested in plagiarism detection and
you are retaining the corpus because it contains books you want to read,
then this corpus is not the right source. Please, you should refer to the
sources mentioned in Section VI where you can find the original content of
the books you are looking for. We emphasize that we are not responsible
for the results of any use of this corpus other than the evaluation of
the intrinsic plagiarism detection methods.
X. CONTACT US
We will be happy to hear from you about your experience in using InAra
corpus. Please do not hesitate to contact us with the following email
address: bens.imene@gmail.com
Imene Bensalem¹, Paolo Rosso², Salim Chikhi¹
¹MISC Lab. Constantine 2 university, Algeria
²NLE Lab. – EliRF, Universitat Politècnica de València, Spain