Download Latest Version InAra-Corpus.zip (35.3 MB)
Email in envelope

Get an email when there's a new version of InAra Plagiarism Detection Corpus

Home
Name Modified Size InfoDownloads / Week
InAra-Corpus.zip 2014-01-23 35.3 MB
______.txt 2013-07-16 9.0 kB
readme.txt 2013-07-16 5.9 kB
Totals: 3 Items   35.3 MB 0
ARAbic INtrinsic plagiarism detection Corpus (InAra Corpus 2013) 

	I.    SYNOPSIS
	II.   DESCRIPTION
	III.  PURPOSE
	IV.   BUILDING METHODS 
	V.	  LANGUAGE AND ENCODING
	VI.   SOURCES OF TEXTS
	VII.  COPYRIGHT AND AVAILABILITY
	VIII. HOW TO CITE THE CORPUS ?
	IX.   WARNING
	X.	  CONTACT US

I. SYNOPSIS 

InAra corpus comprises 1024 documents; 80% of them contain passages 
borrowed from other documents to simulate documents that contain 
plagiarized fragments. 

II. DESCRIPTION 

This corpus consists mainly of 2 datasets: textual files and XML files. 
The textual files represent the suspicious documents i.e. the documents 
that contain artificial plagiarism; and the XML files are the plagiarism 
annotation i.e. they provide for each plagiarized passage its starting 
offset in the suspicious document and its length (offset and length were 
both expressed in bytes and not in characters). A suspicious document 
file and its plagiarism annotation file share the same name. Additional 
XML files are provided in the folder test-annotation, they contain the 
documents metadata but without the plagiarism information. These files 
could be used to insert within the evaluated method results to 
facilitate its comparison with the files in the plagiarism-annotation 
folder. 

III. PURPOSE 

The purpose of InAra corpus is to evaluate automatic plagiarism 
detection methods, notably methods of the intrinsic approach. This 
approach consists in uncovering the plagiarized passages on the basis of 
the writing style inconsistency in a given suspicious document. As 
opposed to the external approach, the intrinsic approach does not 
necessitate any comparison of the suspicious document against the 
potential sources of plagiarism. It should be noted that InAra corpus is 
not appropriate for the evaluation of the external plagiaism detection 
because the plagiarism cases were taken from the sources and inserted 
directly in the suspicious document without undergoing any obfuscation. 
Indeed the detection of verbatim plagiarism is no longer a challenging 
problem and hence it will be very easy to detect the plagiarism in InAra 
corpus using the external approach. 

IV. BUILDING METHODS 

The documents that compose InAra corpus do not contain actual plagiarism 
cases, they are rather artificial suspicious documents in which 
plagiarism was created automatically by a software that takes fragments 
of text from one or more sources documents and inserts them in another 
one according to a set of parameters namely the percentage of plagiarism 
and the plagiarized passages lengths. This building method is the same 
used to construct PAN 2009-2011 corpora of plagiarism detection (see 
http://pan.webis.de for more information on PAN competition and its 
corpora). 

V. LANGUAGE AND ENCODING 

All the textual documents of this corpus are written in Arabic language 
and encoded in UTF-8 without BOM. The XML documents contain titles and 
author names in Arabic. Thus they are also encoded in UTF-8 without BOM. 

VI. SOURCES OF TEXTS 

Texts used to build this corpus, either suspicious documents or the 
inserted passages, are taken mainly from the open library Arabic 
Wikisource (http://ar.wikisource.org), one of Wikimedia foundation 
projects. A few numbers of documents were taken from other websites 
namely: 
Create your own country blog: http://diycountry.blogspot.com 
Corpus of Classical Arabic (KSUCCA): http://ksucorpus.ksu.edu.sa 
Islamic book web site: http://www.islamicbook.ws 

VII. COPYRIGHT AND AVAILABILITY 

We were very careful to build the corpus with copyright-free texts only, 
to be able to make it publicly available without any sort of problems 
with texts owners. 

VIII.	HOW TO CITE THE CORPUS ?

If you publish a paper about your experimentation using InAra corpus, 
please cite the following paper: 

Bensalem I., Rosso P. , Chikhi S.: A New Corpus for the Evaluation of 
Arabic Intrinsic Plagiarism Detection. CLEF 2013, Valencia, Spain, 
September 23-26, Springer. (to appear) 

Additional information on the corpus building are in the paper:

Bensalem, I., Rosso, P., Chikhi, S.: Building Arabic Corpora from 
Wikisource. 10th ACS/IEEE International Conference on Computer Systems 
and Applications (AICCSA’13), Fes/Ifran, Morocco, May 27-30. 


IX. WARNING 

It should be noted that the Arabic texts may contain quotations from the 
Qur'an and the Hadith; and due to the fact that text insertion is 
automatic and in random positions, it is possible that the plagiarized 
text is inserted unintentionally between Quranic verses or sentences of 
a Hadith cited in a document. Moreover, the inserted passages may alter 
the meaning of the original text. For these reasons, this corpus must 
not be used outside the purpose for which it was built. Examples of the 
inappropriate use include using the corpus documents as a source of 
knowledge or distributing them without mentioning that they contain 
borrowed texts. If you are not interested in plagiarism detection and 
you are retaining the corpus because it contains books you want to read, 
then this corpus is not the right source. Please, you should refer to the 
sources mentioned in Section VI where you can find the original content of 
the books you are looking for. We emphasize that we are not responsible 
for the results of any use of this corpus other than the evaluation of 
the intrinsic plagiarism detection methods. 

X.	CONTACT US
We will be happy to hear from you about your experience in using InAra 
corpus. Please do not hesitate to contact us with the following email 
address: bens.imene@gmail.com

Imene Bensalem¹, Paolo Rosso², Salim Chikhi¹
¹MISC Lab. Constantine 2 university, Algeria
²NLE Lab. – EliRF, Universitat Politècnica de València, Spain 

Source: readme.txt, updated 2013-07-16