Text Denoising Tool Wiki
Text Denoising Tool removes the insignificant and/or unwanted sentence
Brought to you by:
rushdishams
Welcome to your wiki!
This is the default page, edit it as you see fit. To add a new page simply reference it within brackets, e.g.: [SamplePage].
The wiki uses Markdown syntax.
1. Overview of Text Denoising Tool:
Text Denoising Tool removes the insignificant and/or unwanted
sentences of a text according to their reading difficulty score (which in this case
is the Fog Index).
Changes made since version 2.1:
According to [1] and for the task of
biomedical relation mining, the 30% of the most difficult-to-read sentences is called
denoised text and the rest of the text is considered as noise text. According to [2], however,
and for the task of Keyphrase indexing, the 70% of the most difficult-to-read sentences is called
denoised text and the rest of the text is considered as noise text. As the denoising threshold (% of
sentences to keep as important part of the texts) varies for different task and datasets,
we are facilitating the user to choose his/her desired threshold (between 0% and 100%).
Second, we are introducing an option to set by the user
to remove short sentences (sentences of length less than 5)
Third, the tool is now accessible with command line arguments.
Note:
To get the best result with the tool, pre-process your text files by
removing various parts of a paper like:
2. Prepare your dataset:
3. Run the Text Denoising Tool:
Legends:
option1: The source directory
option2: The denoised text directory
option3: The noise text directory
option4: Text denoising threshold (from 0 to 100)
option5: Short sentence removal (true for removing sentences of length less than 5, false to run the tool on the full text as it is)
4. Other Documentations:
If you want to see how Text Denoising works and/or to user the tool for your research, please refer
to the work as listed-
[1] Rushdi Shams and Robert E. Mercer, "Extracting Connected Concepts from Biomedical Texts using Fog Index",
12th Conference of the Pacific Association for Computational Linguistics (PACLING 2011), Kuala Lumpur,
Malaysia, July 19-21, 2011.
[2] Rushdi Shams and Robert E. Mercer, "Investigating Keyphrase Indexing with Text Denoising", 12th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL2012), Washington DC, USA, June 10-14, 2012.