BlogTEX: Blog posts extraction for TREC. - Browse Files at SourceForge.net

The interactive file manager requires Javascript. Please enable it or use sftp or scp.
You may still browse the files here.

Name	Modified	Size	InfoDownloads / Week
README.txt	2011-12-15	4.2 kB	0
blogtex.zip	2011-12-15	7.7 MB	1
Totals: 2 Items		7.7 MB	1

========================
README FOR BlogTEX
========================

BlogTEX has been used for TREC Blog08 but it should be able to work for previous TREC Blog versions provided
the same directory hierachy is maintained. Here are few things to consider.

*** Unzip the BlogTEX archive file.

1. CONFIGURATION FILE
===================
You must open the configuration file located in the "config" sub-directory of the blgtex directory (blogtex/config).

2. COLLECTION DIRECTORY
===================
This is the directory containing your "raw" TREC Blog dataset.Within the parent directory, you should have sub-directories
with different dates. Each sub-directory contains numbers of permalinks GZ Archive files starting from "permalinks-0". Ideally
each permalink.gz file contains number of blog posts. You do not need to worry about going through the directories anyway, BlogTEX
does everything for you. 

3. FEEDS ARCHIVEs
===============
Please remove all the feed archive files from thesame directory with the permalinks. We do not support feeds!

4. ENGLISH DOC DIRECTORY
============
In the configuration file, specify the path to a directory to store your extracted blog docs. BlogTEX creates
the final directory if it does not exist.

5. BASEPATH
===========
In the configuration file, specify the base path to config directory. By default, this resides within the BlogTEX directory.

6.  TEMP OUTPUT
===============
In the configuration file, specify the path to the temp.txt file. By default, this resides within the BlogTEX directory.

7. FUNCTION WORDS PATH
=======================
In the configuration file, specify the path to the function word file. BlogTEX use English function words by default. If you are 
interested in extracting blogs in other languages, you must include the appropriate function word file and specify the path in the
config file accordingly.

8. MINIMUM FUNCTION WORDS
=========================
Depending on your preference, specify the number of English functionw words that must be present in each blog post before selecting
it as English document. Although the higher the better, this may amount to higher computational cost however.The default 10 minimum
function words has shown credible performance.

9. NUMBER OF BLOG DOCUMENTS REQUIRED
=================
In the configuration file, specify the number of blog documents to be extracted.

10. SENTENCE MODEL
====================
The sentence model does sentence boundy identification. Sentence-level researchers would be interested in setting this component on.

11. OPTIMIZED SENTENCE MODEL
=======================
This identifies sentences better than the ordinary sentence model above. However, it may reduce the efficiency of your PC. See the Operating
Sytem section for details.

12. BREAK LONG SENTENCE
==================
Depending on your research/requirement, you may decide to break long sentences. An average sentence contains around 15-20 words.


13. SPELLING CORRECTION
======================
But we are sorry! This component is not yet supported.


14. OPERATING SYSTEM
====================
BlogTEX is written in Java. It was tested on WINDOWS (XP & 7) and LINUX (Centos 5.6). However, for efficiency, you may need to tune your
OS for certain reasons.Also remeber to indicate directory paths according to your OS. The default paths are for Linux.

LINUX:
- Number of open files must be set to 20,000 and above (Google how to increase number of open files for your Linux distro)
- give neccessary sudo permissions
- Minimum of 4GB RAM
- Increase JVM memory to 2GB minimum

Windows:
- Minimum of 4GB RAM
- Increase JVM memory to 2GB minimum

15. LICENCE
=============
BlogTEX is published under GPL licence permision without any warranty whatsoever. It is to be used freely for research/academic purpose only.
Please see additonal licence statement for the (LingPipe) sentence model used.


16. REFRENCE
============
If you have been directed by any of our research paper, please encourage us by citing the paper accordingly.


17. CONTACT ?
============
sylvester.orimaye@monash.edu

OR

soori1@monash.edu

Source: README.txt, updated 2011-12-15

BlogTEX: Blog posts extraction for TREC. Files

Get an email when there's a new version of BlogTEX: Blog posts extraction for TREC.