TweetSieve Code

Status: Beta

Brought to you by: syssoev

Tree [r1] /

History

HTTPS access

File	Date	Author	Commit
demo	2009-09-14	syssoev	[r1] initial revision
example	2009-09-14	syssoev	[r1] initial revision
lib	2009-09-14	syssoev	[r1] initial revision
src	2009-09-14	syssoev	[r1] initial revision
build.xml	2009-09-14	syssoev	[r1] initial revision
readme.txt	2009-09-14	syssoev	[r1] initial revision

Read Me

Starting up TweetSieve demo requires the following steps:

1. Create index
First of all you should create a full text index from messages (tweets) you want to analyse.

a. Initialize index
At this step you create a Lucene index (folder with some files). You can do it with the following command

java -Xmx1000m -cp "lib/lucene-core-2.4.1.jar;lib/htmlparser.jar;dist/lib/texterra.jar" ru.ispras.texterra.demo.twitter.index.Indexer create LUCENE_INDEX INIT_FILE

Be carefull when specifing java's classpath. Sometimes it is necessary to give java more memory (with -Xmx1000m).
LUCENE_INDEX is a name of the folder, where index files will be stored. It is not necessary to create the folder before the program execution.
INIT_FILE - file with twetts, using which you can initialize your index. Suppose you have already parsed your tweets and you store them in some
database. Then it is rather easy to extract them from this DB and put into some file. INIT_FILE is a simple text file. It may have two or three tabulation-separated
fields. The first field is author (optional), the second is tweet text, the third is tweet date. The example file may be found in example/tweet_from_db.txt
It you have no parsed tweets, you can provide empty file as INIT_FILE.

b. Append data to index
You can add tweets, downloaded from Twitter, to the created index. This may be done with

java -Xmx1000m -cp "lib/lucene-core-2.4.1.jar;lib/htmlparser.jar;dist/lib/texterra.jar" ru.ispras.texterra.demo.twitter.index.Indexer create LUCENE_INDEX APPEND_FILE_OR_DIR

LUCENE_INDEX is a name of the folder with Lucene index files
APPEND_FILE_OR_DIR is a file or a directory with files, downloaded from Twitter using its Streaming API. If it is a directory, than all files from
its subfolders are also appended to the Lucene index.
The example of such file can be found in example/tweet_raw.txt
Such file can be downloaded with

curl http://stream.twitter.com/1/statuses/sample.xml\?delimited=length -uUSERNAME:PASSWORD > tweet_raw.txt

When downloading such data from Twitter the "delimited" parameter should be specified.
USERNAME and PASSWORD are login and password for your Twitter account.
Download Twitter data in XML format (not JSON).

You can read more about downloading Twitter data at http://apiwiki.twitter.com/Streaming-API-Documentation

c*. Index speedup
Lucene index is rather slow for this application. So we implemented some speedup ad hocs. It may be a bit difficult to use them from the very beginning,
so it is recommended to skip this step when getting acquainted with this application.

At this step we generate additional files for Lucene index which are then used at step 2 (optional step). To create these file run

java -Xmx4000m -cp "lib/lucene-core-2.4.1.jar;lib/htmlparser.jar;dist/lib/texterra.jar" ru.ispras.texterra.demo.twitter.sigir.OfflineWordIds LUCENE_INDEX

This step requires much additional memory for Java.

2*. Start ad hoc speedup service
This step should be executed only if you have done 1.c.

Architecture
Lucene index is rather slow when reading lots of data from hard drive. To cope with this problem we try to store this data in main memory.
One machine stores the data and is invoked by the main machine, which runs the web application itself.

This should be done only on service machine:

a. Create file settings.dat and add a mapping from machine name to the path to this file.
example of settings.dat is in example/. The format of this file is described inside it. This file may contain the following fields

index_dir - path to Lucene index on service machine
rmi_port - port on which the rmi service listens on service machine
server_port - port on which the serice listens on service machine
server_machine - connect to server_machine to get access to the service. In fact it may service machine. However it may be another machine if you need to use port forwarding.

You should modify the source code to make the application know, where you have put your settings.dat file. To do this add a mapping from machine name to the
path to settings.dat in ru.ispras.texterra.demo.twitter.sigir.SettingsManager. Probably, you should just put

machine2settingsFile.put("your_machine_name", "full_path_to_settings.dat");

into the static initialization block.

b. Start rmi registry service on the port, specified in your settings.dat file. Then start the service itself.
example script to do this is example/service.sh.

3. Deploying web application
a. You should tune settings.dat by analogy, as it is done at step 2.a. (then index_dir is path to index on web application machine) This should be done on web application machine.
b. Deploying web application.
Copy contents of ./demo/ folder into APACHE-TOMCAT/webapps/SOMENAME (you should also have META-INF and WEB-INF in this folder. Read about it in apache-tomcat's manual) and start apache-tomcat
Your web application should be avalable at http://your_machine_name/SOMENAME/

TweetSieve Code

Tree [r1] / Download Snapshot History

Read Me

Tree [r1] /

History