Menu

Tree [r1] /
 History

HTTPS access


File Date Author Commit
 demo 2009-09-14 syssoev [r1] initial revision
 example 2009-09-14 syssoev [r1] initial revision
 lib 2009-09-14 syssoev [r1] initial revision
 src 2009-09-14 syssoev [r1] initial revision
 build.xml 2009-09-14 syssoev [r1] initial revision
 readme.txt 2009-09-14 syssoev [r1] initial revision

Read Me

Starting up TweetSieve demo requires the following steps:

1. Create index
  First of all you should create a full text index from messages (tweets) you want to analyse.

  a. Initialize index
     At this step you create a Lucene index (folder with some files). You can do it with the following command

        java -Xmx1000m -cp "lib/lucene-core-2.4.1.jar;lib/htmlparser.jar;dist/lib/texterra.jar" ru.ispras.texterra.demo.twitter.index.Indexer create LUCENE_INDEX INIT_FILE
     
     Be carefull when specifing java's classpath. Sometimes it is necessary to give java more memory (with -Xmx1000m).
     LUCENE_INDEX is a name of the folder, where index files will be stored. It is not necessary to create the folder before the program execution.
     INIT_FILE - file with twetts, using which you can initialize your index. Suppose you have already parsed your  tweets and you store them in some
     database. Then it is rather easy to extract them from this DB and put into some file. INIT_FILE is a simple text file. It may have two or three tabulation-separated
     fields. The first field is author (optional), the second is tweet text, the third is tweet date. The example file may be found in example/tweet_from_db.txt
     It you have no parsed tweets, you can provide empty file as INIT_FILE.

  b. Append data to index
     You can add tweets, downloaded from Twitter, to the created index. This may be done with

        java -Xmx1000m -cp "lib/lucene-core-2.4.1.jar;lib/htmlparser.jar;dist/lib/texterra.jar" ru.ispras.texterra.demo.twitter.index.Indexer create LUCENE_INDEX APPEND_FILE_OR_DIR

     LUCENE_INDEX is a name of the folder with Lucene index files
     APPEND_FILE_OR_DIR is a file or a directory with files, downloaded from Twitter using its Streaming API. If it is a directory, than all files from
     its subfolders are also appended to the Lucene index.
     The example of such file can be found in example/tweet_raw.txt
     Such file can be downloaded with

        curl http://stream.twitter.com/1/statuses/sample.xml\?delimited=length -uUSERNAME:PASSWORD > tweet_raw.txt

     When downloading such data from Twitter the "delimited" parameter should be specified.
     USERNAME and PASSWORD are login and password for your Twitter account.
     Download Twitter data in XML format (not JSON).

     You can read more about downloading Twitter data at http://apiwiki.twitter.com/Streaming-API-Documentation


  c*. Index speedup
     Lucene index is rather slow for this application. So we implemented some speedup ad hocs. It may be a bit difficult to use them from the very beginning,
     so it is recommended to skip this step when getting acquainted with this application.

     At this step we generate additional files for Lucene index which are then used at step 2 (optional step). To create these file run
     
       java -Xmx4000m -cp "lib/lucene-core-2.4.1.jar;lib/htmlparser.jar;dist/lib/texterra.jar" ru.ispras.texterra.demo.twitter.sigir.OfflineWordIds LUCENE_INDEX
    
     This step requires much additional memory for Java.


2*. Start ad hoc speedup service
     This step should be executed only if you have done 1.c.

    Architecture
      Lucene index is rather slow when reading lots of data from hard drive. To cope with this problem we try to store this data in main memory.
      One machine stores the data and is invoked by the main machine, which runs the web application itself. 

  This should be done only on service machine:

  a. Create file settings.dat and add a mapping from machine name to the path to this file.
     example of settings.dat is in example/. The format of this file is described inside it. This file may contain the following fields

        index_dir - path to Lucene index on service machine
        rmi_port  - port on which the rmi service listens on service machine
        server_port - port on which the serice listens on service machine
        server_machine - connect to server_machine to get access to the service. In fact it may service machine. However it may be another machine if you need to use port forwarding.

     You should modify the source code to make the application know, where you have put your settings.dat file. To do this add a mapping from machine name to the
     path to settings.dat in ru.ispras.texterra.demo.twitter.sigir.SettingsManager. Probably, you should just put

         machine2settingsFile.put("your_machine_name", "full_path_to_settings.dat");
  
     into the static initialization block.

  b. Start rmi registry service on the port, specified in your settings.dat file. Then start the service itself.
     example script to do this is example/service.sh.


3. Deploying web application
  a. You should tune settings.dat by analogy, as it is done at step 2.a. (then index_dir is path to index on web application machine) This should be done on web application machine. 
  b. Deploying web application.
      Copy contents of ./demo/ folder into APACHE-TOMCAT/webapps/SOMENAME (you should also have META-INF and WEB-INF in this folder. Read about it in apache-tomcat's manual) and start apache-tomcat
      Your web application should be avalable at http://your_machine_name/SOMENAME/