Menu

Tree [r20] /
 History

HTTPS access


File Date Author Commit
 FeedAggregator 2008-11-30 mluntzel [r17] testing munin plugin capability
 FeedCheck 2008-11-29 mluntzel [r13] 1. integrated Rome FeedFetcher into Feedcheck
 FeedScraper 2008-11-29 mluntzel [r15] still not ignored. poop.
 admin 2008-11-27 root [r1] initial import
 cron 2008-11-29 mluntzel [r13] 1. integrated Rome FeedFetcher into Feedcheck
 images 2008-11-27 root [r1] initial import
 lib 2008-11-29 mluntzel [r13] 1. integrated Rome FeedFetcher into Feedcheck
 src 2008-12-01 mluntzel [r20] comments
 .classpath 2008-12-01 mluntzel [r20] comments
 .project 2008-12-01 mluntzel [r20] comments
 README.txt 2008-11-29 mluntzel [r8] modified the instructions
 addTerm.jsp 2008-11-27 root [r1] initial import
 config.jsp 2008-11-27 root [r1] initial import
 deleteTerm.jsp 2008-11-27 root [r1] initial import
 editTerm.jsp 2008-11-27 root [r1] initial import
 env.sh 2008-11-29 mluntzel [r13] 1. integrated Rome FeedFetcher into Feedcheck
 feedfilter-04-28-08.sql 2008-11-27 root [r1] initial import
 index.html 2008-11-27 root [r1] initial import
 login_error.jsp 2008-11-27 root [r1] initial import
 logout.jsp 2008-11-27 root [r1] initial import
 munin.jsp 2008-12-01 mluntzel [r20] comments
 my_feeds.jsp 2008-11-29 mluntzel [r16] merged changes from my_test_feeds to my_feeds
 my_test_feeds.jsp 2008-11-29 mluntzel [r4] some fixes, testing
 mysql_clean.sql 2008-11-27 root [r1] initial import
 output.xml 2008-11-27 root [r1] initial import
 security_check.jsp 2008-11-27 root [r1] initial import
 session_timeout.html 2008-11-27 root [r1] initial import
 settings.jsp 2008-11-27 root [r1] initial import

Read Me

INSTRUCTIONS FOR INSTALLATION

make sure the following are installed: 

x. tomcat 5.x
x. apache 2.x
x. mysql 5.x

On debian, find this entry in Etcdefault/tomcat5.5 and set it to 'no':

TOMCAT5_SECURITY=no

This will allow the webapp to talk to mysql, which is important...

Then, there is a sql script with the database structure in the top level. It is called feedfilter-04-28-08.sql, and is simply a structure dump of the running database. Once the feedfilter database is created, you'll have to grant access to it in the following two ways:

while in a mysql shell:

mysql> grant all on feedfilter.* to feeder@localhost identified by 'feeder';

(this is used internally by the helper programs)

mysql> insert into user ( id, fname, lname, username, password) values (1, 'admin', 'admin', 'admin', MD5('YOUR_PASSWORD');

where YOUR_PASSWORD is the password you desire for the administrative account on the web application. This is all you need to do inside of mysql.

There is a directory, cron, in the top level of the archive. I contains example scripts to run the FeedScraper. This program is the heart of the application. The script "scrapeRunner.sh" sets some environment variables (CLASSPATH) and then runs the FeedScraper application. This will poll the feeds to be scraped from the database and insert entries, and handle duplicates. You will have to customize this for your environment, as the directory paths to the java libraries reference an absolute path of my own. The libraries you'll need are contained in the "lib" directory.

Inside the cron directory is also a file named "crontab" which is also an example of how often to run the scrapeRunner, I currently have it set to 10 minutes.

The aggRunner.sh script is part of another program I developed after I left the CoolState project, named FeedAggregator. It takes advantage of the ROME library to create customized RSS feeds. This program will push out an XML file which contains the search results from the individual users search parameters. The scraper does not depend on it in any fashion, so feel free to use it or ignore it.

So here is a quick explanation of what is in the archive:

*.jsp: files for the web application (obviously)

admin/*.jsp: files for administrative web interface

env.sh: shell script to add environment variables to enable running the applications from the command line

FedoraSub: unfinished application that would have inserted search results into a Fedora repository.

FeedAggregator: As mentioned above, an RSS feed generator

FeedCheck: A quick little program I cooked up to check the validity of any rss feeds that may be causing exceptions in the main applications. I was planning on expanding this to check the feed when the admin added a url, but never got around to it.

feedfilter-04-28-08.sql: As mentioned above, a SQL dump of the database structure

FeedScraper: heart of the feedfilter program

I also got to work on a little mechanism to limit the size of the database, as previously it would grow unchecked. I created this little script to limit the database records to 100,000. This number is relatively meaningless, as more feeds are added the number of records we keep would naturally want to increase. I would rather have done a date check, but I never got around to converting the date field in the feeddata table to something more useful than "Mar 06 08 14:32:44 PDT 2008" (just a varchar).

mysql_clean.sql:

use feedfilter;
CREATE TABLE tmp LIKE feeddata;
INSERT INTO tmp SELECT DISTINCT * FROM feeddata WHERE storyid > ((SELECT MAX(storyid) FROM feeddata) - 100000);
DROP TABLE feeddata;
RENAME TABLE TMP TO feeddata;


I run this every week or two, it would depend primarily on how fast the database grows and how much data we want to keep etc.

That should do it, let me know if I've missed anything or if anything needs further clarification.