FeedFilter Code

Status: Alpha

Brought to you by: mluntzel

Tree [r20] /

History

HTTPS access

File	Date	Author	Commit
FeedAggregator	2008-11-30	mluntzel	[r17] testing munin plugin capability
FeedCheck	2008-11-29	mluntzel	[r13] 1. integrated Rome FeedFetcher into Feedcheck
FeedScraper	2008-11-29	mluntzel	[r15] still not ignored. poop.
admin	2008-11-27	root	[r1] initial import
cron	2008-11-29	mluntzel	[r13] 1. integrated Rome FeedFetcher into Feedcheck
images	2008-11-27	root	[r1] initial import
lib	2008-11-29	mluntzel	[r13] 1. integrated Rome FeedFetcher into Feedcheck
src	2008-12-01	mluntzel	[r20] comments
.classpath	2008-12-01	mluntzel	[r20] comments
.project	2008-12-01	mluntzel	[r20] comments
README.txt	2008-11-29	mluntzel	[r8] modified the instructions
addTerm.jsp	2008-11-27	root	[r1] initial import
config.jsp	2008-11-27	root	[r1] initial import
deleteTerm.jsp	2008-11-27	root	[r1] initial import
editTerm.jsp	2008-11-27	root	[r1] initial import
env.sh	2008-11-29	mluntzel	[r13] 1. integrated Rome FeedFetcher into Feedcheck
feedfilter-04-28-08.sql	2008-11-27	root	[r1] initial import
index.html	2008-11-27	root	[r1] initial import
login_error.jsp	2008-11-27	root	[r1] initial import
logout.jsp	2008-11-27	root	[r1] initial import
munin.jsp	2008-12-01	mluntzel	[r20] comments
my_feeds.jsp	2008-11-29	mluntzel	[r16] merged changes from my_test_feeds to my_feeds
my_test_feeds.jsp	2008-11-29	mluntzel	[r4] some fixes, testing
mysql_clean.sql	2008-11-27	root	[r1] initial import
output.xml	2008-11-27	root	[r1] initial import
security_check.jsp	2008-11-27	root	[r1] initial import
session_timeout.html	2008-11-27	root	[r1] initial import
settings.jsp	2008-11-27	root	[r1] initial import

Read Me

INSTRUCTIONS FOR INSTALLATION

make sure the following are installed: 

x. tomcat 5.x
x. apache 2.x
x. mysql 5.x

On debian, find this entry in Etcdefault/tomcat5.5 and set it to 'no':

TOMCAT5_SECURITY=no

This will allow the webapp to talk to mysql, which is important...

Then, there is a sql script with the database structure in the top level. It is called feedfilter-04-28-08.sql, and is simply a structure dump of the running database. Once the feedfilter database is created, you'll have to grant access to it in the following two ways:

while in a mysql shell:

mysql> grant all on feedfilter.* to feeder@localhost identified by 'feeder';

(this is used internally by the helper programs)

mysql> insert into user ( id, fname, lname, username, password) values (1, 'admin', 'admin', 'admin', MD5('YOUR_PASSWORD');

where YOUR_PASSWORD is the password you desire for the administrative account on the web application. This is all you need to do inside of mysql.

There is a directory, cron, in the top level of the archive. I contains example scripts to run the FeedScraper. This program is the heart of the application. The script "scrapeRunner.sh" sets some environment variables (CLASSPATH) and then runs the FeedScraper application. This will poll the feeds to be scraped from the database and insert entries, and handle duplicates. You will have to customize this for your environment, as the directory paths to the java libraries reference an absolute path of my own. The libraries you'll need are contained in the "lib" directory.

Inside the cron directory is also a file named "crontab" which is also an example of how often to run the scrapeRunner, I currently have it set to 10 minutes.

The aggRunner.sh script is part of another program I developed after I left the CoolState project, named FeedAggregator. It takes advantage of the ROME library to create customized RSS feeds. This program will push out an XML file which contains the search results from the individual users search parameters. The scraper does not depend on it in any fashion, so feel free to use it or ignore it.

So here is a quick explanation of what is in the archive:

*.jsp: files for the web application (obviously)

admin/*.jsp: files for administrative web interface

env.sh: shell script to add environment variables to enable running the applications from the command line

FedoraSub: unfinished application that would have inserted search results into a Fedora repository.

FeedAggregator: As mentioned above, an RSS feed generator

FeedCheck: A quick little program I cooked up to check the validity of any rss feeds that may be causing exceptions in the main applications. I was planning on expanding this to check the feed when the admin added a url, but never got around to it.

feedfilter-04-28-08.sql: As mentioned above, a SQL dump of the database structure

FeedScraper: heart of the feedfilter program

I also got to work on a little mechanism to limit the size of the database, as previously it would grow unchecked. I created this little script to limit the database records to 100,000. This number is relatively meaningless, as more feeds are added the number of records we keep would naturally want to increase. I would rather have done a date check, but I never got around to converting the date field in the feeddata table to something more useful than "Mar 06 08 14:32:44 PDT 2008" (just a varchar).

mysql_clean.sql:

use feedfilter;
CREATE TABLE tmp LIKE feeddata;
INSERT INTO tmp SELECT DISTINCT * FROM feeddata WHERE storyid > ((SELECT MAX(storyid) FROM feeddata) - 100000);
DROP TABLE feeddata;
RENAME TABLE TMP TO feeddata;


I run this every week or two, it would depend primarily on how fast the database grows and how much data we want to keep etc.

That should do it, let me know if I've missed anything or if anything needs further clarification.

FeedFilter Code

Tree [r20] / Download Snapshot History

Read Me

Tree [r20] /

History