FeedFilter Code
Status: Alpha
                
                Brought to you by:
                
                    mluntzel
                    
                
            INSTRUCTIONS FOR INSTALLATION
make sure the following are installed: 
x. tomcat 5.x
x. apache 2.x
x. mysql 5.x
On debian, find this entry in Etcdefault/tomcat5.5 and set it to 'no':
TOMCAT5_SECURITY=no
This will allow the webapp to talk to mysql, which is important...
Then, there is a sql script with the database structure in the top level. It is called feedfilter-04-28-08.sql, and is simply a structure dump of the running database. Once the feedfilter database is created, you'll have to grant access to it in the following two ways:
while in a mysql shell:
mysql> grant all on feedfilter.* to feeder@localhost identified by 'feeder';
(this is used internally by the helper programs)
mysql> insert into user ( id, fname, lname, username, password) values (1, 'admin', 'admin', 'admin', MD5('YOUR_PASSWORD');
where YOUR_PASSWORD is the password you desire for the administrative account on the web application. This is all you need to do inside of mysql.
There is a directory, cron, in the top level of the archive. I contains example scripts to run the FeedScraper. This program is the heart of the application. The script "scrapeRunner.sh" sets some environment variables (CLASSPATH) and then runs the FeedScraper application. This will poll the feeds to be scraped from the database and insert entries, and handle duplicates. You will have to customize this for your environment, as the directory paths to the java libraries reference an absolute path of my own. The libraries you'll need are contained in the "lib" directory.
Inside the cron directory is also a file named "crontab" which is also an example of how often to run the scrapeRunner, I currently have it set to 10 minutes.
The aggRunner.sh script is part of another program I developed after I left the CoolState project, named FeedAggregator. It takes advantage of the ROME library to create customized RSS feeds. This program will push out an XML file which contains the search results from the individual users search parameters. The scraper does not depend on it in any fashion, so feel free to use it or ignore it.
So here is a quick explanation of what is in the archive:
*.jsp: files for the web application (obviously)
admin/*.jsp: files for administrative web interface
env.sh: shell script to add environment variables to enable running the applications from the command line
FedoraSub: unfinished application that would have inserted search results into a Fedora repository.
FeedAggregator: As mentioned above, an RSS feed generator
FeedCheck: A quick little program I cooked up to check the validity of any rss feeds that may be causing exceptions in the main applications. I was planning on expanding this to check the feed when the admin added a url, but never got around to it.
feedfilter-04-28-08.sql: As mentioned above, a SQL dump of the database structure
FeedScraper: heart of the feedfilter program
I also got to work on a little mechanism to limit the size of the database, as previously it would grow unchecked. I created this little script to limit the database records to 100,000. This number is relatively meaningless, as more feeds are added the number of records we keep would naturally want to increase. I would rather have done a date check, but I never got around to converting the date field in the feeddata table to something more useful than "Mar 06 08 14:32:44 PDT 2008" (just a varchar).
mysql_clean.sql:
use feedfilter;
CREATE TABLE tmp LIKE feeddata;
INSERT INTO tmp SELECT DISTINCT * FROM feeddata WHERE storyid > ((SELECT MAX(storyid) FROM feeddata) - 100000);
DROP TABLE feeddata;
RENAME TABLE TMP TO feeddata;
I run this every week or two, it would depend primarily on how fast the database grows and how much data we want to keep etc.
That should do it, let me know if I've missed anything or if anything needs further clarification.