Synopsis
===============================================================================
RSSamantha is a command line rss/atom feed aggregator/creator written in java.
It is designed to subscribe to a batch of feeds in order to merge their items
into new feeds and write them as rss 2.0 and/or plain text to harddisk.
Additionally it has the ability to download contents of podcastfeeds,
filter feed attributes by regular expressions, preprocess the configurationfile
for rather unhandy searchterms, requesting channels via HTTP GET and add/remove
items from external processes via http POST.
License:
GPL Copyright (C) 2011 David Schröer <tengcomplexATgmail.com>
Usage
===============================================================================
-------------------
Installation
-------------------
A java runtime environment is required. Version 1.5 should work, I recommend
java version 1.6 or higher though.
Unzip rssamantha.zip, edit configuration file as needed.
There is an example wrapper script rssamantha.bsh and an example configuration
file feeds.opml you can take as a start.
-------------------
General Usage
-------------------
java args -jar rssamantha.jar config.opml
For the full list or arguments call:
java -jar rssamantha.jar --help
-------------------
Example start
-------------------
Contains arguments for a http proxy, ignore those standard http.proxy* arguments
if you have direct internet access.
java -Dcom.drinschinz.rssamantha.loglevel=INFO \
-Dcom.drinschinz.rssamantha.showlimit=100 \
-Dcom.drinschinz.rssamantha.itemstoragefile=../rssfeedcreatoritems.dat \
-Dhttp.proxyHost=localhost\
"-Dcom.drinschinz.rssamantha.preprocessconfig.repl_1=hobo+OR+shotgun" \
-Dhttp.proxyPort=8118 \
-jar rssamantha.jar configfile.opml
-------------------
Add/Remove external items and requesting data via HTTP.
-------------------
If an itemacceptor thread is configured, it is possible to add/remove items to a
channel from external sources via HTTP post.
The related properties are:
* com.drinschinz.rssamantha.itemacceptor
* com.drinschinz.rssamantha.itemacceptorport
* com.drinschinz.rssamantha.acceptorlist
POST examples:
Supported keys:
channel(String, channelname, defined in config.name)
ix (Integer, the channel index)
title (String)
description (String)
created (Long, milliseconds, between the current time and midnight, January 1, 1970 UTC)
link (String)
remove (Integer, 1 means true, default false)
Add item:
wget --post-data='title=testtitle&description=testdescription&ix=0' http://host:port/ -O /dev/null
wget --post-data='title=testtitle&description=testdescription&channel=tengtest&created=$(($(date +%s%N)/1000000))' http://host:port/ -O /dev/null
Remove item:
wget --post-data='title=testtitle&description=testdescription&ix=0&created=$CREATED&remove=1' http://host:port/ -O /dev/null
wget --post-data='title=testtitle&description=testdescription&channel=tengtest&created=$CREATED&remove=1' http://host:port/ -O /dev/null
GET examples:
http://$HOST:$PORT/channel=$CHANNELNAME
http://$HOST:$PORT/status
Configuration
===============================================================================
Settings and channels are read from an opml configfile.
-------------------
Supported subscription types
-------------------
rss (Using DOM parser from sun libs)
simplerss (Using parser from qdmxl classes, thanks to Steven R. Brandt, see
http://www.javaworld.com/javatips/jw-javatip128.html?page=1)
rsstwitter
rssidentica
atom
podcast (Note that -Dknowndownloadsfile=filename must be set up, otherwise the
downloadcontrol thread is not started)
Written or HTTP requested channels are published in RSS 2.0.
-------------------
Feed configuation
-------------------
Example:
<source title="title" feedtype="type" feedUrl="url" delay="60000" matchpattern_key="patternA" matchpattern_key_name="patternB" dayofweek="x,y,z" hourofday="x,y" appenddescription="true|false" translate="from->to"/>
feedUrl:
The actual URL of the feed we want to subscribe.
delay:
In ms. Once a day means 86400000, once an hour 3600000.
dayofweek/hourofday:
Commaseparated list of reading days/hours.
See http://download-llnw.oracle.com/javase/6/docs/api/constant-values.html#java.util.Calendar.LONG
appenddescripton:
If true we add the full description to the title of an item.
matchpattern_key[_patternname]:
A pattern can be applied to the title and/or another part like the category.(See simple example)
If there are more than one patterns defined, the item is thrown away if one condition doesn't match.
So if you want just matching some conditions, let's say you want to monitor a source for "amy wong naked"
in title --or-- in description, you have to set this up by two sources.
If you want to match a title for not starting with "Bender" and not not starting "Fry" and not containing
"Homer" you can do so by multiple patterns on the same attribute.
Example:
* matchpattern_title="^(?!.*(Homer)).*$"
* matchpattern_title_nobender="^(?!Bender).*"
* matchpattern_title_nofry="^(?!Fry).*"
Matchpattern is based on java regular expressions, example usage:
a. Find strings that 'starts with' "STR": use "^STR.*"
b. Find strings that 'does not start with' "STR": use "^(?!STR).*"
c. Find strings that 'ends with' "STR": use ".*STR$"
d. Find strings that 'does not end with' "STR": use ".*(?<!STR)$"
e. Find strings that 'contains' "STR" in the middle : use ".+STR.+"
f. Find strings that 'contains' "STR" somewhere : use ".*STR.*"
g. Find strings that 'does not contain' "STR": use "^(?!.*(STR)).*$"
h. Find strings that 'equal to' "STR": use "^STR$"
i. Find strings that 'not equal to' "STR": use "^(?!STR$).*"
Logical operators
j. Find String that 'starts with' "STRA" or 'starts with' "STRB": use "^STRA.*|^STRB.*"
See http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html
-------------------
Channel output feed values
-------------------
[rsschannel.title=title]
[rsschannel.link=link]
[rsschannel.description=description]
[rsschannel.comment=comment]
[rsschannel.yourvalue=value]
-------------------
Channel config values
-------------------
[config.rsswritesleep=sleep{ms}(3600000)]
[config.txtwritesleep=sleep{ms}(3600000)]
[config.showlimit=limit]
[config.storelimit=limit]
[config.rssfilename=rssfilename]
[config.txtfilename=txtfilename]
[config.downloadfolder=folder]
-------------------
Simple example
-------------------
<opml version="1.0">
<head>
<title>RSS Feed Creator Configuration</title>
</head>
<body>
<outline
config.name="myfeed"
config.rsswritesleep="65000"
config.showlimit="100"
config.storelimit="2000"
config.rssfilename="/home/username/myfeed.rss"
config.downloadfolder="/media/podcasts/"
config.adddownloaditems="5"
rsschannel.title="tengnews_international"
rsschannel.link="http://localhost:8080/"
rsschannel.description="We hate international news."
rsschannel.comment="Don't expect useful information here."
rsschannel.ttl="5"
>
<source title="ria novosti" feedtype="rss" feedUrl="http://en.rian.ru/export/rss2/index.xml" delay="980000"/>
<source title="The Register" feedtype="atom" feedUrl="http://www.theregister.co.uk/headlines.atom" delay="48754747"/>
<source title="oe-r" feedtype="podcast" feedUrl="http://static.orf.at/podcast/oe1/oe1_digitalleben.xml"/>
<source title="Niels Ruf" feedtype="rsstwitter" feedUrl="http://twitter.com/statuses/user_timeline/22873141.rss" delay="19999999"/>
<source title="wdr verkehr" feedtype="rss" feedUrl="http://www.wdr.de/verkehrslage/rss.xml" delay="60000" matchpattern_title="A43|A44|A45" dayofweek="2,3,4,5,6" hourofday="8,9" appenddescription="true"/>
<source title="el universal" feedtype="rss" feedUrl="http://www.eluniversal.com/rss/pol_avances.xml"/>
<source title="Internet Archive" feedtype="rss" feedUrl="http://www.archive.org/services/collection-rss.php" matchpattern_title="^(?!SFGTV)." matchpattern_category="^movies."/>
<source title="Googlenews ISIN" feedtype="rss" feedUrl="http://news.google.de/news?pz=1&cf=all&ned=de&hl=de&q=${repl_isinscan_1}&cf=all&output=rss" delay="48000000"/>
</outline>
</body>
</opml>
Changes
===============================================================================
-------------------
Version: 0.791
Release: 20120108
-------------------
- Added feed request via http GET, resulting in on the fly XML RSS 2.0 response.
- Added item remove via HTTP POST.
- Added enhancement regarding logging statistic data.
- Renamed Itemacceptor.BrowserClientThread to Itemacceptor.ClientThread.
- Read CDATA coalescing, ignore comments and elementcontent whitespace when
reading xml.
- Added setting optional setting "config.txtdatetimeformat", default "HH:mm:ss".
- Added dateformat "E',' dd. MMM yyyy HH:mm:ss Z".
-------------------
Version: 0.79
Release: 20111204
-------------------
- Deactivated translation support.
NOTE: Google translate API v1 was shut down.
Now google tranlation lib supports Version 2 of the API and is available
as a paid service.
See:
http://code.google.com/apis/language/translate/v2/getting_started.html
http://code.google.com/apis/language/translate/v2/pricing.html
If you are willing to pay for translation or have an easy alternative,
let me know.
- Minor optimizations. (ChannelReader.DateIndex)
- Various renamings.
-------------------
Version: 0.789
Release: 20111001
-------------------
- Introduced application.properties.
- Item matching now using matches() instead find().
- Rename to rssamantha.
-------------------
Version: 0.788
Release: 20110916
-------------------
- Added dateformat "E MMM dd HH:mm:ss Z yyyy".
- Added dateformat "E',' dd MMM yyyy".
- Added dateformat "E','dd MMM yyyy HH:mm:ss Z".
-------------------
Version: 0.787
Release: 20110618
-------------------
- Support multiple matchpattern for the same attribute by optional suffixes.
-------------------
Version: 0.786
Release: 20110613
-------------------
- ItemAcceptor.readRequest() now using StringBuilder.
- Outsourced Statistics, improved Control.RssFeddCreatorStatistics.
- Just collect the hashcode of Items with foundrsscreated=false in ItemData.
- TxtFileHandler using a DateFomat for the timestamps.
- Minor improvements.
- Fixed exception before trying to translate empty title.
- No longer using toolwit library.
- Upgrade to google translate API version 0.95.
-------------------
Version: 0.785
Release: 20110122
-------------------
- Added support for spain dateformat "E',' dd MMM yyyy HH:mm:ss Z".
- Make ChannelReader.dateformats final.
-------------------
Version: 0.784
Release: 20110116
-------------------
- Added dateformat "MM/dd/yyyy hh:mm:ss a".
-------------------
Version: 0.783
Release: 20101231
-------------------
- Introduced Control.AddItemResult enum.
- Bugfix, we did always return already known item if two items hav the exact
same created timestamp. Now we compare title as well in Item.compareTo
if created timestamps are equal.
- Bugfix, sort read podcastitems after reading them in order to add
the youngest number of "config.adddownloaditems".
- Itemacceptor returns more information.
- Main prints welcome message when starting on stdout.
- Loglevel changed to FINE in FileHandler.hasChanged if nothing has actually
changed.
- Upgrade to google-api-translate-java-0.94.jar lib.
-------------------
Version: 0.74
Release: 20101225
-------------------
- Minor improvements.
-------------------
Version: 0.72
Release: 20101106
-------------------
- Minor cleanups.
-------------------
Version: 0.72
Release: 20101104
-------------------
- Optional preprocessing of the opml configfile.
- Logging improvements.
- HTML decode improvements.
-------------------
Version: 0.71
Release: 20101024
-------------------
- Accept link via http.
- Minor improvements in ItemAcceptor.
-------------------
Version: 0.70
Release: 20101017
-------------------
- Accept created via http.
-------------------
Version: 0.69
Release: 20100905
-------------------
- Configuration in one single opml file.
- Support multiple matchingpattern.(i.e. matchpattern_key).
- Added translation support.
- All system property now starting with the central packagename.
- Various optimizations and better errorhandling.
-------------------
Version: 0.68
Release: 20100530
-------------------
- Replaced startswith and startsnotwith by matching pattern.
- Trim description before appending it, clean it from html and divide
it by a "|" character.
-------------------
Version: 0.67
Release: 20100420
-------------------
- Minor optimizations.
-------------------
Version: 0.66
Release: 20100322
-------------------
- Added systemproperty futuredump, if true we don't write items published in
future.
- Minor optimizations.
-------------------
Version: 0.65
Release: 20100318
-------------------
- Just start downloadcontrol if -Dknowndownloadsfile is defined.
- Minor optimizations.
-------------------
Version: 0.64
Release: 20100314
-------------------
- Added type podcastfeed and downloadmanager.
-------------------
Version: 0.62
Release: 20100307
-------------------
- Added http itemacceptor.
-------------------
Version: 0.61
Release: 20100303
-------------------
- Added datetime support for "yyyy-MM-dd".
- Fixed title removal.
-------------------
Version: 0.6
Release: 20100215
-------------------
- Supporting multiple channels.
-------------------
Version: 0.5
Release: 20100130
-------------------
- Added hours, changed order configitems.
- Just writing items if hashcode has changed.
-------------------
Version: 0.4
Release: 20100116
-------------------
- Added startsnotwith.
TODO
===============================================================================
- Add an example .bat wrapper script runnable on windows systems.
- Jun 27, 2011 6:35:19 AM com.drinschinz.rssfeedcreator.ChannelReader getCreated SEVERE: Error reading http://feeds.feedburner.com/francetv-sports?format=xml Couldn't parse dim 26 juin 2011 22:13:51 +0100
Seems wrong, french guys using weird days are not supported as it seems.
- Jul 11, 2011 6:11:15 AM com.drinschinz.rssfeedcreator.ChannelReader getCreated SEVERE: Error reading http://rss.nrg.co.il/news/ Couldn't parse Sun,10 Jul 2011 18:36:05 +0200