A new version of craigslistTools+search is in the repository. This version has been broken into modules, is configured via an xml file and uses a sqlite database as its internal cache.
NAME: craigslistTools
PURPOSE: allow and individual to easily search craigslist.org for interesting posts
CONFIGURATION: copy search.xml and tailor this file for your particular search (contains many useful comments)
set the locations of the 5 files: cache, index, excluded, included and skipped
set the types of te 4 output files: xhtml, atom, rss2
you will need to lookup the city prefix (on the craigslist.org subdomain)
you will also need to lookup the category suffix (after the craigslist.org domain)
you need to understand regular expressions and the purpose of an xml cdata section
USAGE: python search.py mysearch.xml
(it will take a few minutes to run, printing simple diagnostics about what it is doing)
(you can run it once an hour, once a day, or once a week)
LIMITATIONS: you can search up to 20 cities
you can search up to 5 categories
text is converted to ascii for searching
WARNINGS: do not modify this script to "suck down" all the data from craigslist.org
do not run this script more than once per hour (once per day is usually sufficient)
INTERFACE: command-line control with output to local files
INTERPRETER: Python 2.6
DATABASE: sqlite3 database used for internal data storage
STAGE: this project is in alpha
TESTING: this project does not yet have unit testing
all testing to date is by field usage
MODULES: search.py the controlling script for craigslistTools+search
config.py reads the xml search configuration file
cache.py manages the internal sqlite cache database
get.py fetches listings and posts from craigslist.org
filter.py filters posts to exclude, include and skip
report.py generates output files in various web-friendly formats
FILES: search.xml master copy of the xml search configuration file
_search.db testing copy of the sqlite3 database used for the internal cache
_index.xml testing copy of index containing links to the three report files
_excluded.xml testing copy of postings that have been excluded, sorted by date, city, category
_included.xml testing copy of postings that have been included, sorted by date, city, category
_skipped.xml testing copy of postings that have been skipped, sorted by date, city, category
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
A new version of craigslistTools+search is in the repository. This version has been broken into modules, is configured via an xml file and uses a sqlite database as its internal cache.
NAME: craigslistTools
PURPOSE: allow and individual to easily search craigslist.org for interesting posts
CONFIGURATION: copy search.xml and tailor this file for your particular search (contains many useful comments)
set the locations of the 5 files: cache, index, excluded, included and skipped
set the types of te 4 output files: xhtml, atom, rss2
you will need to lookup the city prefix (on the craigslist.org subdomain)
you will also need to lookup the category suffix (after the craigslist.org domain)
you need to understand regular expressions and the purpose of an xml cdata section
USAGE: python search.py mysearch.xml
(it will take a few minutes to run, printing simple diagnostics about what it is doing)
(you can run it once an hour, once a day, or once a week)
LIMITATIONS: you can search up to 20 cities
you can search up to 5 categories
text is converted to ascii for searching
WARNINGS: do not modify this script to "suck down" all the data from craigslist.org
do not run this script more than once per hour (once per day is usually sufficient)
INTERFACE: command-line control with output to local files
INTERPRETER: Python 2.6
DATABASE: sqlite3 database used for internal data storage
STAGE: this project is in alpha
TESTING: this project does not yet have unit testing
all testing to date is by field usage
MODULES: search.py the controlling script for craigslistTools+search
config.py reads the xml search configuration file
cache.py manages the internal sqlite cache database
get.py fetches listings and posts from craigslist.org
filter.py filters posts to exclude, include and skip
report.py generates output files in various web-friendly formats
FILES: search.xml master copy of the xml search configuration file
_search.db testing copy of the sqlite3 database used for the internal cache
_index.xml testing copy of index containing links to the three report files
_excluded.xml testing copy of postings that have been excluded, sorted by date, city, category
_included.xml testing copy of postings that have been included, sorted by date, city, category
_skipped.xml testing copy of postings that have been skipped, sorted by date, city, category