Menu

GetHeaders

Shaun

GetHeaders

This tool downloads and keeps a set of headers from a newsgroup updated. It can be set to download a block of headers from a start date to an end date or set to download and keep updated a block of headers spanning a particular time period. It can be given a start date and will progressively download new headers each run or can just try to grab all the headers in one go. This configuration allows many different usage scenarios and allows for different approaches given different circumstances.

The headers are saved to disk in TSV (TAB Separated Values) format. The location and name of the saved headers is determined by the DataPath, FileNamePrefix and Group you set in the config file. This data is text and can be opened in any standard text editor or excel etc.

When downloading the most recent headers from groups there can be gaps as not all of the most recent headers have replicated to all servers yet. This can be a problem if just getting the latest headers from the last message ID that is available locally as there are still missing headers further back that will never be retrieved the first time but may be available 10 min later. To try to work around this GetHeaders will identify any gaps in the message ID's it has in its current block of downloaded headers and try to fill in the gaps. This happens each run.

If compressed headers (xzver) are available on the server the NntpClientLib will use compressed headers automatically. Compressed headers are not available on all server but if it is available you can gain a 80%-90% compression ration of downloaded headers. This means an 80%-90% decreased in data downloaded and transferred. This can help with news server download quota and also is you are suing this on a metered network connection.

Scenario 1

In this scenario get the latest 6 hours of headers for a group to be used for release extraction using the [ReleasesExtractor] tool.

This requires the GetHeaders to be run on a schedule say every 10 min to trim old headers older than the TimeBufferSize, identify missing headers, download them and then downloaded any newly available headers.

The options for this scenario as as follows:

StartDate=
EndDate=
TimeBufferSize=6
RetreiveBlockSize=50000
FileNamePrefix=live
DataPath=your output path
DownloadAll=true
TrimOld=true
Group=the group you want
Server=your server details
Password=your server details
Username=your server details

In this scenario you need to run the tool in a schedule every 10 min to keep the headers updated after each run you need to run [ReleasesExtractor] to extract releases for the new data.

Scenario 2

Start at a date and work forwards in an progressively RetreiveBlockSize at a time.
This is to allow back filling, it will progressively work though a group download and trimming data keeping a TimeBufferSize block of headers saved locally and will need to be run over and over again to progress though the headers. Each run can be followed by the [ReleasesExtractor] tool to extract releases as the group is processed. This allows efficient processing of a group and progressively produces releases as the group is processed.

The options for this scenario as as follows:

StartDate=
EndDate=
TimeBufferSize=6
RetreiveBlockSize=50000
FileNamePrefix=backfill
DataPath=your output path
DownloadAll=false
TrimOld=true
Group=the group you want
Server=your server details
Password=your server details
Username=your server details

In this scenario you need to run the tool in a schedule every 10 min to keep the headers updated after each run you need to run [ReleasesExtractor] to extract releases for the new data.

Scenario 3

Get all headers for a particular day. This is a single run get all data.

The options for this scenario as as follows:

StartDate=2012-06-01
EndDate=2012-06-02
TimeBufferSize=
RetreiveBlockSize=50000
FileNamePrefix=backfill_12-06-01
DataPath=your output path
DownloadAll=true
TrimOld=false
Group=the group you want
Server=your server details
Password=your server details
Username=your server details

This is a single run case and when it finises it should have all data available for the date, if it is stopped before it finishes then when it starts again it will start where it left off.

Config file and parameters

The default config file name is getheaders.conf and it contains the following:

StartDate=         Start Date in format (yyyy-mm-dd)
EndDate=           End Date (yyyy-mm-dd)
TimeBufferSize=    Time Buffer Size (in hours)
RetreiveBlockSize= Retreive BlockSize (number of headers to get per run)
FileNamePrefix=    Filename Prefix (string to prefix to output file)
DataPath=          Data Path (output path for data files)
DownloadAll=       Download All (download all headers or in batches of BlockSize)
TrimOld=           Trim Old (drop old headers older than time buffer size)
Group=             Group (group to download headers for)
Server=            Server (your nntp server news.myserver.com)
Password=          Password (your news server password)
Username=          Username (your news server user name)

Each of the above can be overridden using one of the command line parameters

-sd   Start Date in format (yyyy-mm-dd)
-ed   End Date (yyyy-mm-dd)
-tbs  Time Buffer Size (in hours)
-rbs  Retreive BlockSize (number of headers to get per run)
-fp   Filename Prefix (string to prefix to output file)
-dp   Data Path (output path for data files)
-g    Group (group to download headers for)
-s    Server (your nntp server news.myserver.com)
-u    Username (your news server user name)
-p    Password (your news server password)
-c    Config File (filename of config file)
-da   Download All (download all headers or in batches of BlockSize)
-to   Trim Old (drop old headers older than time buffer size)

Example:
GetHeaders -c "config file" -g "group"
This will use the settigns form the config file but override the group.


Related

Wiki: Getting Started
Wiki: Home
Wiki: How to Index NNTP Usenet Groups
Wiki: ReleasesExtractor

MongoDB Logo MongoDB