Recent changes to Home

WikiPage Home modified by Shaun

Shaun — Tue, 05 Mar 2013 12:22:45 -0000

--- v21
+++ v22
@@ -6,7 +6,7 @@

 Download Now
 ------------
-Latest version : [https://sourceforge.net/projects/nntpit/files/]
+Latest version : 

 Getting Started
 ---------------

WikiPage Home modified by Shaun

Shaun — Fri, 22 Feb 2013 01:12:43 -0000

--- v20
+++ v21
@@ -3,6 +3,10 @@
 NNTP Indexing Toolkit is a collection of tools that allow download, indexing and creation of NZB files from a Usenet server using the NNTP protocol. These tools are designed for use with binary news groups and in particulate multiple file collections which throughout this document will be referred to as releases.

 Written in C# the tools have been tested on Windows with the native .Net run-times and on Ubuntu with Mono .NET run-times.
+
+Download Now
+------------
+Latest version : [https://sourceforge.net/projects/nntpit/files/]

 Getting Started
 ---------------

WikiPage Home modified by Shaun

Shaun — Fri, 22 Feb 2013 00:50:56 -0000

--- v19
+++ v20
@@ -1,8 +1,12 @@
 NNTP Indexing Toolkit
-=====================
+---------------------
 NNTP Indexing Toolkit is a collection of tools that allow download, indexing and creation of NZB files from a Usenet server using the NNTP protocol. These tools are designed for use with binary news groups and in particulate multiple file collections which throughout this document will be referred to as releases.

 Written in C# the tools have been tested on Windows with the native .Net run-times and on Ubuntu with Mono .NET run-times.
+
+Getting Started
+---------------
+See how to start indexing the most recent releases submitted to a group [Getting Started]

 NNTP Indexing - How it works
 ----------------------------

WikiPage Home modified by Shaun

Shaun — Thu, 21 Feb 2013 23:04:50 -0000

--- v18
+++ v19
@@ -10,97 +10,6 @@

 [How to Index NNTP Usenet Groups]

-
-
-
-
-NNTP Indexing - How it works
-----------------------------
-The main problem with binary Usenet groups is the system was never intended to be used for large file releases. Posts are limited to a few hundred KB text only messages. To allow posting of large multi file releases to binary news groups a large amount of processing of the files is needed. This included splinting files up into parts, encoding the binary data to a text based format then uploading these messages to the group. This can mean some files can be split across hundreds of posts, and with mutli file release this adds another level of complexity to the problem. 
-
-For more details on Usenet and how it works see http://en.wikipedia.org/wiki/Usenet_newsgroup
-
-Because of the limitations and the need to split files across multiple messages as outlined above a way was then required to allow putting back together all the parts to recreate the original file. And in cases of multiple file releases group all the files into the one release. The only way to do this with the Usenet system was to use the subject.
-
-Each message of a multi part file post has enough information in the subject to merge the data from the messages back into a single file.
-
-Example Message Subject:
-
-**Some release name (1/2) "file.zip" yEnc (1/4)**
-
-As you can see the subject has enough information to allow you to know which part of which file this message is from. In the above case this message has data for part 1 of the 4 parts that make up this file called file.zip. This file is 1 of 2 files in this release and this release is called "Some release name". The final thing to notice is that this message body is encoded with yEnc which is a binary to text encoding mechanism.
-
-http://en.wikipedia.org/wiki/YEnc
-
-So the problem then is, how do you reliably group all these message posts into a data structure that a tool can then download the release. The simple answer is you cant for everything as some posts do not follow anything close to the above format. But for most posts to binary groups you can by following some simple parsing and validation rules. This is what this tool set tries to do, implement a simple set of parsing, grouping and validation rules to build NZB files for releases on binary Usenet groups.
-
-Step 1 - The headers
---------------------
-Message headers are just meta data about the posts that give you enough information to use the messages, this includes things like messageID, date, subject, size.
-
-The first step in trying to build indexes of messages is download the message headers, not all of them but enough to allow you to extract a full release worth of data and create an NZB file.
-
-There are a few ways to do this, you can start at the beginning, i.e. the first message (oldest) the group contains and work forward to the latest (newest) message. Some groups are big and can go back a long way, in some cases years, with millions of messages, so getting all the headers at once and trying to process them would be very time and storage consuming.
-
-A better way is start at a certain point in time or at a particular message ID and work forwards keeping a block of headers for a time period i.e. 6 hours. You would download a number of headers, process the headers, download some more headers, drop headers older than 6 hours from the most recent header and repeat over and over. This would allow processing a 6 hour window of headers into releases. The advantage is you would only need to store 6 hours of headers locally, just the headers you are currently looking at and thus minimizing storage needs. This approach would allow starting back at the beginning of a group and processing all message in the group over time.
-
-To process live, you just keep the last 6 hour of data, update and process it every 10 min.
-
-This process and approach has been implemented in the tool [GetHeaders]
-
-Step 2 - Grouping posts into releases
--------------------------------------
-Once you have a block of headers the next step is to group the messages together to try to build collections of messages that make up a release. As mentioned above you use the subject to loosely group the messages. I say loosely as the subject is free form and there is no specification a poster needs to adhere to when posting. It is true that a majority of posts do confirm to a lose format but there will be posts that due to the lack of coherency and detail will be useless to build multi file releases from.
-
-The general idea is to pull all the information from the subject and then use this information to construct, validate and check a release. Lets go back to out above example subject.
-
-**Some release name (1/2) "file.zip" yEnc (1/4)**
-
-From this we can pull the following meta information:
-
-Meta|Value
-----|------
-Release Name|Some release name
-File Number|1
-Total Files|2
-File Name|file.zip
-Part Number|1
-Total Parts|4
-
-From this we can see it is relatively easy to build a data structure to hold this data and allow easy completeness validation. All we need to do is 3 levels of storage.
-
-Collection[]-->CollectionFile[]-->CollectionFilePart[]
-
-As you parse headers and populate the above data schema, as it fills up with data you can check for completeness by making sure the correct number of parts are present for each file and the correct number of files are present for the release. When you finished parsing your 6 hour block of headers you can check for completeness of the release and if all files and parts are accounted for write out the NZB file.
-
-In an ideal world that would be all you need to do, unfortunately in the Usenet universe it is not ideal. You will have missing parts, duplicate parts, multiple releases with the same name but different file numbers, no name releases and any number of variations and combinations of all of these anomaly. So how do you handle them all, the simple answer is you don't, it is always a compromise, depending on how strict you are with the rules you use will effect how valid and usable your extracted releases will be. If you are very lose then you will get lots of releases in some cases down the file level release. And if you are very strict then you will only get releases that are multi file with good length release names but may miss some.
-
-Compiling the downloaded headers into releases using the above process has been implemented in [ReleasesExtractor]
-
-Step 3 - Validating and verifying
----------------------------------
-Now you have an NZB, is it valid? As stated above releases can be checked for completeness using total the number of files and total number of parts per file. Check all the parts and files are available, this will indicate that there is enough data to complete the release. But it will not indicate if the release is a valid release, checking for things like file names and types, file counts, subject name and poster names can help filter out bogus releases. 
-
-The next thing to try is to actually download parts of the release to check file contents. Downloading .nfo files can help identify correct release names and content. Also downloading just parts of certain files can reveal a lot of information about the release. 
-
-The best example of this is downloading the first segment of the .RAR files. From just the first segment which is usually just a few hundred KB you can check for a the password header, run it through the unrar tool and extract the file list which will also indicate any passworded files in the RAR with a * in front of the name and finally try to extract the files. You wont get the full file or files but you will get enough to check the header if the file is another rar file.
-
-The number of tests and level of verification and checking that can be done is large and with each test invalid and low quality releases can be excluded.
-
-An NZB validation tool is included in the package [ReleasesValidator] this tool has the following testing and validation implemented:
-
-- Release must not contain just a single file that is an NZB file
-- Check release subject for black listed RegEx matches
-- Check RAR files for password
-    - Download first segment
-    - Check for password header
-    - Extract file list and look for names starting with *
-    - Extract files and check for rars, if any check them for passwords
-
-Step 4 - Viewing compressed NZBs
---------------------------------
-The above extractor saves any NZB files it creates as a compressed file in .gz format. You can extract contents using winzip or 7zip but this will extract the content to a new file, to view certain info about the NZB without extracting the NZB first use the included tool [ViewNzb]. Using this tool can extract title and file list info on the saved NZB.
-
 Tools
 -----
 For information on how to use any of the command line tools check the tool page below for usage and examples.

WikiPage Home modified by Shaun

Shaun — Thu, 21 Feb 2013 23:04:03 -0000

--- v17
+++ v18
@@ -4,8 +4,18 @@

 Written in C# the tools have been tested on Windows with the native .Net run-times and on Ubuntu with Mono .NET run-times.

-The problem
------------
+NNTP Indexing - How it works
+----------------------------
+This project aims to solve the problem of indexing complex releases being posted to news groups and building NZB files to allows easy processing of releases.
+
+[How to Index NNTP Usenet Groups]
+
+
+
+
+
+NNTP Indexing - How it works
+----------------------------
 The main problem with binary Usenet groups is the system was never intended to be used for large file releases. Posts are limited to a few hundred KB text only messages. To allow posting of large multi file releases to binary news groups a large amount of processing of the files is needed. This included splinting files up into parts, encoding the binary data to a text based format then uploading these messages to the group. This can mean some files can be split across hundreds of posts, and with mutli file release this adds another level of complexity to the problem. 

 For more details on Usenet and how it works see http://en.wikipedia.org/wiki/Usenet_newsgroup

WikiPage Home modified by Shaun

Shaun — Thu, 21 Feb 2013 12:16:50 -0000

--- v16
+++ v17
@@ -68,6 +68,7 @@
 Compiling the downloaded headers into releases using the above process has been implemented in [ReleasesExtractor]

 Step 3 - Validating and verifying
+---------------------------------
 Now you have an NZB, is it valid? As stated above releases can be checked for completeness using total the number of files and total number of parts per file. Check all the parts and files are available, this will indicate that there is enough data to complete the release. But it will not indicate if the release is a valid release, checking for things like file names and types, file counts, subject name and poster names can help filter out bogus releases. 

 The next thing to try is to actually download parts of the release to check file contents. Downloading .nfo files can help identify correct release names and content. Also downloading just parts of certain files can reveal a lot of information about the release. 
@@ -86,13 +87,17 @@
     - Extract file list and look for names starting with *
     - Extract files and check for rars, if any check them for passwords

+Step 4 - Viewing compressed NZBs
+--------------------------------
+The above extractor saves any NZB files it creates as a compressed file in .gz format. You can extract contents using winzip or 7zip but this will extract the content to a new file, to view certain info about the NZB without extracting the NZB first use the included tool [ViewNzb]. Using this tool can extract title and file list info on the saved NZB.
+
 Tools
 -----
-
+For information on how to use any of the command line tools check the tool page below for usage and examples.

 - [GetHeaders]
 - [ReleasesExtractor]
 - [ReleasesValidator]
-- ViewNzb
+- [ViewNzb]

WikiPage Home modified by Shaun

Shaun — Thu, 21 Feb 2013 12:04:23 -0000

--- v15
+++ v16
@@ -63,15 +63,36 @@

 As you parse headers and populate the above data schema, as it fills up with data you can check for completeness by making sure the correct number of parts are present for each file and the correct number of files are present for the release. When you finished parsing your 6 hour block of headers you can check for completeness of the release and if all files and parts are accounted for write out the NZB file.

-In an ideal world that would be all you need to do, unfortunately in the Usenet universe it is not ideal. You will have missing parts, duplicate parts, multiple releases with the same name but different file numbers, no name releases and any number of variations and combinations of all of these anomaly. So how do you handle them all, the simple answer is you don't, it is always a compromise, depending on how strict you are with the rules you use will effect how valid and usable your extracted releases will be. If you are very lose then you will get lots of releases in some cases down the file level release. And if you are very strict then you will only get releases that are multi file with good length release names. 
+In an ideal world that would be all you need to do, unfortunately in the Usenet universe it is not ideal. You will have missing parts, duplicate parts, multiple releases with the same name but different file numbers, no name releases and any number of variations and combinations of all of these anomaly. So how do you handle them all, the simple answer is you don't, it is always a compromise, depending on how strict you are with the rules you use will effect how valid and usable your extracted releases will be. If you are very lose then you will get lots of releases in some cases down the file level release. And if you are very strict then you will only get releases that are multi file with good length release names but may miss some.
+
+Compiling the downloaded headers into releases using the above process has been implemented in [ReleasesExtractor]
+
+Step 3 - Validating and verifying
+Now you have an NZB, is it valid? As stated above releases can be checked for completeness using total the number of files and total number of parts per file. Check all the parts and files are available, this will indicate that there is enough data to complete the release. But it will not indicate if the release is a valid release, checking for things like file names and types, file counts, subject name and poster names can help filter out bogus releases. 
+
+The next thing to try is to actually download parts of the release to check file contents. Downloading .nfo files can help identify correct release names and content. Also downloading just parts of certain files can reveal a lot of information about the release. 
+
+The best example of this is downloading the first segment of the .RAR files. From just the first segment which is usually just a few hundred KB you can check for a the password header, run it through the unrar tool and extract the file list which will also indicate any passworded files in the RAR with a * in front of the name and finally try to extract the files. You wont get the full file or files but you will get enough to check the header if the file is another rar file.
+
+The number of tests and level of verification and checking that can be done is large and with each test invalid and low quality releases can be excluded.
+
+An NZB validation tool is included in the package [ReleasesValidator] this tool has the following testing and validation implemented:
+
+- Release must not contain just a single file that is an NZB file
+- Check release subject for black listed RegEx matches
+- Check RAR files for password
+    - Download first segment
+    - Check for password header
+    - Extract file list and look for names starting with *
+    - Extract files and check for rars, if any check them for passwords

 Tools
 -----

-- GetHeaders
-- ReleasesExtractor
-- ReleasesValidator
+- [GetHeaders]
+- [ReleasesExtractor]
+- [ReleasesValidator]
 - ViewNzb

WikiPage Home modified by Shaun

Shaun — Mon, 18 Feb 2013 22:27:02 -0000

--- v14
+++ v15
@@ -57,6 +57,13 @@
 Part Number|1
 Total Parts|4

+From this we can see it is relatively easy to build a data structure to hold this data and allow easy completeness validation. All we need to do is 3 levels of storage.
+
+Collection[]-->CollectionFile[]-->CollectionFilePart[]
+
+As you parse headers and populate the above data schema, as it fills up with data you can check for completeness by making sure the correct number of parts are present for each file and the correct number of files are present for the release. When you finished parsing your 6 hour block of headers you can check for completeness of the release and if all files and parts are accounted for write out the NZB file.
+
+In an ideal world that would be all you need to do, unfortunately in the Usenet universe it is not ideal. You will have missing parts, duplicate parts, multiple releases with the same name but different file numbers, no name releases and any number of variations and combinations of all of these anomaly. So how do you handle them all, the simple answer is you don't, it is always a compromise, depending on how strict you are with the rules you use will effect how valid and usable your extracted releases will be. If you are very lose then you will get lots of releases in some cases down the file level release. And if you are very strict then you will only get releases that are multi file with good length release names. 

 Tools
 -----

WikiPage Home modified by Shaun

Shaun — Mon, 18 Feb 2013 21:56:39 -0000

--- v13
+++ v14
@@ -16,7 +16,7 @@

 Example Message Subject:

-Some release name (1/2) "file.zip" yEnc (1/4)
+**Some release name (1/2) "file.zip" yEnc (1/4)**

 As you can see the subject has enough information to allow you to know which part of which file this message is from. In the above case this message has data for part 1 of the 4 parts that make up this file called file.zip. This file is 1 of 2 files in this release and this release is called "Some release name". The final thing to notice is that this message body is encoded with yEnc which is a binary to text encoding mechanism.

@@ -44,16 +44,18 @@

 The general idea is to pull all the information from the subject and then use this information to construct, validate and check a release. Lets go back to out above example subject.

-Some release name (1/2) "file.zip" yEnc (1/4)
+**Some release name (1/2) "file.zip" yEnc (1/4)**

 From this we can pull the following meta information:

-Release Name : Some release name
-File Number : 1
-Total Files : 2
-File Name : file.zip
-Part Number : 1
-Total Parts : 4
+Meta|Value
+----|------
+Release Name|Some release name
+File Number|1
+Total Files|2
+File Name|file.zip
+Part Number|1
+Total Parts|4

 Tools

WikiPage Home modified by Shaun

Shaun — Mon, 18 Feb 2013 21:51:28 -0000

--- v12
+++ v13
@@ -40,6 +40,20 @@

 Step 2 - Grouping posts into releases
 -------------------------------------
+Once you have a block of headers the next step is to group the messages together to try to build collections of messages that make up a release. As mentioned above you use the subject to loosely group the messages. I say loosely as the subject is free form and there is no specification a poster needs to adhere to when posting. It is true that a majority of posts do confirm to a lose format but there will be posts that due to the lack of coherency and detail will be useless to build multi file releases from.
+
+The general idea is to pull all the information from the subject and then use this information to construct, validate and check a release. Lets go back to out above example subject.
+
+Some release name (1/2) "file.zip" yEnc (1/4)
+
+From this we can pull the following meta information:
+
+Release Name : Some release name
+File Number : 1
+Total Files : 2
+File Name : file.zip
+Part Number : 1
+Total Parts : 4

 Tools