Menu

Tree [a5a15c] master /
 History

HTTPS access


File Date Author Commit
 META-INF 2014-12-24 James Frost James Frost [550728] Updated documentation + Added license
 src 2015-01-22 James Frost James Frost [a5a15c] Tweaked how access denied is handled
 .gitignore 2014-12-24 James Frost James Frost [b07bd0] Added .gitignore
 LICENSE 2014-12-24 James Frost James Frost [550728] Updated documentation + Added license
 README.md 2015-01-02 James Frost James Frost [ae85f9] Update README.md

Read Me

robots.io

Robots.io is a Java library designed to make parsing a websites 'robots.txt' file easy.

How to use

The RobotsParser class provides all the functionality to use robots.io.

The Javadoc for Robots.io can be found here.

Examples

Connecting

To parse the robots.txt for Google with the User-Agent string "test":

RobotsParser robotsParser = new RobotsParser("test");
robotsParser.connect("http://google.com");

Alternatively, to parse with no User-Agent, simply leave the constructor blank.

You can also pass a domain with a path.

robotsParser.connect("http://google.com/example.htm"); //This would also be valid

Note: Domains can either be passed in string form or as a URL object to all methods.

Querying

To check if a URL is allowed:

robotsParser.isAllowed("http://google.com/test"); //Returns true if allowed

Or, to get all the rules parsed from the file:

robotsParser.getDisallowedPaths(); //This will return an ArrayList of Strings

The results parsed are cached in the robotsParser object until the connect() method is called again, overwriting the previously parsed data

Politeness

In the event that all access is denied, a RobotsDisallowedException will be thrown.

URL Normalisation

Domains passed to RobotsParser are normalised to always end in a forward slash.
Disallowed Paths returned will never begin with a forward slash.
This is so that URL's can easily be constructed. For example:

robotsParser.getDomain() + robotsParser.getDisallowedPaths().get(0); // http://google.com/example.htm

Licensing

Robots.io is distributed under the GPL.