I am just beginning to use the parser, great by the way, but want to limit my parsing to english only pages. Ideally I would like to recieve an error when I pass a URL pointing any other language.
Given the functionallity in the parser, I am guessing there is an easy way to do this and I am just missing it...
Thanks in advance.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
You could look for a lang="en" attribute on the <HEAD> or, more specifically, the <BODY> tags, but my guess is that 99% of the pages in the wild don't specify this.
So it's not an easy determination. One indicator would be the character encoding, but it wouldn't be very good, because a number of languages can use the ISO-8859-1 encoding, which is the default for English (and HTML).
As a rough approximation, to eliminate obviously non-english pages such as Chinese and Russian, after opening the connection and preferably after parsing the header, which may specify the encoding via the <META> tag, use:
if (parser.getEncoding ().equals ("ISO-8859-1"))
.... do your processing
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am just beginning to use the parser, great by the way, but want to limit my parsing to english only pages. Ideally I would like to recieve an error when I pass a URL pointing any other language.
Given the functionallity in the parser, I am guessing there is an easy way to do this and I am just missing it...
Thanks in advance.
You could look for a lang="en" attribute on the <HEAD> or, more specifically, the <BODY> tags, but my guess is that 99% of the pages in the wild don't specify this.
So it's not an easy determination. One indicator would be the character encoding, but it wouldn't be very good, because a number of languages can use the ISO-8859-1 encoding, which is the default for English (and HTML).
As a rough approximation, to eliminate obviously non-english pages such as Chinese and Russian, after opening the connection and preferably after parsing the header, which may specify the encoding via the <META> tag, use:
if (parser.getEncoding ().equals ("ISO-8859-1"))
.... do your processing