Menu

Simple Question - Parse english only pages/

Help
mfkilgore
2006-02-25
2013-04-27
  • mfkilgore

    mfkilgore - 2006-02-25

    I am just beginning to use the parser, great by the way, but want to limit my parsing to english only pages.  Ideally I would like to recieve an error when I pass a URL pointing any other language.

    Given the functionallity in the parser, I am guessing there is an easy way to do this and I am just missing it...

    Thanks in advance.

     
    • Derrick Oswald

      Derrick Oswald - 2006-02-26

      You could look for a lang="en" attribute on the <HEAD> or, more specifically, the <BODY> tags, but my guess is that 99% of the pages in the wild don't specify this.

      So it's not an easy determination. One indicator would be the character encoding, but it wouldn't be very good, because a number of languages can use the ISO-8859-1 encoding, which is the default for English (and HTML).

      As a rough approximation, to eliminate obviously non-english pages such as Chinese and Russian, after opening the connection and preferably after parsing the header, which may specify the encoding via the <META> tag, use:

      if (parser.getEncoding ().equals ("ISO-8859-1"))
      .... do your processing

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.