Hi Megan,

We ran into the following while using some of the FLOSSmole scrape data:

It looks like the data extraction tools being used in the Flossmole project have a bug related to source page's encoding.

As an example -
In SF.net for the project "cemu" (or csharpemu) the text description on the page http://sourceforge.net/projects/csharpemu is :
        Interpreter jzyka C# w jzyku JAVA Interpreter of language C# in JAVA

But the flossmole data has this as  :
        csharpemu       Interpreter j?zyka C# w j?zyku JAVA Interpreter of language C# in JAVA        28      2006-08-07 14:21:24

It looks like the source encoding of the page ,UTF-8, is not being honored (for the string " j?zyka ", the two bytes followingthe 'j'  are "?" or 0xC43f which is invalid UTF-8).

Doing a curl -v on the URL above shows that the response content-type is text/html with no charset information. Inside of the HTML you get:

<html xmlns="http://www.w3.org/1999/xhtml" lang="en">
<meta http-equiv="content-type" content="text/html; charset=utf-8" />

In FLOSSmole at:

...you'll see that the HTML is read via a readStream() call, which (I think) is going to assume the default locale encoding, or 8859-1.

In Nutch you'll see the code here:


...that sniffs for the charset pattern inside of a meta tag with a content-type attribute.

So something similar would be needed for the scraper.

-- Ken
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"