Hi Megan,

We ran into the following while using some of the FLOSSmole scrape data:

It looks like the data extraction tools being used in the Flossmole project have a bug related to source page's encoding.

As an example -
In SF.net for the project "cemu" (or csharpemu) the text description on the page http://sourceforge.net/projects/csharpemu is :
        Interpreter jzyka C# w jzyku JAVA Interpreter of language C# in JAVA


But the flossmole data has this as  :
        csharpemu       Interpreter j?zyka C# w j?zyku JAVA Interpreter of language C# in JAVA        28      2006-08-07 14:21:24

It looks like the source encoding of the page ,UTF-8, is not being honored (for the string " j?zyka ", the two bytes followingthe 'j'  are "?" or 0xC43f which is invalid UTF-8).

Doing a curl -v on the URL above shows that the response content-type is text/html with no charset information. Inside of the HTML you get:

<html xmlns="http://www.w3.org/1999/xhtml" lang="en">
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />

In FLOSSmole at:
 http://krugle.com/kse/files/cvs/cvs.sourceforge.net/ossmole/OSSmoleJava/ossmole-tools/src/net/sf/ossmole/tools/scraper/NormalScraper.java

...you'll see that the HTML is read via a readStream() call, which (I think) is going to assume the default locale encoding, or 8859-1.

In Nutch you'll see the code here:

http://krugle.com/kse/files/svn/svn.apache.org/lucene/nutch/trunk/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java

...that sniffs for the charset pattern inside of a meta tag with a content-type attribute.

So something similar would be needed for the scraper.

-- Ken
-- 
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"