#124 WebCrawler encoding problems on youtube

1.5.0 - bugs

A user has reported a problem with encoding support in the WebCrawler

Basically he extracted the content of the following website


with the following code

stream = new URL(uri).openStream();
ApertureRuntime ar = new ApertureRuntime();
RDFContainer container = ar.extractFrom(stream, uri);

And later wanted to use the title string


Which had broken encoding (as seen in the Eclipse debugger)


  • Antoni Mylka

    Antoni Mylka - 2010-03-12

    Fixed in rev 2295. This issue is about youtube not including the encoding in HTML, but using HTTP Headers. This works OK for browsers and for the WebCrawler, which used HttpAccessor, which in turn knew that the charset passed with HTTP headers should be passed further to the HtmlExtractor. The problem was that the extractFrom(InputStream) method was too dumb to know about http headers.

    I added a new method: extractFrom(String url) it uses accessor and works correctly in this case.


    fixed, but use

    RDFContainer container = ar.extractFrom(uri);

  • Antoni Mylka

    Antoni Mylka - 2010-03-12
    • milestone: --> 1.5.0 - bugs
    • assigned_to: nobody --> mylka
    • status: open --> closed-fixed

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

JavaScript is required for this form.

No, thanks