#124 WebCrawler encoding problems on youtube

1.5.0 - bugs
closed-fixed
None
5
2010-03-12
2010-03-12
No

A user has reported a problem with encoding support in the WebCrawler

Basically he extracted the content of the following website

http://www.youtube.com/watch?v=C9zS67ZCkw8

with the following code

stream = new URL(uri).openStream();
ApertureRuntime ar = new ApertureRuntime();
RDFContainer container = ar.extractFrom(stream, uri);

And later wanted to use the title string

container.getString(NIE.title);

Which had broken encoding (as seen in the Eclipse debugger)

Discussion

  • Antoni Mylka

    Antoni Mylka - 2010-03-12

    Fixed in rev 2295. This issue is about youtube not including the encoding in HTML, but using HTTP Headers. This works OK for browsers and for the WebCrawler, which used HttpAccessor, which in turn knew that the charset passed with HTTP headers should be passed further to the HtmlExtractor. The problem was that the extractFrom(InputStream) method was too dumb to know about http headers.

    I added a new method: extractFrom(String url) it uses accessor and works correctly in this case.

    So:

    fixed, but use

    RDFContainer container = ar.extractFrom(uri);

     
  • Antoni Mylka

    Antoni Mylka - 2010-03-12
    • milestone: --> 1.5.0 - bugs
    • assigned_to: nobody --> mylka
    • status: open --> closed-fixed
     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

JavaScript is required for this form.





No, thanks