#1623 missing } after property list error when getPage

2.15
closed
None
1
2015-06-12
2014-07-09
yunyan
No

The error stack is:

Caused by: com.gargoylesoftware.htmlunit.ScriptException: missing } after property list (http://diviner.jd.com/diviner?sku=733425&p=102004&lid=1&lim=6&uuid=685364669&ec=utf-8&c1=9847&c2=9848&c3=9866&callback=jsonp1404896164207&_=1404896171731#1)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:705)
at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:620)
at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:513)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.compile(JavaScriptEngine.java:545)
at com.gargoylesoftware.htmlunit.html.HtmlPage.loadJavaScriptFromUrl(HtmlPage.java:1167)
at com.gargoylesoftware.htmlunit.html.HtmlPage.loadExternalJavaScriptFile(HtmlPage.java:1055)
at com.gargoylesoftware.htmlunit.html.HtmlScript.executeScriptIfNeeded(HtmlScript.java:393)
at com.gargoylesoftware.htmlunit.html.HtmlScript$3.execute(HtmlScript.java:274)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.doProcessPostponedActions(JavaScriptEngine.java:750)

Here is the case to reproduce the problem:

String url = "http://item.jd.com/351851.html";
final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_24);
webClient.getCurrentWindow().setInnerHeight(60000);
webClient.waitForBackgroundJavaScript(5000);
WebRequest webRequest = new WebRequest(new URL(url));
final HtmlPage page = webClient.getPage(webRequest);

Discussion

  • Ahmed Ashour

    Ahmed Ashour - 2014-07-09
    • status: open --> accepted
    • assigned_to: Ahmed Ashour
     
  • Ahmed Ashour

    Ahmed Ashour - 2014-07-11

    Help is needed!!! As I fail to understand the browsers behavior.

    With a string like "'黄'", encoded with "GBK", java and other tools convert the Chinese character to two, which is correct. But they don't include the trailing apostrophe.

    But Chrome and FF return the trailing apostrophe, making the result having 4 characters.

    I am not sure how to proceed with that.

    Please have a look at XMLHttpRequestTest:
    - .overrideMimeType_charset_all()
    - .java_encoding()

     
  • Ahmed Ashour

    Ahmed Ashour - 2014-07-11
    • assigned_to: Ahmed Ashour --> nobody
     
  • Marc Guillemot

    Marc Guillemot - 2014-07-11

    @Ahmed
    is it possible that in the test overrideMimeType_charset_all, FF and Chrome detect that the response is UTF-8 encoded and ignore overrideMimeType?

     
  • Ahmed Ashour

    Ahmed Ashour - 2014-07-11

    @Marc

    Good question.

    I added other test cases before, to make sure we have the correct assumption.

    Please look into overrideMimeType_charset_empty(), and other .overrideMimeType_charsetXYZ sibling tests.

     
  • Hartmut Arlt

    Hartmut Arlt - 2014-07-11

    @Ahmed
    I've took a look at the wrapped response and its data is definitely UTF-8 encoded (and includes the trailing apostrophe) which causes trouble when reading it as GBK encoded string.
    Well, the server returns the data UTF-8 encoded which is totally right but I guess that HtmlUnit needs to wrap the input stream in case the mime-type is overridden and convert the data from UTF-8 to GBK.

     
  • Ahmed Ashour

    Ahmed Ashour - 2014-07-11

    Thanks Hartmut for looking into this.

    Yes, the server returns UTF-8, but without BOM, and the browsers convert it to GBK (since the originating page is GBK).

    But the problem is that: if that UTF-8 is converted to GBK in Java (and other third-party tools), that apostrophe is gone.

    Please look into the cases committed in XMLHttpRequestTest.

    I suspect that the Chinese characters by UTF-8 are converted to 3-bytes (Arabic characters are 2-bytes for example), and GBK expects 2-byte boundary.

    I would look deeper, I hope we don't have to override GBK encoding :((

     
  • Hartmut Arlt

    Hartmut Arlt - 2014-07-11

    @Ahmed
    What I was trying to explain is the following:
    Assume you have an arbitrary string str and now you call str.getBytes("UTF-8") on it. Then, the resulting byte sequence has to be decoded with UTF-8 again in order to get the input string str back.

    In context of this issue, the server's response has to be decoded as UTF-8 (since it has encoded it as UTF-8) first and encoded as GBK afterwards in order to get the correct byte sequence.

    BTW You're right about the encoded chinese character: it consumes 3 bytes in UTF-8 encoding and 2 bytes in GBK encoding.

     
  • Ahmed Ashour

    Ahmed Ashour - 2014-08-01

    I guess I couldn't find a way.

    Have a look at:

        public static void main (String args[]) throws Exception {
            Reader reader = new StringReader("黄'");
            Charset originalCharset = Charset.forName("UTF-8");
            Charset charset = Charset.forName("GBK");
            int ch;
            while ((ch = reader.read()) != -1) {
                CharBuffer buff = CharBuffer.allocate(5);
                buff.append((char) ch);
                buff.flip();
    
                ByteBuffer byteBuffer = originalCharset.newEncoder().encode(buff);
                CharBuffer output;
                try {
                    output = charset.newDecoder().decode(byteBuffer);
                }
                catch(MalformedInputException e) {
                    byteBuffer.position(0);
                    ByteBuffer newByteBuffer = ByteBuffer.allocate(byteBuffer.capacity() + 1);
                    newByteBuffer.put(byteBuffer);
                    byteBuffer = newByteBuffer;
                    byteBuffer.position(0);
                    output = charset.newDecoder().decode(byteBuffer);
                }
                for (char c : output.array()) {
                    System.out.println("Found " + c);
                }
    
            }
        }
    
     
  • Ahmed Ashour

    Ahmed Ashour - 2015-06-12
    • status: accepted --> closed
    • assigned_to: Ahmed Ashour
     
  • Ahmed Ashour

    Ahmed Ashour - 2015-06-12

    "If you wait by the river long enough, the bodies of your enemies will float by." - ancient proverb.

    This has been fixed with Java 1.7.0_80, (sun.nio.cs.ext.DoubleByte.Decoder)

     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks