Menu

Problem with Chinese in nextXPath URL

Help
2008-01-07
2012-09-04
  • sunrunner777

    sunrunner777 - 2008-01-07

    Hi,

    I am enjoying using Web-Harvest so far, it is a great tool! Great work!

    I have problems with download-multipage-list, when the nextXPath URL has Chinese characters such as

    http://www.somesite.com/test?a=美国&b=英国

    This causes it fail to find the next page. However, when I use url encoding to encode those chinese cahracters, it then works:

    http://www.somesite.com/test?a=%BC%56%12%CA&b=%23%BD%F2%CB

    How can I force the program to do a url encoding before visiting the link?

    Thanks in advance!

    Sunrunner777

     
    • Nonsun

      Nonsun - 2008-01-15

      I am not sure if I met the same problem as you. However, I met the encoding problem alike. I use XPath function "substring-after/substring-before" with some Chinese characters to locate the data I want. Here bellow is my finding out:
      1. The Web Harvest configuration file must be declared as GB2312 encoded (or any encoding compatible), such as:
      <?xml version="1.0" encoding="gb2312"?>
      2. The "config" element should have the "charset" attribute set as "gb2312", i.e.: <config charset="gb2312">
      3. The configuration file (XML) should be encoded in GB2312, neither UTF-8, nor Unicode, nor anything else.
      4. The HTTP stream should be declared encoding as GB2312. This is usually true, I believe. :-)

      Then such XPath expression as following can be used in configuration:
      <xpath expression="substring-before(substring-after(//h6, '折扣:'), '折')">

      I didn't verify if all these condition above should be satisfied, but I can say in this case It works in my configuration.

      By the way, it seems that Web Harvest GUI editor will always store configuration file into UTF-8, in spite of what encoding is declared in the XML header. Also, it may hang when you trying to input Chinese characters in its editor, or reading UTF-8 encoded configuration file. I guess the problem results from XML parsing. Anyway, I use external editor such as UltraEdit-32 then load configuration file into Web Harvest.

       
      • Vladimir Nikic

        Vladimir Nikic - 2008-01-15

        You are right, Web-Harvest IDE uses default encoding when reading XMl configuration and UTF-8 when saving it. That should be fixed in the next release, and charset specified in XML declaration should be used instead (if specified).

        Vladimir.

         

Log in to post a comment.