WebHarvest - web data extraction tool / Discussion / Help: Problem with Chinese in nextXPath URL

sunrunner777 - 2008-01-07

Hi，

I am enjoying using Web-Harvest so far, it is a great tool! Great work!

I have problems with download-multipage-list, when the nextXPath URL has Chinese characters such as

http://www.somesite.com/test?a=美国&b=英国

This causes it fail to find the next page. However, when I use url encoding to encode those chinese cahracters, it then works:

http://www.somesite.com/test?a=%BC%56%12%CA&b=%23%BD%F2%CB

How can I force the program to do a url encoding before visiting the link?

Thanks in advance!

Sunrunner777

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nonsun - 2008-01-15
  
  I am not sure if I met the same problem as you. However, I met the encoding problem alike. I use XPath function "substring-after/substring-before" with some Chinese characters to locate the data I want. Here bellow is my finding out:
  1. The Web Harvest configuration file must be declared as GB2312 encoded (or any encoding compatible), such as:
  <?xml version="1.0" encoding="gb2312"?>
  2. The "config" element should have the "charset" attribute set as "gb2312", i.e.: <config charset="gb2312">
  3. The configuration file (XML) should be encoded in GB2312, neither UTF-8, nor Unicode, nor anything else.
  4. The HTTP stream should be declared encoding as GB2312. This is usually true, I believe. :-)
  
  Then such XPath expression as following can be used in configuration:
  <xpath expression="substring-before(substring-after(//h6, '折扣：'), '折')">
  
  I didn't verify if all these condition above should be satisfied, but I can say in this case It works in my configuration.
  
  By the way, it seems that Web Harvest GUI editor will always store configuration file into UTF-8, in spite of what encoding is declared in the XML header. Also, it may hang when you trying to input Chinese characters in its editor, or reading UTF-8 encoded configuration file. I guess the problem results from XML parsing. Anyway, I use external editor such as UltraEdit-32 then load configuration file into Web Harvest.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Vladimir Nikic - 2008-01-15
    
    You are right, Web-Harvest IDE uses default encoding when reading XMl configuration and UTF-8 when saving it. That should be fixed in the next release, and charset specified in XML declaration should be used instead (if specified).
    
    Vladimir.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Problem with Chinese in nextXPath URL

Forums

Help

Problem with Chinese in nextXPath URL document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Problem with Chinese in nextXPath URL