I'm using web-harvest to extract data and I need to use POST to fill a form.
So the content type is set to application/x-www-form-urlencoded and the HTTP field
Line-based text data: application/x-www-form-urlencoded need to be filled. For some reason, the special characters such as '=', '&' are always translated into %3D, etc.
Then sever rejects my post.
How can I disable this kind of translation, and just put the following
"center_name=&ddlbCounty=ALL&city_name=&zip_code=&star_level=All&type_facility=both&ebt_storesbyzip=Search+for+Child+Care" into the field?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
It's fine with IE. When I use web-harvest, the page cannot be downloaded.
I use Wireshark to capture the HTTP request, and find the difference:
The GET field sent by web-harvest:
GET /facility/search/licensed.cgi?rm=Search%3Bsearch_require_client_code-2106%3D1%3Bsearch_require_client_code-2102%3D1%3Bsearch_require_client_code-2101%3D1%3BStart%3D26 HTTP/1.1\r\n
The GET field sent by IE6:
GET /facility/search/licensed.cgi?rm=Search;search_require_client_code-2106=1;search_require_client_code-2102=1;search_require_client_code-2101=1;Start=26 HTTP/1.1\r\n
I find that this behavior has something to do with the HTML encode. Here is the link
I'm using web-harvest to extract data and I need to use POST to fill a form.
So the content type is set to application/x-www-form-urlencoded and the HTTP field
Line-based text data: application/x-www-form-urlencoded need to be filled. For some reason, the special characters such as '=', '&' are always translated into %3D, etc.
Then sever rejects my post.
How can I disable this kind of translation, and just put the following
"center_name=&ddlbCounty=ALL&city_name=&zip_code=&star_level=All&type_facility=both&ebt_storesbyzip=Search+for+Child+Care" into the field?
I'm not quite sure where the problem is.
Can you post part of your configuration XML that is making post request?
Vladimir.
Actually it happens to both the get and post. I try download the page
http://www.dss.virginia.gov/facility/search/licensed.cgi?rm=Search;search_require_client_code-2106=1;search_require_client_code-2102=1;search_require_client_code-2101=1;Start=26
It's fine with IE. When I use web-harvest, the page cannot be downloaded.
I use Wireshark to capture the HTTP request, and find the difference:
The GET field sent by web-harvest:
GET /facility/search/licensed.cgi?rm=Search%3Bsearch_require_client_code-2106%3D1%3Bsearch_require_client_code-2102%3D1%3Bsearch_require_client_code-2101%3D1%3BStart%3D26 HTTP/1.1\r\n
The GET field sent by IE6:
GET /facility/search/licensed.cgi?rm=Search;search_require_client_code-2106=1;search_require_client_code-2102=1;search_require_client_code-2101=1;Start=26 HTTP/1.1\r\n
I find that this behavior has something to do with the HTML encode. Here is the link
http://www.w3schools.com/tags/ref_urlencode.asp
Is any way to disable this encoding?
But the strange thing is that I used the same code to download a page from google,
it works fine, not encoding.
===============
<?xml version="1.0" encoding="UTF-8"?>
<config>
<var-def name="gurl">
<![CDATA[http://maps.google.com/maps?f=l&hl=en&geocode=&q=daycare&near=oregon&ie=UTF8&z=7&om=0]]>
</var-def>
<var-def name="page">
<http url="${gurl}"></http>
</var-def>
<file action="write" path="gtest.xml">
<html-to-xml>
<var name="page"></var>
</html-to-xml>
</file>
</config>