WebHarvest - web data extraction tool / Discussion / Open Discussion: Login Nytimes with webharvest

Bonjour,

Je suis stagiaire et j'essaye d'utiliser webharvest afin de me connecter sur
un wiki et d'en extraire des informations.

Dans un premier temps j'ai essaye de faire le même modèle de code que
l'exemple du nytimes en essayant de me connecter sur une page https, donc qui
nécessite une authentification.

Voilà ma page de config:

<http method="post" url="<a class=" "="" href="http://www.nytimes.com/auth/login">http://www.nytimes.com/auth/login">

<http-param name="is_continue">true</http-param>

<http-param name="URI">http://www.nytimes.com/pages/todayspaper/index.html
</http-param>

<http-param name="OQ"></http-param>

<http-param name="OP"></http-param>

<http-param name="USERID">web-harvest</http-param>

<http-param name="PASSWORD">web-harvest</http-param>

</http>

<var-def name="startUrl">https://myaccount.nytimes.com/membercenter/myaccount.html
</var-def>

<var-def name="scrappedContent">

</template>

<xq-param name="doc">

<html-to-xml>

</html-to-xml>

</xq-param>

<xq-expression><![CDATA)

let $title := data($doc//h1)

let $text := data($doc//div)

return

{data($title)} <author>{data($author)}</author> {data($text)}

]]></xq-expression>

</xquery>

</file>

</var-def>

</config>

Voilà m page de test.java

public class Test {

public static void main(String args) throws UnsupportedEncodingException {

try {

// String url = "http://portail.groupeonepoint.com/PROJETS_CLIENTS/SFR/Ulysse
_Services/Wiki/" + URLEncoder.encode("TPE_DEPLOY.aspx","UTF-8");

// String url = "http://portail.groupeonepoint.com/PROJETS_CLIENTS/SFR/Ulysse
_Services/Wiki/C_MAJ_Hispeed.aspx";

//String url="http://www.nytimes.com/pages/todayspaper/index.html";

ScraperConfiguration config =

new ScraperConfiguration("/workspace/webharvest2b1/examples/nytimes.xml");

Scraper scraper = new Scraper(config,"/webharvest2b1/resultat");

// scraper.addVariableToContext("url",url);

scraper.setDebug(true);

scraper.execute();

Variable varScrappedContent =

(Variable)scraper.getContext().getVar("scrappedContent");

// (Variable)scraper.getContext().getVar("MyContent");

// Printing the scraped data here

System.out.println(varScrappedContent.toString());

} catch (FileNotFoundException e) {

e.printStackTrace();

}

J'ai toujours une réponse une réponse qui s'affiche :

<newyourk_times date="27.06.2011">

Please Log In <author/>

</newyourk_times>

Cad qu'il n'arrive pas à se loger sur le site du nytimes.

J'ai essaye de débeuguer, je n'ai pas trouve où ce que la page de
configuration.xml est lu.

SVP si quelqu'un sais prkoi il n'arrive pas à se loger ?

J'ai regarde sur Firebug l'URI ("http://www.nytimes.com/pages/todayspaper/ind
ex.html").

Aurez vous un exemple qui marche et qui arrive à se loger?

I am adding webharvest plugin to my javaj2EE website and i want to scrap a
wiki which need authentification.

I tried to figure out the exemple of nytimes given in the API allowing
authentification but it doesn't work.

It can't log in with this code:

<http method="post" url="<a class=" "="" href="http://www.nytimes.com/auth/login">http://www.nytimes.com/auth/login">;
<http-param name="is_continue">true</http-param> <http-param name="URI">http://www.nytimes.com/pages/todayspaper/index.html</http-
param>;
<http-param name="OQ"></http-param> <http-param name="OP"></http-param> <http- param="" name="USERID">web-harvest</http-param> <http-param name="PASSWORD">web-
harvest</http-param> </http>

When I want to scrap :

https://myaccount.nytimes.com/membercenter/myaccount.html

Do you have any idea or an exemple?

Login Nytimes with webharvest

Forums

Help

Login Nytimes with webharvest

Login Nytimes with webharvest

Forums

Help

Login Nytimes with webharvest document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Login Nytimes with webharvest