Menu

Login Nytimes with webharvest

imen
2011-06-27
2012-09-04
  • imen

    imen - 2011-06-27

    Bonjour,

    Je suis stagiaire et j'essaye d'utiliser webharvest afin de me connecter sur
    un wiki et d'en extraire des informations.

    Dans un premier temps j'ai essaye de faire le même modèle de code que
    l'exemple du nytimes en essayant de me connecter sur une page https, donc qui
    nécessite une authentification.

    Voilà ma page de config:

    <config charset="UTF-8">

    <http method="post" url="&lt;a class=" "="" href="http://www.nytimes.com/auth/login">http://www.nytimes.com/auth/login">

    <http-param name="is_continue">true</http-param>

    <http-param name="URI">http://www.nytimes.com/pages/todayspaper/index.html
    </http-param
    >

    <http-param name="OQ"></http-param>

    <http-param name="OP"></http-param>

    <http-param name="USERID">web-harvest</http-param>

    <http-param name="PASSWORD">web-harvest</http-param>

    </http>

    <var-def name="startUrl">https://myaccount.nytimes.com/membercenter/myaccount.html
    </var-def
    >

    <var-def name="scrappedContent">

    <file action="write" path="nytimes/nytimes${sys.date()}.xml" charset="UTF-8">

    <template>

    </template>

    <xquery>

    <xq-param name="doc">

    <html-to-xml>

    <http url="${startUrl}"/>

    </html-to-xml>

    </xq-param>

    <xq-expression><![CDATA)

    let $title := data($doc//h1)

    let $text := data($doc//div)

    return

    {data($title)} <author>{data($author)}</author> {data($text)}

    ]]></xq-expression>

    </xquery>

    </file>

    </var-def>

    </config>

    Voilà m page de test.java

    public class Test {

    public static void main(String args) throws UnsupportedEncodingException {

    try {

    // String url = "http://portail.groupeonepoint.com/PROJETS_CLIENTS/SFR/Ulysse
    _Services/Wiki/
    " + URLEncoder.encode("TPE_DEPLOY.aspx","UTF-8");

    // String url = "http://portail.groupeonepoint.com/PROJETS_CLIENTS/SFR/Ulysse
    _Services/Wiki/C_MAJ_Hispeed.aspx
    ";

    //String url="http://www.nytimes.com/pages/todayspaper/index.html";

    ScraperConfiguration config =

    new ScraperConfiguration("/workspace/webharvest2b1/examples/nytimes.xml");

    Scraper scraper = new Scraper(config,"/webharvest2b1/resultat");

    // scraper.addVariableToContext("url",url);

    scraper.setDebug(true);

    scraper.execute();

    Variable varScrappedContent =

    (Variable)scraper.getContext().getVar("scrappedContent");

    // (Variable)scraper.getContext().getVar("MyContent");

    // Printing the scraped data here

    System.out.println(varScrappedContent.toString());

    } catch (FileNotFoundException e) {

    e.printStackTrace();

    }

    }

    }

    J'ai toujours une réponse une réponse qui s'affiche :

    <newyourk_times date="27.06.2011">

    Please Log In <author/>

    </newyourk_times>

    Cad qu'il n'arrive pas à se loger sur le site du nytimes.

    J'ai essaye de débeuguer, je n'ai pas trouve où ce que la page de
    configuration.xml est lu.

    SVP si quelqu'un sais prkoi il n'arrive pas à se loger ?

    J'ai regarde sur Firebug l'URI ("http://www.nytimes.com/pages/todayspaper/ind
    ex.html
    ").

    Aurez vous un exemple qui marche et qui arrive à se loger?

     
  • Alex Wajda

    Alex Wajda - 2011-06-27

    К сожалению я не знаю Вашего языка, уважаемый,поэтому не могу прочесть
    написанное :(

    Please, use English.

     
  • imen

    imen - 2011-06-28

    I am adding webharvest plugin to my javaj2EE website and i want to scrap a
    wiki which need authentification.

    I tried to figure out the exemple of nytimes given in the API allowing
    authentification but it doesn't work.

    It can't log in with this code:

    <http method="post" url="&lt;a class=" "="" href="http://www.nytimes.com/auth/login">http://www.nytimes.com/auth/login">;
    <http-param name="is_continue">true</http-param> <http-param name="URI">http://www.nytimes.com/pages/todayspaper/index.html</http-
    param
    >;
    <http-param name="OQ"></http-param> <http-param name="OP"></http-param> <http- param="" name="USERID">web-harvest</http-param> <http-param name="PASSWORD">web-
    harvest</http-param> </http>

    When I want to scrap :

    https://myaccount.nytimes.com/membercenter/myaccount.html

    Do you have any idea or an exemple?

     

Log in to post a comment.