Menu

Help harvesting Google Scholar

Help
2006-12-20
2012-09-04
  • Rodrigo Rech

    Rodrigo Rech - 2006-12-20

    Hello! I'm trying to harvest Google Scholar (http://scholar.google.com)

    The data i'm interested: title of the reference and the number of citations.

    The problem is that sometimes the text of a reference's title is a link, sometimes just a text, without any html tags around it.

    The code below shows the XML version generated by Web-Harvest of the HTML sent by Google Scholar:

    <p class="g">
    <span class="w">
    <a
    href="/url?sa=U&amp;q=http://www8.org/w8-papers/2b-customizing/user/user.html">
    User Adaptable Multimedia resentations for the WWW // THIS IS THE TITLE!!!!!
    </a>
    </span>
    ...
    </p>

    To harvest the title, i'm using the following Xpath expression:
    let $title := data($item//span[@class='w']/a)

    It works fine in this situation!

    But sometimes the title is at another place:

    <p class="g">
    <font size="-2">
    <b>[CITATION]</b>
    </font>
    Modeling of Courses through Workflow using the standard SVG/XML // THIS IS THE TITLE NOW!!!!!
    <font size="-1">
    ...
    </p>

    There are no html tags around the title... And i can't figure out the Xpath expression that would grab it.

    Any ideas?

    I can send the full web-harvest configuration file i'm using, just send me a message here. Of course, the file can be distributed in the next versions of Web-Harvest, no problem!

    Thanks a lot!!!
    Rodrigo - rorech@inf.ufrgs.br

     
    • Rodrigo Rech

      Rodrigo Rech - 2006-12-22

      Still can't figure out a Xpath expression to harvest the title of the reference below :(

      <p class="g">
      <font size="-2">
      <b>[CITATION]</b>
      </font>
      Modeling of Courses through Workflow using the standard SVG/XML
      <font size="-1">

      ...

      </font>

      </p>

      I have tried many expressions, the best currently is:
      let $title2 := data($item//text()[2])

      But then i can't use {normalize-space($title2)}, an exception is throw saying that there are more than one item.

      Thanks for any help!!!
      Rodrigo

       
    • Rodrigo Rech

      Rodrigo Rech - 2006-12-20

      Example of a configuration file i'm using to harvest Google Scholar:

      <?xml version="1.0" encoding="UTF-8"?>

      <config charset="UTF-8">

      &lt;include path=&quot;functions.xml&quot;/&gt;
      
      &lt;var-def name=&quot;url&quot;&gt;
          &lt;template&gt;http://scholar.google.com/scholar?q=valdeni&amp;amp;hl=en&amp;amp;lr=&amp;amp;btnG=Search&lt;/template&gt;
      &lt;/var-def&gt;
      
      &lt;var-def name=&quot;referencias&quot;&gt;
          &lt;call name=&quot;download-multipage-list&quot;&gt;
              &lt;call-param name=&quot;pageUrl&quot;&gt;&lt;var name=&quot;url&quot;/&gt;&lt;/call-param&gt;
              &lt;call-param name=&quot;nextXPath&quot;&gt;//td[.='Next']/a/@href&lt;/call-param&gt;
              &lt;call-param name=&quot;itemXPath&quot;&gt;//p[@class=&quot;g&quot;]&lt;/call-param&gt;
              &lt;call-param name=&quot;maxloops&quot;&gt;5&lt;/call-param&gt;
          &lt;/call&gt;
      &lt;/var-def&gt;
      
      &lt;file action=&quot;write&quot; path=&quot;scholar/scholar.xml&quot; charset=&quot;UTF-8&quot;&gt;
          &lt;![CDATA[ &lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;
                    &lt;scholar&gt;
                    ]]&gt;
          &lt;loop item=&quot;item&quot; index=&quot;i&quot;&gt;
              &lt;list&gt;&lt;var name=&quot;referencias&quot;/&gt;&lt;/list&gt;
              &lt;body&gt;
                  &lt;xquery&gt;
                      &lt;xq-param name=&quot;item&quot;&gt;&lt;var name=&quot;item&quot;/&gt;&lt;/xq-param&gt;
                      &lt;xq-expression&gt;&lt;![CDATA[
                              let $title := data($item//span[@class='w']/a)
                              let $citacoes := data($item//font[@size='-1']/font[@color='#7777CC']/a[contains(@href, 'cites=')])
                                  return
                                      &lt;referencia&gt;
                                          &lt;title&gt;{normalize-space($title)}&lt;/title&gt;
                                          &lt;citacoes&gt;{normalize-space($citacoes)}&lt;/citacoes&gt;
                                      &lt;/referencia&gt;
                      ]]&gt;&lt;/xq-expression&gt;
                  &lt;/xquery&gt;
              &lt;/body&gt;
          &lt;/loop&gt;
          &lt;![CDATA[ 
                    &lt;/scholar&gt; ]]&gt;
      &lt;/file&gt;
      

      </config>

       
    • Rodrigo Rech

      Rodrigo Rech - 2006-12-20

      Another configuration file, just generate a XML file representing the HTML of Google Scholar:

      <?xml version="1.0" encoding="UTF-8"?>

      <config charset="UTF-8">

      &lt;var-def name=&quot;url&quot;&gt;
          &lt;template&gt;http://scholar.google.com/scholar?q=valdeni&amp;amp;hl=en&amp;amp;lr=&amp;amp;btnG=Search&lt;/template&gt;
      &lt;/var-def&gt;
      
      &lt;file action=&quot;write&quot; path=&quot;scholar/layoutScholar.xml&quot; charset=&quot;UTF-8&quot;&gt;
          &lt;html-to-xml&gt;
              &lt;http url=&quot;${url}&quot; charset=&quot;UTF-8&quot;/&gt;
          &lt;/html-to-xml&gt;
      &lt;/file&gt;
      

      </config>

       

Log in to post a comment.