Hello! I'm trying to harvest Google Scholar (http://scholar.google.com)
The data i'm interested: title of the reference and the number of citations.
The problem is that sometimes the text of a reference's title is a link, sometimes just a text, without any html tags around it.
The code below shows the XML version generated by Web-Harvest of the HTML sent by Google Scholar:
<p class="g">
<span class="w">
<a
href="/url?sa=U&q=http://www8.org/w8-papers/2b-customizing/user/user.html">
User Adaptable Multimedia resentations for the WWW // THIS IS THE TITLE!!!!!
</a>
</span>
...
</p>
To harvest the title, i'm using the following Xpath expression:
let $title := data($item//span[@class='w']/a)
It works fine in this situation!
But sometimes the title is at another place:
<p class="g">
<font size="-2">
<b>[CITATION]</b>
</font>
Modeling of Courses through Workflow using the standard SVG/XML // THIS IS THE TITLE NOW!!!!!
<font size="-1">
...
</p>
There are no html tags around the title... And i can't figure out the Xpath expression that would grab it.
Any ideas?
I can send the full web-harvest configuration file i'm using, just send me a message here. Of course, the file can be distributed in the next versions of Web-Harvest, no problem!
Thanks a lot!!!
Rodrigo - rorech@inf.ufrgs.br
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello! I'm trying to harvest Google Scholar (http://scholar.google.com)
The data i'm interested: title of the reference and the number of citations.
The problem is that sometimes the text of a reference's title is a link, sometimes just a text, without any html tags around it.
The code below shows the XML version generated by Web-Harvest of the HTML sent by Google Scholar:
<p class="g">
<span class="w">
<a
href="/url?sa=U&q=http://www8.org/w8-papers/2b-customizing/user/user.html">
User Adaptable Multimedia resentations for the WWW // THIS IS THE TITLE!!!!!
</a>
</span>
...
</p>
To harvest the title, i'm using the following Xpath expression:
let $title := data($item//span[@class='w']/a)
It works fine in this situation!
But sometimes the title is at another place:
<p class="g">
<font size="-2">
<b>[CITATION]</b>
</font>
Modeling of Courses through Workflow using the standard SVG/XML // THIS IS THE TITLE NOW!!!!!
<font size="-1">
...
</p>
There are no html tags around the title... And i can't figure out the Xpath expression that would grab it.
Any ideas?
I can send the full web-harvest configuration file i'm using, just send me a message here. Of course, the file can be distributed in the next versions of Web-Harvest, no problem!
Thanks a lot!!!
Rodrigo - rorech@inf.ufrgs.br
Still can't figure out a Xpath expression to harvest the title of the reference below :(
<p class="g">
<font size="-2">
<b>[CITATION]</b>
</font>
Modeling of Courses through Workflow using the standard SVG/XML
<font size="-1">
...
</font>
</p>
I have tried many expressions, the best currently is:
let $title2 := data($item//text()[2])
But then i can't use {normalize-space($title2)}, an exception is throw saying that there are more than one item.
Thanks for any help!!!
Rodrigo
Example of a configuration file i'm using to harvest Google Scholar:
<?xml version="1.0" encoding="UTF-8"?>
<config charset="UTF-8">
</config>
Another configuration file, just generate a XML file representing the HTML of Google Scholar:
<?xml version="1.0" encoding="UTF-8"?>
<config charset="UTF-8">
</config>