I am creating a configuration to extract a number of simple html pages. Their
structure is like:
I want to collect the name/etc items with their parent topic, but xpath seems
to be unsufficient as I cannot choose a certain number from the resulting list
(I know: the second result for "name" is under "topic 2", but there is not
structural attribute for this). The result will be written into a db table
(the first name goes into an other column than the second one).
So, I want to select all "name"s first and select them manually with beanshell
The question is: If i define a variable like this:
How can I manipulate it in a <script> part, and what kind of object is this?
Is it possible at all, or any better idea...?
what you need is to use xpath
axes and review the
syntax part concerning
indices in square brackets
Do you mean e.g. td ? The position is variable, so this wouln't help. Do you
have an example?
it's not a matter of position, but of enclosing tags.
for example, if your source looks like this:
some other junk...
you just need an xpath like
for the first instance of topic.
if you provide an excerpt of your source i could be of more help.
Oh, you're allright, but I didn't explain it clearly enough, sorry.
The "variable content" consists of the same tags, so it could be
<h2>junk topic 1</h2>
<h2>junk topic 2</h2>
<h2>maybe another junk topic</h2>
<p>bla5 or even bla6
The actual source is but that doesn't make a difference of course.
Is there any light at the end of the tunnel?
thanks - werner
if you use the xpath expression
you will get a LIST variable with ALL the topics.
same goes for tables. you just need to find the commonality.
if you want, please post the ACTUAL HTML so I can give you a few pointers with
a real example...
I want the
's next to topic1 and topic2, but only one of them at a time.
i think that's the problem right there... you cannot get "one at a time" with
maybe you can process this better if you rethink this and use the xpath list
result inside a
(provided by webharvest) ?
I recommend you have a look at the examples in the documentation... there's
plenty of relevant information there.
okay, here is the actual html:
You find there the topics "Motor" and "Propeller" which both have a
"Manufacturer" sub topic. I want to collect the manufacturers for either motor
Hi again and thanks for your responses,
I am using the loop operation for other purposes already, but in this case the
selected items have a different meaning.
I assume I should use a conditional within this loop asking for the current
I will try - anyway, could I use a web harvest within a script block?
thanks a lot for your help
Log in to post a comment.