Xpath not enough, how to pass var into script

Help
werner001
2010-06-13
2012-09-04
  • werner001
    werner001
    2010-06-13

    Hello,

    I am creating a configuration to extract a number of simple html pages. Their
    structure is like:

    • topic 1
    • name
    • etc

    • topic 2

    • name
    • etc

    I want to collect the name/etc items with their parent topic, but xpath seems
    to be unsufficient as I cannot choose a certain number from the resulting list
    (I know: the second result for "name" is under "topic 2", but there is not
    structural attribute for this). The result will be written into a db table
    (the first name goes into an other column than the second one).

    So, I want to select all "name"s first and select them manually with beanshell
    or javascript (xquery seems to be no help too).

    The question is: If i define a variable like this:

    <var-def name="list">

    <xpath expression="//tr/td/../following-sibling::tr/td/following- sibling::td"></xpath>

    </var-def>

    How can I manipulate it in a <script> part, and what kind of object is this?
    Is it possible at all, or any better idea...?

    thank you

    werner

     
  • Dan
    Dan
    2010-06-14

    what you need is to use xpath
    axes
    and review the
    syntax part concerning
    indices in square brackets

     
  • werner001
    werner001
    2010-06-14

    Do you mean e.g. td ? The position is variable, so this wouln't help. Do you
    have an example?

    thanks

    Werner

     
  • Dan
    Dan
    2010-06-14

    it's not a matter of position, but of enclosing tags.

    for example, if your source looks like this:

    <h2>topic1</h2>
    <p>name...
    
    some other junk...
    <h2>topic2</h2>
    <p>name
    

    2

    you just need an xpath like

     //h2[1]/text()
    

    for the first instance of topic.

    if you provide an excerpt of your source i could be of more help.

     
  • werner001
    werner001
    2010-06-15

    Oh, you're allright, but I didn't explain it clearly enough, sorry.

    The "variable content" consists of the same tags, so it could be

    <h2>topic1</h2>
    <p>name...
    
    <h2>junk topic 1</h2> 
    <p>bla
    
    <h2>junk topic 2</h2> 
    <p>bla2
    
    <h2>maybe another junk topic</h2> 
    <p>bla3 
    <p>bla4
    <p>bla5 or even bla6
    
    <h2>maybe not</h2> 
    <p>bla4
    
    <h2>topic2</h2> 
    <p>name
    

    The actual source is but that doesn't make a difference of course.

    Is there any light at the end of the tunnel?

    thanks - werner

     
  • Dan
    Dan
    2010-06-15

    example:

    if you use the xpath expression

     //h2/text()
    

    you will get a LIST variable with ALL the topics.

    same goes for tables. you just need to find the commonality.

    if you want, please post the ACTUAL HTML so I can give you a few pointers with
    a real example...

    cheers

     
  • werner001
    werner001
    2010-06-15

    addition...

    I want the

    's next to topic1 and topic2, but only one of them at a time.

     
  • Dan
    Dan
    2010-06-15

    i think that's the problem right there... you cannot get "one at a time" with
    xpath.

    maybe you can process this better if you rethink this and use the xpath list
    result inside a

    loop operation
    

    (provided by webharvest) ?

    I recommend you have a look at the examples in the documentation... there's
    plenty of relevant information there.

     
  • werner001
    werner001
    2010-06-15

    okay, here is the actual html:

    http://www.aeromarkt.net/offer_detail_print.php?lfz_id=3191&lng=1

    You find there the topics "Motor" and "Propeller" which both have a
    "Manufacturer" sub topic. I want to collect the manufacturers for either motor
    or propeller.

     
  • werner001
    werner001
    2010-06-15

    Hi again and thanks for your responses,

    I am using the loop operation for other purposes already, but in this case the
    selected items have a different meaning.

    I assume I should use a conditional within this loop asking for the current
    index?

    I will try - anyway, could I use a web harvest within a script block?

    thanks a lot for your help

    Werner