Menu

Read list of pages to parse from file

Help
2019-09-23
2019-12-25
  • tholeunseen

    tholeunseen - 2019-09-23

    Hello,

    I have stubled upon this very interesting tool while trying to solve a problem.
    Basically I habe a text file containing 2 columns, column (1) contains a url, column (2) a token to parse.

    I would like to parse the source the webpage specified in column (1) for url that contain the tokem specified at columne (2) and if found, output the url to a file.
    Is this possible with this tool? I could not see an option to read the list of webpages from a file.

    To make it more simple, I could skip parsing token from column (2) and simply have the list of url in column (1) and pass the token or regex in the command line.

    Thanks for any hint!

     
  • Reino

    Reino - 2019-09-27

    Could you post an example of this text-file?

     
    • tholeunseen

      tholeunseen - 2019-09-27

      Sure,

      the file is a csv file that looks like this:

      urltoparse,tokentoparse
      http://intranet/readingarea/pages/service/personalpages/D41952.aspx,dappl
      http://intranet/readingarea/pages/service/personalpages/D41959.aspx,dappl
      

      I was hoping that Xidel could connect and parse the source code of each webpage in the "urltoparse" column and in that webpage find all url references that contains the string "tokentoparse" and output the entire url.

      I can also use a solution where the file only contains url:

      urltoparse,
      http://intranet/readingarea/pages/service/personalpages/D41952.aspx
      http://intranet/readingarea/pages/service/personalpages/D41959.aspx
      

      and the token to be searched for is passed on the command line as regex expression.

      The whole idea is to be able to parse a long list of webpages and for each of the output a list of urls containing the token. So the output should be something like:

      parsedurl,matchedurl
      http://intranet/readingarea/pages/service/personalpages/D41952.aspx,http://otherserver/images/dappl/img1.jpeg
      http://intranet/readingarea/pages/service/personalpages/D41952.aspx,http://otherserver/images/dappl/img2.jpeg
      http://intranet/readingarea/pages/service/personalpages/D41952.aspx,http://otherserver/images/dappl/img3.jpeg
      

      and so on

      or one line per parsed url where the matched urls are separated by semicolon

      parsedurl,matchedurls
      http://intranet/readingarea/pages/service/personalpages/D41952.aspx,http://otherserver/images/dappl/img1.jpeg;http://otherserver/images/dappl/img2.jpeg;http://otherserver/images/dappl/img3.jpeg
      

      Thanks!

       
  • Reino

    Reino - 2019-09-28

    There are different ways to accomplish this, but this is how I would do it.
    I'm going to assume 'input.csv':

    urltoparse,tokentoparse
    http://intranet/readingarea/pages/service/personalpages/D41952.aspx,dappl
    http://intranet/readingarea/pages/service/personalpages/D41959.aspx,dappl
    

    I'm also going to assume the urls with the tokens are inside a <img src="*.jpg"> node.

    To ignore the first line in 'input.csv' I'd create a sequence of every new line and select line #2 and onward:

    xidel -s input.csv -e "x:lines($raw)[position()>1]"
    http://intranet/readingarea/pages/service/personalpages/D41952.aspx,dappl
    http://intranet/readingarea/pages/service/personalpages/D41959.aspx,dappl
    

    (x:lines($raw) is a shorthand for tokenize($raw,'\r\n?|\n'))

    For every line tokenize on the , and put that sequence in a variable:

    xidel -s input.csv --xquery "for $x in x:lines($raw)[position()>1] let $a:=tokenize($x,',') return ($a[1],$a[2])"
    http://intranet/readingarea/pages/service/personalpages/D41952.aspx
    dappl
    http://intranet/readingarea/pages/service/personalpages/D41959.aspx
    dappl
    

    (From this query onward --xquery is required instead of -e)

    So the first element, $a[1], contains the url and the second element, $a[2], contains the token.

    Next open/parse the url and extract the image-urls containing the token:

    xidel -s input.csv --xquery "for $x in x:lines($raw)[position()>1] let $a:=tokenize($x,',') return doc($a[1])//img[contains(@src,$a[2])]/@src"
    http://otherserver/images/dappl/img1.jpeg
    http://otherserver/images/dappl/img2.jpeg
    http://otherserver/images/dappl/img3.jpeg
    [...]
    

    Finally join the image-urls separated by a ;, prepend the url and start the output with the string parsedurl,matchedurl:

    xidel -s input.csv --xquery "'parsedurl,matchedurl',for $x in x:lines($raw)[position()>1] let $a:=tokenize($x,',') return concat($a[1],',',join(doc($a[1])//img[contains(@src,$a[2])]/@src,';'))"
    parsedurl,matchedurls
    http://intranet/readingarea/pages/service/personalpages/D41952.aspx,http://otherserver/images/dappl/img1.jpeg;http://otherserver/images/dappl/img2.jpeg;http://otherserver/images/dappl/img3.jpeg
    http://intranet/readingarea/pages/service/personalpages/D41959.aspx,[...]
    
     
    👍
    1
    • tholeunseen

      tholeunseen - 2019-09-30

      Hello Reino,

      Thank a million for your suggestion. I am speachless. I will try it today

       
    • tholeunseen

      tholeunseen - 2019-10-01

      It worked perfectly,

      Thanks a lot!

       
  • Reino

    Reino - 2019-10-01

    That's good to hear!
    You're welcome.

     
  • Anonymous

    Anonymous - 2019-12-25

    xidel 0.9.8 does not work with MSYS2 on Windows, not results return, but works with Windows cmd:

    Windows MSYS2:

    $ xidel https://google.com -e //title
    **** Retrieving (GET): https://google.com ****
    **** Processing: https://www.google.com/ ****
    

    Windows cmd:

    > xidel https://google.com -e //title
    **** Retrieving (GET): https://google.com ****
    **** Processing: https://www.google.com/ ****
    Google
    

    I can run almsost any command in MSYS2 except xidei on Windows

     
  • Reino

    Reino - 2019-12-25

    This discussion was already finished, so I suggest you start a new discussion, instead of hijacking this one.

     
    • Anonymous

      Anonymous - 2019-12-25

      Thanks

       

Anonymous
Anonymous

Add attachments
Cancel





Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.