Xidel / Discussion / Help: Read list of pages to parse from file

tholeunseen - 2019-09-23

Hello,

I have stubled upon this very interesting tool while trying to solve a problem.
Basically I habe a text file containing 2 columns, column (1) contains a url, column (2) a token to parse.

I would like to parse the source the webpage specified in column (1) for url that contain the tokem specified at columne (2) and if found, output the url to a file.
Is this possible with this tool? I could not see an option to read the list of webpages from a file.

To make it more simple, I could skip parsing token from column (2) and simply have the list of url in column (1) and pass the token or regex in the command line.

Thanks for any hint!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Reino - 2019-09-27

Could you post an example of this text-file?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.
- tholeunseen - 2019-09-27
  
  Sure,
  
  the file is a csv file that looks like this:
  
  urltoparse,tokentoparse http://intranet/readingarea/pages/service/personalpages/D41952.aspx,dappl http://intranet/readingarea/pages/service/personalpages/D41959.aspx,dappl
  
  I was hoping that Xidel could connect and parse the source code of each webpage in the "urltoparse" column and in that webpage find all url references that contains the string "tokentoparse" and output the entire url.
  
  I can also use a solution where the file only contains url:
  
  urltoparse, http://intranet/readingarea/pages/service/personalpages/D41952.aspx http://intranet/readingarea/pages/service/personalpages/D41959.aspx
  
  and the token to be searched for is passed on the command line as regex expression.
  
  The whole idea is to be able to parse a long list of webpages and for each of the output a list of urls containing the token. So the output should be something like:
  
  parsedurl,matchedurl http://intranet/readingarea/pages/service/personalpages/D41952.aspx,http://otherserver/images/dappl/img1.jpeg http://intranet/readingarea/pages/service/personalpages/D41952.aspx,http://otherserver/images/dappl/img2.jpeg http://intranet/readingarea/pages/service/personalpages/D41952.aspx,http://otherserver/images/dappl/img3.jpeg
  
  and so on
  
  or one line per parsed url where the matched urls are separated by semicolon
  
  parsedurl,matchedurls http://intranet/readingarea/pages/service/personalpages/D41952.aspx,http://otherserver/images/dappl/img1.jpeg;http://otherserver/images/dappl/img2.jpeg;http://otherserver/images/dappl/img3.jpeg
  
  Thanks!
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Anonymous
    
    Add attachments
    Cancel
    You seem to have CSS turned off. Please don't fill out this field.
    
    You seem to have CSS turned off. Please don't fill out this field.

Reino - 2019-09-28

There are different ways to accomplish this, but this is how I would do it.
I'm going to assume 'input.csv':

urltoparse,tokentoparse http://intranet/readingarea/pages/service/personalpages/D41952.aspx,dappl http://intranet/readingarea/pages/service/personalpages/D41959.aspx,dappl

I'm also going to assume the urls with the tokens are inside a <img src="*.jpg"> node.

To ignore the first line in 'input.csv' I'd create a sequence of every new line and select line #2 and onward:

xidel -s input.csv -e "x:lines($raw)[position()>1]" http://intranet/readingarea/pages/service/personalpages/D41952.aspx,dappl http://intranet/readingarea/pages/service/personalpages/D41959.aspx,dappl

(x:lines($raw) is a shorthand for tokenize($raw,'\r\n?|\n'))

For every line tokenize on the , and put that sequence in a variable:

xidel -s input.csv --xquery "for $x in x:lines($raw)[position()>1] let $a:=tokenize($x,',') return ($a[1],$a[2])" http://intranet/readingarea/pages/service/personalpages/D41952.aspx dappl http://intranet/readingarea/pages/service/personalpages/D41959.aspx dappl

(From this query onward --xquery is required instead of -e)

So the first element, $a[1], contains the url and the second element, $a[2], contains the token.

Next open/parse the url and extract the image-urls containing the token:

xidel -s input.csv --xquery "for $x in x:lines($raw)[position()>1] let $a:=tokenize($x,',') return doc($a[1])//img[contains(@src,$a[2])]/@src" http://otherserver/images/dappl/img1.jpeg http://otherserver/images/dappl/img2.jpeg http://otherserver/images/dappl/img3.jpeg [...]

Finally join the image-urls separated by a ;, prepend the url and start the output with the string parsedurl,matchedurl:

xidel -s input.csv --xquery "'parsedurl,matchedurl',for $x in x:lines($raw)[position()>1] let $a:=tokenize($x,',') return concat($a[1],',',join(doc($a[1])//img[contains(@src,$a[2])]/@src,';'))" parsedurl,matchedurls http://intranet/readingarea/pages/service/personalpages/D41952.aspx,http://otherserver/images/dappl/img1.jpeg;http://otherserver/images/dappl/img2.jpeg;http://otherserver/images/dappl/img3.jpeg http://intranet/readingarea/pages/service/personalpages/D41959.aspx,[...]

👍
1
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.
- tholeunseen - 2019-09-30
  
  Hello Reino,
  
  Thank a million for your suggestion. I am speachless. I will try it today
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Anonymous
    
    Add attachments
    Cancel
    You seem to have CSS turned off. Please don't fill out this field.
    
    You seem to have CSS turned off. Please don't fill out this field.
- tholeunseen - 2019-10-01
  
  It worked perfectly,
  
  Thanks a lot!
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Anonymous
    
    Add attachments
    Cancel
    You seem to have CSS turned off. Please don't fill out this field.
    
    You seem to have CSS turned off. Please don't fill out this field.

Reino - 2019-10-01

That's good to hear!
You're welcome.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Anonymous - 2019-12-25

xidel 0.9.8 does not work with MSYS2 on Windows, not results return, but works with Windows cmd:

Windows MSYS2:

$ xidel https://google.com -e //title **** Retrieving (GET): https://google.com **** **** Processing: https://www.google.com/ ****

Windows cmd:

> xidel https://google.com -e //title **** Retrieving (GET): https://google.com **** **** Processing: https://www.google.com/ **** Google

I can run almsost any command in MSYS2 except xidei on Windows
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Reino - 2019-12-25

This discussion was already finished, so I suggest you start a new discussion, instead of hijacking this one.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.
- Anonymous - 2019-12-25
  
  Thanks
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Anonymous
    
    Add attachments
    Cancel
    You seem to have CSS turned off. Please don't fill out this field.
    
    You seem to have CSS turned off. Please don't fill out this field.

Read list of pages to parse from file

Xidel is a cli webpage scraping tool supporting XPath/XQuery 3 and CSS

Forums

Help

Read list of pages to parse from file

Read list of pages to parse from file

Xidel is a cli webpage scraping tool supporting XPath/XQuery 3 and CSS

Forums

Help

Read list of pages to parse from file document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Read list of pages to parse from file