I have stubled upon this very interesting tool while trying to solve a problem.
Basically I habe a text file containing 2 columns, column (1) contains a url, column (2) a token to parse.
I would like to parse the source the webpage specified in column (1) for url that contain the tokem specified at columne (2) and if found, output the url to a file.
Is this possible with this tool? I could not see an option to read the list of webpages from a file.
To make it more simple, I could skip parsing token from column (2) and simply have the list of url in column (1) and pass the token or regex in the command line.
Thanks for any hint!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I was hoping that Xidel could connect and parse the source code of each webpage in the "urltoparse" column and in that webpage find all url references that contains the string "tokentoparse" and output the entire url.
I can also use a solution where the file only contains url:
and the token to be searched for is passed on the command line as regex expression.
The whole idea is to be able to parse a long list of webpages and for each of the output a list of urls containing the token. So the output should be something like:
(x:lines($raw) is a shorthand for tokenize($raw,'\r\n?|\n'))
For every line tokenize on the , and put that sequence in a variable:
xidel-sinput.csv--xquery"for $x in x:lines($raw)[position()>1] let $a:=tokenize($x,',') return ($a[1],$a[2])"http://intranet/readingarea/pages/service/personalpages/D41952.aspxdapplhttp://intranet/readingarea/pages/service/personalpages/D41959.aspxdappl
(From this query onward --xquery is required instead of -e)
So the first element, $a[1], contains the url and the second element, $a[2], contains the token.
Next open/parse the url and extract the image-urls containing the token:
xidel-sinput.csv--xquery"for $x in x:lines($raw)[position()>1] let $a:=tokenize($x,',') return doc($a[1])//img[contains(@src,$a[2])]/@src"http://otherserver/images/dappl/img1.jpeghttp://otherserver/images/dappl/img2.jpeghttp://otherserver/images/dappl/img3.jpeg
[...]
Finally join the image-urls separated by a ;, prepend the url and start the output with the string parsedurl,matchedurl:
xidel-sinput.csv--xquery"'parsedurl,matchedurl',for $x in x:lines($raw)[position()>1] let $a:=tokenize($x,',') return concat($a[1],',',join(doc($a[1])//img[contains(@src,$a[2])]/@src,';'))"parsedurl,matchedurlshttp://intranet/readingarea/pages/service/personalpages/D41952.aspx,http://otherserver/images/dappl/img1.jpeg;http://otherserver/images/dappl/img2.jpeg;http://otherserver/images/dappl/img3.jpeghttp://intranet/readingarea/pages/service/personalpages/D41959.aspx,[...]
👍
1
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello,
I have stubled upon this very interesting tool while trying to solve a problem.
Basically I habe a text file containing 2 columns, column (1) contains a url, column (2) a token to parse.
I would like to parse the source the webpage specified in column (1) for url that contain the tokem specified at columne (2) and if found, output the url to a file.
Is this possible with this tool? I could not see an option to read the list of webpages from a file.
To make it more simple, I could skip parsing token from column (2) and simply have the list of url in column (1) and pass the token or regex in the command line.
Thanks for any hint!
Could you post an example of this text-file?
Sure,
the file is a csv file that looks like this:
I was hoping that Xidel could connect and parse the source code of each webpage in the "urltoparse" column and in that webpage find all url references that contains the string "tokentoparse" and output the entire url.
I can also use a solution where the file only contains url:
and the token to be searched for is passed on the command line as regex expression.
The whole idea is to be able to parse a long list of webpages and for each of the output a list of urls containing the token. So the output should be something like:
and so on
or one line per parsed url where the matched urls are separated by semicolon
Thanks!
There are different ways to accomplish this, but this is how I would do it.
I'm going to assume 'input.csv':
I'm also going to assume the urls with the tokens are inside a
<img src="*.jpg">
node.To ignore the first line in 'input.csv' I'd create a sequence of every new line and select line #2 and onward:
(
x:lines($raw)
is a shorthand fortokenize($raw,'\r\n?|\n')
)For every line tokenize on the
,
and put that sequence in a variable:(From this query onward
--xquery
is required instead of-e
)So the first element,
$a[1]
, contains the url and the second element,$a[2]
, contains the token.Next open/parse the url and extract the image-urls containing the token:
Finally join the image-urls separated by a
;
, prepend the url and start the output with the stringparsedurl,matchedurl
:Hello Reino,
Thank a million for your suggestion. I am speachless. I will try it today
It worked perfectly,
Thanks a lot!
That's good to hear!
You're welcome.
xidel 0.9.8 does not work with MSYS2 on Windows, not results return, but works with Windows cmd:
Windows MSYS2:
Windows cmd:
I can run almsost any command in MSYS2 except xidei on Windows
This discussion was already finished, so I suggest you start a new discussion, instead of hijacking this one.
Thanks