and Full Xpath: /html/body/div[1]/div/div[3]/div[1]/div[2]/div[2]
Those "paths" would be much more easier to use (since you can use any common browser to get this) and it would be more correctly. I don't know if it's easier to implement (my guess: it isn't).
Anyway, can anyone give me a solution for my problem for now? I'm currently using a full page check since seems to be the only way, but other website might give me a problem.
In your case I can capture the content with these two tags:
<h2>Documentação do concurso</h2>
...and:
<!--Links-->
...or:
<div class="aba-concursos-internas>
...and:
<div id="sidebar-internas">
.
You don't need to exactly match the inner content to track changes, did you consider that?
WRT to you second question:
Currently, WCM was designed to actually not need to know about the language it parses (HTML/XML, JSON for example). This keeps the complexity low. But I agree that there are cases that would be better to specify with XPath. In addition, most users don't know XPath so it would be for professionals only and due to the many filter options you have you can easily screw the syntax of the underlying language such that a parser could not understand the content anymore.
However, you are not the first one asking for a parser. So I'll see what I can do and whether or not there are tiny XML libraries that support XPath or similar. I had to work with libxml2 a lot and this (for example) is "too fat" to be considered. I like TinyXML2 but it comes w/o XPath. So if you have a suggestion for a well supported but tiny library, let me know...
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
Anonymous
-
2022-01-07
You don't need to exactly match the inner content to track changes, did you consider that?
No, I did not and that makes a lot of sense. Thank you my friend for the help, now I got it how it works. If I catch the content BETWEEN two others points which dont exactly are what i'm looking for, in anyway i'm still watching over what I want.
That was a lot of help, sorry for all the trouble.
I just found out it's hard in forums like sourceforge since those things like * 11 hours ago* make it difficult to track (it will change each hour, even without an update).
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I was digging into it a little I found pugixml that would actually be fine: A light-weight XML parser with XPath support. But now comes the actual issue: HTML is not valid XML. So regular XML parsers (including pugixml) will fail as a lot of opening tags do not have a closing one (e.g. meta).
So what would be needed is either a fully compliant HTML parser with XPath or a converter that converts HTML into XHTML on-the-fly. The latter is expensive for large home-pages - and there are many. So as you can see: Its at least not that easy...
If you still need such functionality, please open a feature request and point to this thread in the forums. But maybe the hint above is good for you...
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
Anonymous
-
2022-01-07
I really can't trouble you this much, your help was enough to make me understand how to do it.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Is it hard to implement the XPath in the Content tab? I'm asking as noob, cause I really don't know. Explaining:
I'm trying to search the content in the blue square in the screenshot (website: http://www.fumarc.com.br/concursos/detalhe/escrivao-de-policia-i-/138#aba-Pub).
If I use the HTML finder I need to put in
start tag:
<div class="aba-concursos-internas>
and in the
end tag it should be:
</div>
.Problem: Between the start wanted tag and the end tag there are multiple
</div>
.Im trying to search for any change in the whole square (any new "topic" or link).
If I used XPath (as given by chrome/edge) it would be:
//*[@id="site"]/div/div[3]/div[1]/div[2]/div[2]
jsPath:
document.querySelector("#site > div > div.main > div.content-internas > div.bloco-info-int > div.aba-concursos-internas")
and Full Xpath:
/html/body/div[1]/div/div[3]/div[1]/div[2]/div[2]
Those "paths" would be much more easier to use (since you can use any common browser to get this) and it would be more correctly. I don't know if it's easier to implement (my guess: it isn't).
Anyway, can anyone give me a solution for my problem for now? I'm currently using a full page check since seems to be the only way, but other website might give me a problem.
Thank you for your work.
Last edit: GABRIEL OLIVEIRA DINIZ 2022-01-05
In your case I can capture the content with these two tags:
...and:
...or:
...and:
.
You don't need to exactly match the inner content to track changes, did you consider that?
WRT to you second question:
Currently, WCM was designed to actually not need to know about the language it parses (HTML/XML, JSON for example). This keeps the complexity low. But I agree that there are cases that would be better to specify with XPath. In addition, most users don't know XPath so it would be for professionals only and due to the many filter options you have you can easily screw the syntax of the underlying language such that a parser could not understand the content anymore.
However, you are not the first one asking for a parser. So I'll see what I can do and whether or not there are tiny XML libraries that support XPath or similar. I had to work with libxml2 a lot and this (for example) is "too fat" to be considered. I like TinyXML2 but it comes w/o XPath. So if you have a suggestion for a well supported but tiny library, let me know...
No, I did not and that makes a lot of sense. Thank you my friend for the help, now I got it how it works. If I catch the content BETWEEN two others points which dont exactly are what i'm looking for, in anyway i'm still watching over what I want.
That was a lot of help, sorry for all the trouble.
I just found out it's hard in forums like sourceforge since those things like * 11 hours ago* make it difficult to track (it will change each hour, even without an update).
I was digging into it a little I found pugixml that would actually be fine: A light-weight XML parser with XPath support. But now comes the actual issue: HTML is not valid XML. So regular XML parsers (including pugixml) will fail as a lot of opening tags do not have a closing one (e.g. meta).
So what would be needed is either a fully compliant HTML parser with XPath or a converter that converts HTML into XHTML on-the-fly. The latter is expensive for large home-pages - and there are many. So as you can see: Its at least not that easy...
If you still need such functionality, please open a feature request and point to this thread in the forums. But maybe the hint above is good for you...
I really can't trouble you this much, your help was enough to make me understand how to do it.