WCM has many options to check the content of items and use powerful filter functions. For clarification, the order of their application s explained hereby:
apply start/stop tag (i.e. remove content before the start tag and after the stop tag)
apply one of: RegEx/XPath/JSON as a filter to assess a specific sub-set of the content
apply ignore patterns as setup in the options (note: you can setup different ways for items, i.e. apply all, some or no ignore patterns at all)
calculate CRC checksum on which the check result is based, i.e. if the content i uprated or not
run the interpreter > special case to actually convert the content into some number or string you can compare to a reference
Modifying and/or filtering the content of items is applied only if it is setup/enabled in the item options! So the actual procedure might only be a sub-set of the above, including no modification/filtering at all.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Since release 23.05 there is an option to apply the ignore/replace patterns two times. Therefore, the order changes as following:
apply start/stop tag (i.e. remove content before the start tag and after the stop tag)
apply ignore patterns as setup in the options (note: you can setup different ways for items, i.e. apply all, some or no ignore patterns at all) and this option is enabled
apply one of: RegEx/XPath/JSON as a filter to assess a specific sub-set of the content
apply ignore patterns as setup in the options (note: you can setup different ways for items, i.e. apply all, some or no ignore patterns at all) and this option is enabled (which is the default)
When is post process executed? What I want is somewhere before step 6 (2023-05-21) a post process to be run that may modify the downloaded data before comparison %old to %new. How can I achieve that?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am, sorry about the late response - somehow I did not get notified about this post.
Interesting question: So far, you can't. The post-process (as the name suggests) is intended to be run after the information about the stet of the item is computed.
It may be an interesting feature you request, though...
...however, may I ask what you would do the the content in between? Because WCM has very powerful capabilities to filter/parse the content which actually might already do what you have in mind.
A general issue I see with calling out of WCM to modify the content is that this will massively impact the speed of checking items and parallelism will be slowed down. Its not impossible though...
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
There are multiple sites for which I could use this feature. Here is a simple example: URL https://services.swpc.noaa.gov/text/3-day-geomag-forecast.txt produces a text output that includes 3 day forecast of geomagnetic storms. Before saving as .new and comparing .new vs .old, I would like most items stripped with the following "sed" script
For this site there is a lot of data that reflects date that I don't want to trigger as a change. Also, I don't want a change notice if the geomagnetic kp index is less than 5.
Original content of URL
:Product: Geomagnetic Forecast
:Issued: 2024 Feb 29 2205 UTC
# Prepared by the U.S. Dept. of Commerce, NOAA, Space Weather Prediction Center
#
NOAA Ap Index Forecast
Observed Ap 28 Feb 006
Estimated Ap 29 Feb 008
Predicted Ap 01 Mar-03 Mar 008-012-012
NOAA Geomagnetic Activity Probabilities 01 Mar-03 Mar
Active 15/35/35
Minor storm 01/15/15
Moderate storm 01/01/01
Strong-Extreme storm 01/01/01
NOAA Kp index forecast 01 Mar - 03 Mar
Mar 01 Mar 02 Mar 03
00-03UT 1.67 2.67 3.67
03-06UT 1.67 2.00 3.67
06-09UT 2.67 2.00 3.00
09-12UT 1.67 2.33 2.67
12-15UT 2.67 2.33 2.00
15-18UT 2.00 2.33 1.67
18-21UT 1.67 3.33 1.67
21-00UT 2.67 3.67 2.00
Maybe this should be considered a "preprocess" feature. I thought of post process as after web download and before comparison. No wonder it doesn't work. What is the value of post process then?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
So maybe for clarification: To detect changes of a webpage(in default download mode) WCM does download the page, then apply several filters (if setup) and then calculate a CRC based on the remaining content. If this CRC has changed, the item is considered as changed. Before the post-process, the content of the previous check and the content of the current check are saved to a file such that they can be used in tool that are commanded. One example is sending an Email with both files (or a diff) as an attachment.
What you want to do is actually possible, however, you would need to make the diff yourself. Here is how it should go:
1.) Create a batch file for post processing
2.) Call the batch file on a detected change providing the call the files
3.) Process both files as you wish
4.) Compute the diff, e.g. using a diff tool
5.) React accordingly.
Alternatively, which maybe even simpler:
Setup the reg-ex as you've used it in the SED command inside WCM to filter the downloaded content. Then let WCM compute the checksum based on the filtered result. In that case the checksum will only change if your "content of interest" has changed.
To do so: Edit the item, go to the "Content" tab, enable the RegEx filter set it up as desired and inspect the result in the "RegEx result" tab below until it fits your needs.
Let me know if that is of help. Otherwise I can try myself and provide you with a setup accordingly.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
...an not t forget: You can also setup multiple RegEx replace / ignores in the options:
Menu "Tools"-"Configuration" - section "Ignores/Replace"
Make sure you' enabled to apply ignores in the item settings and probably selected these specific ones for just that item (if needed).
These will also be pre-processed before the item is checked for changes.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Your first suggestion of a batch file operation after a detected change means I'd be running it every day manually completely negating the auto check design criteria of WCM.
Adding global Ignores/Replace would probably change data unexpectedly for some sites that are set as enabled. In other words targeting would benefit the section of most interest, but corrupt other data in that same webpage.
I am a pretty competent with sed and grep. In this case I cannot seem to get line by line regex operation in the item RegEx filter. It doesn't support many regex features such as beginning-of-line, end-of-line, multi-patterns and search/replace (sed feature).
For instance the RegEx "^[0-2][0-8].*[0-9]" (without quotes) produces the following from the URL example:
Regular expression is not applicable ... no matches found
whereas grep with that command on a text file from 3/2/24 10 PM PST produces
So you likely have a great deal more experience with WCM than I. Perhaps you have the time to solve this URL for me and therefore I and others can learn how to configure WCM for other sites needing special handling.
From my original post, my sed script neatly throws out the geomagnetic data of any of the daily 7 time samples IF activity is less than 5.0. There are several sites in my list and some I have given up on that would greatly benefit from such "pre-processing" after download but before comparison.
Thank You
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
WRT the ignore/replace items: Of course, you can create sets of such items that only apply to dedicated items and not all in general. This should be explained in the settings. So still, this option might be what you are looking for.
However, keep in mind that WebChangeMonitor is there to detect changes - what you want to do is deep (numerical) data analysis on the content. This is not what WCM was primarily intended for. However, meanwhile quite a lot of the functionality offered goes in that direction and maybe you found a use-case that could be simplified with a new option:
Currently, there is the option for a Replace-RegEx which works globally on the content. What you need is a line-by-line RegEx (or multiple) to filter the content beforehand. Currently, there is a global RegEx flag to enable RegEx to support line-feeds. I've implemented that once on request but never really used it myself. This might already work for you already, but an easier way would be to support a line-by-line RegEx operation which would actually be easy to implement. Would that solve your issue?
BTW: The RegEx has PCRE syntax and a RegEx that works fr me on the content is e.g.: "[0-9]+\-[0-9]+[A-Z]{2}\s*[0-9+\.0-9+\s*]*"
Last edit: Morten MacFly 2024-03-03
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Although I think it is possible already with the options present it is definitely not easy to setup.
So... in the meantime I've implemented a new feature that should do what you want:
It allows to setup a RegEx as a filter that s being applied either to the whole content (and only those lines remain that match the RegEx) or even line-by-line.
I see you have done quite a bit in the new coding. I'd love to give it a try. Unfortunately, I am in some transitions and don't have appropriate compilers setup for Windows or Ubuntu. When you have an Alpha or Beta I'd be happy to give it a go.
I tried enabling the EOL and the RegEx you provided and entered the text and RegEx in the "test bed". The attachment shows my results.
Sure thing: I've uploaded an ALPHA version (32 and 64 bit in compatibility mode which should definitely work somehow, although its not compiled fr speed/performance).
But be careful: Its really just alpha - I've drafted the code ad not yet tested it myself. Not sure whether this is going to work at al in this stage. But any feedback is welcome!
What I would recommend try:
Create and setup an "Ignore" in the options (Type: "RegEx Filter") with the pattern I've provided above and enable "line-by-line" (the patter could be tuned later-on - as it may not be perfect)
Make an "Ignore group" that just contains this very item and give it a name
For the address you provided to monitor:
Enable "Apply ignores" in the items options
Select ""Select/setup ignores to apply" and enable just that ignore group and nothing else
This will result is only this single RegEx being applied for this single item out of all your items in the item list.
I don't have time today to try by myself, but if it works for you it would be a great feedback.
Thank you. I installed the fixed new 64-bit code, your RegEx Filter for ignores, applied RegEx line-by-line, gave it a group name, enabled the group name for ignores in the item config and made a run of just this item. The results contained in *.new are partially successful. The RegEx stripped all the stuff before the time-data set, but the columnar floating point values less than 5.0 remain. The hours in the time-data will stay the same tomorrow, but the data itself will change. I'd like to know when that data >= 5.0 for any of the 3 columns. Can that be done in this RegEx experiment?
Last edit: MMcVeigh 2024-03-05
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
So first of all, I did introduce a bug which I had to fix which may affect you, too (that happens with an ALPHA version...).
The effect would be, that some default values for ignores are set in false way.
If you have many ignore patterns setup, you should check all of their settings in the options dialog and if you spot any wrong flags. If so, the best way to fix this would be to export the ignore patterns into a CSV file, change it back to what is was before (using a spreadsheet app like Excel, for example) and import from CSV again.
WRT your question: I am not sure if you can do this in a single RegEx, but you could of course just implement a second one. Ignore / filter patterns are applied in the order of their index and as many as are applicable. The first pattern would reduce the content to the lines you see and the second one (again line-by-line) would look for any number above 5.0 and remove those below that value (if I got you right and that is what you want).
If you can't do that yourself, I would need to know which number exactly should be above 5.0. I don't really understand the content behind and there are many numbers in several columns/rows. Do you mean all of the number in all columns/rows must be above 5.0?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I will download and install Alpha3 x64. Thank you.
I just have the one global RegEx filter so I'll remove and retry plus add another for the test of removing any columnar value < 5.0. These columns are forecast data for the next 3 days. The row headings are GMT times of the day. I only want change notification if any GMT time on any of the 3 days is 5.0+.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
This test is working now with Alpha3. Today's data from the website to check contains a couple of value entries > 4.0. First I created a RegEx Filter [0-9]+-[0-9]+[A-Z]{2}\s.+[[1-9].[0-9]{2}\s]* with group name "NOAA Geo RegEx Filter". Next I created a RegEx Replace [0-3].[0-9]{2} replaced with 4 spaces and group name "NOAA Geo RegEx Replace". I assigned these to my item and reran a "check now". Success! I got a table with the GMT row headers and only the two data points > 4.0. Now I will edit my RegEx Replace to only look for > 5.0.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
OK, pretty cool. So this feature will be officially introduced with the next release. Thank you for testing! If you see any deficits in the meantime, let me know. Currently, the next release is scheduled to be 24.04 - obviously somewhen in April.
WCM has many options to check the content of items and use powerful filter functions. For clarification, the order of their application s explained hereby:
Modifying and/or filtering the content of items is applied only if it is setup/enabled in the item options! So the actual procedure might only be a sub-set of the above, including no modification/filtering at all.
Since release 23.05 there is an option to apply the ignore/replace patterns two times. Therefore, the order changes as following:
When is post process executed? What I want is somewhere before step 6 (2023-05-21) a post process to be run that may modify the downloaded data before comparison %old to %new. How can I achieve that?
I am, sorry about the late response - somehow I did not get notified about this post.
Interesting question: So far, you can't. The post-process (as the name suggests) is intended to be run after the information about the stet of the item is computed.
It may be an interesting feature you request, though...
...however, may I ask what you would do the the content in between? Because WCM has very powerful capabilities to filter/parse the content which actually might already do what you have in mind.
A general issue I see with calling out of WCM to modify the content is that this will massively impact the speed of checking items and parallelism will be slowed down. Its not impossible though...
There are multiple sites for which I could use this feature. Here is a simple example: URL https://services.swpc.noaa.gov/text/3-day-geomag-forecast.txt produces a text output that includes 3 day forecast of geomagnetic storms. Before saving as .new and comparing .new vs .old, I would like most items stripped with the following "sed" script
For this site there is a lot of data that reflects date that I don't want to trigger as a change. Also, I don't want a change notice if the geomagnetic kp index is less than 5.
Original content of URL
Processed URL output
Maybe this should be considered a "preprocess" feature. I thought of post process as after web download and before comparison. No wonder it doesn't work. What is the value of post process then?
So maybe for clarification: To detect changes of a webpage(in default download mode) WCM does download the page, then apply several filters (if setup) and then calculate a CRC based on the remaining content. If this CRC has changed, the item is considered as changed. Before the post-process, the content of the previous check and the content of the current check are saved to a file such that they can be used in tool that are commanded. One example is sending an Email with both files (or a diff) as an attachment.
What you want to do is actually possible, however, you would need to make the diff yourself. Here is how it should go:
1.) Create a batch file for post processing
2.) Call the batch file on a detected change providing the call the files
3.) Process both files as you wish
4.) Compute the diff, e.g. using a diff tool
5.) React accordingly.
Alternatively, which maybe even simpler:
Setup the reg-ex as you've used it in the SED command inside WCM to filter the downloaded content. Then let WCM compute the checksum based on the filtered result. In that case the checksum will only change if your "content of interest" has changed.
To do so: Edit the item, go to the "Content" tab, enable the RegEx filter set it up as desired and inspect the result in the "RegEx result" tab below until it fits your needs.
Let me know if that is of help. Otherwise I can try myself and provide you with a setup accordingly.
...an not t forget: You can also setup multiple RegEx replace / ignores in the options:
Menu "Tools"-"Configuration" - section "Ignores/Replace"
Make sure you' enabled to apply ignores in the item settings and probably selected these specific ones for just that item (if needed).
These will also be pre-processed before the item is checked for changes.
Your first suggestion of a batch file operation after a detected change means I'd be running it every day manually completely negating the auto check design criteria of WCM.
Adding global Ignores/Replace would probably change data unexpectedly for some sites that are set as enabled. In other words targeting would benefit the section of most interest, but corrupt other data in that same webpage.
I am a pretty competent with sed and grep. In this case I cannot seem to get line by line regex operation in the item RegEx filter. It doesn't support many regex features such as beginning-of-line, end-of-line, multi-patterns and search/replace (sed feature).
For instance the RegEx "^[0-2][0-8].*[0-9]" (without quotes) produces the following from the URL example:
Regular expression is not applicable ... no matches found
whereas grep with that command on a text file from 3/2/24 10 PM PST produces
00-03UT 3.67 2.67 1.67
03-06UT 3.67 3.00 1.33
06-09UT 2.67 2.33 1.33
12-15UT 1.67 2.00 1.33
15-18UT 1.67 2.00 1.33
18-21UT 2.67 2.33 1.67
21-00UT 2.67 2.67 1.67
So you likely have a great deal more experience with WCM than I. Perhaps you have the time to solve this URL for me and therefore I and others can learn how to configure WCM for other sites needing special handling.
From my original post, my sed script neatly throws out the geomagnetic data of any of the daily 7 time samples IF activity is less than 5.0. There are several sites in my list and some I have given up on that would greatly benefit from such "pre-processing" after download but before comparison.
Thank You
WRT the ignore/replace items: Of course, you can create sets of such items that only apply to dedicated items and not all in general. This should be explained in the settings. So still, this option might be what you are looking for.
However, keep in mind that WebChangeMonitor is there to detect changes - what you want to do is deep (numerical) data analysis on the content. This is not what WCM was primarily intended for. However, meanwhile quite a lot of the functionality offered goes in that direction and maybe you found a use-case that could be simplified with a new option:
Currently, there is the option for a Replace-RegEx which works globally on the content. What you need is a line-by-line RegEx (or multiple) to filter the content beforehand. Currently, there is a global RegEx flag to enable RegEx to support line-feeds. I've implemented that once on request but never really used it myself. This might already work for you already, but an easier way would be to support a line-by-line RegEx operation which would actually be easy to implement. Would that solve your issue?
BTW: The RegEx has PCRE syntax and a RegEx that works fr me on the content is e.g.: "
[0-9]+\-[0-9]+[A-Z]{2}\s*[0-9+\.0-9+\s*]*
"Last edit: Morten MacFly 2024-03-03
Although I think it is possible already with the options present it is definitely not easy to setup.
So... in the meantime I've implemented a new feature that should do what you want:
It allows to setup a RegEx as a filter that s being applied either to the whole content (and only those lines remain that match the RegEx) or even line-by-line.
You can follow/comment the new feature here:
https://sourceforge.net/p/webchangemon/feature-requests/268/
I see you have done quite a bit in the new coding. I'd love to give it a try. Unfortunately, I am in some transitions and don't have appropriate compilers setup for Windows or Ubuntu. When you have an Alpha or Beta I'd be happy to give it a go.
I tried enabling the EOL and the RegEx you provided and entered the text and RegEx in the "test bed". The attachment shows my results.
Sure thing: I've uploaded an ALPHA version (32 and 64 bit in compatibility mode which should definitely work somehow, although its not compiled fr speed/performance).
But be careful: Its really just alpha - I've drafted the code ad not yet tested it myself. Not sure whether this is going to work at al in this stage. But any feedback is welcome!
What I would recommend try:
This will result is only this single RegEx being applied for this single item out of all your items in the item list.
I don't have time today to try by myself, but if it works for you it would be a great feedback.
Links:
32 bit Windows version:
https://sourceforge.net/projects/webchangemon/files/Windows/WebChangeMonitor_24_XX-32bit-Win7-NoOpt-ALPHA.zip/download
64 bit Windows version:
https://sourceforge.net/projects/webchangemon/files/Windows/WebChangeMonitor_24_XX-64bit-Win7-NoOpt-ALPHA.zip/download
OK, the first one had a bug that did not allow to create such a type of "RegEx Filter". Version ALPHA2 has that fixed:
New links:
32 bit Windows version:
https://sourceforge.net/projects/webchangemon/files/Windows/WebChangeMonitor_24_XX-32bit-Win7-NoOpt-ALPHA2.zip/download
64 bit Windows version:
https://sourceforge.net/projects/webchangemon/files/Windows/WebChangeMonitor_24_XX-64bit-Win7-NoOpt-ALPHA2.zip/download
( As I've said: No promises that in this stage the feature already works properly... ;-) )
Last edit: Morten MacFly 2024-03-03
Thank you. I installed the fixed new 64-bit code, your RegEx Filter for ignores, applied RegEx line-by-line, gave it a group name, enabled the group name for ignores in the item config and made a run of just this item. The results contained in
*.new
are partially successful. The RegEx stripped all the stuff before the time-data set, but the columnar floating point values less than 5.0 remain. The hours in the time-data will stay the same tomorrow, but the data itself will change. I'd like to know when that data >= 5.0 for any of the 3 columns. Can that be done in this RegEx experiment?Last edit: MMcVeigh 2024-03-05
So first of all, I did introduce a bug which I had to fix which may affect you, too (that happens with an ALPHA version...).
The effect would be, that some default values for ignores are set in false way.
If you have many ignore patterns setup, you should check all of their settings in the options dialog and if you spot any wrong flags. If so, the best way to fix this would be to export the ignore patterns into a CSV file, change it back to what is was before (using a spreadsheet app like Excel, for example) and import from CSV again.
WRT your question: I am not sure if you can do this in a single RegEx, but you could of course just implement a second one. Ignore / filter patterns are applied in the order of their index and as many as are applicable. The first pattern would reduce the content to the lines you see and the second one (again line-by-line) would look for any number above 5.0 and remove those below that value (if I got you right and that is what you want).
If you can't do that yourself, I would need to know which number exactly should be above 5.0. I don't really understand the content behind and there are many numbers in several columns/rows. Do you mean all of the number in all columns/rows must be above 5.0?
Oops, here are the results from
*.new
00-03UT 1.67 1.67 1.67
03-06UT 1.33 1.33 1.33
06-09UT 1.33 1.33 1.33
09-12UT 1.33 1.33 1.33
12-15UT 1.33 1.33 1.33
15-18UT 1.33 1.33 1.33
18-21UT 1.67 1.67 1.67
21-00UT 1.67 1.67 1.67
Last edit: MMcVeigh 2024-03-05
I will download and install Alpha3 x64. Thank you.
I just have the one global RegEx filter so I'll remove and retry plus add another for the test of removing any columnar value < 5.0. These columns are forecast data for the next 3 days. The row headings are GMT times of the day. I only want change notification if any GMT time on any of the 3 days is 5.0+.
This test is working now with Alpha3. Today's data from the website to check contains a couple of value entries > 4.0. First I created a RegEx Filter
[0-9]+-[0-9]+[A-Z]{2}\s.+[[1-9].[0-9]{2}\s]*
with group name "NOAA Geo RegEx Filter". Next I created a RegEx Replace[0-3].[0-9]{2}
replaced with 4 spaces and group name "NOAA Geo RegEx Replace". I assigned these to my item and reran a "check now". Success! I got a table with the GMT row headers and only the two data points > 4.0. Now I will edit my RegEx Replace to only look for > 5.0.OK, pretty cool. So this feature will be officially introduced with the next release. Thank you for testing! If you see any deficits in the meantime, let me know. Currently, the next release is scheduled to be 24.04 - obviously somewhen in April.
Oh, btw: It would have been cool to have this conversation in a related ticket. Next time, if you miss a feature, you could also use the feature tracker:
https://sourceforge.net/p/webchangemon/feature-requests/
Actually, I thought I just didn't know how best to use the tool. Thank you for all your support. Very appreciated.