Menu

Order of processing items and item content

2023-04-15
2024-03-15
  • Morten MacFly

    Morten MacFly - 2023-04-15

    WCM has many options to check the content of items and use powerful filter functions. For clarification, the order of their application s explained hereby:

    1. apply start/stop tag (i.e. remove content before the start tag and after the stop tag)
    2. apply one of: RegEx/XPath/JSON as a filter to assess a specific sub-set of the content
    3. apply ignore patterns as setup in the options (note: you can setup different ways for items, i.e. apply all, some or no ignore patterns at all)
    4. remove white-space (i.e. excessive whites-spaces, multiple line-feeds)
    5. calculate CRC checksum on which the check result is based, i.e. if the content i uprated or not
    6. run the interpreter > special case to actually convert the content into some number or string you can compare to a reference

    Modifying and/or filtering the content of items is applied only if it is setup/enabled in the item options! So the actual procedure might only be a sub-set of the above, including no modification/filtering at all.

     
  • Morten MacFly

    Morten MacFly - 2023-05-21

    Since release 23.05 there is an option to apply the ignore/replace patterns two times. Therefore, the order changes as following:

    1. apply start/stop tag (i.e. remove content before the start tag and after the stop tag)
    2. apply ignore patterns as setup in the options (note: you can setup different ways for items, i.e. apply all, some or no ignore patterns at all) and this option is enabled
    3. apply one of: RegEx/XPath/JSON as a filter to assess a specific sub-set of the content
    4. apply ignore patterns as setup in the options (note: you can setup different ways for items, i.e. apply all, some or no ignore patterns at all) and this option is enabled (which is the default)
    5. remove white-space (i.e. excessive whites-spaces, multiple line-feeds)
    6. calculate CRC checksum on which the check result is based, i.e. if the content i uprated or not
    7. run the interpreter > special case to actually convert the content into some number or string you can compare to a reference
     
  • MMcVeigh

    MMcVeigh - 2024-02-19

    When is post process executed? What I want is somewhere before step 6 (2023-05-21) a post process to be run that may modify the downloaded data before comparison %old to %new. How can I achieve that?

     
    • Morten MacFly

      Morten MacFly - 2024-03-01

      I am, sorry about the late response - somehow I did not get notified about this post.
      Interesting question: So far, you can't. The post-process (as the name suggests) is intended to be run after the information about the stet of the item is computed.
      It may be an interesting feature you request, though...
      ...however, may I ask what you would do the the content in between? Because WCM has very powerful capabilities to filter/parse the content which actually might already do what you have in mind.
      A general issue I see with calling out of WCM to modify the content is that this will massively impact the speed of checking items and parallelism will be slowed down. Its not impossible though...

       
  • MMcVeigh

    MMcVeigh - 2024-03-01

    There are multiple sites for which I could use this feature. Here is a simple example: URL https://services.swpc.noaa.gov/text/3-day-geomag-forecast.txt produces a text output that includes 3 day forecast of geomagnetic storms. Before saving as .new and comparing .new vs .old, I would like most items stripped with the following "sed" script

    C:\apps\utils\sed.exe --in-place "s/^ *[A-Z:#].*//g;s/[1-4]\.[0-9][0-9]//g" 
    

    For this site there is a lot of data that reflects date that I don't want to trigger as a change. Also, I don't want a change notice if the geomagnetic kp index is less than 5.

    Original content of URL

    :Product: Geomagnetic Forecast
    :Issued: 2024 Feb 29 2205 UTC
    # Prepared by the U.S. Dept. of Commerce, NOAA, Space Weather Prediction Center
    #
    NOAA Ap Index Forecast
    Observed Ap 28 Feb 006
    Estimated Ap 29 Feb 008
    Predicted Ap 01 Mar-03 Mar 008-012-012
    
    NOAA Geomagnetic Activity Probabilities 01 Mar-03 Mar
    Active                15/35/35
    Minor storm           01/15/15
    Moderate storm        01/01/01
    Strong-Extreme storm  01/01/01
    
    NOAA Kp index forecast 01 Mar - 03 Mar
                 Mar 01    Mar 02    Mar 03
    00-03UT        1.67      2.67      3.67      
    03-06UT        1.67      2.00      3.67      
    06-09UT        2.67      2.00      3.00      
    09-12UT        1.67      2.33      2.67      
    12-15UT        2.67      2.33      2.00      
    15-18UT        2.00      2.33      1.67      
    18-21UT        1.67      3.33      1.67      
    21-00UT        2.67      3.67      2.00      
    

    Processed URL output

    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    00-03UT
    03-06UT
    06-09UT
    09-12UT
    12-15UT
    15-18UT
    18-21UT
    21-00UT
    
     
  • MMcVeigh

    MMcVeigh - 2024-03-01

    Maybe this should be considered a "preprocess" feature. I thought of post process as after web download and before comparison. No wonder it doesn't work. What is the value of post process then?

     
    • Morten MacFly

      Morten MacFly - 2024-03-02

      So maybe for clarification: To detect changes of a webpage(in default download mode) WCM does download the page, then apply several filters (if setup) and then calculate a CRC based on the remaining content. If this CRC has changed, the item is considered as changed. Before the post-process, the content of the previous check and the content of the current check are saved to a file such that they can be used in tool that are commanded. One example is sending an Email with both files (or a diff) as an attachment.

      What you want to do is actually possible, however, you would need to make the diff yourself. Here is how it should go:

      1.) Create a batch file for post processing
      2.) Call the batch file on a detected change providing the call the files
      3.) Process both files as you wish
      4.) Compute the diff, e.g. using a diff tool
      5.) React accordingly.

      Alternatively, which maybe even simpler:
      Setup the reg-ex as you've used it in the SED command inside WCM to filter the downloaded content. Then let WCM compute the checksum based on the filtered result. In that case the checksum will only change if your "content of interest" has changed.
      To do so: Edit the item, go to the "Content" tab, enable the RegEx filter set it up as desired and inspect the result in the "RegEx result" tab below until it fits your needs.

      Let me know if that is of help. Otherwise I can try myself and provide you with a setup accordingly.

       
      • Morten MacFly

        Morten MacFly - 2024-03-02

        ...an not t forget: You can also setup multiple RegEx replace / ignores in the options:
        Menu "Tools"-"Configuration" - section "Ignores/Replace"
        Make sure you' enabled to apply ignores in the item settings and probably selected these specific ones for just that item (if needed).

        These will also be pre-processed before the item is checked for changes.

         
  • MMcVeigh

    MMcVeigh - 2024-03-03

    Your first suggestion of a batch file operation after a detected change means I'd be running it every day manually completely negating the auto check design criteria of WCM.

    Adding global Ignores/Replace would probably change data unexpectedly for some sites that are set as enabled. In other words targeting would benefit the section of most interest, but corrupt other data in that same webpage.

    I am a pretty competent with sed and grep. In this case I cannot seem to get line by line regex operation in the item RegEx filter. It doesn't support many regex features such as beginning-of-line, end-of-line, multi-patterns and search/replace (sed feature).
    For instance the RegEx "^[0-2][0-8].*[0-9]" (without quotes) produces the following from the URL example:

    Regular expression is not applicable ... no matches found

    whereas grep with that command on a text file from 3/2/24 10 PM PST produces

    00-03UT 3.67 2.67 1.67
    03-06UT 3.67 3.00 1.33
    06-09UT 2.67 2.33 1.33
    12-15UT 1.67 2.00 1.33
    15-18UT 1.67 2.00 1.33
    18-21UT 2.67 2.33 1.67
    21-00UT 2.67 2.67 1.67

    So you likely have a great deal more experience with WCM than I. Perhaps you have the time to solve this URL for me and therefore I and others can learn how to configure WCM for other sites needing special handling.

    From my original post, my sed script neatly throws out the geomagnetic data of any of the daily 7 time samples IF activity is less than 5.0. There are several sites in my list and some I have given up on that would greatly benefit from such "pre-processing" after download but before comparison.

    Thank You

     
    • Morten MacFly

      Morten MacFly - 2024-03-03

      WRT the ignore/replace items: Of course, you can create sets of such items that only apply to dedicated items and not all in general. This should be explained in the settings. So still, this option might be what you are looking for.

      However, keep in mind that WebChangeMonitor is there to detect changes - what you want to do is deep (numerical) data analysis on the content. This is not what WCM was primarily intended for. However, meanwhile quite a lot of the functionality offered goes in that direction and maybe you found a use-case that could be simplified with a new option:

      Currently, there is the option for a Replace-RegEx which works globally on the content. What you need is a line-by-line RegEx (or multiple) to filter the content beforehand. Currently, there is a global RegEx flag to enable RegEx to support line-feeds. I've implemented that once on request but never really used it myself. This might already work for you already, but an easier way would be to support a line-by-line RegEx operation which would actually be easy to implement. Would that solve your issue?

      BTW: The RegEx has PCRE syntax and a RegEx that works fr me on the content is e.g.: "[0-9]+\-[0-9]+[A-Z]{2}\s*[0-9+\.0-9+\s*]*"

       

      Last edit: Morten MacFly 2024-03-03
  • Morten MacFly

    Morten MacFly - 2024-03-03

    Although I think it is possible already with the options present it is definitely not easy to setup.

    So... in the meantime I've implemented a new feature that should do what you want:
    It allows to setup a RegEx as a filter that s being applied either to the whole content (and only those lines remain that match the RegEx) or even line-by-line.

    You can follow/comment the new feature here:
    https://sourceforge.net/p/webchangemon/feature-requests/268/

     
  • MMcVeigh

    MMcVeigh - 2024-03-03

    I see you have done quite a bit in the new coding. I'd love to give it a try. Unfortunately, I am in some transitions and don't have appropriate compilers setup for Windows or Ubuntu. When you have an Alpha or Beta I'd be happy to give it a go.

    I tried enabling the EOL and the RegEx you provided and entered the text and RegEx in the "test bed". The attachment shows my results.

     
  • MMcVeigh

    MMcVeigh - 2024-03-05

    Thank you. I installed the fixed new 64-bit code, your RegEx Filter for ignores, applied RegEx line-by-line, gave it a group name, enabled the group name for ignores in the item config and made a run of just this item. The results contained in *.new are partially successful. The RegEx stripped all the stuff before the time-data set, but the columnar floating point values less than 5.0 remain. The hours in the time-data will stay the same tomorrow, but the data itself will change. I'd like to know when that data >= 5.0 for any of the 3 columns. Can that be done in this RegEx experiment?

     

    Last edit: MMcVeigh 2024-03-05
    • Morten MacFly

      Morten MacFly - 2024-03-08

      So first of all, I did introduce a bug which I had to fix which may affect you, too (that happens with an ALPHA version...).
      The effect would be, that some default values for ignores are set in false way.
      If you have many ignore patterns setup, you should check all of their settings in the options dialog and if you spot any wrong flags. If so, the best way to fix this would be to export the ignore patterns into a CSV file, change it back to what is was before (using a spreadsheet app like Excel, for example) and import from CSV again.

      WRT your question: I am not sure if you can do this in a single RegEx, but you could of course just implement a second one. Ignore / filter patterns are applied in the order of their index and as many as are applicable. The first pattern would reduce the content to the lines you see and the second one (again line-by-line) would look for any number above 5.0 and remove those below that value (if I got you right and that is what you want).

      If you can't do that yourself, I would need to know which number exactly should be above 5.0. I don't really understand the content behind and there are many numbers in several columns/rows. Do you mean all of the number in all columns/rows must be above 5.0?

       
  • MMcVeigh

    MMcVeigh - 2024-03-05

    Oops, here are the results from *.new

    00-03UT 1.67 1.67 1.67
    03-06UT 1.33 1.33 1.33
    06-09UT 1.33 1.33 1.33
    09-12UT 1.33 1.33 1.33
    12-15UT 1.33 1.33 1.33
    15-18UT 1.33 1.33 1.33
    18-21UT 1.67 1.67 1.67
    21-00UT 1.67 1.67 1.67

     

    Last edit: MMcVeigh 2024-03-05
  • MMcVeigh

    MMcVeigh - 2024-03-08

    I will download and install Alpha3 x64. Thank you.

    I just have the one global RegEx filter so I'll remove and retry plus add another for the test of removing any columnar value < 5.0. These columns are forecast data for the next 3 days. The row headings are GMT times of the day. I only want change notification if any GMT time on any of the 3 days is 5.0+.

     
  • MMcVeigh

    MMcVeigh - 2024-03-08

    This test is working now with Alpha3. Today's data from the website to check contains a couple of value entries > 4.0. First I created a RegEx Filter [0-9]+-[0-9]+[A-Z]{2}\s.+[[1-9].[0-9]{2}\s]* with group name "NOAA Geo RegEx Filter". Next I created a RegEx Replace [0-3].[0-9]{2} replaced with 4 spaces and group name "NOAA Geo RegEx Replace". I assigned these to my item and reran a "check now". Success! I got a table with the GMT row headers and only the two data points > 4.0. Now I will edit my RegEx Replace to only look for > 5.0.

     
    • Morten MacFly

      Morten MacFly - 2024-03-14

      OK, pretty cool. So this feature will be officially introduced with the next release. Thank you for testing! If you see any deficits in the meantime, let me know. Currently, the next release is scheduled to be 24.04 - obviously somewhen in April.

      Oh, btw: It would have been cool to have this conversation in a related ticket. Next time, if you miss a feature, you could also use the feature tracker:
      https://sourceforge.net/p/webchangemon/feature-requests/

       
  • MMcVeigh

    MMcVeigh - 2024-03-15

    Actually, I thought I just didn't know how best to use the tool. Thank you for all your support. Very appreciated.

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.