When WCM was new, it would create a Content subfolder for its data. That folder contained .new and .old files. Everything was self-explanatory.
Now, even with the WCM change database disabled, WCM creates both a Content subfolder and a Contentpages subfolder for downloaded data. What is the purpose of the 2 different folders?
Within either of those folders, .dump files may now also be present. What is the purpose of those .dump files, and why are they needed in addition to the .new and .old files?
TIA.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
There are several reasons that might cause a content check to fail: Encoding errors, start/stop tag errors, regex errors, XML XPath errors, JSON errors. In such cases, a dump file is generated not to loose any information and these files have names accordingly. The same is true if you enabled to create a dump explicitly file everytime the CRC changes. I've now implemented the ability to enable/disable the creation of these files individually. You'll find settings accordingly in the next release - that should make it more transparent (and the default setting will be off).
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
An encoding error appears, if you try to read content from a web-page that is non-ascii and WCM is unable to detect the encoding properly. This should actually happen only rarely. But WCM is just as good as the encoding detector used which is a mixture of Google Encoding Detector (CED) and wxWidgets methods (primarily as fall-back). That's also why the dump file is written - it should contain the downloaded content "as-is" to find out whats going wrong. It could also be an issue with a mis-configured server, e.g. encoding errors will happen (definitely) if binary content is not marked as such by the server. In that case the content cannot be downloaded correctly as it is provided in wrong format by the server already.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
BTW: There should only be one folder. and this should be named "pages". I don't know where "Content" is coming from - this seems to me like a setting you did. Please check the paths you setup in the configuration and the command line parameters.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
When WCM was new, it would create a
Content
subfolder for its data. That folder contained.new
and.old
files. Everything was self-explanatory.Now, even with the WCM change database disabled, WCM creates both a
Content
subfolder and aContentpages
subfolder for downloaded data. What is the purpose of the 2 different folders?Within either of those folders,
.dump
files may now also be present. What is the purpose of those.dump
files, and why are they needed in addition to the.new
and.old
files?TIA.
There are several reasons that might cause a content check to fail: Encoding errors, start/stop tag errors, regex errors, XML XPath errors, JSON errors. In such cases, a dump file is generated not to loose any information and these files have names accordingly. The same is true if you enabled to create a dump explicitly file everytime the CRC changes. I've now implemented the ability to enable/disable the creation of these files individually. You'll find settings accordingly in the next release - that should make it more transparent (and the default setting will be off).
Thanks. I see some that were created due to encoding errors. To WCM, what defines an encoding error?
An encoding error appears, if you try to read content from a web-page that is non-ascii and WCM is unable to detect the encoding properly. This should actually happen only rarely. But WCM is just as good as the encoding detector used which is a mixture of Google Encoding Detector (CED) and wxWidgets methods (primarily as fall-back). That's also why the dump file is written - it should contain the downloaded content "as-is" to find out whats going wrong. It could also be an issue with a mis-configured server, e.g. encoding errors will happen (definitely) if binary content is not marked as such by the server. In that case the content cannot be downloaded correctly as it is provided in wrong format by the server already.
BTW: There should only be one folder. and this should be named "pages". I don't know where "Content" is coming from - this seems to me like a setting you did. Please check the paths you setup in the configuration and the command line parameters.
Ah, this is apparently a bug in WCM. I'll take a deeper look and file a bug report when I have more details.
I am not convinced entirely yet at it is really a bug but lets see how the ticket you've created is going on...