OK. I've decided we are going to do v0.21.0 the week of March 1. To that end I need to hear about any outstanding problems with the current source code, and I'll be fixing up the test suite so that we are back to 100% pass.
Please hold off any changes to CVS until I have OKed them so that we minimize change.
John.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
They get the basic job done, but there are still a few things I am unhappy with.
Here's what I've done:
New pseudo-words for css font sizes and css coloring. This means (for now) that spammer colors/sizes that POPFile learns about from the <font> tag won't tell it anything about CSS.
This is flexible, though. I can make the CSS color pseudo-word the same as our other color pseudo-word (or use both, which would also be interesting).
For font-sizes, I decided not to do the research necessary to create a single, unified font-size pseudo-word scale. I did do some looking, and it looks like a mess that would take some time and experimentation to sort out. So for now, the CSS font-size pseudo-words just include the units (eg: html:cssfontsize12em).
Foreground and background colors are sucesfully parsed, however (and this is the only thing I'm really unhappy about) since CSS can occur in almost any tag (ugh) knowing when the CSS is no longer in scope is impossible (or very hard) without building a DOM tree for the document.
I havn't solved this problem yet. A solution I will try tomorrow is to save the tagname that set the CSS and clear font information when that closing tag is seen. This won't be 100% reliable (eg: <span style=foo><span></span> text with the style </span>) but will catch simple cases for now.
DOM is on my mental parser roadmap (for after the refactor).
This has all been done in a way that will let us easily deal with other CSS properties that become interesting in the future.
Tests to come tomorrow or later tonight.
I also have a few more parser tweaks to check in. The bigger changes I will hold off on until approved, but I found some bugs. The simple ones I will just fix (eg, one regexp was truncating HTML tag attributes in some cases), but I will detail the bigger changes in another post to get your OK on them.
Regards,
Sam
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have added the display property to the CSS to be pseudo-worded.
For now, and until we do have a DOM of parsed HTML we are processing that is about the best we can do.
Our mechanism for detecting hidden text isn't really generalized (it triggers on colors and the limited number of tags non-CSS color attributes can exist in) so I didn't have anything to use that I could trust, and generalizing the idea that text can be hidden was a bit beyond what I was trying to acheive for a first run at CSS.
Regards,
Sam
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I ended up needing to turn off the front/back color when encountering a closing tag for the opening tag. Styles were just being left in effect way too long.
It's not perfect, but I'm pleased with the results.
Regards,
Sam
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Woops, found a small issue with my CSS code while groking the output of MailParse.pm with $self->{debug__} = 1 for some CSS test messages.
When detecting a closing tag to disable style settings I wasn't re-computing the html color distance. This would leave a bunch of extra html:colordistanceN pseudo-words in html parsed with CSS coloring.
I will check this fix in along with the body-less email fix if John approves the other fix. If not, I will commit my CSS fixes seperately tomorrow.
Regards,
Sam
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
This was due to update_tag getting non-alphanumeric information passed in to $tag. I had assumed this was impossible. But occasionally in messages with <'s and the right stuff around them (nothing that looks like valid HTML to the human eye), this would result in non-alpha (and potentially regexp-active) characters. I made the mistake of using $tag inside a regexp. This has been corrected
Post 0.21.0 I will hunt down the regexp that was permitting this.
Sorry about the bother for those of you impacted by the unstable CVS.
The checkin also fixes my oversight in not calling compute_html_color_distance in all appropriate areas of the CSS parsing. This resulted in incorrect CSS pseudo-words.
Tests seem to be running individually as expected for me on Win98. I still can't do a full "make test" -- it always seems to hang at some random place (I've tried eliminating tests using TESTARGS but the failure seems to move around between a few different places we block in some of the tests).
Regards,
Sam
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
1) regexp for html attributes would incorrectly truncate attributes containing whitespace (bad!). I will check this fix in as inline CSS usually has whitespace which would trigger this glitch.
The final (non bug) change I wanted to make (erk, looking at my checkin, I forgot to back it out before checking in) was to change how we stem domain-names.
Currently users.sourceforge.net results in:
users.sourceforge.net
sourceforge.net
being identified as words. The change that a few users have asked for (and I agree with, I'm not sure what I was thinking initially) is to have users.sourceforge.net result in:
users.sourceforge.net
sourceforge.net
.net
being identified as words.
This would give POPFile some ability to learn from TLD's (eg .biz *g*)
Regards,
Sam
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm not sure if these are "problems with the current source code" or if they are merely "features":
(1) The POPFILE_USER environment variable makes it easy to place the configuration data anywhere, even on a different drive from that used for the POPFile program files. One of the side effects of this flexibility is that the <current log file> link on the UI's Configuration page stops working because incorrect HTML is generated for the link. [I've mentioned this a few times in the past]
(2) The UI's Advanced page lists lots of configuration parameters and has an Update button but there is no indication of which popfile.cfg file will be changed when the button is clicked. There is no obvious way to tell from the UI which set of configuration data is in use. If the user's popfile.cfg file uses the default ./ value for the log directory then the broken <current log file> link can help here (but it is not very user friendly).
Brian
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
2) I got the following error message when I start 0.21.0 test 5 on Japanese mode. bytes.pm is not installed. The problem can be fixed by placing bytes.pm under /lib directory.
A copy of POPFile appears to be running.
Attempting to signal the previous copy.
The other POPFile (1992) failed to signal back, starting new copy (1128)
config logger}
Can't locate bytes.pm in @INC (@INC contains: C:\data\popfile C:/Data/popfile/li
b .) at C:/Data/popfile/lib/Encode.pm line 285.
BEGIN failed--compilation aborted at C:/Data/popfile/lib/Encode.pm line 285.
Compilation failed in require at C:\data\popfile/Classifier/MailParse.pm line 19
51.
{interface:
Junya
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I haven't forgotten about Japanese users and I know that there are a number of patches for me to merge... I'll get to it as soon as the test suite is at 100%.
Be excellent to one another,
John.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I know you are very busy. Let me prioritize the all patches/bugs regarding Japanese mode so that you can decide what can/cannot be remained unfixed in 0.21.0 for Japanese users.
- Patch ID 902183 is most critical. It makes POPFile no use for Windows user with Active Perl.
- Patch ID 902513 is medium priority. Nice to have it fixed so that user can use X-POPFile-Link header with no hassle.
- Patch ID 860172 is low priority.
- Patch ID 860105. I noticed that this has already merged in Test5 version. Maybe closed?
- Patch ID 848766. I don't recommend this to be merged to 0.21.0 now. This makes big change in the code and it is risky to implement at final stage. It will be very nice if you can merge it right after 0.21.0 is released.
- Bug ID 844689. Very minor and it is not Japanese specific problem, but general problem.
Hope this will help.
Junya
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
OK. I've decided we are going to do v0.21.0 the week of March 1. To that end I need to hear about any outstanding problems with the current source code, and I'll be fixing up the test suite so that we are back to 100% pass.
Please hold off any changes to CVS until I have OKed them so that we minimize change.
John.
CSS changes are coming along nicely.
It's a deceptively simple syntax, on the surface, but then you get down to specific properties and things get wild.
Regards,
Sam
Ok, checking in changes for CSS.
They get the basic job done, but there are still a few things I am unhappy with.
Here's what I've done:
New pseudo-words for css font sizes and css coloring. This means (for now) that spammer colors/sizes that POPFile learns about from the <font> tag won't tell it anything about CSS.
This is flexible, though. I can make the CSS color pseudo-word the same as our other color pseudo-word (or use both, which would also be interesting).
For font-sizes, I decided not to do the research necessary to create a single, unified font-size pseudo-word scale. I did do some looking, and it looks like a mess that would take some time and experimentation to sort out. So for now, the CSS font-size pseudo-words just include the units (eg: html:cssfontsize12em).
Foreground and background colors are sucesfully parsed, however (and this is the only thing I'm really unhappy about) since CSS can occur in almost any tag (ugh) knowing when the CSS is no longer in scope is impossible (or very hard) without building a DOM tree for the document.
I havn't solved this problem yet. A solution I will try tomorrow is to save the tagname that set the CSS and clear font information when that closing tag is seen. This won't be 100% reliable (eg: <span style=foo><span></span> text with the style </span>) but will catch simple cases for now.
DOM is on my mental parser roadmap (for after the refactor).
This has all been done in a way that will let us easily deal with other CSS properties that become interesting in the future.
Tests to come tomorrow or later tonight.
I also have a few more parser tweaks to check in. The bigger changes I will hold off on until approved, but I found some bugs. The simple ones I will just fix (eg, one regexp was truncating HTML tag attributes in some cases), but I will detail the bigger changes in another post to get your OK on them.
Regards,
Sam
Have you done anything about visibility:hidden and display:none? Because I would think those are pretty obvious for hiding random text.
Nothing has been done to prevent words within those particular styles from being detected, but I have done this:
# CSS visibility
if (defined($style->{'visibility'})) {
$self->update_pseudoword( 'html', "cssvisibility" . $style->{'visibility'}, $encoded, $original );
}
I have added the display property to the CSS to be pseudo-worded.
For now, and until we do have a DOM of parsed HTML we are processing that is about the best we can do.
Our mechanism for detecting hidden text isn't really generalized (it triggers on colors and the limited number of tags non-CSS color attributes can exist in) so I didn't have anything to use that I could trust, and generalizing the idea that text can be hidden was a bit beyond what I was trying to acheive for a first run at CSS.
Regards,
Sam
CSS parsing and tests are in and done.
I ended up needing to turn off the front/back color when encountering a closing tag for the opening tag. Styles were just being left in effect way too long.
It's not perfect, but I'm pleased with the results.
Regards,
Sam
Woops, found a small issue with my CSS code while groking the output of MailParse.pm with $self->{debug__} = 1 for some CSS test messages.
When detecting a closing tag to disable style settings I wasn't re-computing the html color distance. This would leave a bunch of extra html:colordistanceN pseudo-words in html parsed with CSS coloring.
I will check this fix in along with the body-less email fix if John approves the other fix. If not, I will commit my CSS fixes seperately tomorrow.
Regards,
Sam
Happy for you to commit these changes, as long as the test suite is updated and still passes.
John.
I will update all side-effects that impact tests and put in tests that will catch regressions on this one.
Regards,
Sam
Ok, this is my final CSS commit going in right now.
It fixes:
http://sourceforge.net/tracker/index.php?func=detail&aid=906958&group_id=63137&atid=502956
This was due to update_tag getting non-alphanumeric information passed in to $tag. I had assumed this was impossible. But occasionally in messages with <'s and the right stuff around them (nothing that looks like valid HTML to the human eye), this would result in non-alpha (and potentially regexp-active) characters. I made the mistake of using $tag inside a regexp. This has been corrected
Post 0.21.0 I will hunt down the regexp that was permitting this.
Sorry about the bother for those of you impacted by the unstable CVS.
The checkin also fixes my oversight in not calling compute_html_color_distance in all appropriate areas of the CSS parsing. This resulted in incorrect CSS pseudo-words.
Tests seem to be running individually as expected for me on Win98. I still can't do a full "make test" -- it always seems to hang at some random place (I've tried eliminating tests using TESTARGS but the failure seems to move around between a few different places we block in some of the tests).
Regards,
Sam
I need to add a mention that my last checkin included the bodyless email fix and tests. I forgot to mention it in my commit.
Regards,
Sam
So will a color set using color: in a style work with our existing Camouflage and Invisible Ink detection?
John.
Yes, absolutely. I will go through some of the CSS emails I have had sent to me and send you a few for inclusion in our tests.
CSS will now set {htmlbackcolor__}, {htmlfontcolor__} and {htmlbodycolor__} wherever appropriate.
The only incompatibility with our "regular" color detection code is the use of different pseudo-words.
Regards,
Sam
Ok, other parser bugs I located:
1) regexp for html attributes would incorrectly truncate attributes containing whitespace (bad!). I will check this fix in as inline CSS usually has whitespace which would trigger this glitch.
2) In bodiless email the parser will ignore the last header. An example message is attached to this bug: https://sourceforge.net/tracker/?func=detail&atid=502956&aid=899447&group_id=63137
I will prepare a fix for 2) to check in tomorrow if you approve it.
Regards,
Sam
I have a fix for the last header in body-less messages being ignored ready and a TestMailParse029.msg candidate containing this type of message.
I will get the files for TestMailParse030-033.msg (or so) containing various CSS fun ready as well.
Regards,
Sam
The final (non bug) change I wanted to make (erk, looking at my checkin, I forgot to back it out before checking in) was to change how we stem domain-names.
Currently users.sourceforge.net results in:
users.sourceforge.net
sourceforge.net
being identified as words. The change that a few users have asked for (and I agree with, I'm not sure what I was thinking initially) is to have users.sourceforge.net result in:
users.sourceforge.net
sourceforge.net
.net
being identified as words.
This would give POPFile some ability to learn from TLD's (eg .biz *g*)
Regards,
Sam
Sam,
Commit that, it seems worthwhile to me.
John.
I'm not sure if these are "problems with the current source code" or if they are merely "features":
(1) The POPFILE_USER environment variable makes it easy to place the configuration data anywhere, even on a different drive from that used for the POPFile program files. One of the side effects of this flexibility is that the <current log file> link on the UI's Configuration page stops working because incorrect HTML is generated for the link. [I've mentioned this a few times in the past]
(2) The UI's Advanced page lists lots of configuration parameters and has an Update button but there is no indication of which popfile.cfg file will be changed when the button is clicked. There is no obvious way to tell from the UI which set of configuration data is in use. If the user's popfile.cfg file uses the default ./ value for the log directory then the broken <current log file> link can help here (but it is not very user friendly).
Brian
Number (2) is done, I'm working in number (1).
John.
I got 2 problems when using 0.21.0 test 5 on Japanese mode.
1) POPFile crashes when I classify or re-classify Japanese emails. Patch is uploaded.
http://sourceforge.net/tracker/index.php?func=detail&aid=902183&group_id=63137&atid=502958
2) I got the following error message when I start 0.21.0 test 5 on Japanese mode. bytes.pm is not installed. The problem can be fixed by placing bytes.pm under /lib directory.
A copy of POPFile appears to be running.
Attempting to signal the previous copy.
The other POPFile (1992) failed to signal back, starting new copy (1128)
config logger}
Can't locate bytes.pm in @INC (@INC contains: C:\data\popfile C:/Data/popfile/li
b .) at C:/Data/popfile/lib/Encode.pm line 285.
BEGIN failed--compilation aborted at C:/Data/popfile/lib/Encode.pm line 285.
Compilation failed in require at C:\data\popfile/Classifier/MailParse.pm line 19
51.
{interface:
Junya
Thanks for the report. I'll add bytes.pm to the list of modules installed when Japanese is selected.
The Test_6 installer will appear soon with this fix and (I hope) fixes for the other bugs found in Test_5.
Brian
Another bug on Japanese mode.
I found that Japanese page appears corrupted when page is navigated by using jump_to_message URL.
Patch is uploaded to:
http://sourceforge.net/tracker/index.php?func=detail&aid=902513&group_id=63137&atid=502958
I haven't forgotten about Japanese users and I know that there are a number of patches for me to merge... I'll get to it as soon as the test suite is at 100%.
Be excellent to one another,
John.
Thanks for your response.
I know you are very busy. Let me prioritize the all patches/bugs regarding Japanese mode so that you can decide what can/cannot be remained unfixed in 0.21.0 for Japanese users.
- Patch ID 902183 is most critical. It makes POPFile no use for Windows user with Active Perl.
- Patch ID 902513 is medium priority. Nice to have it fixed so that user can use X-POPFile-Link header with no hassle.
- Patch ID 860172 is low priority.
- Patch ID 860105. I noticed that this has already merged in Test5 version. Maybe closed?
- Patch ID 848766. I don't recommend this to be merged to 0.21.0 now. This makes big change in the code and it is risky to implement at final stage. It will be very nice if you can merge it right after 0.21.0 is released.
- Bug ID 844689. Very minor and it is not Japanese specific problem, but general problem.
Hope this will help.
Junya
0.21.0 test 6 also crashes in Japanese mode when I receive e-mails.
The problem has solved by this patch. Thanks.
naoki