Menu

Final week for v0.21.0

2004-02-21
2004-03-01
1 2 > >> (Page 1 of 2)
  • John Graham-Cumming

    OK.  I've decided we are going to do v0.21.0 the week of March 1. To that end I need to hear about any outstanding problems with the current source code, and I'll be fixing up the test suite so that we are back to 100% pass.

    Please hold off any changes to CVS until I have OKed them so that we minimize change.

    John.

     
    • Sam Schinke

      Sam Schinke - 2004-02-22

      CSS changes are coming along nicely.

      It's a deceptively simple syntax, on the surface, but then you get down to specific properties and things get wild.

      Regards,
      Sam

       
      • Sam Schinke

        Sam Schinke - 2004-02-22

        Ok, checking in changes for CSS.

        They get the basic job done, but there are still a few things I am unhappy with.

        Here's what I've done:

        New pseudo-words for css font sizes and css coloring. This means (for now) that spammer colors/sizes that POPFile learns about from the <font> tag won't tell it anything about CSS.

        This is flexible, though. I can make the CSS color pseudo-word the same as our other color pseudo-word (or use both, which would also be interesting).

        For font-sizes, I decided not to do the research necessary to create a single, unified font-size pseudo-word scale. I did do some looking, and it looks like a mess that would take some time and experimentation to sort out. So for now, the CSS font-size pseudo-words just include the units (eg: html:cssfontsize12em).

        Foreground and background colors are sucesfully parsed, however (and this is the only thing I'm really unhappy about) since CSS can occur in almost any tag (ugh) knowing when the CSS is no longer in scope is impossible (or very hard) without building a DOM tree for the document.

        I havn't solved this problem yet. A solution I will try tomorrow is to save the tagname that set the CSS and clear font information when that closing tag is seen. This won't be 100% reliable (eg: <span style=foo><span></span> text with the style </span>) but will catch simple cases for now.

        DOM is on my mental parser roadmap (for after the refactor).

        This has all been done in a way that will let us easily deal with other CSS properties that become interesting in the future.

        Tests to come tomorrow or later tonight.

        I also have a few more parser tweaks to check in. The bigger changes I will hold off on until approved, but I found some bugs. The simple ones I will just fix (eg, one regexp was truncating HTML tag attributes in some cases), but I will detail the bigger changes in another post to get your OK on them.

        Regards,
        Sam

         
        • Jan-Erik Finnberg

          Have you done anything about visibility:hidden and display:none? Because I would think those are pretty obvious for hiding random text.

           
          • Sam Schinke

            Sam Schinke - 2004-02-22

            Nothing has been done to prevent words within those particular styles from being detected, but I have done this:

                        # CSS visibility
                        if (defined($style->{'visibility'})) {
                            $self->update_pseudoword( 'html', "cssvisibility" . $style->{'visibility'}, $encoded, $original );
                        }

            I have added the display property to the CSS to be pseudo-worded.

            For now, and until we do have a DOM of parsed HTML we are processing that is about the best we can do.

            Our mechanism for detecting hidden text isn't really generalized (it triggers on colors and the limited number of tags non-CSS color attributes can exist in) so I didn't have anything to use that I could trust, and generalizing the idea that text can be hidden was a bit beyond what I was trying to acheive for a first run at CSS.

            Regards,
            Sam

             
        • Sam Schinke

          Sam Schinke - 2004-02-23

          CSS parsing and tests are in and done.

          I ended up needing to turn off the front/back color when encountering a closing tag for the opening tag. Styles were just being left in effect way too long.

          It's not perfect, but I'm pleased with the results.

          Regards,
          Sam

           
          • Sam Schinke

            Sam Schinke - 2004-02-27

            Woops, found a small issue with my CSS code while groking the output of MailParse.pm with $self->{debug__} = 1 for some CSS test messages.

            When detecting a closing tag to disable style settings I wasn't re-computing the html color distance. This would leave a bunch of extra html:colordistanceN pseudo-words in html parsed with CSS coloring.

            I will check this fix in along with the body-less email fix if John approves the other fix. If not, I will commit my CSS fixes seperately tomorrow.

            Regards,
            Sam

             
            • John Graham-Cumming

              Happy for you to commit these changes, as long as the test suite is updated and still passes.

              John.

               
              • Sam Schinke

                Sam Schinke - 2004-02-27

                I will update all side-effects that impact tests and put in tests that will catch regressions on this one.

                Regards,
                Sam

                 
              • Sam Schinke

                Sam Schinke - 2004-03-01

                Ok, this is my final CSS commit going in right now.

                It fixes:

                http://sourceforge.net/tracker/index.php?func=detail&aid=906958&group_id=63137&atid=502956

                This was due to update_tag getting non-alphanumeric information passed in to $tag. I had assumed this was impossible. But occasionally in messages with <'s and the right stuff around them (nothing that looks like valid HTML to the human eye), this would result in non-alpha (and potentially regexp-active) characters. I made the mistake of using $tag inside a regexp. This has been corrected

                Post 0.21.0 I will hunt down the regexp that was permitting this.

                Sorry about the bother for those of you impacted by the unstable CVS.

                The checkin also fixes my oversight in not calling compute_html_color_distance in all appropriate areas of the CSS parsing. This resulted in incorrect CSS pseudo-words.

                Tests seem to be running individually as expected for me on Win98. I still can't do a full "make test" -- it always seems to hang at some random place (I've tried eliminating tests using TESTARGS but the failure seems to move around between a few different places we block in some of the tests).

                Regards,
                Sam

                 
            • Sam Schinke

              Sam Schinke - 2004-03-01

              I need to add a mention that my last checkin included the bodyless email fix and tests. I forgot to mention it in my commit.

              Regards,
              Sam

               
        • John Graham-Cumming

          So will a color set using color: in a style work with our existing Camouflage and Invisible Ink detection?

          John.

           
          • Sam Schinke

            Sam Schinke - 2004-02-24

            Yes, absolutely. I will go through some of the CSS emails I have had sent to me and send you a few for inclusion in our tests.

            CSS will now set {htmlbackcolor__}, {htmlfontcolor__} and {htmlbodycolor__} wherever appropriate.

            The only incompatibility with our "regular" color detection code is the use of different pseudo-words.

            Regards,
            Sam

             
    • Sam Schinke

      Sam Schinke - 2004-02-22

      Ok, other parser bugs I located:

      1) regexp for html attributes would incorrectly truncate attributes containing whitespace (bad!). I will check this fix in as inline CSS usually has whitespace which would trigger this glitch.

      2) In bodiless email the parser will ignore the last header. An example message is attached to this bug: https://sourceforge.net/tracker/?func=detail&atid=502956&aid=899447&group_id=63137

      I will prepare a fix for 2) to check in tomorrow if you approve it.

      Regards,
      Sam

       
      • Sam Schinke

        Sam Schinke - 2004-02-27

        I have a fix for the last header in body-less messages being ignored ready and a TestMailParse029.msg candidate containing this type of message.

        I will get the files for TestMailParse030-033.msg (or so) containing various CSS fun ready as well.

        Regards,
        Sam

         
    • Sam Schinke

      Sam Schinke - 2004-02-22

      The final (non bug) change I wanted to make (erk, looking at my checkin, I forgot to back it out before checking in) was to change how we stem domain-names.

      Currently users.sourceforge.net results in:
      users.sourceforge.net
      sourceforge.net
      being identified as words. The change that a few users have asked for (and I agree with, I'm not sure what I was thinking initially) is to have users.sourceforge.net result in:
      users.sourceforge.net
      sourceforge.net
      .net
      being identified as words.

      This would give POPFile some ability to learn from TLD's (eg .biz *g*)

      Regards,
      Sam

       
      • John Graham-Cumming

        Sam,

        Commit that, it seems worthwhile to me.

        John.

         
    • Brian Smith

      Brian Smith - 2004-02-22

      I'm not sure if these are "problems with the current source code" or if they are merely "features":

      (1) The POPFILE_USER environment variable makes it easy to place the configuration data anywhere, even on a different drive from that used for the POPFile program files. One of the side effects of this flexibility is that the <current log file> link on the UI's Configuration page stops working because incorrect HTML is generated for the link. [I've mentioned this a few times in the past]

      (2) The UI's Advanced page lists lots of configuration parameters and has an Update button but there is no indication of which popfile.cfg file will be changed when the button is clicked. There is no obvious way to tell from the UI which set of configuration data is in use. If the user's popfile.cfg file uses the default ./ value for the log directory then the broken <current log file> link can help here (but it is not very user friendly).

      Brian

       
      • John Graham-Cumming

        Number (2) is done, I'm working in number (1).

        John.

         
    • Junya Ishihara

      Junya Ishihara - 2004-02-22

      I got 2 problems when using 0.21.0 test 5 on Japanese mode.

      1) POPFile crashes when I classify or re-classify Japanese emails. Patch is uploaded.

      http://sourceforge.net/tracker/index.php?func=detail&aid=902183&group_id=63137&atid=502958

      2) I got the following error message when I start 0.21.0 test 5 on Japanese mode. bytes.pm is not installed. The problem can be fixed by placing bytes.pm under /lib directory.

      A copy of POPFile appears to be running.
      Attempting to signal the previous copy.

      The other POPFile (1992) failed to signal back, starting new copy (1128)
      config logger}
      Can't locate bytes.pm in @INC (@INC contains: C:\data\popfile C:/Data/popfile/li
      b .) at C:/Data/popfile/lib/Encode.pm line 285.
      BEGIN failed--compilation aborted at C:/Data/popfile/lib/Encode.pm line 285.
      Compilation failed in require at C:\data\popfile/Classifier/MailParse.pm line 19
      51.
               {interface:

      Junya

       
      • Brian Smith

        Brian Smith - 2004-02-22

        Thanks for the report. I'll add bytes.pm to the list of modules installed when Japanese is selected.

        The Test_6 installer will appear soon with this fix and (I hope) fixes for the other bugs found in Test_5.

        Brian

         
      • Junya Ishihara

        Junya Ishihara - 2004-02-23

        Another bug on Japanese mode.

        I found that Japanese page appears corrupted when page is navigated by using jump_to_message URL.

        Patch is uploaded to:

        http://sourceforge.net/tracker/index.php?func=detail&aid=902513&group_id=63137&atid=502958

         
        • John Graham-Cumming

          I haven't forgotten about Japanese users and I know that there are a number of patches for me to merge... I'll get to it as soon as the test suite is at 100%.

          Be excellent to one another,
          John.

           
          • Junya Ishihara

            Junya Ishihara - 2004-02-24

            Thanks for your response.

            I know you are very busy. Let me prioritize the all patches/bugs regarding Japanese mode so that you can decide what can/cannot be remained unfixed in 0.21.0 for Japanese users.

            - Patch ID 902183 is most critical. It makes POPFile no use for Windows user with Active Perl.

            - Patch ID 902513 is medium priority. Nice to have it fixed so that user can use X-POPFile-Link header with no hassle.

            - Patch ID 860172 is low priority.

            - Patch ID 860105. I noticed that this has already merged in Test5 version. Maybe closed?

            - Patch ID 848766. I don't recommend this to be merged to 0.21.0 now. This makes big change in the code and it is risky to implement at final stage. It will be very nice if you can merge it right after 0.21.0 is released.

            - Bug ID 844689. Very minor and it is not Japanese specific problem, but general problem.

            Hope this will help.

            Junya

             
      • naoki iimura

        naoki iimura - 2004-02-26

        0.21.0 test 6 also crashes in Japanese mode when I receive e-mails.
        The problem has solved by this patch. Thanks.

        naoki

         
1 2 > >> (Page 1 of 2)

Log in to post a comment.