Menu

POPFile v0.22.6?

2007-08-29
2013-04-15
1 2 > >> (Page 1 of 2)
  • John Graham-Cumming

    Folks,

    Between Brian's changes for Vista support and Naoki's performance improvement it feels like a 0.22.6 might be in order.  Anything else we'd want to slip into it?

    John.

     
    • naoki iimura

      naoki iimura - 2007-08-31

      John,

      I've post three patches in the bug section and the patch section today.

      1) The new option for Japanese users (new feature)

      Japanese users will be able to choose the Japanese parser.

      http://sourceforge.net/tracker/index.php?func=detail&aid=1785492&group_id=63137&atid=502958

      2) Some e-mail addresses are not treated as e-mail address (bug)

      http://sourceforge.net/tracker/index.php?func=detail&aid=1785487&group_id=63137&atid=502956

      3) Yet another fix for the history tab in Nihongo mode (bug)

      http://sourceforge.net/tracker/index.php?func=detail&aid=1785488&group_id=63137&atid=502956

      These patches have been tested for weeks in Japanese forum.

      I want to merge them to POPFile v0.22.6. Can I do?

      Naoki

       
    • John Graham-Cumming

      Yes.  They seem very interesting.  Could explain a little more about the different parsers you are using?

      John.

       
      • naoki iimura

        naoki iimura - 2007-08-31

        >> Yes. They seem very interesting. Could explain a little more about the different parsers you are using? <<

        OK.

        Japanese text consists of several kinds of characters.

        The first is called "Kanji" and originates from Chinese.
        The second is called "Kana" and was created by simplifying Kanji.
        Kana has two forms called "Hira-gana" (Japanese syllabary characters) and
        "Kata-kana".

        One or more characters construct words but there's no white spaces between
        words. So we have to splits words.

        All of the Kanji has its Yomi (pronunciation) but most of Kanji has some
        different pronunciations. So we need the dictionary which defines the list
        of the Kanji words and their Yomi.

        (1) Kakasi

        Kakasi is originally developed as the Kanji-Kana converter.
        (Kakasi is abbreviation of KAnji Kana Simple Inverter.)
        So Kakasi's has no information about "Hira-gana" and "Kata-kana" words.

        The Kakasi's wakachi-gaki (splitting words) function is based on the
        conversion and it's not that Kakasi analyzes the Japanese texts morphologically.
        For this reason Kakasi is not so accurate as MeCab in Japanese parsing.

        For more information of Kakasi, see:
        http://kakasi.namazu.org/index.html.en

        (2) MeCab

        MeCab is a morphological analyzer.

        MeCab's wakachi-gaki function is based on the morphological analysis.
        So it has a large dictionary but it is accurate.

        MeCab is released on GPL / BSD license / LGPL.

        For more information of MeCab, see:
        http://sourceforge.net/projects/mecab

        (3) The simple parser

        The simple parser splits the Japanese texts by the kinds of characters
        (Kanji, Hira-gana, Kata-kana and symbols). It is very fast since it does
        not use dictionaries, but it is not accurate in Japanese parsing.

        (4) The comparison between the parsers

        Speed (faster is better):
        Simple -> MeCab -> Kakasi

        Accuracy in Japanese parsing (more accurate is better):
        MeCab -> Kakasi -> Simple

        Data size (smaller is better):
        Simple -> Kakasi -> MeCab

        POPFile accuracy:
        MeCab == Kakasi == Simple

        Naoki

         
        • naoki iimura

          naoki iimura - 2007-09-07

          I committed all of my patches to b0_22_2 branch.

          Naoki

           
    • Texas Fett

      Texas Fett - 2007-08-31

      I am wondering about the performance improvements.  A couple people wrote in that it really helped.  We need to make sure it doesn't somehow make it slower in some unusual case.  Is there any possibility it could be operating system or Perl version dependent?

      We had a speedup that involved deleting but had to remove because it didn't support Win98.  It seems to have been work done on the 0.23 branch, but seems simple enough that it could be back ported if we can work around the 98 problem.  Is there anyway we can test for 98 and use the old code only in that case?

      http://sourceforge.net/forum/forum.php?thread_id=1454786&forum_id=230652

      Some other patches that have been around for a while:
      https://sourceforge.net/tracker/index.php?func=detail&aid=1091590&group_id=63137&atid=502958
      https://sourceforge.net/tracker/index.php?func=detail&aid=928685&group_id=63137&atid=502958
      https://sourceforge.net/tracker/index.php?func=detail&aid=1202342&group_id=63137&atid=502958
      http://sourceforge.net/tracker/index.php?func=detail&aid=1537724&group_id=63137&atid=502958

      Other than that, I can't think of anything.  But the improved Vista support is pretty valuable and that performance improvement sounds great so even if that is it, we should release it.

       
      • Texas Fett

        Texas Fett - 2007-08-31
         
        • Manni

          Manni - 2007-08-31

          Thanks for the heads-up. Didn't have a look at the tracker for quite some time now.

          I've fixed the IMAP bug and that last one.

          Manni

           
      • naoki iimura

        naoki iimura - 2007-09-07

        Joseph,

        >> I am wondering about the performance improvements. <<

        My patch changes one regex and removes a loop.
        I think it does not make POPFile slow down.

        I created a benchmark script to test the effect of my change.
        http://amatubu.skr.jp/POPFile/patch_performance_tets.pl

        Here's the result on several environments:

        (1) Windows XP Pro SP2 / Celeron 1.06GHz / ActivePerl 5.8.8 build 822

        Benchmark: timing 50000 iterations of test_0225, test_0225_patch...
        test_0225: 455 wallclock secs (438.45 usr +  0.05 sys = 438.50 CPU) @ 114.03/s (n=50000)
        test_0225_patch: 27 wallclock secs (26.47 usr +  0.00 sys = 26.47 CPU) @ 1889.07/s (n=50000)

        (2) Windows 98 / Pentium 166MHz / ActivePerl 5.8.4 build 810

        Benchmark: timing 50000 iterations of test_0225, test_0225_patch...
        test_0225: 5201 wallclock secs (5201.01 usr +  0.00 sys = 5201.01 CPU) @  9.61/s (n=50000)
        test_0225_patch: 228 wallclock secs (227.55 usr +  0.00 sys = 227.55 CPU) @ 219.73/s (n=50000)

        (3) Windows 2000 Pro / Pentium II 450MHz / ActivePerl 5.8.8 build 822

        Benchmark: timing 50000 iterations of test_0225, test_0225_patch...
        test_0225: 2570 wallclock secs (1673.22 usr +  0.42 sys = 1673.64 CPU) @ 29.88/s (n=50000)
        test_0225_patch: 105 wallclock secs (92.49 usr +  0.01 sys = 92.50 CPU) @ 540.52/s (n=50000)

        (4) Windows XP Home SP2 / Pentium II 450MHz / ActivePerl 5.8.8 build 822

        Benchmark: timing 50000 iterations of test_0225, test_0225_patch...
        test_0225: 1632 wallclock secs (1603.69 usr +  0.09 sys = 1603.78 CPU) @ 31.18/s (n=50000)
        test_0225_patch: 93 wallclock secs (91.36 usr +  0.01 sys = 91.37 CPU) @ 547.23/s (n=50000)

        (5) Mac OS X 10.3.9 / PowerPC G3 500MHz / Perl 5.8.1

        Benchmark: timing 50000 iterations of test_0225, test_0225_patch...
        test_0225: 1181 wallclock secs (920.41 usr +  6.02 sys = 926.43 CPU) @ 53.97/s (n=50000)
        test_0225_patch: 59 wallclock secs (54.59 usr +  0.18 sys = 54.77 CPU) @ 912.91/s (n=50000)

        (6) Mac OS X 10.4.9 / PowerPC G4 1.25GHz / Perl 5.8.6

        Benchmark: timing 50000 iterations of test_0225, test_0225_patch...
        test_0225: 697 wallclock secs (358.87 usr +  8.80 sys = 367.67 CPU) @ 135.99/s (n=50000)
        test_0225_patch: 37 wallclock secs (23.12 usr +  0.27 sys = 23.39 CPU) @ 2137.67/s (n=50000)

        (7) Windows 2000 Pro / Virtual PC / ActivePerl 5.8.8 build 820

        Benchmark: timing 50000 iterations of test_0225, test_0225_patch...
        test_0225: 1241 wallclock secs (1063.74 usr +  1.18 sys = 1064.92 CPU) @ 46.95/s (n=50000)
        test_0225_patch: 85 wallclock secs (59.67 usr +  0.16 sys = 59.83 CPU) @ 835.74/s (n=50000)

        (8) Windows 98 / Virtual PC / ActivePerl 5.8.8 build 820

        Benchmark: timing 50000 iterations of test_0225, test_0225_patch...
        test_0225: 1863 wallclock secs (1862.97 usr +  0.00 sys = 1862.97 CPU) @ 26.84/s (n=50000)
        test_0225_patch: 110 wallclock secs (109.25 usr +  0.00 sys = 109.25 CPU) @ 457.67/s (n=50000)

        (9) Debian GNU/Linux 4.0 x86 / SF.jp compile farm / Perl 5.8.8

        Benchmark: timing 50000 iterations of test_0225, test_0225_patch...
        test_0225: 268 wallclock secs (260.61 usr +  0.00 sys = 260.61 CPU) @ 191.86/s (n=50000)
        test_0225_patch: 16 wallclock secs (15.53 usr +  0.00 sys = 15.53 CPU) @ 3219.58/s (n=50000)

        (10) Debian GNU/Linux 4.0 amd64 / SF.jp compile farm / Perl 5.8.8

        Benchmark: timing 50000 iterations of test_0225, test_0225_patch...
        test_0225: 142 wallclock secs (141.98 usr +  0.00 sys = 141.98 CPU) @ 352.16/s (n=50000)
        test_0225_patch: 10 wallclock secs ( 9.79 usr +  0.00 sys =  9.79 CPU) @ 5107.25/s (n=50000)

        In all the cases, the patched version is faster than the current version.

        Naoki

         
    • Brian Smith

      Brian Smith - 2007-08-31

      >> Brian's changes for Vista support <<

      Although I've committed some Vista related changes, I've still got to fix the problem where "standard" users running some of the NSIS-based utilities will see a message box asking the user to shutdown POPFile and click a button to continue even if POPFile is not running. I may be able to do some work on this problem at the weekend.

      >> Anything else we'd want to slip into it? <<

      What about Naoki's "transaction" patch that should avoid the "database is locked" errors that some users have been reporting with 0.22.4 and 0.22.5? I have been using the patch for a while without noticing any problems but then my system never displayed any "database is locked" messages before I applied the patch (perhaps because I don't get a lot of mail?).

      A newer version of ActivePerl has been released but I have not upgraded to it yet (too many other things to worry about at the moment).

      >> I am wondering about the performance improvements. A couple people wrote in that it really helped. We need to make sure it doesn't somehow make it slower in some unusual case. <<

      I'm no Perl expert but it looks to me like the patch removes a loop so I don't see how it can slow things down.

      >> We had a speedup that involved deleting but had to remove because it didn't support Win98. It seems to have been work done on the 0.23 branch, but seems simple enough that it could be back ported if we can work around the 98 problem. Is there anyway we can test for 98 and use the old code only in that case? <<

      I'll need to check the link you quoted but if I remember correctly that patch was a long time ago. The minimal Perl has been changed a few times since then so I think it might be worth taking another look at it. The installer already does some checks on the Windows version so it could do something at install-time (or later) to adjust things to make the code work on Win98 (I still have a working Win98 system).

      >> Japanese users will be able to choose the Japanese parser. <<

      If the MeCab option is going to have its own installer, do you want the POPFile installer to offer to download it in a way similar to the way it downloads the SSL support files?

      If you don't want the POPFile installer to download the new MeCab installer then if Nihongo is selected the POPFile installer could simply display a link to the web page about the MeCab installer or even open the page in the default browser.

      Brian

       
      • naoki iimura

        naoki iimura - 2007-09-07

        Brian,

        >> What about Naoki's "transaction" patch that should avoid the "database is locked" errors that some users have been reporting with 0.22.4 and 0.22.5? <<

        I also have never seen the 'database is locked' message so I have no idea
        whether my patch can solve the problem. But I think it's worth trying.

        Naoki

         
        • Marc Bejarano

          Marc Bejarano - 2007-10-15

          hi, amatubu.  i was seeing those locked messages and they went away with the patch, but popfile still locks up :(

          see http://sourceforge.net/forum/forum.php?thread_id=1768349&forum_id=213100

          thanks,
          marc

           
          • Marc Bejarano

            Marc Bejarano - 2007-10-15

            now that i look closer using procexp, i see that popfileif.exe isn't totally locked up.  eudora times out waiting for it and trying to pull up the Web UI never succeeds, but popfile is trying to do something because it has never-ending IO and CPU usage.

            if there's anything you guys can think of to collect more info to get to the bottom of this, let me know.

             
            • Texas Fett

              Texas Fett - 2007-10-15

              Marc, have you tried to disable POPFile's tray icon?  It is known to cause POPFile to freeze on some computers.

               
              • Marc Bejarano

                Marc Bejarano - 2007-10-19

                like i said, it wasn't really frozen, just unresponsive.  i used to run without the tray icon, but have been using it for a while with no apparently problems.  just turned it off.  we'll see what happens.

                thanks,
                marc

                 
          • naoki iimura

            naoki iimura - 2007-10-16

            Hi Marc

            Thank you for your report.
            I'm sorry that my patch could not solve whole of your problem.
            But if it could solve a part of the problem, I want to merge it to the next version of POPFile.

            I'll commit it to CVS later.

            Naoki

             
            • Marc Bejarano

              Marc Bejarano - 2007-10-19

              sounds good.  it certainly hasn't made things worse and seems like The Right Thing :)

              i had a new error today:
              Use of uninitialized value in string eq at C:\PROGRA~1\POPFile/Classifier/Bayes.pm line 1547, <GEN29> line 6094.

              popfile kept chugging right along after it, though.

              btw: what do all those errors like:
              --
              Day too small - -134774 > 0
              Sec too small - -134774 < 0
              Day too small - -134774 > 0
              Sec too small - -134774 < 0
              Day too small - -134774 > 0
              Sec too small - -134774 < 0
              Day too big - 45217 > 24855
              Sec too big - 45217 > 11647
              --
              mean?  if they're harmless, why are we printing them?

               
              • Brian Smith

                Brian Smith - 2007-10-19

                >> btw: what do all those errors like ... Sec too big - 45217 > 11647 ... mean? <<

                These messages come from a bug in Time::Local.pm v1.11 which is what the minimal Perl uses. This came up in a long forum discussion ages ago (I won't go into details). When I checked CPAN I found that this bug had been fixed some time ago (e.g. Time::Local v1.17 does not show these "too big" etc messages) but the ActiveState repository does not seem to have a version with the bug-fix in it.

                You can get Time::Local v1.17 from another repository, such as bribes.

                Brian

                 
      • naoki iimura

        naoki iimura - 2007-09-07

        Brian,

        >> If the MeCab option is going to have its own installer, do you want the POPFile installer to offer to download it in a way similar to the way it downloads the SSL support files? <<

        That sounds good.

        To use MeCab, we need to install the MeCab perl binding, dictionaries and
        the configuration files. And we also need to set an environment variable.

        (1) The MeCab perl binding

        Since the PPD of the MeCab perl binding has not been released, I've created
        one. You can get it from here:
        http://idisk.mac.com/amatubu/Public/MeCab/MeCab.ppd

        (2) The dictionaries and the configuration file

        I zipped the dictionaries, the configuration file and the documents of Mecab.
        You can download it here:
        http://idisk.mac.com/amatubu/Public/MeCab/mecab-ipadic.zip

        The archive should be unzipped in the POPFile folder.

        (3) The environment variable

        We have to set an environment variable called "MECABRC" to point the
        "mecabrc" configuration file.
        The configuration file is in the "etc" folder in the "mecab" folder.
        If POPFile is installed in "C:\Program Files\POPFile", the "MECABRC"
        environment variable should be "C:\Program Files\POPFile\mecab\etc\mecabrc".

        The perl binding and the dictionaries are now hosted on my rental server.
        I'm wondering whether they should be on the SF.net web server.

        Naoki

         
        • Brian Smith

          Brian Smith - 2007-09-08

          Naoki,

          >> To use MeCab, we need to install the MeCab perl binding, dictionaries and the configuration files. And we also need to set an environment variable. <<

          Thanks for the links to those files. I've downloaded them all and managed to add the MeCab binding to my Perl installation.

          Once I've fixed the problem with the installer and other NSIS-based utilities not working properly when run by standard (i.e. non-admin) users, I'll update the installer to offer MeCab as an option.

          Downloading and installing MeCab should be easy to do because I can modify the code used to download and install the SSL support files. The installer already creates two environment variables for the Kakasi package so adding support for the new MECABRC variable should also be easy.

          The only completely new code that MeCab will require is the code that offers MeCab as an installation option.

          The current installer always installs Kakasi when "Nihongo" is selected as the installer language.

          Since the new popfile.cfg parameter is used to switch between Kakasi and MeCab, am I correct in assuming you want the installer to always install Kakasi and then offer MeCab as an option?

          If the MeCab option is selected the MeCab Perl binding and the large MeCab dictionary package will be downloaded and installed.

          If the MeCab option is selected the installer will also offer to use the new popfile.cfg parameter to make POPFile use the MeCab parser instead of Kakasi. The installer already changes some popfile.cfg settings so this would only be a minor change.

          These new MeCab options could be shown on the components page but if that page does not have enough room to describe the benefits MeCab offers then a separate "MeCab" page could be shown in the installer with checkboxes to select the options.

          It will be about a week or so before I can start work on these MeCab changes. If you have any ideas about how the installer should be changed to handle the MeCab installation let me know. For example, if MeCab is selected then the installer could skip the Kakasi installation because MeCab is more accurate.

          >> The perl binding and the dictionaries are now hosted on my rental server. I'm wondering whether they should be on the SF.net web server. <<

          I cannot answer this question. I'm not sure how much space we have left on the SF.net server anyway!

          For test purposes I can make the installer download the MeCab files from a personal server (that's how I started testing the new SSL patch system used for the 0.22.5 release).

          Brian

           
          • naoki iimura

            naoki iimura - 2007-09-10

            Brian,

            >>Once I've fixed the problem with the installer and other NSIS-based
            utilities not working properly when run by standard (i.e. non-admin) users,
            I'll update the installer to offer MeCab as an option. <<

            Thanks. I'm looking forward to the new installer.

            >>Since the new popfile.cfg parameter is used to switch between Kakasi and
            MeCab, am I correct in assuming you want the installer to always install
            Kakasi and then offer MeCab as an option?<<

            No. Since the users who wants to use MeCab don't need Kakasi, I think we
            don't have to install Kakasi when the user chooses the MeCab option.

            >>These new MeCab options could be shown on the components page but if that
            page does not have enough room to describe the benefits MeCab offers then a
            separate "MeCab" page could be shown in the installer with checkboxes to
            select the options.

            It will be about a week or so before I can start work on these MeCab changes.
            If you have any ideas about how the installer should be changed to handle
            the MeCab installation let me know. For example, if MeCab is selected then
            the installer could skip the Kakasi installation because MeCab is more
            accurate.<<

            Thanks for your suggestion.

            I think it is the best to create a new option page for choosing Japanese
            parsers.
            In this page, users can choose the parser to use from 'Kakasi' (default),
            'MeCab' or 'Internal'.

            If the 'Kakasi' option is chosen, the installer will install Kakasi and the
            dictionaries and popfile.cfg will be configured to use Kakasi.
            If the 'MeCab' option is chosen, the installer will download and install
            MeCab and the dictionaries and popfile.cfg will be configured to use MeCab.
            If the 'Internal' option is chosen, the installer will install no additional
            modules and popfile.cfg will be configured to use the internal parser.

            (When a user is upgrading POPFile, the parser selected in the previous
            version should be selected by default.)

            In addition, I want the installer to show a brief information about the
            parsers so that users can decide which parser to use.

            Here's my image of the option page:

            (Japanese version)
            http://idisk.mac.com/amatubu/Public/POPFile/parser_selector_ja.png
            (English version)
            http://idisk.mac.com/amatubu/Public/POPFile/parser_selector_en.png
            (NSIS source code)
            http://idisk.mac.com/amatubu/Public/POPFile/parser_selector_src.zip

            Following is the text-based image:

            ---
            Please choose the Japanese wakachi-gaki (splitting words) parser program:

            (To analyze Japanese e-mails by POPFile we have to split (wakachi-gaki) the
            e-mail body texts into words.)

            (o) Kakasi - KAnji KAna Simple Inverter (Recommended)

            This program have been used by POPFile 0.22.5 or before.
            The wakachi-gaki accuracy is poorer than MeCab (because Kakasi does not
            have the information about the words which is constructed by Hira-gana or
            Kata-kana), but Kakasi uses smaller dictionaries (about 2MB).
            The POPFile installer contains Kakasi and its dictionaries.

            ( ) MeCab - Yet Another Part-of-Speech and Morphological Analyzer

            The wakachi-gaki accuracy is better than Kakasi, but MeCab uses larger
            dictionaries (about 40MB).
            The POPFile installer does not contain MeCab. It will be downloaded from
            the Internet.

            ( ) The internal parser - splitting by the kinds of characters

            Instead of using external programs, the parser splits texts by the kinds of
            characters (ex. Kanji, Hira-gana or Kata-kana).
            The wakachi-gaki accuracy is poor than programs which use dictionaries, but
            it does not use dictionaries so it is faster.

            Note:

            * The wakachi-gaki accuracy does not relate directly to the POPFile's
              classification accuracy. In an experiment the POPFile's accuracy does not
              be affected whichever program you choose.
            * Changing wakachi-gaki program may deteriorate temporarily the POPFile's
              classification accuracy.
            * You can change the wakachi-gaki program after the installation. For more
              information, please see: http://popfile.sourceforge.net/wiki/jp:faq:mecab
            ---

            Naoki

             
            • Brian Smith

              Brian Smith - 2007-09-10

              Naoki,

              Thanks for all of your feedback.

              >> I think it is the best to create a new option page for choosing Japanese parsers. <<

              Your screenshots and NSIS code will make it much easier for me to update the installer.

              Thanks also for your detailed explanation of what you'd like the installer to do when Nihongo is selected.

              >> I'm looking forward to the new installer. <<

              I think I'll be able to start work on it at the weekend.

              Brian

               
1 2 > >> (Page 1 of 2)

Log in to post a comment.