|
From: <jgr...@us...> - 2003-07-09 18:18:47
|
Update of /cvsroot/popfile/engine/Classifier
In directory sc8-pr-cvs1:/tmp/cvs-serv2831/Classifier
Modified Files:
Bayes.pm MailParse.pm
Log Message:
PERFORMANCE CHANGES
Bayes.pm: Added new add_messages_to_bucket API to add multiple messages to a
bucket at the same time with a single read/write of the appropriate
corpus table for speed.
New write_line__ method to write a line to a MSG file and optionally
to the parse_line API of MailParse.pm. Now we write a file to disk
and parse it without reloading the MSG file from disk for speed.
The MSG gets a temporary name until the CLS file is written to prevent
the history from reloading in the middle of a download ending up with
a message with a class file error
classify_file becomes classify and can classify either from a file
or from the preparsed information in the parser
classify_and_modify returns the name of the file where the message
was stored in addition to the classification.
HTML.pm: Use add_messages_to_bucket API to reclassification for speed.
Use the new classify method in Bayes.pm to classify a file after it
has been digested by the parser for colorization and get the word
scores. This means we only load the MSG file once (used to be
twice) and hence double the speed of viewing a colorized message.
New method load_disk_cache__ and save_disk_cache__ are used to
keep a copy of the history cache on disk between sessions so that
session start up is as fast as possible. There will be no need
to parse messages for header information on start up if the
history_cache file is present.
Removed the boundary feature because it is incompatible with the
concept of a "download" since we now send new history file messages
async. through the MQ.
Load the history cache progessively as files are written. The proxies
send the message NEWFL and the method new_history_file__ adds the
file to the history. This is done so that when the user hits the
History tab button after a mail download the history cache is
already loaded and there should be no delay in displaying the
history page.
MailParse.pm: Renamed parse_stream to parse_file since that's a better name
New start_parse, stop_parse and parse_line APIs so that a file can
be parsed line by line.
MQ.pm: Defined a new message type NEWFL which is used to indicate that
a file has been added to the history cache. NEWFL's message
is the name of the file (the MSG file) that was added.
POP3.pm: Send the NEWFL message through the pipe to the parent so that
the history is aware of new messages.
SMTP.pm:
NNTP.pm: Send CLASS and NEWFL messages through the pipe to the parent.
insert.pl: Updated to use new parse_file API
bayes.pl: Updated to use new classify not classify_file API.
TEST SUITE CHANGES
tests.pl: New test_assert_regexp function for doing fuzzy matching of
test results.
Returns 0 if all tests run successfully, and 1 if there are
any errors
TestLogger.tst: New file for testing POPFile::Logger functionality.
Makefile: The test target has a variable TESTARGS can be set with the
specific module (or modules using glob patterns) to run.
For example: gmake test TESTARGS='TestLogger'
There's a new coverage target to run the test suite and output
code coverage information for the modules used.
TestCoverage.pm: New module that provides line coverage information for
the test suite. Executed as a Perl debugger using the -d
switch and outputs code coverage information for all
POPFile files tested.
Index: Bayes.pm
===================================================================
RCS file: /cvsroot/popfile/engine/Classifier/Bayes.pm,v
retrieving revision 1.161
retrieving revision 1.162
diff -C2 -d -r1.161 -r1.162
*** Bayes.pm 6 Jul 2003 07:59:30 -0000 1.161
--- Bayes.pm 9 Jul 2003 18:18:11 -0000 1.162
***************
*** 483,495 ****
if ( open WORDS, '<' . $self->config_( 'corpus' ) . "/$bucket/table" ) {
! while (<WORDS>) {
! if ( /__CORPUS__ __VERSION__ (\d+)/ ) {
! if ( $1 != $self->{corpus_version__} ) {
! print "Incompatible corpus version in $bucket\n";
! return;
! }
!
! next;
}
s/[\r\n]//g;
--- 483,497 ----
if ( open WORDS, '<' . $self->config_( 'corpus' ) . "/$bucket/table" ) {
! my $first = <WORDS>;
! if ( $first =~ s/^__CORPUS__ __VERSION__ (\d+)// ) {
! if ( $1 != $self->{corpus_version__} ) {
! print "Incompatible corpus version in $bucket\n";
! return;
}
+ } else {
+ return;
+ }
+
+ while ( <WORDS> ) {
s/[\r\n]//g;
***************
*** 563,569 ****
# ---------------------------------------------------------------------------------------------
#
! # classify_file
#
! # $file The name of the file containing the text to classify
# $ui Reference to the UI used when doing colorization
#
--- 565,572 ----
# ---------------------------------------------------------------------------------------------
#
! # classify
#
! # $file The name of the file containing the text to classify (or undef to use
! # the data already in the parser)
# $ui Reference to the UI used when doing colorization
#
***************
*** 572,578 ****
#
# ---------------------------------------------------------------------------------------------
! sub classify_file
{
! my ($self, $file, $ui) = @_;
my $msg_total = 0;
--- 575,581 ----
#
# ---------------------------------------------------------------------------------------------
! sub classify
{
! my ( $self, $file, $ui ) = @_;
my $msg_total = 0;
***************
*** 580,584 ****
$self->{magnet_detail__} = '';
! $self->{parser__}->parse_stream($file);
# Check to see if this email should be classified based on a magnet
--- 583,589 ----
$self->{magnet_detail__} = '';
! if ( defined( $file ) ) {
! $self->{parser__}->parse_file( $file );
! }
# Check to see if this email should be classified based on a magnet
***************
*** 927,930 ****
--- 932,956 ----
}
+ # ---------------------------------------------------------------------------------------------
+ #
+ # write_line__
+ #
+ # Writes a line to a file and parses it
+ #
+ # $file File handle for file to write line to
+ # $line The line to write
+ # $class The current classification
+ #
+ # ---------------------------------------------------------------------------------------------
+ sub write_line__
+ {
+ my ( $self, $file, $line, $class ) = @_;
+
+ print $file $line;
+
+ if ( $class eq '' ) {
+ $self->{parser__}->parse_line( $line );
+ }
+ }
# ---------------------------------------------------------------------------------------------
***************
*** 944,948 ****
# $echo - 1 to echo to the client, 0 to supress, defaults to 1
#
! # Returns a classification if it worked, otherwise returns an empty string
#
# ---------------------------------------------------------------------------------------------
--- 970,975 ----
# $echo - 1 to echo to the client, 0 to supress, defaults to 1
#
! # Returns a classification if it worked and the name of the file where the message
! # was saved
#
# ---------------------------------------------------------------------------------------------
***************
*** 988,992 ****
my $class_file = $self->history_filename($dcount,$mcount, ".cls",0);
! open TEMP, ">$temp_file";
while ( <$mail> ) {
--- 1015,1028 ----
my $class_file = $self->history_filename($dcount,$mcount, ".cls",0);
! # If we don't yet know the classification then start the parser
! if ( $class eq '' ) {
! $self->{parser__}->start_parse();
! }
!
! # We append .TMP to the filename for the MSG file so that if we are in
! # middle of downloading a message and we refresh the history we do not
! # get class file errors
!
! open TEMP, ">$temp_file.tmp";
while ( <$mail> ) {
***************
*** 1024,1028 ****
if ( !( $line =~ /^(\r\n|\r|\n)$/i ) ) {
$message_size += length $line;
! print TEMP $fileline;
# If there is no echoing occuring, it doesn't matter what we do to these
--- 1060,1064 ----
if ( !( $line =~ /^(\r\n|\r|\n)$/i ) ) {
$message_size += length $line;
! $self->write_line__( \*TEMP, $fileline, $class );
# If there is no echoing occuring, it doesn't matter what we do to these
***************
*** 1053,1057 ****
}
} else {
! print TEMP "\n";
$message_size += length $eol;
$getting_headers = 0;
--- 1089,1093 ----
}
} else {
! $self->write_line__( \*TEMP, "\n", $class );
$message_size += length $eol;
$getting_headers = 0;
***************
*** 1060,1064 ****
$message_size += length $line;
$msg_body .= $line;
! print TEMP $fileline;
}
--- 1096,1100 ----
$message_size += length $line;
$msg_body .= $line;
! $self->write_line__( \*TEMP, $fileline, $class );
}
***************
*** 1075,1082 ****
close TEMP;
# Do the text classification and update the counter for that bucket that we just downloaded
# an email of that type
! $classification = ($class ne '')?$class:$self->classify_file($temp_file);
my $modification = $self->config_( 'subject_mod_left' ) . $classification . $self->config_( 'subject_mod_right' );
--- 1111,1123 ----
close TEMP;
+ # If we don't yet know the classification then stop the parser
+ if ( $class eq '' ) {
+ $self->{parser__}->stop_parse();
+ }
+
# Do the text classification and update the counter for that bucket that we just downloaded
# an email of that type
! $classification = ($class ne '')?$class:$self->classify(undef);
my $modification = $self->config_( 'subject_mod_left' ) . $classification . $self->config_( 'subject_mod_right' );
***************
*** 1177,1184 ****
if ( !$nosave ) {
! $self->history_write_class($class_file, undef, $classification, undef, ($self->{magnet_used__}?$self->{magnet_detail__}:undef))
}
! return $classification;
}
--- 1218,1231 ----
if ( !$nosave ) {
! $self->history_write_class($class_file, undef, $classification, undef, ($self->{magnet_used__}?$self->{magnet_detail__}:undef));
!
! # Now rename the MSG file, since the class file has been written it's safe for the mesg
! # file to have the correct name. If the history cache is reloaded then we wont have a class
! # file error since it was already written
!
! rename "$temp_file.tmp", $temp_file;
}
! return ( $classification, $nopath_temp_file );
}
***************
*** 1374,1380 ****
$self->{parser__}->{color__} = 1;
$self->{parser__}->{bayes__} = bless $self;
! my $result = $self->{parser__}->parse_stream($file);
$self->{parser__}->{color__} = 0;
- $self->{parser__}->{words__} = {};
return $result;
--- 1421,1426 ----
$self->{parser__}->{color__} = 1;
$self->{parser__}->{bayes__} = bless $self;
! my $result = $self->{parser__}->parse_file( $file );
$self->{parser__}->{color__} = 0;
return $result;
***************
*** 1450,1464 ****
# ---------------------------------------------------------------------------------------------
#
! # add_message_to_bucket
#
! # Parses a mail message and updates the statistics in the specified bucket
#
- # $file Name of file containing mail message to parse
# $bucket Name of the bucket to be updated
#
# ---------------------------------------------------------------------------------------------
! sub add_message_to_bucket
{
! my ( $self, $file, $bucket ) = @_;
my %words;
--- 1496,1510 ----
# ---------------------------------------------------------------------------------------------
#
! # add_messages_to_bucket
#
! # Parses mail messages and updates the statistics in the specified bucket
#
# $bucket Name of the bucket to be updated
+ # @files List of file names to parse
#
# ---------------------------------------------------------------------------------------------
! sub add_messages_to_bucket
{
! my ( $self, $bucket, @files ) = @_;
my %words;
***************
*** 1489,1496 ****
}
! $self->{parser__}->parse_stream( $file );
! foreach my $word (keys %{$self->{parser__}->{words__}}) {
! $words{$word} += $self->{parser__}->{words__}{$word};
}
--- 1535,1544 ----
}
! foreach my $file (@files) {
! $self->{parser__}->parse_file( $file );
! foreach my $word (keys %{$self->{parser__}->{words__}}) {
! $words{$word} += $self->{parser__}->{words__}{$word};
! }
}
***************
*** 1508,1511 ****
--- 1556,1576 ----
# ---------------------------------------------------------------------------------------------
#
+ # add_message_to_bucket
+ #
+ # Parses a mail message and updates the statistics in the specified bucket
+ #
+ # $file Name of file containing mail message to parse
+ # $bucket Name of the bucket to be updated
+ #
+ # ---------------------------------------------------------------------------------------------
+ sub add_message_to_bucket
+ {
+ my ( $self, $file, $bucket ) = @_;
+
+ $self->add_messages_to_bucket( $bucket, $file );
+ }
+
+ # ---------------------------------------------------------------------------------------------
+ #
# remove_message_from_bucket
#
***************
*** 1547,1551 ****
}
! $self->{parser__}->parse_stream( $file );
foreach my $word (keys %{$self->{parser__}->{words__}}) {
--- 1612,1616 ----
}
! $self->{parser__}->parse_file( $file );
foreach my $word (keys %{$self->{parser__}->{words__}}) {
Index: MailParse.pm
===================================================================
RCS file: /cvsroot/popfile/engine/Classifier/MailParse.pm,v
retrieving revision 1.141
retrieving revision 1.142
diff -C2 -d -r1.141 -r1.142
*** MailParse.pm 29 Jun 2003 21:02:47 -0000 1.141
--- MailParse.pm 9 Jul 2003 18:18:14 -0000 1.142
***************
*** 259,265 ****
$self->{ut__} .= $to . ' ';
}
- } else {
- $self->increment_word( $mword );
}
}
--- 259,265 ----
$self->{ut__} .= $to . ' ';
}
}
+
+ $self->increment_word( $mword );
}
***************
*** 301,307 ****
$self->{ut__} .= "<font color=\"$color\">$word<\/font> ";
}
- } else {
- increment_word( $self, $mword );
}
}
}
--- 301,307 ----
$self->{ut__} .= "<font color=\"$color\">$word<\/font> ";
}
}
+
+ $self->increment_word( $mword );
}
}
***************
*** 930,956 ****
# ---------------------------------------------------------------------------------------------
#
! # parse_stream
#
! # Read messages from a file stream and parse into a list of words and frequencies
#
# $file The file to open and parse
#
# ---------------------------------------------------------------------------------------------
! sub parse_stream
{
! my ($self, $file) = @_;
# This will contain the mime boundary information in a mime message
! my $mime = '';
# Contains the encoding for the current block in a mime message
! my $encoding = '';
# Variables to save header information to while parsing headers
! my $header = '';
! my $argument = '';
# Clear the word hash
--- 930,1001 ----
# ---------------------------------------------------------------------------------------------
#
! # parse_file
#
! # Read messages from file and parse into a list of words and frequencies, returns a colorized
! # HTML version of message if color__ is set
#
# $file The file to open and parse
#
# ---------------------------------------------------------------------------------------------
! sub parse_file
{
! my ( $self, $file ) = @_;
!
! $self->start_parse();
!
! open MSG, "<$file";
! binmode MSG;
!
! # Read each line and find each "word" which we define as a sequence of alpha
! # characters
!
! while (<MSG>) {
! $self->parse_line( $_ );
! }
!
! $self->{colorized__} .= $self->clear_out_base64();
! close MSG;
!
! $self->stop_parse();
! $self->{in_html_tag__} = 0;
!
! if ( $self->{color__} ) {
! $self->{colorized__} .= $self->{ut__} if ( $self->{ut__} ne '' );
!
! $self->{colorized__} .= "</tt>";
! $self->{colorized__} =~ s/(\r\n\r\n|\r\r|\n\n)/__BREAK____BREAK__/g;
! $self->{colorized__} =~ s/[\r\n]+/__BREAK__/g;
! $self->{colorized__} =~ s/__BREAK__/<br \/>/g;
!
! return $self->{colorized__};
! } else {
! return '';
! }
! }
!
! # ---------------------------------------------------------------------------------------------
! #
! # start_parse
! #
! # Called to reset internal variables before parsing. This is automatically called when using
! # the parse_file API, and must be called before the first call to parse_line.
! #
! # ---------------------------------------------------------------------------------------------
! sub start_parse
! {
! my ( $self ) = @_;
# This will contain the mime boundary information in a mime message
! $self->{mime__} = '';
# Contains the encoding for the current block in a mime message
! $self->{encoding__} = '';
# Variables to save header information to while parsing headers
! $self->{header__} = '';
! $self->{argument__} = '';
# Clear the word hash
***************
*** 958,965 ****
$self->{content_type__} = '';
- # Used to return a colorize page
-
- my $colorized = '';
-
# Base64 attachments are loaded into this as we read them
--- 1003,1006 ----
***************
*** 993,1006 ****
$self->{first20count__} = 0;
! $colorized .= "<tt>" if ( $self->{color__} );
! open MSG, "<$file";
! binmode MSG;
! # Read each line and find each "word" which we define as a sequence of alpha
! # characters
! while (<MSG>) {
! my $read = $_;
# For the Mac we do further splitting of the line at the CR characters
--- 1034,1079 ----
$self->{first20count__} = 0;
! # Used to return a colorize page
! $self->{colorized__} = '';
! $self->{colorized__} .= "<tt>" if ( $self->{color__} );
! }
! # ---------------------------------------------------------------------------------------------
! #
! # stop_parse
! #
! # Called at the end of a parse job. Automatically called if parse_file is used, must be
! # called after the last call to parse_line.
! #
! # ---------------------------------------------------------------------------------------------
! sub stop_parse
! {
! my ( $self ) = @_;
! # If we reach here and discover that we think that we are in an unclosed HTML tag then there
! # has probably been an error (such as a < in the text messing things up) and so we dump
! # whatever is stored in the HTML tag out
!
! if ( $self->{in_html_tag__} ) {
! $self->add_line( $self->{html_tag__} . ' ' . $self->{html_arg__}, 0, '' );
! }
! }
!
! # ---------------------------------------------------------------------------------------------
! #
! # parse_line
! #
! # Called to parse a single line from a message. If using this API directly then be sure
! # to call start_parse before the first call to parse_line.
! #
! # $line Line of file to parse
! #
! # ---------------------------------------------------------------------------------------------
! sub parse_line
! {
! my ( $self, $read ) = @_;
!
! if ( $read ne '' ) {
# For the Mac we do further splitting of the line at the CR characters
***************
*** 1016,1024 ****
if (!$self->{in_html_tag__}) {
! $colorized .= $self->{ut__};
$self->{ut__} = '';
}
! $self->{ut__} .= splitline($line, $encoding);
}
--- 1089,1097 ----
if (!$self->{in_html_tag__}) {
! $self->{colorized__} .= $self->{ut__};
$self->{ut__} = '';
}
! $self->{ut__} .= splitline($line, $self->{encoding__});
}
***************
*** 1034,1042 ****
# Parse the last header
! ($mime,$encoding) = $self->parse_header($header,$argument,$mime,$encoding);
# Clear the saved headers
! $header = '';
! $argument = '';
$self->{ut__} .= splitline( "\015\012", 0 );
--- 1107,1115 ----
# Parse the last header
! ($self->{mime__},$self->{encoding__}) = $self->parse_header($self->{header__},$self->{argument__},$self->{mime__},$self->{encoding__});
# Clear the saved headers
! $self->{header__} = '';
! $self->{argument__} = '';
$self->{ut__} .= splitline( "\015\012", 0 );
***************
*** 1055,1064 ****
# Parse the last header
! ($mime,$encoding) = $self->parse_header($header,$argument,$mime,$encoding) if ($header ne '');
# Save the new information for the current header
! $header = $1;
! $argument = $2;
next;
}
--- 1128,1137 ----
# Parse the last header
! ($self->{mime__},$self->{encoding__}) = $self->parse_header($self->{header__},$self->{argument__},$self->{mime__},$self->{encoding__}) if ($self->{header__} ne '');
# Save the new information for the current header
! $self->{header__} = $1;
! $self->{argument__} = $2;
next;
}
***************
*** 1067,1071 ****
if ( $line =~ /^([\t ].*?)(\r\n|\r|\n)/ ) {
! $argument .= "\015\012" . $1;
}
next;
--- 1140,1144 ----
if ( $line =~ /^([\t ].*?)(\r\n|\r|\n)/ ) {
! $self->{argument__} .= "\015\012" . $1;
}
next;
***************
*** 1074,1082 ****
# If we are in a mime document then spot the boundaries
! if ( ( $mime ne '' ) && ( $line =~ /^\-\-($mime)(\-\-)?/ ) ) {
# approach each mime part with fresh eyes
! $encoding = '';
if (!defined $2) {
--- 1147,1155 ----
# If we are in a mime document then spot the boundaries
! if ( ( $self->{mime__} ne '' ) && ( $line =~ /^\-\-($self->{mime__})(\-\-)?/ ) ) {
# approach each mime part with fresh eyes
! $self->{encoding__} = '';
if (!defined $2) {
***************
*** 1097,1101 ****
my $temp_mime;
! foreach my $aboundary (split(/\|/,$mime)) {
if ($boundary ne $aboundary) {
if (defined $temp_mime) {
--- 1170,1174 ----
my $temp_mime;
! foreach my $aboundary (split(/\|/,$self->{mime__})) {
if ($boundary ne $aboundary) {
if (defined $temp_mime) {
***************
*** 1107,1113 ****
}
! $mime = ($temp_mime || '');
! print "MIME boundary list now $mime\n" if $self->{debug__};
$self->{in_headers__} = 0;
}
--- 1180,1186 ----
}
! $self->{mime__} = ($temp_mime || '');
! print "MIME boundary list now $self->{mime__}\n" if $self->{debug__};
$self->{in_headers__} = 0;
}
***************
*** 1128,1132 ****
# for decoding
! if ( $encoding =~ /base64/i ) {
$line =~ s/[\r\n]//g;
$line =~ s/!$//;
--- 1201,1205 ----
# for decoding
! if ( $self->{encoding__} =~ /base64/i ) {
$line =~ s/[\r\n]//g;
$line =~ s/!$//;
***************
*** 1146,1150 ****
# Decode quoted-printable
! if ( $encoding =~ /quoted\-printable/i ) {
$line = decode_qp( $line );
$line =~ s/[\r\n]+$//g;
--- 1219,1223 ----
# Decode quoted-printable
! if ( $self->{encoding__} =~ /quoted\-printable/i ) {
$line = decode_qp( $line );
$line =~ s/[\r\n]+$//g;
***************
*** 1155,1182 ****
}
}
-
- # If we reach here and disover that we think that we are in an unclosed HTML tag then there
- # has probably been an error (such as a < in the text messing things up) and so we dump
- # whatever is stored in the HTML tag out
-
- if ( $self->{in_html_tag__} ) {
- add_line( $self, $self->{html_tag__} . ' ' . $self->{html_arg__}, 0, '' );
- }
-
- $colorized .= clear_out_base64( $self );
- close MSG;
-
- $self->{in_html_tag__} = 0;
-
- if ( $self->{color__} ) {
- $colorized .= $self->{ut__} if ( $self->{ut__} ne '' );
-
- $colorized .= "</tt>";
- $colorized =~ s/(\r\n\r\n|\r\r|\n\n)/__BREAK____BREAK__/g;
- $colorized =~ s/[\r\n]+/__BREAK__/g;
- $colorized =~ s/__BREAK__/<br \/>/g;
-
- return $colorized;
- }
}
--- 1228,1231 ----
***************
*** 1204,1208 ****
$decoded = decode_base64( $self->{base64__} );
! parse_html( $self, $decoded, 1 );
print "Decoded: " . $decoded . "\n" if ($self->{debug__});
--- 1253,1257 ----
$decoded = decode_base64( $self->{base64__} );
! $self->parse_html( $decoded, 1 );
print "Decoded: " . $decoded . "\n" if ($self->{debug__});
***************
*** 1212,1216 ****
if ( $self->{color__} ) {
if ( $self->{ut__} ne '' ) {
! $colorized = $self->{ut__};
$self->{ut__} = '';
}
--- 1261,1265 ----
if ( $self->{color__} ) {
if ( $self->{ut__} ne '' ) {
! $colorized = $self->{ut__};
$self->{ut__} = '';
}
|