SA 3.3 and MySQL 5 bugs

See http://marc.info/?l=spamassassin-users&m=130885200600613
See http://verchick.com/mecham/public_html/spam/awl.update.sql for update to AWL schema for 3.3

The bayes database should be wiped so all the problematic garbage is cleaned out.
Maybe something like this (after opening the database with USE 
- and stopping amavisd-new or Maia):

DELETE from awl;
DELETE from bayes_expire;
DELETE from bayes_global_vars;
INSERT INTO bayes_global_vars VALUES ('VERSION','3');
DELETE from bayes_seen;
DELETE from bayes_token;
DELETE from bayes_vars; 

ALTER TABLE bayes_token MODIFY token BINARY(5) NOT NULL DEFAULT '';
ALTER TABLE bayes_token DROP PRIMARY KEY;
ALTER TABLE bayes_token ADD PRIMARY KEY (id, token);

ALTER TABLE awl MODIFY email VARBINARY(255) NOT NULL DEFAULT '';
ALTER TABLE awl DROP PRIMARY KEY;
ALTER TABLE awl ADD PRIMARY KEY (username, email, signedby, ip);

If using

MySQL 5.1.20 and later contain a bug which misreports the number of
rows updated with this operation.  SpamAssassin can INSERT
new tokens into its Bayes database, in other words, but can never
UPDATE them later.  In short, the Bayes training mechanism is
completely broken, so only new tokens can be added.  Perhaps
worst of all, SpamAssassin does not report the UPDATE failure,
so this bug has gone unnoticed for almost two years.

The SpamAssassin devs have opted to work around the bug with a small
patch to BayesStore/MySQL.pm which restores the proper UPDATE functionality.
Applicable to 3.3

I placed a copy of the following at:
http://verchick.com/mecham/public_html/spam/sa.mysql.patch1.txt

--- lib/Mail/SpamAssassin/BayesStore/MySQL.pm	(revision 1138970)
+++ lib/Mail/SpamAssassin/BayesStore/MySQL.pm	(working copy)
@@ -840,14 +840,28 @@
       return 0;
     }
 
+    # With ON DUPLICATE KEY UPDATE, the affected-rows value per row is 1 if
+    # the row is inserted as a new row and 2 if an existing row is updated.
+    #
+    # Due to a MySQL server bug a value of 3 can be seen.
+    # See: http://bugs.mysql.com/bug.php?id=46675
+    #   When executing the INSERT ... ON DUPLICATE KEY UPDATE statement
+    #   and checking the rows return count:
+    #   mysql_client_found_rows = 0: The second INSERT returns a row count
+    #                                of 2 in all MySQL versions.
+    #   mysql_client_found_rows = 1: The second INSERT returns this row count:
+    #     Before MySQL 5.1.20: 2
+    #     MySQL 5.1.20: undef on Mac OS X, 139775481 on Linux (garbage?)
+    #     MySQL 5.1.21 and up: 3
+    #
     my $num_rows = $rc;
 
     $sth->finish();
 
-    if ($num_rows == 1 || $num_rows == 2) {
+    if ($num_rows == 1 || $num_rows == 2 || $num_rows == 3) {
       my $token_count_update = '';
       
-      $token_count_update = "token_count = token_count + 1," if ($num_rows == 1);
+      $token_count_update = "token_count = token_count + 1," if $num_rows == 1;
       $sql = "UPDATE bayes_vars SET
                      $token_count_update
                      newest_token_age = GREATEST(newest_token_age, ?),
@@ -872,7 +886,11 @@
     }
     else {
       # $num_rows was not what we expected
-      dbg("bayes: _put_token: Updated an unexpected number of rows.");
+      my $token_displ = $token;
+      $token_displ =~ s/(.)/sprintf('%02x',ord($1))/egs;
+      dbg("bayes: _put_token: Updated an unexpected number of rows: %s, ".
+          "id: %s, token (hex): %s",
+          $num_rows, $self->{_userid}, $token_displ);
       $self->{_dbh}->rollback();
       return 0;
     }
@@ -987,8 +1005,24 @@
       else {
 	my $num_rows = $rc;
 
-	$need_atime_update_p = 1 if ($num_rows == 1 || $num_rows == 2);
-	$new_tokens++ if ($num_rows == 1);
+        # With ON DUPLICATE KEY UPDATE, the affected-rows value per row is 1 if
+        # the row is inserted as a new row and 2 if an existing row is updated.
+        # But see MySQL bug (as above): http://bugs.mysql.com/bug.php?id=46675
+
+        if ($num_rows == 1) {
+          $new_tokens++;
+          $need_atime_update_p = 1;
+        } elsif ($num_rows == 2 || $num_rows == 3) {
+          $need_atime_update_p = 1;
+        } else {
+          # $num_rows was not what we expected
+          my $token_displ = $token;
+          $token_displ =~ s/(.)/sprintf('%02x',ord($1))/egs;
+          dbg("bayes: _put_tokens: Updated an unexpected number of rows: %s, ".
+              "id: %s, token (hex): %s",
+              $num_rows, $self->{_userid}, $token_displ);
+          $error_p = 1;
+        }
       }
     }
 
@@ -1026,10 +1060,10 @@
       }
     }
     else {
-      # $num_rows was not what we expected
-      dbg("bayes: _put_tokens: Updated an unexpected number of rows.");
-      $self->{_dbh}->rollback();
-      return 0;
+      info("bayes: _put_tokens: no atime updates needed?  Num of tokens: %d",
+           scalar keys %{$tokens});
+#     $self->{_dbh}->rollback();
+#     return 0;
     }
   }

Notes from the Maia Devel mailing list (Robert LeBlanc):
Build [1556] was motivated by a curious problem (#563) that seems to arise when MySQL is configured to use UTF8 as its character set. The problem is that Unicode characters are several bytes in length rather than just one, so if there's a sequence of characters stored in the database as, for instance, 'abc', that could be seen as three 1-byte characters, or as a single 3-byte Unicode character. In most cases we don't operate on the contents at such a low level so it doesn't much matter to us as long as we use the same encoding throughout. But some parts of SpamAssassin do operate at that level, most notably the Bayes database and its bayes_token table.

SpamAssassin's "tokens" are small chunks of characters sampled from the contents of the email (which may itself be encoded). SpamAssassin makes no effort to decode the mail for this purpose, it just operates on it as if it were a literal string of bytes. As such, bayes_token.token is meant to be character-set and encoding agnostic. This has always been an issue for MySQL users; the PostgreSQL schema has always correctly defined bayes_token.token as a string of bytes. The only reason it hasn't gained more attention over the years is that MySQL has traditionally shipped with a LATIN1 encoding, where bytes and characters are effectively the same. Newer distros are shipping MySQL with UTF8 as its default character set and encoding, however, so new installations are encountering this.

The impact of such a little thing is enormous, however. Because SpamAssassin is expecting to be looking at raw bytes when it performs its token matching in the Bayes database, what it receives from MySQL is in fact garbage, because MySQL is interpreting the token data as multi-byte characters, grabbing three of them at a time to return the single UTF8 character they represent. Consequently token-matching is completely broken, and any results from Bayes tests would end up being spurious noise. The fix is trivial but vitally important:

ALTER TABLE bayes_token MODIFY token BINARY(5) NOT NULL DEFAULT '';
ALTER TABLE bayes_token DROP PRIMARY KEY;
ALTER TABLE bayes_token ADD PRIMARY KEY (id, token);

It has been suggested on SpamAssassin-Dev that it might be a good precaution to treat the awl.email column similarly, since it performs its matching against literal email addresses and might fail if it's trying to match literal bytes against multi-byte representations in the AWL table:

ALTER TABLE awl MODIFY email VARBINARY(255) NOT NULL DEFAULT '';
ALTER TABLE awl DROP PRIMARY KEY;
ALTER TABLE awl ADD PRIMARY KEY (username, email, signedby, ip);

Note that in both cases I've included steps to drop and recreate the primary keys. This forces MySQL to rebuild its indexes now that we've changed the way it interprets those BINARY/VARBINARY columns.

Builds [1557] and [1558] address an even more serious problem (#565) with MySQL and SpamAssassin, and like the previous issue, it was discovered less than an hour after SpamAssassin 3.3.2 was released, and there's no clear plan yet for a 3.3.3 release which would correct it. Essentially the problem is this: SpamAssassin makes use of MySQL's "ON DUPLICATE KEY UPDATE" to update tokens which already exist in its Bayes database, but MySQL 5.1.20 and later contain a bug which misreports the number of rows updated with this operation. SpamAssassin can INSERT new tokens into its Bayes database, in other words, but can never UPDATE them later. In short, the Bayes training mechanism is completely broken, so only new tokens can be added. Perhaps worst of all, SpamAssassin does not report the UPDATE failure, so this bug has gone unnoticed for almost two years.

MySQL has an open ticket (#46675) regarding this issue, but it has been largely ignored since 2009, so it's not clear if/when it will be solved at the source. The SpamAssassin devs have opted to work around the bug with a small patch to BayesStore/MySQL.pm which restores the proper UPDATE functionality. The problem, of course, is that while they committed this fix to their 3.4 branch and may decide to include it in the 3.3 branch as well, it may be months before we see SpamAssassin 3.3.3 now that they've just released 3.3.2. This forces us, in turn, to work around SpamAssassin if we're to provide a working Bayes implementation for users of MySQL.

The broader question is whether this MySQL bug means that a lot of us have defective, learning-impaired Bayes databases, and how to deal with this if that's the case. My own Ubuntu Server (11.04) includes MySQL 5.1.54, for example, and I'm confident that I've been running versions >= 5.1.20 for at least a year, so by all accounts my Bayes database should be suffering for it. I still see a lot of BAYES_99 hits these days, but it's possible those could arise from a lucky combination of one-time tokens arising in a single email, rather than a reflection of a "mature" database with extensive reputation data about each token. To be clear, the data in my Bayes database is all good data, not gibberish, so this doesn't mean I should wipe the database and rebuild from scratch. It simply means that (after applying the patch and restarting maiad) I can expect those tokens to begin gaining the reputation data I've been trying to feed it all this time with auto-learning and process-quarantine.