Migrating our Debian Anti-Spam Anti-Virus Gateway Email Server's Bayes database to MySQL


Notes: Before you begin, my experience shows MySQL will use at least 42MB RAM, so make sure you can spare at least that much free memory. During the upgrade, sa-learn will use about 20MB RAM, so if you don't have 64MB free memory available during the migration (and after), the process may run very slowly. If immediately after a reboot, you don't have 100MB free memory, I'm not sure I would install MySQL. This migration procedure is extremely CPU intensive. You can expect the system to slow to a crawl while sa-learn is backing up and restoring the Bayes data. This may cause problems on a busy box so I would take measures to limit mail traffic as much as possible. These instructions assume you have built a Debian spamfilter box according to the instructions here: http://verchick.com/mecham/public_html/spam/ and (like me) you have no prior experience with MySQL. The disclaimer for this document is also located there. I assume you are running amavisd-new but I do not require that you are. Some aspects of this HOWTO may differ considerably if you are not however. If you are building a new system, I strongly recommend you first process at least a couple hundred messages prior to converting Bayes to MySQL because some parts of this document will make more sense if you currently have some data in your Bayes and auto-whitelist files.

Items in bold are variables and will be provided by you. If you would like to customize this document, save it to your system, then open it in Notepad, WordPad or your favorite plain text HTML editor and do a search and replace on the bold items. When you open the document in your editor, you will find instructions at the top of the page. Note that the name of the database (sa_bayes) and the user name (sa_user) could also be changed. The MySQL reference manual: http://dev.mysql.com/doc/refman/4.1/en/index.html.

Assuming you are running Debian lenny, install MySQL 5.0:
apt-get install mysql-server-5.0

You should be prompted to enter a password for root (twice). Note that the Debian installation provided the Perl modules SpamAssassin needs.
(See 'Requirements' at http://spamassassin.apache.org/full/3.0.x/dist/sql/README.bayes)
Once MySQL is installed, log in:

mysql -p

Now create the database we will use to store Bayes and auto-whitelist data:
create database sa_bayes;

We will now create a user called 'sa_user', create a password for 'sa_user' and allow 'sa_user' to do what it needs to the 'sa_bayes' database:
GRANT SELECT, INSERT, UPDATE, DELETE ON sa_bayes.* TO 'sa_user'@'localhost' IDENTIFIED BY 'sa_user_password';

Refresh privileges:
FLUSH PRIVILEGES;

Let's take a look at what we have done:
SELECT Host, User FROM mysql.user;

Note that there is one user the system uses: debian-sys-maint. We are done for the moment, so:
quit


Navigate to where our Bayes data is currently stored:
cd /var/lib/amavis/.spamassassin

As the 'amavis' user, we need to create a backup file of our Bayes data that will later be restored to our SQL database. This process may take a while:
su amavis -c 'sa-learn --sync --force-expire'
su amavis -c 'sa-learn --backup > backup.txt'

There are four files we need to grab from the source code:
wget http://spamassassin.apache.org/full/3.0.x/dist/tools/convert_awl_dbm_to_sql
wget http://spamassassin.apache.org/full/3.0.x/dist/sql/bayes_mysql.sql
wget http://spamassassin.apache.org/full/3.0.x/dist/sql/awl_mysql.sql
wget http://spamassassin.apache.org/gtube/gtube.txt
wget http://verchick.com/mecham/public_html/spam/awl.update.sql


These commands create the tables we need, in the sa_bayes database.
mysql -u root -p sa_bayes < bayes_mysql.sql

mysql -u root -p sa_bayes < awl_mysql.sql


This one updates the awl table fo use with SpamAssassin 3.3.0 or later.
mysql -u root -p sa_bayes < awl.update.sql

Now we will configure SpamAssassin to use MySQL. Once we do this, don't restart amavisd-new until we are completely finished. If you have to restart amavisd-new before the process is complete, set local.cf back the way it was. Note that Debian places local.cf in the /etc/spamassassin directory while many other OSs place it in /etc/mail/spamassassin. Also note that we will comment out bayes_path and auto_whitelist_path:
cp /etc/spamassassin/local.cf /etc/spamassassin/local.cf-pre-sql
vi /etc/spamassassin/local.cf


Insert text below, and comment out items as noted:
bayes_store_module              Mail::SpamAssassin::BayesStore::MySQL
bayes_sql_dsn                   DBI:mysql:sa_bayes:localhost
bayes_sql_username              sa_user
bayes_sql_password              sa_user_password

auto_whitelist_factory          Mail::SpamAssassin::SQLBasedAddrList
user_awl_dsn                    DBI:mysql:sa_bayes:localhost
user_awl_sql_username           sa_user
user_awl_sql_password           sa_user_password

#this next part will store all data in just one user's table, no matter who runs sa-learn
bayes_sql_override_username amavis

#auto_whitelist_path /var/lib/amavis/.spamassassin/auto-whitelist
#bayes_path  /var/lib/amavis/.spamassassin/bayes

It is required we initialize the database by learning a message.
su amavis -c 'sa-learn --spam gtube.txt'

We are now going to load in the data from our old auto-whitelist file. Run these commands to show what arguments are required to run the special program convert_awl_dbm_to_sql. You should still be in the /var/lib/amavis/.spamassassin directory where we downloaded this script:
chmod +x convert_awl_dbm_to_sql
./convert_awl_dbm_to_sql


Now that you have read and understood what is needed, you may run the program. Edit if necessary; make sure you select, copy and paste the entire text:
./convert_awl_dbm_to_sql --username amavis --dsn DBI:mysql:sa_bayes:localhost --dbautowhitelist /var/lib/amavis/.spamassassin/auto-whitelist --sqlusername sa_user --sqlpassword sa_user_password --ok

Now log in to MySQL and issue these commands so we can see if it worked:
mysql -u root -p

USE sa_bayes;
SELECT email,count,totscore FROM awl;


Assuming you have been using auto-whitelist for at least a little while, you should see some data. Make a note of how many rows are in the set. Now we can quit:
quit

OK, now let's load in the Bayes data; this may take 5 minutes, or more than an hour and don't be surprised if your CPU load increases substantially during this process:
su amavis -c 'sa-learn --restore backup.txt'

On a healthy system that has been in use for a while there will be between 100,000 and 150,000 rows of data in the Bayes data tables. If you would like to view our progress, open another PuTTY session, log into MySQL, and monitor the progress:
mysql -u root -p

USE sa_bayes;
SELECT COUNT(*) spam_count FROM bayes_token;


You can use the [up-arrow] to recall commands. While you wait, commands that provide interesting data:
SHOW DATABASES;
SHOW TABLES;
DESCRIBE bayes_token;
DESCRIBE bayes_vars;
SELECT username FROM bayes_vars;

(should show 'amavis', and possibly 'root')

Once the restore is complete (our first PuTTY session has returned to the shell prompt), we will change all tables from MyISAM to InnoDB. Note that the 5.x version of MySQL we are using does not require any configuration changes to use InnoDB; 3.x versions do however. Log back into MySQL if necessary (and use the sa_bayes database), then issue these commands to convert our data:
mysql -u root -p

USE sa_bayes;
ALTER TABLE awl TYPE=InnoDB;
ALTER TABLE bayes_expire TYPE=InnoDB;
ALTER TABLE bayes_global_vars TYPE=InnoDB;
ALTER TABLE bayes_seen TYPE=InnoDB;
ALTER TABLE bayes_token TYPE=InnoDB;
ALTER TABLE bayes_vars TYPE=InnoDB;
ANALYZE TABLE awl;
ANALYZE TABLE bayes_expire;
ANALYZE TABLE bayes_global_vars;
ANALYZE TABLE bayes_seen;
ANALYZE TABLE bayes_token;
ANALYZE TABLE bayes_vars;


If everything went well, we can quit:
quit


The data conversion is complete, now we can test:
wget http://spamassassin.apache.org/gtube/gtube.txt
spamassassin -D <gtube.txt 2>&1 | egrep '(bayes:|whitelist:|AWL)'


You should see:
debug: bayes: Using username: sa_user
debug: bayes: Database connection established
debug: bayes: found bayes db version 3

and if use_auto_whitelist is enabled (rather, is not disabled):
dbg: auto-whitelist: sql-based connected to DBI:mysql:bayes:localhost

If everything looks OK, reload amavisd-new:
amavisd-new reload

Our password file changed, so update the copy Postfix uses:
cp /etc/passwd /var/spool/postfix/etc/passwd
postfix reload



Now we will create a script to keep our auto-whitelist data trim. If you are currently using my /etc/cron.weekly/trim_whitelist_weekly script, it will need to be replaced. First let's create a new script:
vi /usr/sbin/trim-awl

and insert (always make sure you hit [Enter] at the end of the last line):
#!/bin/sh
/usr/bin/mysql -usa_user -psa_user_password < /etc/trim-awl.sql
exit 0


Save and exit the file, then make the script executable:
chmod 0750 /usr/sbin/trim-awl

Now create a file that the script will use for input:
vi /etc/trim-awl.sql

and insert:
USE sa_bayes;
DELETE FROM awl WHERE count = 1;


Save and exit, then create a cron job to run the script:
cd /etc/cron.weekly
rm trim_whitelist_weekly
vi trim-sql-awl-weekly


and insert:
#!/bin/sh
#
#  Weekly maintenance of auto-whitelist for Debian with amavisd-new and 
#  SpamAssassin using MySQL
test -e /usr/sbin/trim-awl && test -e /usr/sbin/amavisd-new && {
        /usr/sbin/trim-awl >/dev/null
}
exit 0
Save and exit, then make it executable:
chmod +x trim-sql-awl-weekly

Then, to test it, run it:
./trim-sql-awl-weekly

Assuming you have been using auto_whitelist at least for a few days, log back in to MySQL, and note how many rows of data are in the awl table, compare it to the number I asked you to make a note of earlier. There should be less if the script is working (unless you previously did not have any entries with a 'count' of '1'):
mysql -u root -p

USE sa_bayes;
SELECT email,count,totscore FROM awl;

quit

Do you have any free memory?
free

Is this maybe an opportunity to reboot?


Optional: If you would like finer control of auto-whitelist maintenance:
Log in to mysql and add a new field "lastupdate" to the awl database, and then populate it (initialize it) with today's date:

mysql -u root -p

USE sa_bayes;
ALTER TABLE awl ADD lastupdate timestamp default CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP;
UPDATE awl SET lastupdate = NOW( ) WHERE lastupdate < 1;
quit


Now, edit the trim-awl.sql file:
vi /etc/trim-awl.sql

and replace the simple-minded:
DELETE FROM awl WHERE count = 1;
with more sophisticated statements:
DELETE FROM awl WHERE lastupdate <= DATE_SUB(SYSDATE(), INTERVAL 6 MONTH);
DELETE FROM awl WHERE count = 1 AND lastupdate <= DATE_SUB(SYSDATE(), INTERVAL 15 DAY);


Feel free to modify the time intervals as you see fit. Each time a record is created or modified, the lastupdate field will reflect the current date and time. These statements not only get rid of "single hit" records 15 days or older, but also records with 6 months or more of inactivity.

Optional: The bayes_seen table grows forever and should be cleaned up once on a while. Log in to mysql and add a new field "lastupdate" to the bayes_seen table and delete all entries so we start clean:
mysql -u root -p

USE sa_bayes;
DELETE FROM bayes_seen;
ALTER TABLE bayes_seen ADD lastupdate timestamp default CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP;
OPTIMIZE TABLE bayes_seen;
quit


Now, edit the trim-awl.sql file:
vi /etc/trim-awl.sql

and insert this at the bottom:
DELETE FROM bayes_seen WHERE lastupdate <= DATE_SUB(SYSDATE(), INTERVAL 20 DAY);


If you would like to make a backup of the Bayes data:
cd /var/lib/amavis/.spamassassin
su amavis -c 'sa-learn --sync'
su amavis -c 'sa-learn --backup' > bayes_backup.txt
chown amavis:amavis bayes_backup.txt


If you would like to see how many messages have been learned and how many tokens are in the database:
su amavis -c 'sa-learn --dump magic'



Suggested reading:
http://www.mysqlperformanceblog.com/2006/09/29/what-to-tune-in-mysql-server-after-installation/
http://www.mysql.com/news-and-events/newsletter/2003-11/a0000000269.html
Out of the box MySQL is poorly tuned for use with InnoDB tables. I suggest at least configuring innodb_buffer_pool_size and innodb_log_file_size (in /etc/mysql/my.cnf). For example, simply increasing innodb_buffer_pool_size (from the default of 8M) to a conservative 128M (innodb_buffer_pool_size = 128M) can make a hugh difference in performance. This assumes you have at least 768MB of RAM. I do not set this to more that 25% of physical RAM.

References:
http://wiki.apache.org/spamassassin/BetterDocumentation/SqlReadmeBayes?highlight=%28mysql%29
http://people.apache.org/~parker/presentations/
http://www.paulstimesink.com/index.php?op=ViewArticle&articleId=163&blogId=2
http://www.paulstimesink.com/index.php?op=ViewArticle&articleId=167&blogId=2
http://dev.mysql.com/doc/mysql/en/index.html
http://infocenter.guardiandigital.com/archive/amavis/2004/Dec/0319.html
http://spamassassin.apache.org/full/3.0.x/dist/sql/README.bayes
http://spamassassin.apache.org/full/3.0.x/dist/sql/README
https://www.maiamailguard.com/maia/wiki/SpamAssassin3SQLBayes

mr88talent at yahoo dot com 2/7/2010