Fighting image spam on our Debian Sarge spamfilter with FuzzyOcr 3.5.1 and ImageInfo plugins


Absolutely no warranty, use entirely at you own risk. See the disclaimer at http://verchick.com/mecham/public_html/spam/

Image spam is rather difficult to deal with. FuzzyOcr has helped me do so. Of course, running an OCR scanner on images contained in messages will slow down the spam scanning process (considerably in some cases), but for me this is not a problem, just be aware of it. I document my installation of FuzzyOcr 3.5.1 on Debian sarge here. I will assume you are able to compile programs from source. You need SpamAssassin version 3.1.4 or greater to use this plugins. If you are trying to keep your system stable (and you currently use the Debian spamassassin package), consider installing SpamAssassin from sarge-backports. If you add the sarge-backports source to /etc/apt/sources.list you will likely also need to increase the apt cache limit. To do so, in a file called /etc/apt/apt.conf include the setting:
APT::Cache-Limit "25165824";
Then you could simply run:
apt-get -t sarge-backports install spamassassin
As an alternative you can possibly install SpamAssassin from 'testing' without upgrading to the testing versions of libc6 or Perl or the Kernel by using the form 'apt-get install spamassassin/testing'. Using this form should install dependencies from 'stable'. Simulate it first with 'apt-get -s install spamassassin/testing'. In order to prevent accidental upgrades when running 'apt-get upgrade' you might also consider putting the package on hold if you use this method:
echo "spamassassin hold" | dpkg --set-selections
If you are running an older version of FuzzyOcr you need to uninstall it. You can probably simply move all the related files to a safe location and remove the FuzzyOcr loadplugin directive in v310.pre or init.pre (if you created one). Then you will want to run 'spamassassin --lint' to check for errors and reload spamd or amavisd (if that's what you are running) and make sure everything is still running Ok after your changes.


To prevent the gifsicle program from installing undesirable dependencies, we install dependencies first. This gives us more control of the situation. Note: "x11-common: Conflicts: xfree86-common". Sarge is using xfree86-common and etch is using x11-common (but I suppose you might be using xfree86-common on an etch system if you are running a mixed system). Use whichever one is compatible with they way your system is currently configured:
apt-get update

Sarge:
apt-get install lsb-base libice6 libx11-6 xlibs-data libsm6 xfree86-common

Etch:
apt-get install lsb-base libice6 libx11-6 xlibs-data libsm6 x11-common

Lenny:
apt-get install lsb-base libice6 libx11-6 libsm6 x11-common

If running sarge, to prevent problems you will need to install a couple programs from sarge-backports. These programs are not available in the sarge version of Debian. Follow the instructions above and add the sarge-backports source to sources.list and also modify apt.conf as noted, then run the following commands. If you are running etch then of course you can leave out '-t sarge-backports'. I do not recommend installing these packages from testing or unstable if you are running sarge however:
Sarge:
apt-get -t sarge-backports install gifsicle ocrad

Etch or Lenny:
apt-get install gifsicle ocrad

You should not have both giflib-bin and libungif-bin installed. Simulate removing giflib-bin:
apt-get -s remove giflib-bin

If it's not installed, then you can move on. If it's the only thing that will be removed, then remove it. Note: an etch system may try to remove libungif-bin instead of giflib-bin. There is no need to remove libungif-bin:
apt-get remove giflib-bin

Continue to install other required programs. If you are running a mix of stable and testing and the these commands want to install a whole bunch of dependencies, you may want to attempt to install these from stable: apt-get -t stable install ...
apt-get install netpbm bzip2 libmldbm-perl libstring-approx-perl libmldbm-sync-perl

apt-get install liblog-agent-perl libdbi-perl libdbd-mysql-perl libtie-cache-perl

Download, extract, patch, compile and install libungif (even if you have libungif-bin installed):
cd /usr/local/src
wget http://internap.dl.sourceforge.net/sourceforge/giflib/libungif-4.1.4.tar.gz
tar xzvf libungif-4.1.4.tar.gz
cd libungif-4.1.4/util
wget http://users.own-hero.net/~decoder/fuzzyocr/giftext-segfault.patch
patch giftext.c < giftext-segfault.patch
cd ..
./configure --prefix=/usr && make && make install

Download, extract, compile and install gocr:
cd /usr/local/src
wget http://www-e.uni-magdeburg.de/jschulen/ocr/gocr-0.43.tar.gz
tar xzvf gocr-0.43.tar.gz
cd gocr-0.43
./configure --with-netpbm=/usr/lib --prefix=/usr && make && make install

At this point make we should have all of these programs installed in /usr/bin:
which gifsicle
which giffix
which giftext
which gifinter
which giftopnm
which jpegtopnm
which pngtopnm
which bmptopnm
which tifftopnm
which ppmhist
which pamfile
which ocrad
which gocr
which pnmnorm
which pnminvert
which ppmtopgm


Visit http://fuzzyocr.own-hero.net/wiki/
If you are using SpamAssassin 3.1.x, install FuzzyOcr 3.5.1 - this document is based on that version. If you are using SpamAssassin 3.2.x, do not use these commands, jump ahead to the next set of commands:

cd /usr/local/src
wget http://users.own-hero.net/~decoder/fuzzyocr/fuzzyocr-3.5.1-devel.tar.gz
tar xzvf fuzzyocr-3.5.1-devel.tar.gz
cd FuzzyOcr-3.5.1

If you are using SpamAssassin 3.2.x, then install FuzzyOcr from SVN:
apt-get install subversion

cd /usr/local/src
test -e FuzzyOcr-3.5.1 && mv FuzzyOcr-3.5.1 FuzzyOcr-3.5.1-old
test -e devel && mv devel devel-old
svn -r 131 co https://svn.own-hero.net/fuzzyocr/trunk/devel

mv devel FuzzyOcr-3.5.1
cd FuzzyOcr-3.5.1


If you are using netpbm < 10.34 (Debian uses 10.0-10.1) you need to apply these patches. They disable some features only available in newer versions:
wget http://verchick.com/mecham/public_html/spam/FuzzyOcr-3.5.0-rc1.netpbm_less_than_10.34.patch
wget http://verchick.com/mecham/public_html/spam/FuzzyOcr-3.5.0-rc1.netpbm_less_than_10.34.patch2
wget http://verchick.com/mecham/public_html/spam/FuzzyOcr-3.5.0-rc1.netpbm_less_than_10.34.patch3
patch -p0 < FuzzyOcr-3.5.0-rc1.netpbm_less_than_10.34.patch
patch -p0 < FuzzyOcr-3.5.0-rc1.netpbm_less_than_10.34.patch2
patch -p0 < FuzzyOcr-3.5.0-rc1.netpbm_less_than_10.34.patch3

Our Debian version of Netpbm (10.0) may not contain both of these:
which pamtopnm
which pamditherbw

Those programs are used together; therefore we need to disable any scansets that use them and we will also remove them from the preprocessors file. If you have both of these programs, you do not need to apply these patches. If you are missing either one of them, apply these patches:
wget http://verchick.com/mecham/public_html/spam/gary.3.5.0-rc1.old.netpbm.patch1
wget http://verchick.com/mecham/public_html/spam/gary.3.5.0-rc1.old.netpbm.patch2
wget http://verchick.com/mecham/public_html/spam/gary.3.5.0-rc1.old.netpbm.patch3
patch -p0 < gary.3.5.0-rc1.old.netpbm.patch1
patch -p0 < gary.3.5.0-rc1.old.netpbm.patch2
patch -p0 < gary.3.5.0-rc1.old.netpbm.patch3

Now that we are all patched up, we can place the files:
cp -r FuzzyOcr /etc/mail/spamassassin
cp FuzzyOcr.cf /etc/mail/spamassassin
cp FuzzyOcr.pm /etc/mail/spamassassin
cp FuzzyOcr.preps /etc/mail/spamassassin
cp FuzzyOcr.scansets /etc/mail/spamassassin
cp FuzzyOcr.words /etc/mail/spamassassin

Configure FuzzyOcr.cf:
vi /etc/mail/spamassassin/FuzzyOcr.cf

Set log level to 2 (only while we test):
#focr_verbose 3
focr_verbose 2

Enable logging by uncommenting this:
#focr_logfile /tmp/FuzzyOcr.log

uncomment:
#focr_timeout 15

Some people think the default of 0.25 here is too fuzzy, so uncomment (and maybe set to 0.21 or 0.22):
#focr_threshold 0.20

Set focr_base_score to 3 (this is my personal choice):
#focr_base_score 5
focr_base_score 3

I change focr_add_score from the default of 1, to 0.5:
#focr_add_score 0.375
focr_add_score 0.5


I lower focr_corrupt_score:
#focr_corrupt_score 2.5
focr_corrupt_score 1.5


I lower focr_corrupt_unfixable_score:
#focr_corrupt_unfixable_score 5
focr_corrupt_unfixable_score 2.5


Save and exit the file, then we test. Start by linting spamassassin:
spamassassin --lint

If you get this error:
Subroutine FuzzyOcr::O_NONBLOCK redefined at /usr/share/perl/5.8/Exporter.pm line 65.
 at /usr/lib/perl/5.8/POSIX.pm line 19
it appears to be related somehow to Net::Ident. If you run spamd with the --auth-ident option then you need this module and will have to deal with the harmless error message (actually it may not be harmless if you depend on a clean --lint). If you don't need Net::Ident (you don't have any programs that use it), then I suggest you remove it:
apt-get remove libnet-ident-perl

Once you have resolved any (serious) lint errors, we do some more testing.
This assumes you are still in the /usr/local/src/FuzzyOcr-3.5.1 directory:

cd samples
spamassassin -tD < ocr-animated.eml


I got:
 5.0 FUZZY_OCR              BODY: Mail contains an image with common spam text inside
                            Words found:
                            "price" in 1 lines
                            "company" in 1 lines
                            "alert" in 1 lines
                            "news" in 1 lines
                            (6 word occurrences found)
If you did not get something similar, check the log for the last error message (if any).
For example, on a low powered machine you may have to increase focr_timeout in /etc/mail/spamassassin/FuzzyOcr.cf:

cat /tmp/FuzzyOcr.log

Continue on to the next test:
spamassassin -tD < ocr-gif.eml

I got:
 1.5 FUZZY_OCR_WRONG_CTYPE  BODY: Mail contains an image with wrong
                            content-type set
                            Image has format "GIF" but content-type is
                            "image/jpeg"
 2.5 FUZZY_OCR_CORRUPT_IMG  BODY: Mail contains a corrupted image
                            Corrupt image: GIF-LIB error: Image is
                            defective, decoding aborted.
 8.0 FUZZY_OCR              BODY: Mail contains an image with common spam text inside
                            Words found:
                            "target" in 1 lines
                            "service" in 1 lines
                            "stock" in 2 lines
                            "price" in 2 lines
                            "company" in 1 lines
                            "recommendation" in 1 lines
                            (12 word occurrences found)
Continue on to the next test:
spamassassin -tD < ocr-jpg.eml

I got:
 5.0 FUZZY_OCR              BODY: Mail contains an image with common spam text inside
                            Words found:
                            "levitra" in 1 lines
                            "cialis" in 1 lines
                            "viagra" in 2 lines
                            (6 word occurrences found)
Continue on to the next test:
spamassassin -tD < ocr-obfuscated.eml

I got:
 3.0 FUZZY_OCR              BODY: Mail contains an image with common spam text inside
                            Words found:
                            "profit" in 1 lines
                            "profit" in 1 lines
                            (2 word occurrences found)
Continue on to the next test:
spamassassin -tD < ocr-png.eml

I got:
  15 FUZZY_OCR              BODY: Mail contains an image with common spam text inside
                            Words found:
                            "buy" in 1 lines
                            "target" in 2 lines
                            "service" in 1 lines
                            "stock" in 1 lines
                            "investor" in 1 lines
                            "price" in 3 lines
                            "company" in 2 lines
                            "trade" in 1 lines
                            "software" in 1 lines
                            "recommendation" in 1 lines
                            "news" in 3 lines
                            (25.5 word occurrences found)
Continue on to the next test:
spamassassin -tD < ocr-wrongext.eml

I got:
 1.5 FUZZY_OCR_WRONG_CTYPE  BODY: Mail contains an image with wrong
                            content-type set
                            Image has format "GIF" but content-type is
                            "image/jpeg"
 1.5 FUZZY_OCR_WRONG_EXTENSION BODY: Mail contains an image with wrong
                            file extension
                            Image has format "GIF" but file extension is
                            "jpeg"
 2.5 FUZZY_OCR_CORRUPT_IMG  BODY: Mail contains a corrupted image
                            Corrupt image: GIF-LIB error: Image is
                            defective, decoding aborted.
 8.0 FUZZY_OCR_KNOWN_HASH   BODY: Mail contains an image with known hash
                            Words found:
                            "target" in 1 lines
                            "service" in 1 lines
                            "stock" in 2 lines
                            "price" in 2 lines
                            "company" in 1 lines
                            "recommendation" in 1 lines
                            (12 word occurrences found)
Reload amavisd-new (or spamd, for those that use spamd) and send a test message through. You will have to give ownership of the log file to the amavis user (or whatever user is sending the message through spamd or spamassassin):
chown amavis:amavis /tmp/FuzzyOcr.log

You can grab host.gif from me and attach it to a message and send it through the spamfilter. Tail the log file as you send it through:
tail -f /tmp/FuzzyOcr.log

I got:
2006-12-25 19:50:47 [15341] Processing Message with ID 
 "<1499292447.20061225195047@example.net>" (Reporter 
  -> garyv@example.com)
2006-12-25 19:50:47 [15341] GIF: [192x361] host.gif (3696)
2006-12-25 19:50:47 [15341] Found: 1 images
2006-12-25 19:50:47 [15341] Found GIF header name="host.gif"
2006-12-25 19:50:47 [15341] Image is single non-interlaced...
2006-12-25 19:50:47 [15341] Image hashing disabled in configuration, skipping...
2006-12-25 19:50:47 [15341] Scanset Order: ocrad(0) ocrad-invert(0) gocr(0) gocr-180(0)
2006-12-25 19:50:47 [15341] Scanset "ocrad" found word "drugs" with fuzz of 0.0000
                      line: "ed drugs  "
2006-12-25 19:50:48 [15341] Scanset "ocrad" found word "cialis" with fuzz of 0.0000
                      line: "viagrd cialis letra       "
2006-12-25 19:50:48 [15341] Scanset "ocrad" found word "viagra" with fuzz of 0.1667
                      line: "viagrd cialis letra       "
2006-12-25 19:50:48 [15341] Scanset "ocrad" found word "price" with fuzz of 0.0000
                      line: "iowest onlie price garanteedi"
2006-12-25 19:50:48 [15341] Scanset "ocrad" found word "profit" with fuzz of 0.1667
                      line: "w guarantee oo topqalityofthe prodit we oi"
2006-12-25 19:50:48 [15341] Scanset "ocrad" found word "prescription" with fuzz of 0.0000
                      line: "uick here wo prescription re uiredi  "
2006-12-25 19:50:48 [15341] Scanset "ocrad" generates enough hits (6),
  skipping further scansets...
2006-12-25 19:50:48 [15341] Message is spam, score = 6.500
2006-12-25 19:50:48 [15341] Words found:
                      "drugs" in 1 lines
                      "cialis" in 1 lines
                      "viagra" in 1 lines
                      "price" in 1 lines
                      "profit" in 1 lines
                      "prescription" in 1 lines
                      (9 word occurrences found)
Edit FuzzyOcr.cf, turn off verbose logging and set focr_autodisable_score score back to a suitable level:
vi /etc/mail/spamassassin/FuzzyOcr.cf

focr_verbose 0

I set the focr_autodisable_score to the same value as my $sa_kill_level_deflt in amavisd.conf:
focr_autodisable_score 8

Once again reload amavisd-new (or spamd):
amavisd-new reload

And keep an eye on the mail.log for a while:
tail -f /var/log/mail.log

In FuzzyOcr.cf I suggest you do not configure an image hash database. If you decide to, remember that the database must be writable by the user running SA. For example, if you are running amavisd-new, the user (on a Debian system) would be 'amavis'. The easiest way to allow the amavis user to write to database files would be to place the files in the amavis home directory or subdirectory (/var/lib/amavis or /var/lib/amavis/db for example). This also applies to the log file. Do not leave the log file in the /tmp directory if you plan on doing any logging at all or you risk filling up the /tmp directory. Move it into /var/log or the home directory of the user running SA and keep an eye on it. If you want to rotate the log, study 'man logrotate' and any examples you might find in /etc/logrotate.d/.
focr_verbose 0
focr_logfile /var/log/FuzzyOcr.log

touch /var/log/FuzzyOcr.log
chown amavis:amavis /var/log/FuzzyOcr.log

This seems to work:
contents of /etc/logrotate.d/FuzzyOcr:
/var/log/FuzzyOcr.log {
     rotate 7
     daily
     compress
     delaycompress
     copytruncate
     notifempty
}

If using amavisd-new at log levels higher that 0, you may see something similar to this in your log file:
Feb  4 11:40:21 sfa amavis[13467]: (13467-01) extra modules loaded:
 /etc/spamassassin/FuzzyOcr.pm, /usr/lib/perl/5.8/auto/Storable/autosplit.ix,
 /usr/share/perl5/auto/Log/Agent/Priorities/autosplit.ix,
 /usr/share/perl5/auto/Log/Agent/autosplit.ix, FuzzyOcr/Config.pm,
 FuzzyOcr/Deanimate.pm, FuzzyOcr/Hashing.pm, FuzzyOcr/Logging.pm,
 FuzzyOcr/Misc.pm, FuzzyOcr/Preprocessor.pm, FuzzyOcr/Scanset.pm,
 FuzzyOcr/Scoring.pm, Log/Agent.pm, Log/Agent/Formatting.pm,
 Log/Agent/Message.pm, Log/Agent/Priorities.pm, MLDBM.pm, MLDBM/Sync.pm,
 MLDBM/Sync/SDBM_File.pm, SDBM_File.pm, Storable.pm,
 String/Approx.pm, Tie/Cache.pm
If you are using amavisd-new 2.4.3 or newer you can do something about this. Note that FuzzyOcr will work fine even if you don't do this. We can however get rid of the extra log entry and optimize loading of FuzzyOcr and other modules (which would be nice). From the amavisd-new RELEASE_NOTES
 - added a global configuration variable @additional_perl_modules, which
  is a list of additional Perl module names or absolute file names that
  should be compiled/executed (by calling 'require') at a program startup
  time by a master parent process, before chroot-ing and before changing
  UID takes place. Its purpose is to pre-load additional non-standard
  SpamAssassin plugins and similar modules that a standard SpamAssassin
  initialization would miss, causing them to be loaded later by each
  child process, which is inefficient and may not work in a chrooted
  process. Example:
    @additional_perl_modules = qw(
      /usr/local/etc/mail/spamassassin/FuzzyOcr.pm
      /usr/local/etc/mail/spamassassin/ImageInfo.pm
      /usr/local/etc/mail/spamassassin/WebRedirect.pm
      String::Approx Net::HTTP Net::HTTP::Methods
      URI URI::http URI::_generic URI::_query URI::_server
      HTTP::Date HTTP::Headers HTTP::Message HTML::HeadParser
      HTTP::Request HTTP::Response HTTP::Status
      LWP LWP::Protocol LWP::Protocol::http
      LWP::UserAgent LWP::MemberMixin LWP::Debug
    );
  Make sure these files are owned by root and not writable by unprivileged
  users such as amavis!
So, while transferring the log entry nearly literally (and placing this in amavisd.conf and reloading amavisd-new) seems to work:
@additional_perl_modules = qw(
 /etc/spamassassin/FuzzyOcr.pm
 /usr/lib/perl/5.8/auto/Storable/autosplit.ix
 /usr/share/perl5/auto/Log/Agent/Priorities/autosplit.ix
 /usr/share/perl5/auto/Log/Agent/autosplit.ix
 FuzzyOcr/Config.pm FuzzyOcr/Deanimate.pm
 FuzzyOcr/Hashing.pm FuzzyOcr/Logging.pm
 FuzzyOcr/Misc.pm FuzzyOcr/Preprocessor.pm
 FuzzyOcr/Scanset.pm FuzzyOcr/Scoring.pm
 Log/Agent.pm Log/Agent/Formatting.pm
 Log/Agent/Message.pm Log/Agent/Priorities.pm
 MLDBM.pm MLDBM/Sync.pm MLDBM/Sync/SDBM_File.pm
 SDBM_File.pm Storable.pm String/Approx.pm
 Tie/Cache.pm
);
I imagine it should (or at least could) be rephrased:
@additional_perl_modules = qw(
 /etc/spamassassin/FuzzyOcr.pm
 /usr/lib/perl/5.8/auto/Storable/autosplit.ix
 /usr/share/perl5/auto/Log/Agent/Priorities/autosplit.ix
 /usr/share/perl5/auto/Log/Agent/autosplit.ix
 FuzzyOcr::Config FuzzyOcr::Deanimate
 FuzzyOcr::Hashing FuzzyOcr::Logging
 FuzzyOcr::Misc FuzzyOcr::Preprocessor
 FuzzyOcr::Scanset FuzzyOcr::Scoring
 Log::Agent Log::Agent::Formatting
 Log::Agent::Message Log::Agent::Priorities
 MLDBM MLDBM::Sync MLDBM::Sync::SDBM_File
 SDBM_File Storable String::Approx
 Tie::Cache
);
One last note. If you run amavisd-new in debug mode, when SpamAssassin initializes you will see entries such as this:
[...] /usr/sbin/amavisd-new[13545]: SpamControl: initializing Mail::SpamAssassin
Subroutine new redefined at /etc/spamassassin/FuzzyOcr.pm line 48.
Subroutine dummy_check redefined at /etc/spamassassin/FuzzyOcr.pm line 59.
Subroutine fuzzyocr_check redefined at /etc/spamassassin/FuzzyOcr.pm line 63.
Subroutine fuzzyocr_do redefined at /etc/spamassassin/FuzzyOcr.pm line 101.
[...] /usr/sbin/amavisd-new[13545]: SpamControl: init_pre_fork done
I don't believe there is much significance to these somewhat unexpected messages.


mr88talent at yahoo dot com
25 JAN 2007