-
Notifications
You must be signed in to change notification settings - Fork 0
Compressor-based spam filter
License
arachsys/zf
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
zf == zf is a compressor-based spam filter which uses zstd compression to classify inbound email. Dictionaries are trained from samples of valid messages and junk mail respectively. Subsequent messages will compress best with the dictionary trained on samples they resemble most closely. This technique is straightforward and relatively inexpensive but nonetheless achieves state-of-the-art filtering accuracy with realistic training sets of a few thousand messages per category. Building dictionaries --------------------- The train program expects a list of files on stdin or as arguments, each containing a single message. It trains a dictionary from these samples using the zstd fast cover algorithm then writes it to stdout. By default, train will generate a 1M dictionary from the first 16k of each message using fast cover parameters d = 6, k = 100 and optimised for compression level 12. These defaults can be adjusted using command-line options. It is also possible to train on just message bodies or headers using the -B and -H options. All relevant parameters are recorded in a short header prepended to the dictionary so they can be reproduced exactly during classification. The examples/train script demonstrates how to train whole-message and header-only dictionaries from maildir archives of read and junk mail. It expects to live in the zf subdirectory of a maildir hierarchy and is a verbatim copy of my own daily job to rebuild classifiers. Classifying messages -------------------- The classify program loads the dictionary given as its first argument, then reports the size of each message specified as an argument or the message supplied on stdin when compressed with this dictionary. All parameters are set to the values used in training, including whole-message, header-only or body-only mode and truncation size. The examples/deliver script demonstrates how classify can be used in an MTA pipe alias to filter inbound junk mail using whole-message and header-only dictionaries. It expects to live in the zf subdirectory of a maildir hierarchy and is a verbatim copy of the script which delivers my own inbound email. A message is only filtered as junk if both the whole-message and header-only classifiers agree it is spam, making false positives vanishingly rare. Delivering to a maildir ----------------------- The deliver helper reads a message on stdin and safely writes it to a unique file named according to the maildir specification, reporting the filename on stdout. This file must be created in the tmp subdirectory of a maildir, then renamed into the new subdirectory. The examples/deliver script demonstrates how it can be used. Extracting and parsing message headers -------------------------------------- The header helper extracts the unfolded content of specified headers from messages given as arguments or from the message supplied on stdin. Optionally, it can parse date headers and convert them to script-friendly unix timestamps, or extract individual addresses from headers that contain RFC 5322 address lists. The latter feature is especially useful for implementing whitelisting: examples/whitelist/generate demonstrates how to compile a whitelist from archived mail and examples/whitelist/deliver shows how one can be integrated into filtering. This program is written in Go to avoid heavy-duty string wrangling of untrusted message text in C. It has no dependencies beyond the standard library. If a Go compiler is not available, the makefile will not attempt to build or install it. Building and installing ----------------------- Run 'make install' at the top of the source tree to install classify, deliver, header and train in /lib/zf. Alternatively, you can set DESTDIR and/or LIBDIR to install in a different location, or make, strip and copy the binaries into place manually. The programs should be portable to any reasonably modern POSIX system with the zstd library available, including Linux and BSD. The optional header helper also requires a Go compiler and the example scripts are written for Bash 4.x or later. Please report any problems or bugs to Chris Webb <[email protected]>. Copying ------- This software was written by Chris Webb <[email protected]> and is distributed as Free Software under the terms of the MIT license in COPYING.
About
Compressor-based spam filter