The aspam Anti-Spam Tool

Release: 2.6  6/15/04
Author: Stephen R. Whiteley aspam@wrcad.com
Whiteley Research Inc. www.wrcad.com
The aspam home page: http://www.wrcad.com/aspam/

Warning: This is a pre-production release. Athough aspam is in use at Whiteley Research and successfully deals with the mountain of spam received every day, it has not been tested in other situations. It has been deployed on our FreeBSD and Red Hat Linux 9 mail servers only.

NOTE ADDED 5/20/08
Whiteley Research now uses industrial-strength spam removal tools. aspam was an interesting programming exercise, and may still be useful, or bits and pieces may be useful in other programs, so we will continue to make it available.

Contents

What is aspam?
aspam Features
Installation
Invoking and Running aspam
The aspamrc File
Interactive Mode
Effective Use of Interactive Mode
The Word Table and Probability Analysis
The asutil utility

What is aspam?

The aspam program is a tool for separating potential spam (unsolicited commercial) email messages from other messages. It works by applying pattern matching and other tests to each message found in the mail inbox, and assigning a score to each message. Messages with a high enough score are removed from the inbox and placed in a "spambox".

The aspam program is intended for use on Unix/Linux systems which provide mail delivery services. On such systems, mail is delivered to an "inbox" file for each user. The inbox file is simply a concatenation of the messages received for that user by the system. This file is read and manipulated by the user's mail client program, allowing the user to read, respond to, and dispose of individual messages in the file. In some cases, the entire file is transferred to another machine by a POP server, usually to support Windows users.

If run periodically or before the mail client or POP server is invoked, aspam will keep the inbox substantially free of spam messages, which have been placed in a "spambox" file. The user should check the spambox periodically for possible messages of interest that were misidentified as spam. If misidentified messages are found, there is a procedure described below which should be used to correctly update the internal tables. Otherwise, the spambox can be deleted immediately.

When a message is moved to the spambox, a block of text is added to the message body which contains a tabulation of the test results, and the score. If the message is subsequently read in again, for example while in interactive mode, aspam will ignore this block when retesting or printing. The interested user can determine from this block why aspam categorized the message as spam.

aspam Features

Installation

To build aspam, type the following commands:
./configure
make depend
make

If not using gcc, the Makefile, produced by the configure script, will probably have to be tweeked. It the build fails, please notify Whiteley Research (aspam@wrcad.com), and we may be able to help out, otherwise consult with someone familiar with C/C++ programming. The build should work without modification on any freeBSD, Linux or Solaris system with gcc installed.

The aspam file (the executable) should be moved to a place where the user keeps executable files, or to a system location such as /usr/local/bin.

Current releases (2.3 and later) are quite different with respect to the expected locations of startup and database files from earlier releases (2.2 and before). The precedure below can be applied to perform the update.

New Installation

The aspamrc file should be modified with a text editor to provide initial customization. The existing example words and tests may not be appropriate for the user. Then, the following steps should be performed:
  1. Create a directory named ".aspam" in the user's home directory.

  2. Move the aspamrc file into the new .aspam directory.

This completes installation.

Updating an Existing Post-2.2 Installation

  1. Update your .aspam/aspamrc file to reflect use of new features and changes. See the supplied aspamrc file for information. The new entries can be copied to the existing file with a text editor.
This completes the update.

Updating an Existing Pre-2.3 Installation

The current releases use the same database files, but under new names. To retain past history, the user must move existing database files to the new location.

The existing startup file can be reused, but it too must be moved, and some editing is required to avoid warning messages. Perform the following steps to update the installation:

  1. Create a directory named ".aspam" in the user's home directory.

  2. Copy the present .aspamrc file into this directory as "aspamrc", i.e., no dot in the new name.

  3. Copy the present message database file (default name ".aspam_db1") into the new directory as "mesg_table".

  4. If the blocked address database is not shared with other users, copy the blocked address database file (default name ".aspam_db2") into the new directory as "addr_table".

  5. Copy the word database file (default name ".aspam_db3") into the new directory as "word_table".

  6. With a text editor, modify the .aspam/aspamrc file as follows:
    Deletions:
    1. Delete or comment out the line containing the MsgIdDatabase keyword.

    2. Delete or comment out the line containing the WordDatabase keyword.

    3. Unless the blocked address database is shared with other users, delete or comment out the line containing the BadAddrDatabase keyword. Otherwise, this must be a full path to the shared database file.

    4. Delete the GoodReplyPlaces keyword. The list of words associated with this keyword should be moved to the GoodPlaces list.
    Additions:
    To disable use of the databases, one can add the following keywords, which take no arguments:

    NoBadAddrDatabase don't use bad address table
    NoWordDatabase don't use word table
    NoGoodReturnDatabase don't use good return table

    Previously, these databases were disabled by giving the respective keyword without a file name argument. In the present release, the keywords above must be given to disable the feature.

This completes the update.

See the Unix manual pages for the cron and/or crontab commands to learn how to run aspam automatically. A typical installation may run aspam every half hour, for example. Then, only the spam messages received after the last aspam run will exist in the inbox.

Quesions and feedback can be addressed to aspam@wrcad.com. Ideas for new tests and capabilities, and feedback about effectiveness, would be welcome.

Invoking and Running aspam

The aspam program will operate on the inbox file, or any file containing a concatenation of email messages, and may remove messages identified as spam to another file (the "spambox") which is also a list of messages in the same format as the inbox.

The program is invoked from the command line as follows:

aspam [-f inbox] [-s spambox] [-h home_dir] [-t] [-c | -d | -i]

The argument following -f is the inbox file. If not given, this defaults to the value given in the aspamrc file, and if not given there it defaults to te value of the MAIL environment variable, which is usually the user's system inbox. If the MAIL environment variable is not set, the default is to a file named "INBOX" in the current directory.

If you just want to experiment and not touch your inbox, copy your inbox to another file, and use the -f option to work from that file.

The argument following -s is a path to a file used for collecting spam messages. This will override the value set in the aspamrc file, if any. If not given, a file named "SPAM" in the current directory is used. The file will be created if necessary.

The aspamrc files and database files are located in a directory named ".aspam" in the user's home directory. The -h option can be used to specify another directory to be used to locate the .aspam directory. The home_dir is a full path to the directory that contains the .aspam directory.

Ordinarily, each message is tested only once. When a message is tested, its unique message id is saved in a database, so future testing will be skipped. If the -t option is given, this database will be ignored, and all messages in the queue will be freshly tested. The databases will be read, but not updated in this mode.

If -c is given, aspam will check for corruption in the inbox and spambox, and perform repairs if necessary. After checking/repairing, aspam exits.

It was discovered (release 2.3 time frame) that corruption was occuring in the inbox and spambox files. What was observed was that occasionally (about one in a hundred messages) a message would get truncated by a few characters, so that the "From" line that starts the next message would not appear at the beginning of a line, preceded by a blank line. In this case, the second message would be hidden, acting as if it were part of the first. Messages were getting lost.

An audit of the aspam code and hours of experiments failed to reproduce the effect. It is possible that aspam was not causing the problem: if the local mail distribution program is writing a corrupted inbox, that corruption would be transferred to the spambox as the "hidden" messages tag along. Nevertheless, aspam was still suspect, being the only "new" code in the chain.

To address the problem, the functions that read/write the inbox and spambox were rewritten to be as paranoid as possible. Checking is now performed, and corruption, if found, is repaired and reported in a log file.

NOTE added: The problem was caused by null bytes in message body text, found in spam messages. Non-printing characters should not appear in messages, but for some reason on our system they do. I don't know if this is deliberate on the part of the spammer for some purpose, or an error. Non-printing characters are now converted to space characters when messages are processed in aspam, avoiding this problem.

If -d is given, no files are modified, but the messages found in the inbox are listed, along with the "score" and test result details. The score is determined for each message according to the criteria set by the aspamrc file. Messages with a score of 20 or higher are considered spam. The message id database is ignored (as if -t was given), so that all messages in the list will be tested. Again, there is no update to anything when -d is used, and spam messages are not removed from the inbox.

If -i is given, aspam enters an interactive mode.

Although aspam can be run directly from the command line, once aspam is "trained", it is more effective to run aspam periodically through cron(1) or at(1). The user is only required to check the spambox file periodically for any misidentified messages and delete the file.

The effectiveness of aspam is largely dependent on the past history of messages tested, as the databases acquire more information. It is also dependent on the keyword tests defined in the aspamrc file, particularly before much past history is accumulated. These tests should be tuned by the user, as they depend specifically on the type of messages that the user receives. The aspamrc file supplied was written for the author's email, which probably is a bit different from yours. Thus, the user should expect to spend some time experimenting and tweeking the aspamrc file for best, or even adequate, results.

In the initial phase of using aspam, the user should modify the aspamrc file to catch spam messages initially not identified, and to prevent messages that are actually wanted from being identified as spam. After a few days of this "training", aspam should be effective at correctly identifying most spam, though occasional updates to the aspamrc file will probably be necessary as new types of spam arrive. As the databases acquire more information, aspam should become increasingly effective, with less attention required of the user.

The aspamrc File

The aspamrc file is read when aspam starts, and provides initialization. It is located in the .aspam directory, which is usually located in the user's home directory.

Aspam provides three main categories of tests that are performed on email messages, representing three "lines of defense" against spam intrusion.

  1. Word and pattern matching is performed for the message subject line and sender's user name and address. The aspamrc file contains lists of words and patterns that if a match is found indicate goodness (non-spam) or varying levels of badness.

  2. The sender's address is compared to a local database, and to external DNSBL databases. A match indicates that the sender is a known source of spam. The local database is built up from the history of spam messages received. The database provides partial matching capability, so that, for example, if the first three components of an address match the database, this is deemed suspicious and scored accordingly.

  3. A database of words obtained from the processed messages is maintained, and a probabilistic analysis of the words found in the message body using this database will indicate a probability that the message is spam.

The aspamrc file is used to set the parameters that control these tests, and other factors that control aspam operation. The aspamrc file found in the distribution is a prototype that can be used as a starting point. For effectiveness this file will be heavily customized by the user, as many of the tests depend on the exact nature of the email that the user receives. The example file is heavily commented, and can be modified with any text editor.

The format is rather simple. There are a number of keywords, which are followed by data that you supply. A keyword will only be recognized if it starts in the first column. Some keywords have a single data item that follows immediately on the same line, separated by a space or tab. Other keywords introduce a list of words, which appear in following lines. Each of these lines must start with a tab or space, and the words are separated by a tab or space. Blank lines, and lines starting with the '#' character, are ignored by aspam.

Lines in the aspamrc file can be continued using the backslash continuation method. If a line ends with a backslash ('\'), it will be joined to the following line (replacing the backslash) yielding a single logical line.

The lists of words are the areas most likely updated by the user. The user is also free to fine-tune the numeric parameters as necessary. The numeric parameters control the weighting of associated tests. it should be remembered that if a message accumulates a score of 20 or more, the message is considered spam.

General and Database Keywords

InBox inbox
SpamBox spambox
These keywords define the default paths to the user's mailbox file, and the name for a file to be used for spam messages. These will be overridden by the -f and -s command line options, respectively. If InBox is not given, the file name defaults to the value of the MAIL environment variable, or "INBOX" if that is not found. If SpamBox is not given, the file name defaults to "SPAM".

There are four database files used by aspam. These are typically created and maintained by aspam in the .aspam directory located in the user's home directory.

These databases are:

The tested messages database
File: mesg_table
This contains a list of the unique message identifiers of messages in the inbox that have previously been tested. This avoids redundant testing, and (worse) redundant saving of message data in the other databases.
The bad address database
File: addr_table
This database contains the source IP address of all messages identified as spam. When a message is received, the source address is compared to this list, and a match indicates that the message will be considered spam. Furthermore, the database maintains a count of messages recieved from each address, so that the worst spam sources can be identified. Each entry is also given a time stamp which represents the last "hit". Note thet this is the time that aspam processes the message, and not the time that the message was received.

The source IP address is not always the sender's IP address, as resolved from the sender's domain. The source address is determined from the "Received:" message headers, if possible.

Unlike the other databases, the bad address database can be shared between multiple users, thus if one user receives spam from some source, the other users will be protected from spam from that source.

The word database
File: word_table
This database contains a list of words, and counts of the number of occurrences of the word in spam and normal messages. This is used for probability analysis based on the words in a message, as to whether the message is spam or not.
The good return database
File: good_table
This database contains the return addresses of normal messages received. When a message is received, the senders return address is checked against this database, and if a match is found the message is accepted without further testing. Unlike the other database file, this is a text file, and can be edited by the user (carefully!). It is of course bad news if a spam return address gets into this table, since all further spam with that return address will not be blocked. Using the '+' command in interactive mode will clear such entries.
The bad address, word, and good return databases can be disabled. When disabled, the testing associated with the database is simply skipped, and the associated database file is neither read nor updated. It is unlikely that the user would wish to eliminate these features, though the following keywords provide that capability. These keywords have no arguments.
NoBadAddrDatabase
NoWordDatabase
NoGoodRetrunDatabase
Giving any of these keywords disables the respective database and features.
The bad address database is normally created and read from a file in the user's .aspam directory. If instead an "external" database is to be used, such as when the database is shared between several users, the following keyword should be given:
BadAddrDatabase full_path_to_file
This specifies that the given file is to be used as the bad address database, rather than the default addr_table file in the .aspam directory. The file will be created if it does not exist, but the parent directories must exist.

A locking scheme is employed if the BadAddrDatabase file name is given in the aspamrc file. In this case, it is assumed that several users are sharing the same database file. A file with the same name and path as the database file, but with a ".lock" extension performs the locking. This file will be created if necessary, and automatically updated by the programs. It contains the process id of the owning process when the lock is in force, or "0" or is empty when the lock is not in force. The file must be writable by all users. It may have to be initially created by root using "touch" if the containing directory does not give users write permission - i.e., users cannot create a file in the directory but can update an existing file.

If BadAddrDatabase is not given, the table is located in the user's .aspam directory and is assumed to not be shared, and no locking is done.

The file is locked when the database is read into memory, and unlocked when the file is rewritten from memory, or the program exits. When locked, it can not be opened by another copy of aspam. If aspam finds the file locked, it will go to sleep and try again every five seconds. Note that long sessions in interactive mode will lock out all other aspam processes for the duration.

The IP address of the sending machine, obtained from the headers, is checked against the bad address database, and optionally against external DNSBL databases of known spam sites. The local database contains a list of the addresses of the sending machines of all spam messages seen by aspam. Since much spam originates from the same sources, this effectively eliminates a great deal of spam.

AddressMatch2 score
AddressMatch3 score
The sending machine's IP address is checked against the bad address database. If the first two components match a database entry, this is deemed suspicious and the integer following the AddressMatch2 is added. Likewise, if the first three components match, the integer following AddressMatch3 is added. If all four components match, the message is assumed to be spam. if the AddressMatch2 keyword is not given, the default score for this test is 4. Similarly, the default score for AddressMatch3 is 8.

It is possible to prevent addresses from being added to the bad address database.

OkSources
   list of IP dot-quads on following lines
The OkSources is a list of IP addresses that will never be added to the address table. A 0 in the fourth position is a wildcard for this position, similarly 0 in the third and fourth positions is a wildcard for these two positions. A match to this list will not change the score, but only prevent the address from being added to the block list under any circumstances. You may want to include your trusted relays here. This is useful if your ISP sends you important mail as well as advertisements, for example. You want to filter the advertisements without permamently blocking the source address.

Addresses already in the database that match this list will be removed from the database the next time it is rebuilt, i.e., the next time aspam is run normally.

The BadAddrDatabase saves a time stamp for the latest hit for each address. In order to limit the growth of this database and to maintain efficiency, a mechanism is provided for purging old entries. The aspamrc file can contains lines of the form below.

BadAddrPurge count days
Up to five of these entries may appear (beyond five will be ignored). The count and days are integers. When the addresses are being saved back to the file, each one is tested against the purge specifications, and will be discarded if for any specification
  1. the count is 0 or the address hit count is less than or equal to count, and
  2. the address most recent hit date is before the current date minus days.

If no BadAddrPurge specifications appear, all addresses will be saved forever.

The following keywords apply to the testing associated with the word database.

WordScore score
If the WordDatabase is in use and has a sufficient number of entries, a probability analysis is performed on the text found in a message. If the test indicates that the message has spam characteristics, the WordScore value is added to the message score. By default, if this keyword is not given, the score added is 20, meaning that the message will be considered spam. Giving this keyword can be used to relax this interpretation, which may be necessary if the probability analysis gives too many false-positives.
WordAlgorithm index
This sets the algorithm used for probability analysis, the index being an integer. If not given or the index is 0, the Bayes probability is used. If the argument is nonzero, an alternative algorithm is used. The algorithms are described below.

Recipient Tests

The following keywords consider the addresses found in the "To:" and "Cc:" headers. These are only useful if the inbox supports multiple users or aliases. In our case, all mail to wrcad.com goes to a single inbox, so it is useful to filter made-up user names applied by spammers.

There is a big problem with use of these keywords, however. Mail that is delivered in response to a BCC (blind carbon-copy) field will fail the test, unless the "real" recipient is also listed.

Recipients
   list of recipient names
The words listed will be checked against the addresses found in the To: and Cc: headers. If no match is found, the BadRecipientScore is added. If this keyword is not given or the BadrecipientScore is 0, this test is not done. The list of words are compared in a case-insensitive manner, but otherwise literally (no wildcards). Thus, all possible recipient tokens should be listed, for example:
stevew stevew@chaucer stevew@chaucer.wrcad.com stevew@wrcad.com
BadRecipientScore score
If the To: and Cc: headers do not contain a match to a word listed in the Recipients list, this score is added. If this keyword is not given, the defualt score is 8.

Keywords for Sender's Address Tests

The first group of tests generally operate on the sender's address, or name@address. The "sender" is the name@address obtained from the initial From line of the message (before the headers).

The "sender" may not be the same address to which a reply to the message is directed. The message headers may provide a different return address from the sender's address. There may be occasions where the sender's address is flagged as a spam site, though messages to be returned to a particular user or site that originate at the sender's site may not be spam. This can happen if legitimate mail is sent through a bulk-email distributer that also sends out spam. The return address is the address found in the "Reply-To:" or "From:" headers (sought in that order).

Words in the lists of this section match as a suffix, i.e., "abc.com" will match "joe@spammer.abc.com", case-insensitive. If the word begins with '*', then a match will occur if the rest of the word is found anywhere in the string, e.g., "*offers" will match "fred@directoffers.com" or "bill@offers4u.spammer.com" or "bestoffers@yahoo.com". Forms like joe@yahoo.com are accepted as well, which can be useful in the GoodPlaces or VeryBadPlaces lists to always allow or not allow a particular sender.

Unconditional acceptance always supersedes spam characteristics, i.e., unconditional acceptance tests are performed first, and if the message passes no further testing is done.

GoodPlaces
   list of words on following lines
If the sender's address matches a word in the GoodPlaces list, the message is accepted, without further scrutiny. Further, if there is a separate return address such as from a Reply-To: or From: header, that address will be checked as well. Sometimes, the return address is different from the sender's address, and this enables the return address from a trusted source to be identified, even if the actual source is a bulk mail distributor, for example.
The remaining tests apply only to the address obtained from the initial From line, and not a separate return address.
SuspiciousPlaces
   list of words on following lines
The SuspiciousPlaces list contains words that would indicate a probability that the message is spam. The SuspiciousPlaceScore is added if a match is found.
SuspiciousPlaceScore score
The integer following SuspiciousPlaceScore is added for matches in the SuspiciousPlaces list of words. The score defaults to 8 if this keyword is not given.
BadPlaces
   list of words on following lines
The BadPlace list contains words that would indicate a high probability that the message is spam. The BadPlaceScore is added if a match is found.
BadPlaceScore score
The integer following the BadPlaceScore is added for matches in the BadPlaceList of words. The score defaults to 12 if this keyword is not given.
VeryBadPlaces
   list of words on following lines
If a match is found to a word in the VeryBadPlaces, the message is considered to be spam with no further testing.
RemoveTo from_addr  folder_path
Unlike the keywords listed above, this keyword can appear multiple times. Any messages whose sender matches from_addr (in the manner of the GoodPlaces keyword) will be removed from the inbox and saved in folder_path, which is a full path to a file, which will be created if it doesn't exist. Messages from this sender, spam or not, will go to folder_path. If folder_path can't be opened, the message will be treated normally.

This enables messages from a particular sender to be routed to a special file. It solves a particular problem: an evil spammer discovers that an anti-spam tool is available on wrcad.com. So, the spammer tries to retaliate by using "wrcad.com" in the bogus return address applied to a gazillion spams sent to aol.com. Huge numbers of these bounce since the recipient address is often incorrect, and end up in the mailbox of wrcad.com, from "MAILER_DAEMON@aol.com". This keyword enables filtering of these into a separate file.

The GoodPlaces, and VeryBadPlaces lists usually contain full addresses, and these can include the "name@", and will override other testing. The GoodPlaces test is performed before the other tests. The SuspiciousPlaces and BadPlaces provide two levels of "badness". The actual IP address of the sending machine is determined from the "Received:" headers, if possible. The IP address of the sender, obtained from a resolver query using the sender's domain name, is also obtained if possible. Both ot these may be bogus in spam messages, so aspam tries to ensure that the results are reasonable.
BogusPlaceScore score
If the sender's domain can't be resolved through the local nameserver, it is assumed to be bogus and the integer following the BogusPlaceScore is added. The default, if this keyword is not given, is to tag the message as spam (score = 20).
PlaceDiffScore score
If the IP address resolved from the sender's domain is different from the IP address of the sending machine (obtained from the headers), this score will be added. This is suspicious, but legitimate messages may have this property. The default score is 4, if this keyword is not given.
The following keyword enables the use of external DNSBL blocklist queries for the sending machine's IP address.
Blocklist zone [score]
This keyword enables querying of DNSBL blocklists. Unlike other keywords, this keyword has no default and can be given any number of times, or not at all. The "zone" is the domain of the blocklist provider, such as "sbl.spamhaus.org". This is followed by an optional numeric score, which is added to messages whose source is in the blocklist. If not given, the score is 20, i.e., the message is considered to be spam.

The "name" is the user id field of the sender. There are tests that examine the name for spam characteristics.

BogusNameScore score
If no valid user name is provided in the sender's name@address, the integer following the BogusNameScore is added. The default, if this keyword is not given, is to tag the message as spam (score = 20).
NameLength length
NameLengthScore score
Many spam names are long. If the number of characters up to the first '.', or the total number of characters if there is no '.' is longer than NameLength characters, the NameLengthScore is added. One should be careful with this test, as it may be triggered by bulk mail or machine-generated mail of any sort. The default length is 14, and the default score is 10.
NameDigitFrac fraction
NameDigitFracScore score
Spam names often use a lot of digits. If the fraction of the name characters consisting of digits exceeds the NameDigitFrac, then NameDigitFracScore is added. The fraction is a floating point number between 0 and 1.0. Again, this can be triggered by legitimate machine-generated email. The default fraction is 0.3, and the default score is 10.

Keywords for Subject Header Tests

There are several tests for words found in the "Subject:" field of the message. This section is where much of the specific characterization of the email received by the user is represented. The words in the lists should be quoted if they contain white space, e.g., "earn money". The default action is to look for each word or phrase in the subject line, and if found add a score. The test is case-insensitive by default. There are optional special characters that can be added to the beginning of a word to modify the default behavior. These can be in any order, inside or outside of any quote marks.
* Ordinarily a word given must match exactly a word in the string. If '*' precedes the word, then it is allowed to match part of a word in the string, e.g., "*free" would match "freedom" and "carefree".
^ The word must match the leading word in the subject string. If '*' is also given, the word must match the leading non-space character in the subject string and the characters that follow. The first character in the word should not be a space.
@ If '@' is found, the match testing will be case-sensitive.

If the subject text is encoded via rfc2047, a score may be added. Encoding is necessary for non western European character sets, but can aslo be used with western character sets. If the subject line looks like a mess when the message is viewed in a text editor of in a non-decrypting mail client, then it is probably encoded. In many mail clients, the decoding is transparent to the user. Spammers like to encode the subject line to try and foil spam detection software. Some mail programs in the Microsoft world encode the subject line if the user so-specifies or perhaps by default.

SubjEncEnglScore score
If the subject was encoded using a western European character set, this is deemed suspicious and the SubjEncEnglScore is added. The score defaults to 4, if the keyword is not given.
SubjEncFrgnScore score
If the subject was encoded using a character set that is not western European, the SubjEncFrgnScore is added. This value defaults to 0 so as to not affect users who receive legitimate mail with non-western characters. However, unless you get, for example, Asian or Cyrillic characters in legitimate mail, these messages are are probably spam, in which case the score can be set to a high value such as 20.
NoAdvTest
If a message subject starts with "adv:" (case insensitive), it is assumed to be spam. If this keyword is given, this test is not performed.
GoodWords
   list of words on following lines
If a match is found for any of the words in this list, the message is accepted without further testing.
SuspiciousWords
   list of words on following lines
Matches to words in this list trigger the addition of the SuspiciousWordScore for each match.
SuspiciousWordScore score
The integer following the SuspiciousWordScore is added for each match found to words in the SuspiciousWords list. The default, if this keyword is not given, is 8.
BadWords
   list of words on following lines
Matches to words in this list trigger the addition of the BadWordScore, for each match. Matches to these words indicate a high probability that the message is spam.
BadWordScore score
The integer following the BadWordScore is added for each match found to words in the BadWords list. The default, if this keyword is not given, is 12.

Keywords for Additional Tests

Finally, much spam seems to be HTML-formatted. The following score is added to messages that are exclusively HTML-formatted. Users whose legitimate mail is also mostly HTML-formatted should probably disable this test (set the value to 0). The default value is 10.
HtmlScore score

During the initial "training" procedure, one can use aspam with the -d option, which will list the messages found in the inbox along with the score. Words can be added to the aspamrc file to more correctly recognize the spam, and to allow legitimate messages that might be incorrectly scored as spam (score is 20 or more). After the change, aspam -d can be run again to verify the results. It is probably not possible to obtain perfect accuracy, but a carefully crafted aspamrc file should be quite effective. After any changes, aspam can be run without -d, and the messages identified as spam will be removed from the inbox and appended to the spambox, which defaults to a file named "SPAM" in the current directory. After making sure that this file does not contain any legitimate messages, this file can be deleted. After a period of training, one can set up the Unix cron or at programs to run aspam periodically. This should keep the user's inbox substantially free of spam. The user should periodically check and delete the spambox.

Rule/Action Blocks

The aspamrc file can contain any number of Rule/Action blocks. These are supplemental tests which are applied to the messages, and if the test returns "true" an action is performed. These blocks have the following syntax, and can appear anywhere in the aspamrc file.
Rule expresion
Action action [argument]

The construct exists on two sequential logical lines, but each logical line may consist of more than one physical line with backslash continuation used to logically join the lines. Both the Rule and Action logical lines must be present.

Any number of these constructs can appear in the aspamrc file. For each message, each expression is evaluated, and if true the corresponding action is performed.

The expression follows the keyword Rule, and must exist entirely on the same logical line. As with any line in the aspamrc file, physical lines can be continued with the backslash continuation method.

The expression is an algebraic form similar to an expression in C. The following tokens are understood.

The Action line must follow the Rule line, and consists of a keyword from the list below, following "Action", which may be followed by additional text for certain actions.

Action keywords:

Noop
This action does nothing.

Delete
This action means that the message will not be written to the spam box. There is no effect on non-spam messages, but spam messages will disappear forever.

AddScore score
The integer score will be added to the spam score of the message. This allows the score to be modified.

AppendFile filename
The message will be appended to the file filename. If written successfully, the message will not be written to the spam box, and is removed from the inbox. It will also be written to the RemoveTo destination if one applies.

Examples:

Rule DUPLICATE | VERY_BAD_PLACE | HITS > 25
Action Delete

If the message is a duplicate or has a match in the VeryBadPlaces list, or if we have seen more than 25 spams from the same source, it will not be saved in the spam box.

Rule SPAM > 40 & (IN_DB = 4 | BLOCKED)
Action AppendFile /home/user/crud

If the message was blocked and has score > 40, append it to the "crud" file.

Rule BAD_RECIP & BAD_PLACE
Action AddScore 20

Make sure that messages to a bad recipient from a bad place are tagged as spam.

Interactive Mode

Giving "-i" on the aspam command line places aspam in interactive mode. In interactive mode, aspam responds to a number of commands to perform various tasks. These are described below.

The commands generally operate on a message list, which is a list of the messages currently in the user's inbox or spam box. When aspam starts, it reads the inbox and creates a list of descriptors for the messages. It is possible that while using aspam, the system will append additional messages to the inbox. Such messages are not accessible to aspam until aspam is restarted. These messages will not be disturbed when aspam moves spam out of the inbox. However, if the inbox file changes between the time that aspam starts and interactive mode is exited, such as if another copy of aspam was run non-interactively and spam messages were removed, the update will not be done, and a message will indicate that the file changed asynchronously. This should be borne in mind if aspam is run periodically in the background.

Each message in the list is given a sequential message number starting with one. Several of the commands take a message number or range of message numbers as an "argument". The possibilities are
no range given The range used is the previous range given, or the range starts at the message following the last range given, depending on the command. If there was no range previouly given, the range is the first message, in either case.
number This implies the message corresponding to the number.
number1-number2 This implies the messages in the range of number1 through number2. The number2 can be the dollar sign ('$') character to indicate the last message in the list.

The command prompt will indicate the current default range, i.e., the range that will be used or advanced if no range is given to a command.

Some of the commands allow output redirection. This enables the command output to be placed in a file, or processed by a system command. The default operation if no redirection is specified is to display the output on the screen, after processing with the system pager command. The system pager command is the value of the PAGER environment variable, or the command "more" if this is not set. The possible redirection forms, which must appear at the end of a command line, are
> filename Write the output to filename.
>> filename Append the output to filename, which will be created if it does not exist.
| command Pipe the output through command.

In the table below, the square brackets imply that the enclosed item is optional. The range and redir are the constructs described above.

S
Set the context to a list of messages read from the spam box.
I
Set the context to a list of messages read from the inbox. This is the default.
In the listing commands below, if no range is given, the range used will consist of the number of messages in the previous range, starting at the first message following the previous range.
h[range] [redir]
Display the messages in the list according to the range.
f[range] [redir]
Display the messages as with h, but include specific information about the test results on the message. Test results are only available if the message was tested in the current aspam session, see the c command below.
l[range] [redir]
Display the messages as with h, but include all of the message headers in the listing.
[range]
Giving just a range, or just pressing Enter, will list the messages in the format of the last one of h, f, or l given, advancing the range. The initial command is assumed h.

In the listings, the first line gives the message number and the sender's user name and address (which are probably bogus for spam). The second line gives the return address obtained from the "Reply-To:" or "From:" header, if found, which may be different. The third line is a little different depending on whether the message was read from the inbox or the spam box. If from the spam box, the sender's IP address, if it can be determined from the message headers, is listed. If the message was read from the inbox, the third line gives the message score, possibly a short code string, and the originating IP address as determined from the message headers. If the address was found in the local block list, a D will appear before the address. If the address was determined to be a spam source from a DNSBL query, a B followed by an integer will appear ahead of the address. The integer indicates which of the DNSBL servers returned the indication, and is 1 or larger corresponding to the Blocklist entries in the aspamrc file. The fourth line gives the message subject text. The l command will show additional header lines, and the f command will show a breakdown of the test results.

For the following commands, if no range is given, the previous range will be used.

b[range] [redir]
Apply the DNSBL tests to the messages, and show the results.
c[range]
Test/retest the messages in the range. This will clear the test status of each message, and apply the tests. This is useful for messages in the spam list, or inbox messages that have been previously tested, and you want to know the detailed test results.
u[range] [redir]
For each message in the range, evaluate the expressions in the Rule/Action list and show the result. The actions are not performed.
U[range]
The user is prompted to type in an expression, as would be provided in a Rule/Action block. The expression is applied to each message in the range, and the result printed (there is no action). The process repeats until the user enters "q" for the expression, or no expression at all.
p[f][range] [redir]
The p command will display the message body, printed verbatim as plain text. In spambox context, the test result block, which aspam appends to each message in the spambox, is not printed. The pf variation, which is only applicable in spambox context (S given) will print only the test result block.
y[f][range] [redir]
The y command will display the message body, mime decoded. In spambox context, the test result block, which aspam appends to each message in the spambox, is not printed. The yf variation, which is only applicable in spambox context (S given) will print only the test result block, the same as pf.
+[range]
Set the messages in the range as spam. This applies only to messages read from the inbox. This will also remove the message return address from the good return database.
-[range]
Set the messages in the range as not spam. For messages in the spam list, this will remove the source address from the local database, and fix the word table.

The remaining commands do not operate on the message list and are thus insensitive to the current message range.

d[mincnt] [baktime] [redir]
Dump the sorted contents of the local block list database. This is a list of IP addresses from which received mail will be automatically tagged as spam. The first line of the dump is a magic header, which allow the redirected ascii file to be read into aspam the same way as the addr_table file in the .aspam directory. Thus, it is possible to use a text editor to tweek this database: dump a file using 'd', edit as necessary, then replace the addr_table file. The next time the file is updated by aspam however, it will revert to the more efficient binary format. If you do this, be sure that the magic header is retained. Every other line in the file must be in the form:
A . B . C . D [count [date [anything]]]
An IP-quad must appear on the line. This is followed by an optional hit count, which will be taken as 1 if not given. A date stamp can appear after the hit count. This is in the format as returned from the Unix time(3) command, and if not given the present time is assumed. Anything that follows is ignored - the dumped output has the humanized date string appended.

Up to two optional integers can follow the d. The first is a minimum hit count. If given, only addresses with this many hits or more will be listed. The second is a time value, which can be provided if a hit count is also given. If positive, it represents the time in days prior to the present time that the most recent hit must have been recorded. if the value is negative, it represents the number of minutes prior to the present time that the most recent hit must have been recorded. If 0 or not given, there is no time comparison and all addresses that match the hit count constraint will be listed.

Examples:

d25 > badones
This will list the source addresses from which 25 or more spams have been received in file "badones".
d0 1 > badones
This will list all addresses which had a hit in the last day.
d10 -30 > badones
The list will contain addresses with 10 hits or more with a hit in the last half hour.

One may find that the bulk of spam received comes from a few sources, or "spamhauses". These are the paragons of the "email marketing industry" created for the purpose of flooding your inbox with garbage. As their sleazy proprietors give heavily to the Republican party, they have been allowed to expand and own ranges of source addresses, each address disgorging spam as fast as their fiber-optic connections can handle. Aspam does not know about the range, but over time, the individual addresses in the range should appear in the block list. If you do a d to dump the block list, you can see this. Your listing might contain something like

123.213.132.1
123.213.132.2
123.213.132.4
123.213.132.7
123.213.132.8
and so on. There are places on the net, such as www.spamhaus.org, where these addresses can be traced to the company, about which information is provided. Enterprising users may wish to use this information to initiate lawsuits, or pass it along to cousin Guido the "enforcer".

The address database has partial matching capability, so that addresses "close" to a known bad address will be identified, and a score added.

dm [redir]
Dump the sorted contents of the message id table. These are the ids of the messages that have been tested on a previous run.
dw [redir]
Dump the sorted contents of the word database. This is a table of words extracted from messages that is used for Bayes probability analysis.
dg [redir]
Dump the sorted contents of the good return database. This is a list of the return addresses from non-spam messages, and any message with a return address found in this list will be accepted as a normal message. If a spam return address finds its way into this list, spam from that source will no longer be blocked, so it is important to keep this list clean. Unlike the other database files, the good_table file in the .aspam directory that contains this database is a text file, and can be edited.

The following commands must be followed by a file name. The file is a list of messages in the same format as the inbox and spam box. These permit seeding of the word table from existing message collections.

an filename
For each message in the file, the words are extracted and added to the word table as words from normal (non-spam) messages.
as filename
For each message in the file, the words are extracted and added to the word table as words from spam messages.

The following commands must be followed by an IP address in numerical quad form.

r address
Remove the address from the local database. Aspam tries to determine the IP address where the spam originated, by examining the message headers. This address is added to the database, and any future messages originating from that address will be tagged as spam. However, clever spammers can add bogus headers, or the spam may have originated from an otherwise legitimate site that may be a source of "good" email. The r command allows "good" addresses that have gotten into the block list to be removed.
a address
Add the address to the block list database.
z address
Test the address against the block list database, and print the degree of matching. The degree of matching is the number of components of the address, starting from the left, that match an address in the database.

Interactive mode must be exited with q or t for any changes made with the r or a commands to be reflected in the disk files.

q
Quit, performing the normal update operations:
  1. The database files are updated.
  2. The message list is evaluated for Delete and AppendFile rules, and messages dealt with accordingly.
  3. The RemoveTo operation is performed on matching messages.
  4. Spam messages are removed from the inbox and placed in the spam box, unless they were already appended to a file or deleted.
t
Quit, updating the message, good return, and the non-spam word tables only. All messages remain in the inbox. Messages that are not spam will not be tested on the next run, since they have been duly recorded. Spam messages are effectively ignored, to be dealt with on a subsequent run.
x
Quit, making no changes.

Any command line that starts with an unrecognized character will print a synopsis of the commands available.

Effective Use of Interactive Mode

Spam Misidentified as Normal Mail

One very important use for interactive mode is to assist aspam in identifying spam messages by using the + command. For effective spam filtering, the user should periodically run aspam in interactive mode, using + on the spam messages that have scores less than 20. This will add the originating IP address to the blocking database, eliminating future spam from that source, and correctly update the word database. Most importantly, though, this will remove the spam return address from the good return database. If this is not done, future spam with that return address will not be blocked.

Thus, instead of simply deleting spam messages that aspam misses, the user should instead use aspam to mark the spam with + and update (i.e., use q to exit aspam). This will add the source addresses to the database, update the word and good return databases, and remove the spam from the inbox.

Normal Mail Misidentified as Spam

It is likely that sooner or later a desirable message will be found in the spam box. The spam box should always be carefully examined before deletion, for this reason. No anti-spam tool is perfect, but one should try to find out why the message was determined to be spam, and if possible make changes to prevent this from happening in the future.

The most important thing to do is to remove the message's source address from the local blocking database. As it stands, all future messages from this source will be flagged as spam. The following procedure can be used to revert the tables. This procedure does not put the message back into the inbox, so that the message should be read or saved before deleting the spam file.

Start aspam in interactive mode, and give S on the command line. This sets the message list context to the spam box. Find the message number of the message you want to keep, and apply the '-' command to this message number. This will remove the source address from the bad address database, back the message out of the word database, and re-add the words as from a normal message.

The critical step is now done, and one may use q to exit aspam at this point. It is entirely possible that the update to the word table fixed the problem for that message.

Before exiting, one can try to determine how the message was flagged as spam in the first place. Releases 2.4 and later append a block of text to each message in the spambox, which provides a summary of the test results. This can be viewed in interactive mode with the pf command, which must be given in the spambox context (S given). This will indicate the tests that triggered the spam categorization.

An alternate method repeats the testing, which can be useful as the aspamrc file is tweeked. Give the c command for the message. This will repeat all of the tests applied to the message. Then, the f command can be used to display the results. Note that the state of the word table is now different from the original test, so that result may be different. You may have to adjust the scoring or other parameters in the aspamrc file. Note that the "good" lists override the other tests, so adding the "from" address to the GoodPlaces will ensure that all mail with that return address will get through.

Finally, you probably want a copy of the message. Aspam can not move a message from the spam box back into the inbox, but the message text can be dumped to a file with the p command and redirection, e.g.,

p msg_number > filename
The q or t commands should be used to exit aspam, and not x, so that the database files are updated.

For example (a true story), conference announcements from computer.org were ending up in the spam list. The procedure above was applied, leading to the information that computer.org was blacklisted by one of the DNSBL sites. I had two choices: 1) disable that DNSBL site as too paranoid, or 2) add computer.org to the GoodPlaces list. I chose the second option.

The Word Table and Probability Analysis

The algorithm used for probability analysis can be set with the WordAlgorithm keyword in the aspamrc file. If not given or used with argument 0, the Bayes theorem is used. If nonzero, an alternative algorithm is used.

Bayes Analysis

This is a probabilistic technique employed successfully in a number of implementations for filtering spam. The implementation used in aspam is still a work in progress, but seems to be pretty accurate. This section describes the algorithm currently employed in aspam.

The body of each message is mime-decoded if necesssary to provide plain text or html text. The program understands base64 and quoted-printable encoding, and multi-part formats. This text is then tokenized into words. If the text is html, the html commands are stripped, somewhat carefully as some html tags are considered as token separators and some are not. Spammers sometimes put dummy tags within words to try and foil word recognition, but aspam will recognize this (or at least it should). The words found in the message are filtered: tokens that are 4-16 characters long and start with an alpha character are retained, anything else is thrown away. Punctuation, white space, and certain html tags separate tokens. The list is also filtered to remove duplicates.

The word database keeps track of the following: the number of normal message word lists added, the number of spam message word lists added, and for each word the number of normal messages that contain the word and the number of spam messages that contain the word.

After it is determined whether or not the message is spam, the word list is added to the word database, so that either the normal or spam counts are incremented.

With a sufficient number of word lists in the database, the database has predictive power for identifying spam messages. The prediction is disabled until 250 messages (spam plus normal) are in the word database. This is a guess as to how many messages are needed for accuracy, the more messages that are included, the better the accuracy in theory. The interactive mode of aspam provides a means to read in collections of known spam or non-spam messages to build up this database.

The word list for an unknown message is tested against the database in the following manner. Considering one of the words in the word list, the database provides the following parameters:
ms The number of spam messages in the database
mn The number of normal messages in the database
ws The number of spam messages that contain word w
wn The number of normal messages that contain word w

From this we obtain:
p(w|s) The probability that the message contains word w, given that it is spam
= ws/ms
p(w|n) The probability that the message contains word w, given that it is normal
= wn/mn
p(s) The probability that a message is spam
= ms/(ms+mn)
p(w) The probability that a message contains word w
= (ws+wn)/(ms+mn)

In probability theory, Bayes Theorem provides an answer to problems such as the following:

The probability that a basketball player is 7 feet tall is p. The probability that a man is 7 feet tall is q. The probability that a man is a basketball player is r. Given that a man is 7 feet tall, what is the probability that the man is a basketball player?

Bayes Theorem states that

p(basketball player given 7 feet tall) * p(7 feet tall) =
   p(7 feet tall given basketball player) * p(basketball player)

Applying this to our problem,

p(w|s)*p(s) = p(s|w)*p(w)
p(s|w) = p(w|s)*p(s)/p(w) = (ws/ms) * (ms/(ms+mn)) / ((ws+wn)/(ms+mn))
= ws/(ws+wn)
= The probability that the message is spam, given that it contains word w

Note that (somewhat remarkably) the message counts drop out, and only the word counts are needed. However, we use the message counts to set limits on the probability:

pmin = 1/(mn + ms)
pmax = 1 - pmin

The probability computed using Bayes Theorem is limited by these values, to account for uncertainty.

The Bayes probability is computed for each word in the unknown message word list. The words are then sorted in descending order of the absolute difference of this probability from 0.5, and only the first 15 are used for further processing (or all of the words, if there are less than 15. If the word is not found in the database, or has a total count of less than 5, it is assigned a probability of 0.5. Thus, we choose the 15 most "interesting" words, in the sense that they almost always or almost never appear in spam messages.

Let Wi be the Bayes probability for word i. The proper way to combine the probabilities is with the expression
p = W1*...*Wn / ( W1*...*Wn + (1-W1)*...*(1-Wn) )
= the probability that the message is spam
where n = the number of words considered (15).

If the result p > 0.9, aspam categorizes the message as spam.

Alternative Algorithm

In use, the Bayes analysis produces a few false-positives, maybe 1-2 percent in our site. Consequently, it may be useful to try different algorithms, since false-positives are undesirable.

Our site receives far more spam than "good" messages. In this case, there is some doubt whether the Bayes approach is the best. In the Bayes approach, the past frequency of spam is "built in", meaning that for us, there is a built-in bias that a neutral message is spam. I'm not sure that this is a good thing. The alternative algorithm has no such bias.

Recall that the probability that a message is spam given the presence in the message of some word is

p = ws/(ws + wn)
where
ws = number of spam messages in the database containing word
wn = number of normal messages in the database containing word

Consider a word that appears with equal frequency in spam and normal messages. Then, if there are far more spam messages in the database, ws must be much larger than wn, so p would indicate spam, which may or may not be true.

The alternative algorithm is

p = ps/(ps + pn)
where
ps = ns/ms
pn = nn/mn
ms = total number of spam messages in database
mn = total number of normal messages in database

In this case, if the frequency of the word is equal in spam and normal messages, pn = ps and the test is inconclusive, as it really should be based on this one piece of information.

The word probabilities are clipped and combined in the manner described above for Bayes analysis.

The asutil Utility

The package includes an additional executable, asutil, which uses the innards of aspam to provide some useful utility functions.

Usage: asutil [-f mailbox] [-h homedir] -r | -s |  filter

The arguments are as follows:

-f mailbox
This specifies the inbox folder. The inbox has the same default as for the aspam program.
-h homedir
This specifies the location of the directory containing the .aspam directory, if other than the user's home directory, as for aspam.
Only one of the following should be given.
-r
The inbox is rewritten so that the messages are in most-recent to least-recent time order. The message time is obtained from the initial "From" line in the header.
-s
The inbox is rewritten so that the messages are in least-recent to most-recent time order. The message time is obtained from the initial "From" line in the header.
filter
The argument is a string giving a shell command, quoted if it contains white space. The text of each message in the inbox will be given to the command as standard input as the command is executed. The messages in the inbox are not otherwise touched.


Copyright © Whiteley Research Inc. 2004