Warning: This is a pre-production release. Athough aspam is in use at Whiteley Research and successfully deals with the mountain of spam received every day, it has not been tested in other situations. It has been deployed on our FreeBSD and Red Hat Linux 9 mail servers only.
NOTE ADDED 5/20/08
Whiteley Research now uses industrial-strength spam removal tools.
aspam was an interesting programming exercise,
and may still be useful, or bits and pieces may be useful in other programs,
so we will continue to make it available.
Contents
What is aspam?
aspam Features
Installation
Invoking and Running aspam
The aspamrc File
Interactive Mode
Effective Use of Interactive Mode
The Word Table and Probability Analysis
The asutil utility
What is aspam?
The aspam program is a tool for
separating potential spam (unsolicited commercial) email messages from
other messages. It works by applying pattern matching and other tests
to each message found in the mail inbox, and assigning a score to each
message. Messages with a high enough score are removed from the inbox
and placed in a "spambox".
The aspam program is intended for use on Unix/Linux systems which provide mail delivery services. On such systems, mail is delivered to an "inbox" file for each user. The inbox file is simply a concatenation of the messages received for that user by the system. This file is read and manipulated by the user's mail client program, allowing the user to read, respond to, and dispose of individual messages in the file. In some cases, the entire file is transferred to another machine by a POP server, usually to support Windows users.
If run periodically or before the mail client or POP server is invoked, aspam will keep the inbox substantially free of spam messages, which have been placed in a "spambox" file. The user should check the spambox periodically for possible messages of interest that were misidentified as spam. If misidentified messages are found, there is a procedure described below which should be used to correctly update the internal tables. Otherwise, the spambox can be deleted immediately.
When a message is moved to the spambox, a block of text is added to
the message body which contains a tabulation of the test results, and
the score. If the message is subsequently read in again, for example
while in interactive mode, aspam will
ignore this block when retesting or printing. The interested user can
determine from this block why aspam
categorized the message as spam.
aspam Features
./configure
make depend
make
If not using gcc, the Makefile, produced by the configure script, will probably have to be tweeked. It the build fails, please notify Whiteley Research (aspam@wrcad.com), and we may be able to help out, otherwise consult with someone familiar with C/C++ programming. The build should work without modification on any freeBSD, Linux or Solaris system with gcc installed.
The aspam file (the executable) should be moved to a place where the user keeps executable files, or to a system location such as /usr/local/bin.
Current releases (2.3 and later) are quite different with respect to the expected locations of startup and database files from earlier releases (2.2 and before). The precedure below can be applied to perform the update.
This completes installation.
The existing startup file can be reused, but it too must be moved, and some editing is required to avoid warning messages. Perform the following steps to update the installation:
| NoBadAddrDatabase | don't use bad address table |
| NoWordDatabase | don't use word table |
| NoGoodReturnDatabase | don't use good return table |
Previously, these databases were disabled by giving the respective keyword without a file name argument. In the present release, the keywords above must be given to disable the feature.
This completes the update.
See the Unix manual pages for the cron and/or crontab commands to learn how to run aspam automatically. A typical installation may run aspam every half hour, for example. Then, only the spam messages received after the last aspam run will exist in the inbox.
Quesions and feedback can be addressed to aspam@wrcad.com.
Ideas for new tests and capabilities, and feedback about
effectiveness, would be welcome.
Invoking and Running aspam
The aspam program will operate on the inbox file, or any file containing a concatenation of email messages, and may remove messages identified as spam to another file (the "spambox") which is also a list of messages in the same format as the inbox.
The program is invoked from the command line as follows:
aspam [-f inbox] [-s spambox] [-h home_dir] [-t] [-c | -d | -i]
The argument following -f is the inbox file. If not given, this defaults to the value given in the aspamrc file, and if not given there it defaults to te value of the MAIL environment variable, which is usually the user's system inbox. If the MAIL environment variable is not set, the default is to a file named "INBOX" in the current directory.
If you just want to experiment and not touch your inbox, copy your inbox to another file, and use the -f option to work from that file.
The argument following -s is a path to a file used for collecting spam messages. This will override the value set in the aspamrc file, if any. If not given, a file named "SPAM" in the current directory is used. The file will be created if necessary.
The aspamrc files and database files are located in a directory named ".aspam" in the user's home directory. The -h option can be used to specify another directory to be used to locate the .aspam directory. The home_dir is a full path to the directory that contains the .aspam directory.
Ordinarily, each message is tested only once. When a message is tested, its unique message id is saved in a database, so future testing will be skipped. If the -t option is given, this database will be ignored, and all messages in the queue will be freshly tested. The databases will be read, but not updated in this mode.
If -c is given, aspam will check for corruption in the inbox and spambox, and perform repairs if necessary. After checking/repairing, aspam exits.
It was discovered (release 2.3 time frame) that corruption was occuring in the inbox and spambox files. What was observed was that occasionally (about one in a hundred messages) a message would get truncated by a few characters, so that the "From" line that starts the next message would not appear at the beginning of a line, preceded by a blank line. In this case, the second message would be hidden, acting as if it were part of the first. Messages were getting lost.
An audit of the aspam code and hours of experiments failed to reproduce the effect. It is possible that aspam was not causing the problem: if the local mail distribution program is writing a corrupted inbox, that corruption would be transferred to the spambox as the "hidden" messages tag along. Nevertheless, aspam was still suspect, being the only "new" code in the chain.
To address the problem, the functions that read/write the inbox and spambox were rewritten to be as paranoid as possible. Checking is now performed, and corruption, if found, is repaired and reported in a log file.
NOTE added: The problem was caused by null bytes in message body text, found in spam messages. Non-printing characters should not appear in messages, but for some reason on our system they do. I don't know if this is deliberate on the part of the spammer for some purpose, or an error. Non-printing characters are now converted to space characters when messages are processed in aspam, avoiding this problem.
If -d is given, no files are modified, but the messages found in the inbox are listed, along with the "score" and test result details. The score is determined for each message according to the criteria set by the aspamrc file. Messages with a score of 20 or higher are considered spam. The message id database is ignored (as if -t was given), so that all messages in the list will be tested. Again, there is no update to anything when -d is used, and spam messages are not removed from the inbox.
If -i is given, aspam enters an interactive mode.
Although aspam can be run directly from the command line, once aspam is "trained", it is more effective to run aspam periodically through cron(1) or at(1). The user is only required to check the spambox file periodically for any misidentified messages and delete the file.
The effectiveness of aspam is largely dependent on the past history of messages tested, as the databases acquire more information. It is also dependent on the keyword tests defined in the aspamrc file, particularly before much past history is accumulated. These tests should be tuned by the user, as they depend specifically on the type of messages that the user receives. The aspamrc file supplied was written for the author's email, which probably is a bit different from yours. Thus, the user should expect to spend some time experimenting and tweeking the aspamrc file for best, or even adequate, results.
In the initial phase of using aspam, the
user should modify the aspamrc file to catch spam messages
initially not identified, and to prevent messages that are actually
wanted from being identified as spam. After a few days of this
"training", aspam should be effective at
correctly identifying most spam, though occasional updates to the
aspamrc file will probably be necessary as new types of spam
arrive. As the databases acquire more information, aspam should become increasingly effective,
with less attention required of the user.
The aspamrc File
The aspamrc file is read when aspam starts, and provides
initialization. It is located in the .aspam directory, which
is usually located in the user's home directory.
Aspam provides three main categories of tests that are performed on email messages, representing three "lines of defense" against spam intrusion.
The aspamrc file is used to set the parameters that control these tests, and other factors that control aspam operation. The aspamrc file found in the distribution is a prototype that can be used as a starting point. For effectiveness this file will be heavily customized by the user, as many of the tests depend on the exact nature of the email that the user receives. The example file is heavily commented, and can be modified with any text editor.
The format is rather simple. There are a number of keywords, which are followed by data that you supply. A keyword will only be recognized if it starts in the first column. Some keywords have a single data item that follows immediately on the same line, separated by a space or tab. Other keywords introduce a list of words, which appear in following lines. Each of these lines must start with a tab or space, and the words are separated by a tab or space. Blank lines, and lines starting with the '#' character, are ignored by aspam.
Lines in the aspamrc file can be continued using the backslash continuation method. If a line ends with a backslash ('\'), it will be joined to the following line (replacing the backslash) yielding a single logical line.
The lists of words are the areas most likely updated by the user. The user is also free to fine-tune the numeric parameters as necessary. The numeric parameters control the weighting of associated tests. it should be remembered that if a message accumulates a score of 20 or more, the message is considered spam.
There are four database files used by aspam. These are typically created and maintained by aspam in the .aspam directory located in the user's home directory.
These databases are:
The source IP address is not always the sender's IP address, as resolved from the sender's domain. The source address is determined from the "Received:" message headers, if possible.
Unlike the other databases, the bad address database can be shared between multiple users, thus if one user receives spam from some source, the other users will be protected from spam from that source.
A locking scheme is employed if the BadAddrDatabase file name is given in the aspamrc file. In this case, it is assumed that several users are sharing the same database file. A file with the same name and path as the database file, but with a ".lock" extension performs the locking. This file will be created if necessary, and automatically updated by the programs. It contains the process id of the owning process when the lock is in force, or "0" or is empty when the lock is not in force. The file must be writable by all users. It may have to be initially created by root using "touch" if the containing directory does not give users write permission - i.e., users cannot create a file in the directory but can update an existing file.
If BadAddrDatabase is not given, the table is located in the user's .aspam directory and is assumed to not be shared, and no locking is done.
The file is locked when the database is read into memory, and unlocked when the file is rewritten from memory, or the program exits. When locked, it can not be opened by another copy of aspam. If aspam finds the file locked, it will go to sleep and try again every five seconds. Note that long sessions in interactive mode will lock out all other aspam processes for the duration.
The IP address of the sending machine, obtained from the headers, is checked against the bad address database, and optionally against external DNSBL databases of known spam sites. The local database contains a list of the addresses of the sending machines of all spam messages seen by aspam. Since much spam originates from the same sources, this effectively eliminates a great deal of spam.
It is possible to prevent addresses from being added to the bad address database.
Addresses already in the database that match this list will be removed from the database the next time it is rebuilt, i.e., the next time aspam is run normally.
The BadAddrDatabase saves a time stamp for the latest hit for each address. In order to limit the growth of this database and to maintain efficiency, a mechanism is provided for purging old entries. The aspamrc file can contains lines of the form below.
If no BadAddrPurge specifications appear, all addresses will be saved forever.
The following keywords apply to the testing associated with the word database.
There is a big problem with use of these keywords, however. Mail that is delivered in response to a BCC (blind carbon-copy) field will fail the test, unless the "real" recipient is also listed.
stevew stevew@chaucer stevew@chaucer.wrcad.com stevew@wrcad.com
The first group of tests generally operate on the sender's address, or name@address. The "sender" is the name@address obtained from the initial From line of the message (before the headers).
The "sender" may not be the same address to which a reply to the message is directed. The message headers may provide a different return address from the sender's address. There may be occasions where the sender's address is flagged as a spam site, though messages to be returned to a particular user or site that originate at the sender's site may not be spam. This can happen if legitimate mail is sent through a bulk-email distributer that also sends out spam. The return address is the address found in the "Reply-To:" or "From:" headers (sought in that order).
Words in the lists of this section match as a suffix, i.e., "abc.com" will match "joe@spammer.abc.com", case-insensitive. If the word begins with '*', then a match will occur if the rest of the word is found anywhere in the string, e.g., "*offers" will match "fred@directoffers.com" or "bill@offers4u.spammer.com" or "bestoffers@yahoo.com". Forms like joe@yahoo.com are accepted as well, which can be useful in the GoodPlaces or VeryBadPlaces lists to always allow or not allow a particular sender.
Unconditional acceptance always supersedes spam characteristics, i.e., unconditional acceptance tests are performed first, and if the message passes no further testing is done.
This enables messages from a particular sender to be routed to a special file. It solves a particular problem: an evil spammer discovers that an anti-spam tool is available on wrcad.com. So, the spammer tries to retaliate by using "wrcad.com" in the bogus return address applied to a gazillion spams sent to aol.com. Huge numbers of these bounce since the recipient address is often incorrect, and end up in the mailbox of wrcad.com, from "MAILER_DAEMON@aol.com". This keyword enables filtering of these into a separate file.
The "name" is the user id field of the sender. There are tests that examine the name for spam characteristics.
| * | Ordinarily a word given must match exactly a word in the string. If '*' precedes the word, then it is allowed to match part of a word in the string, e.g., "*free" would match "freedom" and "carefree". |
| ^ | The word must match the leading word in the subject string. If '*' is also given, the word must match the leading non-space character in the subject string and the characters that follow. The first character in the word should not be a space. |
| @ | If '@' is found, the match testing will be case-sensitive. |
If the subject text is encoded via rfc2047, a score may be added. Encoding is necessary for non western European character sets, but can aslo be used with western character sets. If the subject line looks like a mess when the message is viewed in a text editor of in a non-decrypting mail client, then it is probably encoded. In many mail clients, the decoding is transparent to the user. Spammers like to encode the subject line to try and foil spam detection software. Some mail programs in the Microsoft world encode the subject line if the user so-specifies or perhaps by default.
During the initial "training" procedure, one can use aspam with the -d option, which will list the messages found in the inbox along with the score. Words can be added to the aspamrc file to more correctly recognize the spam, and to allow legitimate messages that might be incorrectly scored as spam (score is 20 or more). After the change, aspam -d can be run again to verify the results. It is probably not possible to obtain perfect accuracy, but a carefully crafted aspamrc file should be quite effective. After any changes, aspam can be run without -d, and the messages identified as spam will be removed from the inbox and appended to the spambox, which defaults to a file named "SPAM" in the current directory. After making sure that this file does not contain any legitimate messages, this file can be deleted. After a period of training, one can set up the Unix cron or at programs to run aspam periodically. This should keep the user's inbox substantially free of spam. The user should periodically check and delete the spambox.
Rule expresion
Action action [argument]
The construct exists on two sequential logical lines, but each logical line may consist of more than one physical line with backslash continuation used to logically join the lines. Both the Rule and Action logical lines must be present.
Any number of these constructs can appear in the aspamrc file. For each message, each expression is evaluated, and if true the corresponding action is performed.
The expression follows the keyword Rule, and must exist entirely on the same logical line. As with any line in the aspamrc file, physical lines can be continued with the backslash continuation method.
The expression is an algebraic form similar to an expression in C. The following tokens are understood.
| Logical | |
|---|---|
| & | And |
| | | Or |
| ! | Not |
| Arithmetic | |
|---|---|
| + | Add |
| - | Subtract or unary |
| * | Multiply |
| / | Divide |
| % | Modulus |
| ^ | Power (not implemented) |
| Relational | |
|---|---|
| = | Equal |
| != | Not equal |
| > | Greater than |
| >= | Greater than or equal |
| < | Less than |
| <= | Less than or equal |
The following variable names are known. These are recognized case-independently.
| Name | Type | Description |
|---|---|---|
| int | Message score | |
| HITS | int | Number of previous spams from message source IP |
| TESTED | bool | Message was tested this run |
| PREV_TEST | bool | Message was tested in previous run |
| DUPLICATE | bool | Duplicate spam message |
| GOOD_PLACE_DB | bool | Sender found in good places database |
| GOOD_PLACE | bool | Keyword match to GoodPlaces |
| GOOD_WORD | bool | Keyword match to GoodWords |
| ADV | bool | Subject starts with "adv:" |
| IN_DB | int | 0,2,3,4 component match to bad address database |
| BLOCKED | int | 1-based block list affirmation if IN_DB < 4 |
| BOGUS_NAME | bool | Bad sender username |
| BOGUS_PLACE | bool | Bad sender domain |
| PLACE_DIFF | bool | Sender domain string is not IP address |
| VERY_BAD_PLACE | bool | Keysord match to VeryBadPlaces |
| BAD_PLACE | bool | Keysord match to BadPlaces |
| SUSPICIOUS_PLACE | bool | keysord match to SuspiciousPlaces |
| BAD_RECIP | bool | Recipient not in list |
| NAME_LENGTH | bool | Sender username too long |
| NAME_DIGIT_FRAC | bool | Sender username has too many digits |
| BAD_WORD | int | Count of keyword matches to BadWords |
| SUSPICIOUS_WORD | int | Count of keyword matches to SuspiciousWords |
| HTML_FORMAT | bool | Message is in html format only |
| BOGUS_RELAY | bool | Received-from header spoofing detected |
| WORD_DB | bool | Word probability analysis indicates spam |
The Action line must follow the Rule line, and consists of a keyword from the list below, following "Action", which may be followed by additional text for certain actions.
Action keywords:
Examples:
Rule DUPLICATE | VERY_BAD_PLACE | HITS > 25
Action Delete
If the message is a duplicate or has a match in the VeryBadPlaces list, or if we have seen more than 25 spams from the same source, it will not be saved in the spam box.
Rule SPAM > 40 & (IN_DB = 4 | BLOCKED)
Action AppendFile /home/user/crud
If the message was blocked and has score > 40, append it to the "crud" file.
Rule BAD_RECIP & BAD_PLACE
Action AddScore 20
Make sure that messages to a bad recipient from a bad place are tagged
as spam.
Interactive Mode
Giving "-i" on the aspam command line
places aspam in interactive mode. In
interactive mode, aspam responds to a
number of commands to perform various tasks. These are described
below.
The commands generally operate on a message list, which is a list of the messages currently in the user's inbox or spam box. When aspam starts, it reads the inbox and creates a list of descriptors for the messages. It is possible that while using aspam, the system will append additional messages to the inbox. Such messages are not accessible to aspam until aspam is restarted. These messages will not be disturbed when aspam moves spam out of the inbox. However, if the inbox file changes between the time that aspam starts and interactive mode is exited, such as if another copy of aspam was run non-interactively and spam messages were removed, the update will not be done, and a message will indicate that the file changed asynchronously. This should be borne in mind if aspam is run periodically in the background.
Each message in the list is given a sequential message number starting with one. Several of the commands take a message number or range of message numbers as an "argument". The possibilities are
| no range given | The range used is the previous range given, or the range starts at the message following the last range given, depending on the command. If there was no range previouly given, the range is the first message, in either case. |
| number | This implies the message corresponding to the number. |
| number1-number2 | This implies the messages in the range of number1 through number2. The number2 can be the dollar sign ('$') character to indicate the last message in the list. |
The command prompt will indicate the current default range, i.e., the range that will be used or advanced if no range is given to a command.
Some of the commands allow output redirection. This enables the command output to be placed in a file, or processed by a system command. The default operation if no redirection is specified is to display the output on the screen, after processing with the system pager command. The system pager command is the value of the PAGER environment variable, or the command "more" if this is not set. The possible redirection forms, which must appear at the end of a command line, are
| > filename | Write the output to filename. |
| >> filename | Append the output to filename, which will be created if it does not exist. |
| | command | Pipe the output through command. |
In the table below, the square brackets imply that the enclosed item is optional. The range and redir are the constructs described above.
In the listings, the first line gives the message number and the sender's user name and address (which are probably bogus for spam). The second line gives the return address obtained from the "Reply-To:" or "From:" header, if found, which may be different. The third line is a little different depending on whether the message was read from the inbox or the spam box. If from the spam box, the sender's IP address, if it can be determined from the message headers, is listed. If the message was read from the inbox, the third line gives the message score, possibly a short code string, and the originating IP address as determined from the message headers. If the address was found in the local block list, a D will appear before the address. If the address was determined to be a spam source from a DNSBL query, a B followed by an integer will appear ahead of the address. The integer indicates which of the DNSBL servers returned the indication, and is 1 or larger corresponding to the Blocklist entries in the aspamrc file. The fourth line gives the message subject text. The l command will show additional header lines, and the f command will show a breakdown of the test results.
For the following commands, if no range is given, the previous range will be used.
The remaining commands do not operate on the message list and are thus insensitive to the current message range.
A . B . C . D [count [date [anything]]]An IP-quad must appear on the line. This is followed by an optional hit count, which will be taken as 1 if not given. A date stamp can appear after the hit count. This is in the format as returned from the Unix time(3) command, and if not given the present time is assumed. Anything that follows is ignored - the dumped output has the humanized date string appended.
Up to two optional integers can follow the d. The first is a minimum hit count. If given, only addresses with this many hits or more will be listed. The second is a time value, which can be provided if a hit count is also given. If positive, it represents the time in days prior to the present time that the most recent hit must have been recorded. if the value is negative, it represents the number of minutes prior to the present time that the most recent hit must have been recorded. If 0 or not given, there is no time comparison and all addresses that match the hit count constraint will be listed.
Examples:
d25 > badonesThis will list the source addresses from which 25 or more spams have been received in file "badones".
d0 1 > badonesThis will list all addresses which had a hit in the last day.
d10 -30 > badonesThe list will contain addresses with 10 hits or more with a hit in the last half hour.
One may find that the bulk of spam received comes from a few sources, or "spamhauses". These are the paragons of the "email marketing industry" created for the purpose of flooding your inbox with garbage. As their sleazy proprietors give heavily to the Republican party, they have been allowed to expand and own ranges of source addresses, each address disgorging spam as fast as their fiber-optic connections can handle. Aspam does not know about the range, but over time, the individual addresses in the range should appear in the block list. If you do a d to dump the block list, you can see this. Your listing might contain something like
123.213.132.1and so on. There are places on the net, such as www.spamhaus.org, where these addresses can be traced to the company, about which information is provided. Enterprising users may wish to use this information to initiate lawsuits, or pass it along to cousin Guido the "enforcer".
123.213.132.2
123.213.132.4
123.213.132.7
123.213.132.8
The address database has partial matching capability, so that addresses "close" to a known bad address will be identified, and a score added.
The following commands must be followed by a file name. The file is a list of messages in the same format as the inbox and spam box. These permit seeding of the word table from existing message collections.
The following commands must be followed by an IP address in numerical quad form.
Interactive mode must be exited with q or t for any changes made with the r or a commands to be reflected in the disk files.
Any command line that starts with an unrecognized character will print
a synopsis of the commands available.
Effective Use of Interactive Mode
Thus, instead of simply deleting spam messages that aspam misses, the user should instead use
aspam to mark the spam with + and update
(i.e., use q to exit aspam).
This will add the source addresses to the database, update the word
and good return databases, and remove the spam from the inbox.
Normal Mail Misidentified as Spam
It is likely that sooner or later a desirable message will be found in the spam box. The spam box should always be carefully examined before deletion, for this reason. No anti-spam tool is perfect, but one should try to find out why the message was determined to be spam, and if possible make changes to prevent this from happening in the future.
The most important thing to do is to remove the message's source address from the local blocking database. As it stands, all future messages from this source will be flagged as spam. The following procedure can be used to revert the tables. This procedure does not put the message back into the inbox, so that the message should be read or saved before deleting the spam file.
Start aspam in interactive mode, and give S on the command line. This sets the message list context to the spam box. Find the message number of the message you want to keep, and apply the '-' command to this message number. This will remove the source address from the bad address database, back the message out of the word database, and re-add the words as from a normal message.
The critical step is now done, and one may use q to exit aspam at this point. It is entirely possible that the update to the word table fixed the problem for that message.
Before exiting, one can try to determine how the message was flagged as spam in the first place. Releases 2.4 and later append a block of text to each message in the spambox, which provides a summary of the test results. This can be viewed in interactive mode with the pf command, which must be given in the spambox context (S given). This will indicate the tests that triggered the spam categorization.
An alternate method repeats the testing, which can be useful as the aspamrc file is tweeked. Give the c command for the message. This will repeat all of the tests applied to the message. Then, the f command can be used to display the results. Note that the state of the word table is now different from the original test, so that result may be different. You may have to adjust the scoring or other parameters in the aspamrc file. Note that the "good" lists override the other tests, so adding the "from" address to the GoodPlaces will ensure that all mail with that return address will get through.
Finally, you probably want a copy of the message. Aspam can not move a message from the spam box back into the inbox, but the message text can be dumped to a file with the p command and redirection, e.g.,
p msg_number > filenameThe q or t commands should be used to exit aspam, and not x, so that the database files are updated.
For example (a true story), conference announcements from
computer.org were ending up in the spam list. The procedure
above was applied, leading to the information that
computer.org was blacklisted by one of the DNSBL sites. I
had two choices: 1) disable that DNSBL site as too paranoid, or 2)
add computer.org to the GoodPlaces list. I chose
the second option.
The Word Table and Probability Analysis
The algorithm used for probability analysis can be set with the
WordAlgorithm keyword in the aspamrc file.
If not given or used with argument 0, the Bayes theorem is used.
If nonzero, an alternative algorithm is used.
The body of each message is mime-decoded if necesssary to provide plain text or html text. The program understands base64 and quoted-printable encoding, and multi-part formats. This text is then tokenized into words. If the text is html, the html commands are stripped, somewhat carefully as some html tags are considered as token separators and some are not. Spammers sometimes put dummy tags within words to try and foil word recognition, but aspam will recognize this (or at least it should). The words found in the message are filtered: tokens that are 4-16 characters long and start with an alpha character are retained, anything else is thrown away. Punctuation, white space, and certain html tags separate tokens. The list is also filtered to remove duplicates.
The word database keeps track of the following: the number of normal message word lists added, the number of spam message word lists added, and for each word the number of normal messages that contain the word and the number of spam messages that contain the word.
After it is determined whether or not the message is spam, the word list is added to the word database, so that either the normal or spam counts are incremented.
With a sufficient number of word lists in the database, the database has predictive power for identifying spam messages. The prediction is disabled until 250 messages (spam plus normal) are in the word database. This is a guess as to how many messages are needed for accuracy, the more messages that are included, the better the accuracy in theory. The interactive mode of aspam provides a means to read in collections of known spam or non-spam messages to build up this database.
The word list for an unknown message is tested against the database in the following manner. Considering one of the words in the word list, the database provides the following parameters:
| ms | The number of spam messages in the database |
| mn | The number of normal messages in the database |
| ws | The number of spam messages that contain word w |
| wn | The number of normal messages that contain word w |
From this we obtain:
| p(w|s) | The probability that the message contains word w, given that it
is spam = ws/ms |
| p(w|n) | The probability that the message contains word w, given that it
is normal = wn/mn |
| p(s) | The probability that a message is spam = ms/(ms+mn) |
| p(w) | The probability that a message contains word w = (ws+wn)/(ms+mn) |
In probability theory, Bayes Theorem provides an answer to problems such as the following:
The probability that a basketball player is 7 feet tall is p. The probability that a man is 7 feet tall is q. The probability that a man is a basketball player is r. Given that a man is 7 feet tall, what is the probability that the man is a basketball player?
Bayes Theorem states that
p(basketball player given 7 feet tall) * p(7 feet tall) =
p(7 feet tall given basketball player) * p(basketball player)
Applying this to our problem,
p(w|s)*p(s) = p(s|w)*p(w)
| p(s|w) | = p(w|s)*p(s)/p(w) = (ws/ms) * (ms/(ms+mn)) / ((ws+wn)/(ms+mn)) = ws/(ws+wn) = The probability that the message is spam, given that it contains word w |
Note that (somewhat remarkably) the message counts drop out, and only the word counts are needed. However, we use the message counts to set limits on the probability:
pmin = 1/(mn + ms)
pmax = 1 - pmin
The probability computed using Bayes Theorem is limited by these values, to account for uncertainty.
The Bayes probability is computed for each word in the unknown message word list. The words are then sorted in descending order of the absolute difference of this probability from 0.5, and only the first 15 are used for further processing (or all of the words, if there are less than 15. If the word is not found in the database, or has a total count of less than 5, it is assigned a probability of 0.5. Thus, we choose the 15 most "interesting" words, in the sense that they almost always or almost never appear in spam messages.
Let Wi be the Bayes probability for word i. The proper way to combine the probabilities is with the expression
| p | = W1*...*Wn / ( W1*...*Wn + (1-W1)*...*(1-Wn) ) = the probability that the message is spam |
If the result p > 0.9, aspam categorizes the message as spam.
Our site receives far more spam than "good" messages. In this case, there is some doubt whether the Bayes approach is the best. In the Bayes approach, the past frequency of spam is "built in", meaning that for us, there is a built-in bias that a neutral message is spam. I'm not sure that this is a good thing. The alternative algorithm has no such bias.
Recall that the probability that a message is spam given the presence in the message of some word is
p = ws/(ws + wn)where
ws = number of spam messages in the database containing word
wn = number of normal messages in the database containing word
Consider a word that appears with equal frequency in spam and normal messages. Then, if there are far more spam messages in the database, ws must be much larger than wn, so p would indicate spam, which may or may not be true.
The alternative algorithm is
p = ps/(ps + pn)where
ps = ns/ms
pn = nn/mn
ms = total number of spam messages in database
mn = total number of normal messages in database
In this case, if the frequency of the word is equal in spam and normal messages, pn = ps and the test is inconclusive, as it really should be based on this one piece of information.