---- Release 2.6 ----------------------------------------------------------- Our mail setup has changed radically. The mail server is now a Red Hat Linux 9 machine, over which we have full control. Previously our mail server was a shared FreeBSD machine, over which we had little control. The new setup allows us to configure sendmail to our liking. This has had a very positive impact on our spam situation, for two reasons: 1) We can now reject mail that does not have a valid recipient. Previously, mail to any username at our domains would be accepted. Spammers were sending dozens of copies of each message, addressed to semi-random and generally nonexistent users. Aspam easily culls the duplicates, but now such messages are never accepted by the mail server. 2) We use the dnsbl feature of sendmail, and reject nonresolvable domains. This cuts way down on spam from major sources. Still, one or two hundred messages per day have to be dealt with. Our first experiment was to install SpamAssassin on the server, called from a procmail delivery agent. Although the thoroughness of SpamAssassin is impressive, we found it unsuitable to our needs. 1) It is apparently compute intensive and seems very slow. The first time I fed it a message to categorize, I thought it was hung. If must have taken 15 seconds or so to process a single rather small message. I don't know if this is normal, but it doesn't cut it. Asapm can process dozens of messages in the same time. 2) In our setup, a new process is forked for each message received. When I dumped the queue from the old server, the new server became unresponsive due to the load. 3) When the messages were finally processed, there were errors, both false positive and negative. We don't view this with much alarm, as SpamAssassin had no past history. Once the Bayes database has a significant sample, the accuracy would no doubt improve. 4) One worry is that SpamAssassin is too popular. Spammers actually use it to tailor their headers. So, for now we removed SpamAssassin from the mail server. Aspam is our only spam filtering tool, and it is quite effective. * Minor changes were made to get aspam operational under Linux. Having been developed and used on FreeBSD, there were some minor porting issues when going to Linux. * There is a new executable contained in the aspam package: asutil. This provides a couple of useful functions: 1) It will sort a mail folder by date. 2) It will iterate through a mail folder, directing the text of each message to an external program. * These two new aspamrc keywords deal with encoded subject text. If the subject is in Chinese, for example, it is encoded. SubjEncEnglScore (score) This is the score to add if the subject text is encoded, using a western European character set. This is deemed slightly suspicious, so the default score is 4. SubjEncFrgnScore (score) This is the score to add if the subject is encoded using a character set that is not western European. The default is 0 so as to not affect users who receive such legitimate mail. * The OkSources in the aspamrc file is a list of IP addresses that will never be added to the address table. A 0 in the fourth position is a wildcard for this position, similarly 0 in the third and fourth positions is a wildcard for these two positions. A match to this list will not change the score, but only prevent the address from being added to the block list under any circumstances. You may want to include your trusted relays here. This is useful if your ISP sends you important mail as well as advertisements, for example. You want to filter the advertisements without permamently blocking the source address. OkSources (list of IP quad addresses) * The file format of the BadAddrDatabase has changed (again). As before, aspam can read the previous format, but will write the new format. In addition, aspam can now read the ascii format produced by the interactive mode 'd' command. The BadAddrDatabase now saves a time stamp for the latest hit for each address. In order to limit the growth of this database and to maintain efficiency, a mechanism is provided for purging old entries. The aspamrc file can contains lines of the form BadAddrPurge count days Up to five of these entries may appear (beyond five will be ignored). The count and days are integers. When the addresses are being saved back to a file, each one is tested against the purge specifications, and will be discarded if for any specification 1) The count is 0 or the address hit count <= count, and 2) The address last hit date is before the current date minus days. If no BadAddrPurge specifications appear, all addresses will be saved forever (as before). * The interactive mode 'd' command now takes a second optional integer argument, which is the number of days in the past to include. For example, "d 10 2" will dump the addresses with 10 or more hits with the most recent hit recorded in the last 2 days. ---- Release 2.5 ----------------------------------------------------------- * Lines in the aspamrc file can be continued using the backslash continuation method. If a line ends with a backslash ('\'), it will be joined to the following line (replacing the backslash) yielding a single logical line. * There is a new construct recognized in the aspamrc file, which provides access to a new expression interpreter. Rule Action action_keyword [argument] For each message, if the expression is true, the action is performed. This construct can appear any number of times. See the documentation for the syntax and list of symbols allowed in the expression, and for the keywords accepted in the Action line. * New interactive mode command: u[range] [redir] For each message in the range, evaluate the expressions in the rule/action list and show the result. The actions are not performed. * New interactive mode command: U[range] The user is prompted to type in an expression, as would be provided in a Rule/Action block. The expression is applied to each message in the range, and the result printed (there is no action). The process repeats until the user enters "q" for the expression, or no expression at all. * The bad address table now includes a count of the number of spam messages received from each IP address in the database. When an IP address is "added" to the database, and that address is already present, the count for that address is incremented. There are a couple of side effects: 1) The file format of the addr_table database file in the .aspam directory has changed, and is not backward compatible. The old format can be read, but the new format will be written. 2) The addr_table file will always be fully updated when aspam exits (unless table updates are suppressed). Previously, new entries could simply be appended to this file, but this is no longer true. In interactive mode, the 'u' in q[u] and t[u] no longer has meaning, since the database is always rebuilt. The number of spams received from the address is printed in the info block appended to spam messages, and printed with the 'f' command in interactive mode. It is also printed in the second column of output from the 'd' interactive mode command. * A locking scheme is now employed if the BadAddrDatabase file name is given in the aspamrc file. In this case, it is assumed that several users are sharing the same database file. A file with the same name and path as the database file, but with a ".lock" extension performs the locking. This file will be created if necessary, and automatically updated by the programs. it contains the process id of the owning process when the lock is in force, or "0" or is empty when the lock is not in force. The file must be writable by all users. It may have to be initially created by root using "touch" if the containing directory does not give users write permission - i.e., users cannot create a file in the directory but can update an existing file. If BadAddrDatabase is not given, the table is located in the user's .aspam directory and is assumed to not be shared, and no locking is done. The file is locked when the database is read into memory, and unlocked when the file is rewritten from memory, or the program exits. When locked, it can not be opened by another copy of aspam. If aspam finds the file locked, it will go to sleep and try again every five seconds. Note that long sessions in interactive mode will lock out all other aspam processes for the duration. * Modified interactive mode command: d[mincnt] [redir] This dumps a human-readable version of the address database. An optional integer mincnt can follow the "d", which if given will cause only table elements with message counts greater or equal to this number to be listed. Example: d25 > badones This will list the source addresses from which 25 or more spams have been received in file "badones". ---- Release 2.4 ----------------------------------------------------------- IMPORTANT --------- It has been discovered that corruption was occuring in the inbox and spambox files. What was observed was that occasionally (about one in a hundred messages) a message would get truncated by a few characters, so that the "From" line that starts the next message would not appear at the beginning of a line, preceded by a blank line. In this case, the second message would be hidden, acting as if it were part of the first. Messages were getting lost. An audit of the aspam code and hours of experiments failed to reproduce the effect. It is possible that aspam is not causing the problem: if the local mail distribution program is writing a corrupted inbox, that corruption would be transferred to the spambox as the "hidden" messages tag along. Nevertheless, aspam was still suspect, being the only "new" code in the chain. To address the problem, the functions that read/write the inbox and spambox have been rewritten to be as paranoid as possible. Checking is now performed, and corruption, if found, is repaired and reported in a log file. NOTE added: The problem was caused by null bytes in message body text, found in spam messages. Non-printing characters should not appear in messages, but for some reason on our system they do. I don't know if this is deliberate on the part of the spammer for some nefarious purpose, or some error. Non-printing characters are now converted to space characters when messages are processed. * There is a new "-c" command line option. If -c is given, aspam will check for corruption (described above) in the inbox and spambox, and perform repairs if necessary. After checking/repairing, aspam exits. * There is a new error logging capability. If there is a run-time error, aspam will write a message to stderr and also append the message to a file named "error_log" in the .aspam directory, which is created automatically. * New aspamrc keywords: Recipients list of recipient names and names@sddresses BadRecipientScore score (default 8) These keywords, if given, will cause checking that the "To:" or "Cc:" header contains a literal match in the given list. If no match is found, the score is added. Warning: mail that is bcc'ed will fail this test, unless the "real" recipient is in the list. In our case, we almost never get bcc'ed messages, and all mail to our domain goes to one inbox. This tests for made-up user names that are sent to our domain. * The SuspiciousPlaces, BadPlaces, and VeryBadPlaces are now tested against the full sender@address, and not just the address. Thus, the name@ can be used in the keywords in these lists (The GoodPlaces list already has this property). * All non-suppressed tests are now run when a message is being tested. Previously, testing would stop when the accumulated score reached the spam score (20). This takes a little more cpu horsepower but gives a full evaluation of each message. * When a message is added to the spambox, a block of text describing the test results is added to the message body. This block is ignored if the message is read in again, such as in interactive mode, for testing and display purposes. * In interactive mode, the new pf and yf commands (they are equivalent) will print the test result block for the message if in spambox context (S given). When printing spambox messages with p or y, the result block is not shown. * The prompt in interactive mode now displays the current message range. ---- Release 2.3 ----------------------------------------------------------- * The startup file and database files are now located in $HOME/.aspam. * The startup file, previously "$HOME/.aspamrc" or "./.aspamrc" must now be found as "$HOME/.aspam/aspamrc". A previous startup file (if any) can be moved to the new location, but some editing is needed to avoid warning messages. * The MsgIdDatabass keyword in the startup file is no longer accepted. This should be deleted or commented out of an old startup file. The previous message id database file (default name "$HOME/.aspam_db1" if any can be moved to "$HOME/.aspam/mesg_table". * The WordDatabass keyword in the startup file is no longer accepted. This should be deleted or commented out of an old startup file. The previous word database file (default name "$HOME/.aspam_db2" if any can be moved to "$HOME/.aspam/word_table". * The BadAddrDatabase keyword, if given, now must be a full path name to a file (can be any name). This is to allow multiple users to share the bad address database. If this database is not shared, This keyword should be deleted or commented out of the startup file, and the existing database (default name "$HOME/.aspam_db2") can be moved to "$HOME/.aspam/addr_table". * There is a new database of "good" return addresses. This is created and maintained by aspam in the file "$HOME/.aspam/good_table". Unlike the other database files, this is a text file, and can be edited by the user (carefully!) if necessary. * The following keywords can be added to the startup file. These should have no following text. Disable use of the bad address database: NoBadAddrDatabase Disable use of the word database: NoWordDatabase Disable use of the good return address database: NoGoodReturnDatabase * The GoodReplyPlaces startup file keyword is no longer accepted. The entries following this keyword should be added to the GoodPlaces field. The entries in the GoodPlaces list are now checked against the "From" address, and the reply address. * In interactive mode, the new "dg" command will dump the good return address database. * On the command line, the "-h path" will reset the "home" directory path, where the .aspam directory is located. * The format of the word database has been changed to reduce file size. The old format can still be read, but the new format is written. ---- Release 2.2 ----------------------------------------------------------- Second release 11/30/03 * Bug fix - avoids infrequent crash * Comment: The "alternative" word probability algorithm has been in use now for nearly one month. So far, there have been no false positives. There seem to be a few more spams that get through, usually one or two per day. Tota; email load is 2-3 hundred per day, of which maybe 10 are non-spam. The "alternative" algorithm is recommended, and will probably become the default algorithm in the next release. ---- Release 2.2 ----------------------------------------------------------- * New keyword in .aspamrc file: RemoveTo from_addr folder_path Any messages whose source matches from_addr (in the manner of the GoodPlaces keyword) will be removed from the inbox and saved in folder_path, which is a full path to a file, which will be created if it doesn't exist. Messages from this sender, spam or not, will go to folder_path. If folder_path can't be opened, the message will be treated normally. This solves a particular problem: An evil spammer discovers that an anti-spam tool is available on wrcad.com. So, the spammer tries to retaliate by using "wrcad.com" in the bogus return address applied to a gazillion spams sent to aol.com. Huge numbers of these bounce since the recipient address is incorrect, and end up in the mailbox of wrcad.com, from "NAILER_DAEMON@aol.com". This keyword enables filtering of these into a separate file. * The "BayesScore" keyword in now officially "WordScore", but the old name is still recognized. * New Keyword in .aspamrc file: WordAlgorithm 0|1 This changes the algorithm used to compute probability from the word table (if in use). The argument is 0 or 1. If this keyword is not given, or 0 is given as the argument, the Bayes algorithm is used (as in release 2.1). If another value is given, an alternative algorithm is used. In use, the Bayes analysis produces a few false-positives, maybe 1-2 percent in our site. Consequently, it may be useful to try different algorithms, since false-positives are really bad. Our site receives far more spam than "good" messages. In this case, there is some doubt whether the Bayes approach is the best. In the Bayes approach, the past frequency of spam is "built in", meaning that for us, there is a built-in bias that a neutral message is spam. I'm not sure that this is a good thing. The alternative algorithm has no such bias. Recall that the probability that a message is spam given the presence in the message of some word is p = ns/(ns + nn) where ns = number of spam messages in the database containing word nn = number of normal messages in the database containing word Consider a word that appears with equal frequency in spam and normal messages. Then, if there are far more spam messages in the database, ns must be much larger than nn, so p would indicate spam, which may or may not be true. The alternative algorithm is p = ps/(ps + pn) where ps = ns/Ns pn = nn/Nn Ns = total number of spam messages in database Nn = total number of normal messages in database In this case, if the frequency of the word is equal in spam and normal messages, pn = ps and the test is inconclusive, as it really should be based on this one piece of information. In both cases, we use the same formula to combine the probabilities of the 15 words whose individual probabilities differ most from 0.5. P = p1*...*p15/(p1*...*p15 + (1-p1)*...*(1-p15)) If P > 0.9, the test indicates spam. ---- Release 2.1 ----------------------------------------------------------- * New keyword in aspamrc file: BayesScore N This score is added if the "Bayes" probability analysis fails. * RFC 2047 encoded words in the subject header are now decoded. * The text from the subject header is now included in the word table.