CSLab-provided spam-filtering options

This page is written for the purpose of citing to people who need or want more explanation of how the CSLab-provided spam-filtering options work, and why there is the complicated distinction between the rejection options and the discard options.

First of all, here is some background about how e-mail transport on the internet works.

A protocol called "SMTP", which stands for "simple message transport protocol", is used to transport the message between computers until it reaches its destination. In the simplest case, the e-mail message is initiated on one computer, whose e-mail system makes a direct SMTP connection to the computer on which the recipient is going to read the e-mail; and the e-mail message is transmitted over this SMTP connection. (In more complex cases, there may be multiple SMTP connections, and maybe additional other kinds of connections at the two ends.)

SMTP is not, actually, an extremely simple protocol. It has complications which derive from the complex realities of a large interconnected network. One concern of the SMTP protocol is reliability: suppose the transmitting computer crashes while sending? Suppose the receiving computer receives the complete message text but then crashes before its buffers are written to disk?

The SMTP protocol guarantees reliability by saying that during the SMTP conversation, after a message is transmitted, the receiver of the message does not acknowledge the message until it has been successfully written to disk. At that point, the second computer has responsibility for the message. At all times, at least one computer (for a brief instant it may be two) has complete responsibility for the message. So SMTP transport of a message basically consists of transferring who has responsibility for the message.

You will have received "bounce" messages, error messages saying that a particular e-mail message could not be delivered for one reason or another. Suppose you compose an e-mail message on a computer named "A", addressed to "flaps@B", and suppose that this computer then transmits it to the computer "B". In the SMTP conversation, it will tell computer B that the message is for "flaps@B". If the computer B knows that there is no valid e-mail address "flaps@B", or has some other reason to refuse the message, it will reply with a result code which means this, and reject the attempt to transfer the responsibility for the message to the computer named B. At this point, traditionally the computer named A will send you a bounce message from MAILER-DAEMON.

Suppose, however, that the target computer doesn't know at this point whether the message is valid or not. For example, perhaps the machine B is not connected to the internet, and the SMTP conversation is in fact to a machine named C, which is listed as the "mail exchanger" for B. Then C will accept "flaps@B" because it looks plausible. At some later time it will attempt to transmit the message to B, which will reject it. But C has already accepted responsibility for the message! Therefore the computer C will generate a bounce message, addressed to the original user on machine A, and this message will be transported back to machine A using a new SMTP conversation. Sometimes this happens much later than the original transmission.

How do we filter out spam?

Suppose that during the SMTP conversation, the receiving computer knows that the message being transmitted is spam. Then it makes sense for it to use one of the "reject" codes, and refuse to accept responsibility for this message.

Maybe you think you have a program which when presented with an e-mail message, determines whether or not the message is spam. If you can run this program while the SMTP conversation is live, then you would know whether you should accept or reject the message.

But that's tricky to arrange in many cases. Maybe you run the program later, after the message has been accepted, and then you know that the message is spam. Should you generate a bounce message, since you have accepted responsibility for the message?

In fact, you should not bounce spam. First of all, consensus has become pretty clear that despite the rules in the SMTP protocol standards, there's no obligation to treat spam "correctly", and it's best to spend as little resources on disposing of it as possible. But more than that, these days spam e-mail messages are almost always sent with forged "from" information. For example, if you and I are both in the spammer's e-mail address database, they might send you a message which says it's "from" me, and send me a message which says it's "from" someone else. (This will bypass any anti-spam rule you have which says only to accept messages from people you know, for example.) So, bouncing spam is bad, because then you are sending spam to random third parties... technically, you're a spammer too. Very bad.

So what do we do with an e-mail message which we've accepted, and thus taken responsibility for, but is spam? No problem -- we can simply delete it, forget about it.

But it's not that simple. Because the program which you think identifies whether an input message is spam, actually is not perfect, quite far from it. Typically it will have some number of "false positives", where it identifies a message as spam even though it isn't, and "false negatives", where it identifies a message as not being spam even though it is.

False negatives aren't so much of a problem -- the spam gets through. It's extremely annoying, but it's not fatal to the spam-filtering concept.

But false positives are a big problem. You can't simply drop the message because the program says it's spam. It might be a legitimate message. The easing of the rules about responsibility for the e-mail message only apply to actual spam. So you can't just drop ("discard") the message because some program, with a non-zero false-positive rate, identifies the message as spam.

Or can you? First of all, good spam-filtering programs^† have a very low false positive rate (although a zero false positive rate is impossible without a very large false negative rate, which would make the spam filter useless). Secondly, if the "you" in question is the addressee of the e-mail, it's your mail, you can do what you like with it, e.g. it's perfectly legitimate for a user to discard their own mail without reading it.

So under some circumstances, it's possible to discard spam automatically. But it's better to reject it, during the SMTP conversation. If it's properly rejected, then if it's being sent by "spamware", probably the spamware ignores the result code but anyway we can't control what they do; but if it's a legitimate e-mail message being sent by a legitimate mail transport agent, then proper bounce messages will be generated and the sender will see that their legitimate message was falsely rejected as spam and they can take appropriate action.

† Not to be confused with spam-filtering systems such as hotmail's, which have a high false-positive rate.

What kinds of spam can CSLab's e-mail servers currently reject?

The most effective way to reject spam, if you can't run a probabilistic spam filter while the SMTP conversation is still going on, is to reject all e-mail from computers which are known to send spam. People out there compile big lists of such machines, and make them available for consultation in real-time over the internet.

First of all, e-mail addresses can be restricted to receiving e-mail from within the department, or to receiving e-mail from within U of T (these are two different options). If you ask for this for a particular e-mail address, then all other transmissions will be rejected.

But that's only useful for a small fraction of the e-mail addresses at CSLab. More generally:

CSLab currently can (if you ask for "spam rejection") consult various databases from "spamhaus.org" which list known spammers, residential computer address ranges (users there will send legitimate e-mail to the SMTP server their ISP tells them to use, and that SMTP server will send to CS, and it won't be blocked), and known compromised ms-windows machines which have been observed sending spam (which is where the majority of spam is injected from these days). Transmissions from those computers will be rejected.

There's also a strange but surprisingly effective technique which will also be activated for your e-mail address(es) at CSLab if you ask for "spam rejection". It is known as "greylisting" (whereas the above database-consultation was originally known as "blacklisting", although people seem to prefer the term "blocklisting" these days).

"Greylisting" involves the use of a third category of SMTP result codes, other than reject or accept. This category is "temporary failure". This is the kind of result you would send for a "disk full" error, for example: the sender should not consider the responsibility for the message to have been transferred, but neither should it generate a bounce message. It should try again in a little while.

"Greylisting" consists of sending a "temporary failure" result code, and then accepting the message when it's retransmitted. The point is that real mail transport agents will indeed retransmit... but spamware typically ignores the result code anyway (continuing to spam whether messages have been accepted or rejected or whatever) and thus will not retry and the message will never be delivered.

What other spam can be filtered?

As implied above, CSLab also has a probabilistic spam filter which examines an e-mail message and decides whether a message is probably spam, based on a number of characteristics (including message content) of both the message and of various kinds of spam. This cannot, currently, be run during the SMTP conversation, so you can't reject e-mail based on this. However, you can ask for e-mail to be discarded based on this.

If you ask for e-mail discarding, you should also ask for e-mail rejection. It's better to reject a message than to discard it, because of the spectre of false positives, and the overlap in messages caught by these two categories of spam-filtering techniques is considerable.

Rejecting isn't perfect after all

Spam-rejecting here won't cause any new problems here. However, it might cause new problems elsewhere, and we might get new problems here because of your spam rejection elsewhere.

A common case is where a CSLab user has a .forward file which forwards their e-mail elsewhere, and that other location rejects a lot of the e-mail which is forwarded in this way. One reason for this is spam rejection at that other site.

The e-mail has been transmitted to CSLab's mail server; it's for a valid address at CS; so CS's mail server accepts it.

Then it goes to forward it as your .forward specifies, which involves making a new SMTP connection to a new machine as specified by your .forward. Now suppose that during this new SMTP conversation, the new recipient rejects the message.

Now, since CS has accepted responsibility for the message, it has to generate a bounce message. But if it's spam, from a forged sender, then CS's e-mail system sends the bounce message, most of which is the original spam message, to that forged sender address, thus sending spam.

So you should not, if at all possible, forward spam to another mail server which is going to reject it. Or, in general, forward anything to a mail server which is going to reject it (e.g. it's also a bad idea to have an invalid address in your .forward).

You can also use CSLab's probabilistic spam filter to help with this situation. If instead of creating a ".forward" you create a file called ".forward-nonspam" (or rename your .forward to .forward-nonspam), then the CSLab probabilistic spam filter is run on your mail messages and they are forwarded only if it decides that they are not spam.