C.S. is a regular contributor to the Red Condor Security blog and a master at writing anti-spam rules—handwritten scripts that target new, slippery spam campaigns that automated methods such as session defenses, virus scans and static policy settings can’t detect. The Red Condor filters use up to 60,000 rules at any one time, with new rules being added in near real-time. Team Red Condor sat down with C.S. this week to ask him about this process.
Is rule-writing a science or an art?
Both. The science involves text processing and analysis at a minimum, but can get fairly elaborate with machine learning and statistical methods to aid in expert analysis – particularly useful in exposing invariant components of what would appear to the uninitiated as a mind-numbing degree of randomization in spam campaigns.
And the art?
The art comes with years of hands-on experience, and learning to balance the many constraints spam reviewers must keep in mind. For example, we need to ensure that our rules are fast and efficient, block spam and never ham (good mail), and don’t target trivial aspects of a campaign that are unstable and likely to change during a campaign run or over longer time intervals as the spammers continue to randomize their approach. The rules need to be general so that we’re not writing them in a one-to-one ratio with spam samples, but not so general that they end up matching unanticipated features of legitimate email. And we need to make the rules as fast as possible in response to real-time events.
Describe the rule-writing process–from detection of a new campaign to creating and activating the rule to block it.
The cycle looks something like this: A novel campaign emerges that seeps through our defenses, and we begin obtaining samples immediately. The samples are analyzed by our 24/7/365 review staff, who then describe features of the campaign to our filter stack, which is then updated within seconds to all deployed gateway security devices. This causes the samples from the new campaign to cease arriving in the reviewers’ queue as the breach has been dealt with. Rinse, repeat.
Are junk mail and spam rules different?
Yes. One of the complexities of using a spam filter is that I have a notion of what spam is, but my neighbor’s definition is likely different. There simply is no satisfactory definition for spam that covers everyone who uses email. Some people consider spam to be anything that they receive that they didn’t want. Others may view any email that has some sort of advertisement as spam, even if they have purchased items from the company in the past and specifically opted in to receive discounts and what not. Because Red Condor doesn’t model its filters at an individual mailbox granularity we have to make some compromises and Junk rules are part of how we accomplish this. Our notion of spam is meant to drive at the heart of an almost universal commonality in perspective of what spam is: spam either explicitly uses deceptive tactics to make its way past filters, or has content that is intended to deceive or harm the recipient in some way.
According to your bio, your favorite rules are “simple gems.” What do you mean by this?
Occasionally we detect annoyance campaigns, usually associated with probing activity where there is no sensible payload to the spam. These can be completely randomized with very little to “bite on”. This makes describing the campaigns more difficult (essentially tantamount to describing patterns in the decimal expansion of pi), which can result in rather convoluted looking rules where several features have to be present for a match to take place. To me, these are just ugly beasts when compared to say a simple eight-byte string that has blocked millions of spam over several years without generating any false positives.
How do you measure a rule’s worth?
Well, one standard would simply be the number of hits that it acquires over its lifetime. However there is complexity here, too, related to volume. Just because a rule targets the lower volume campaigns doesn’t mean it’s worth less than a rule that targets high-volume spam. It’s just where the rule lies in the distribution. We have to have rules that cover all scales or else customers will receive spam.
A better measure of worth is the false-positive rate. It’s not uncommon for us to discard rules that generate just two or three false positives, even when they have blocked millions of spam, but we have very strict standards and take reports of false positives very seriously; we always strive to do better.