« CFP: 2005 Symposium on Measurement, Modeling, and Simulation of Malware | Main | Real-Time Virus Detection System Using iNetmon Engine »
Automated Worm Fingerprinting
This paper is very much like another EarlyBird paper posted here recently:
Network worms are a clear and growing threat to the security of today's Internet-connected hosts and networks. The combination of the Internet's unrestricted connectivity and widespread software homogeneity allows network pathogens to exploit tremendous parallelism in their propagation. In fact, modern worms can spread so quickly, and so widely, that no human-mediated reaction can hope to contain an outbreak.
In this paper, we propose an automated approach for quickly detecting previously unknown worms and viruses based on two key behavioral characteristics-a common exploit sequence together with a range of unique sources generating infections and destinations being targeted. More importantly, our approach-called "content sifting"-automatically generates precise signatures that can then be used to filter or moderate the spread of the worm elsewhere in the network.
Using a combination of existing and novel algorithms we have developed a scalable content sifting implementation with low memory and CPU requirements. Over months of active use at UCSD, our Earlybird prototype system has automatically detected and generated signatures for all pathogens known to be active on our network as well as for several new worms and viruses which were unknown at the time our system identified them. Our initial experience suggests that, for a wide range of network pathogens, it may be practical to construct fully automated defenses-even against so-called "zero-day" epidemics.
Source: Automated Worm Fingerprinting, Sumeet Singh, Cristian Estan, George Varghese, and Stefan Savage, appeared at OSDI 2004.
Quickly scanning this paper reveals some interesting algorithms and approaches. I'm concerned a bit about the algorithm described in Figure 1, however. It uses simple thresholds to detect the presence of multiple targets. However, this seems like it would be indistinguishable from P2P traffic, for example, or any sort of related activity which sees increasing interest. Some initial analysis must be done on the payloads prior to analyzing them in this fashion or a high number of false positives are likely to be thrown for P2P requests. Furthermore, it seems important to periodically flush out the list of buffers, providing a sliding window of observations. Over time, these thresholds will be hit as observations accumulate. And finally, it's important to keep a baseline of observations of payloads to avoid detecting normal traffic changes as new worms.
Otherwise this is a worthwhile paper and it seems to have proven itself in field testing.
January 3, 2005 in papers | Permalink
Tell others: digg submit
|
del.icio.us this
|
Reddit
Comments
> This paper is very much like another
> EarlyBird paper posted here recently:
Hmmm, while they're both similar in that they deal with worms, I think that's where the similarly ends. Barreno's paper focuses on using limited observables (src, dst, pkt size) to infer the presence of infection (presumably to drive the quarantine of infected hosts). By contrast, our paper focuses on inferring content signatures for new worms. Signatures inherently have more power (and thus more challenges) since they can be shared and then used independent of detection. Perhaps a better comparison point is Kim and Karp's Autograph paper which has similar goals and approaches to our own. That said, the idea of automated signature extraction has been around in some form since Kephart and Arnold's work in the early 90's. Thus, its likely that the biggest contribution of the our work is the algorithms to do this at line rate in the network.
> I'm concerned a bit about the algorithm
> described in Figure 1, however. It uses
> simple thresholds to detect the presence
> of multiple targets.
Actually, this isn't to detect _targets_ per se, but filters for strings demonstrating epidemic behavior (seen at a variety of sources and destinations). That said, the paper does not discuss how these parameters should be automatically set or adjusted -- which is definitely an important point.
> However, this seems like it would be
> indistinguishable from P2P traffic, for
In practice, this is only true for p2p headers, not p2p content. This is because few pieces of p2p content truly exhibit disproportionate many-to-many transfers from a given network vantage point. The one key exception is BitTorrent which effectively replicates its content over short time scales -- and thus will cause false positives for the algorithm as described in the paper.
> example, or any sort of related activity
> which sees increasing interest.
Actually, this isn't the case in practice (Note that the _increasing_ characteristic of the prevalent content is not used by the content sifting algorithm). In production use, there are really very few true false positives. While there are a large number of detections which are supressed via whitelisting, the size of this class is only proportional to the number of very popular protocol headers (e.g. e-mail, Web, a few p2p protocols, etc) and the whitelist is easily generated procedurally. The real (i.e. alerted) false-positives are almost all relayed SPAM or BitTorrent. Now, while its possible there are other sources of false positives that we haven't encountered, we have over a year of use in multiple large networks so such issues at least aren't extremely common.
> Furthermore, it seems important to
> periodically flush out the list of
> buffers, providing a sliding window of
> observations.
Absolutely, although this is more for garbage collection purposes (i.e. resource consumption) than to prevent false positives. This is actually a more significant issue for extracting signatures for slow worms where you want to preserve state across longer time scales.
> And finally, it's important to keep a
> baseline of observations of payloads to
> avoid detecting new worms.
I don't think this is accurate (or at least I don't understand it).
- Stefan
Posted by: Stefan Savage | Jan 3, 2005 12:29:27 PM
"I don't think this is accurate (or at least I don't understand it)."
i misspoke, and i have edited the text to refer to what i meant to say. basically traffic changs over time can, in some circumstances, mimick worm attributes, albeit slowly. a dynamic baseline can help alleviate these false positives.
Posted by: jose | Jan 3, 2005 5:51:52 PM
Do anyone here know where can I find the prototype coding for Earlybird worm automatted fingerprinting ?
Posted by: William | Aug 25, 2009 10:40:39 PM
The comments to this entry are closed.