The Anvil Podcast: Malware Classifier

By Community Team August 14th, 2012

Rich: Adobe hosts a number of Open Source projects on SourceForge, in their Open@Adobe site. These projects are developed by Adobe employees, and I recently spoke with Karthik Raman, who has worked on a project called Malware Classifier. The Malware Classifier is a set of machine learning algorithms for identifying malicious vs. clean binaries for Win32 operating systems.

If you’d like to have your project featured on the SourceForge podcast, just drop me a note and we’ll schedule something.

If the embedded audio player below doesn’t work for you, you can download the audio in mp3 or ogg formats.

You can subscribe to this, and future podcasts, in iTunes or elsewhere, at http://feeds.feedburner.com/sourceforge/podcasts, and it’s also listed in the iTunes store.

Related Content:

Here’s my conversation with Karthik.

Rich: We’re talking about the Malware Classifier.

Karthik: Right. It’s a project up on the Adobe Open Source site, it’s called Malware Classifier.

Rich: Tell me something about the Malware Classifier. What is it trying to accomplish?

Karthik: It’s a tool that uses a machine learning algorithm to try to quickly determine whether a binary under analysis – a Win32 binary – is malware possibly, or a clean file. It uses four machine learning algorithms that were generated by running certain classifiers against a data set of about 100,000 malicious programs, and 16,000 clean programs. This is part of some research I did when I was a grad student at U.C. Irvine, and something I continued to do when I started working at Adobe about a year and a half ago. The tool released on SourceForge is a culmination of the research I did and it incorporates the distilled versions of four of the six classifiers that I used in my research.

Rich: When you say it uses a machine learning algorithm, does it have a feedback loop where you tell it whether it’s correct or not, and then it learns further from that, or it’s just based on the data that it’s already got?

Karthik: It’s more the latter. There’s no learning happening within the source code itself. It’s a result of training that happened in advance by training the classifiers against the data set that we discussed a minute ago. The results of that training was incorporated into the script. If you think about it, it really is simple. I’ve labelled the four algorithms that I’m using – the classifiers that I’m using. If you look at the python source code, there’s a bunch of decision trees that incorporate the learning that the algorithms experienced when they were training with the data set, and my hope is really that people will look at this stuff, and if they’re interested in machine learning and malware classification, either use the tool themselves, or extend it by running their own machine learning algorithm, and extending the current four set of classifiers or writing their own classifiers.

Rich: I’m curious if your research gave you any idea of how the classification of “malicious” might evolve over time. Would a tool like this run on today’s software work on software from six years ago, or six years in the future, do you think? Do the “malicious” techniques tend to persist over time?

Karthik: I have to take a couple of steps back and talk about what it is that these classifiers use to make a determination of whether something is possibly malicious, or something is clean. I used a technique called “feature reduction”. The end product of my research was that I identified seven features within the file format these binaries are compiled in – the file format is called the P.E., or portable executable format – essentially the values these seven features that are compared in a large decision tree in each of the four classifiers. The second aspect of your question is, would the classifiers be relevant to data or files from six years ago or files that are compiled in the future. I think in general, the problem in machine learning is there is a possibility of training our models to fit the data at hand – sort of over-fitting the problem, and I accept that’s a valid concern here as well because, for example, I only had 16,000 clean programs to train with, and I had 100,000 malicious programs, so you could argue that the algorithm learned that malicious programs all share the characteristics of of those 100,000 files, and clean programs all share only the characteristics of the 16,000 files. So there’s the disparity in the size of the data set, and also the specificity of the files that are used in this data set. The overall purpose of this project was to evangelize the idea that one could use machine learning and use it with a limited number of features to solve a problem within given parameters with an established false positive and true positive rate. I don’t expect this program to be used commercially, it’s just the idea that I’m trying to spread by the use of this tool.

Rich: Are you looking for a community of people to become involved in this project to move it forward, or is it pretty much done?

Karthik: I’ve given talks at a few conferences on this topic, and I’m hoping the community would look at that research and if they’re interested, pick it up. I’ve outlined the methods and techniques and given the background on how someone could be introduced into machine learning and follow the train of research that I did myself. So, yes, I’m hopeful that other people look at this and build on it themselves for their own environment. One example that comes to mind readily is, a lot of people work in research or analysis for I.T. companies. The application of research like this is that they could look at unknown binaries that their environments receive, and if their antivirus programs are in lag, they could train their models over time with the binaries in their environment, and then extent the script so that it grows for their particular environment. I am hopeful that the community picks up on this idea and goes to town with it.

There’s one aspect that I covered when I was speaking about this research at conferences – it’s that you need to be a domain expert in whatever domain you’re trying to apply machine learning in. There is the necessity that you understand what you’re doing when you’re trying to apply machine learning to that domain. So, I have some experience being a malware analyst, and that helped me along the way as I was determining which features to use in machine learning. I think in the end it comes back to looking at the research. There’s technical papers, there’s hundreds of slides that have been published, and there’s the source code that’s available, so I’m really hopeful that people out there who are keen on machine learning, which is, in my opinion, an underutilized technique in computer security in general, I’m hopeful that this research is at the vanguard of what people are interested in doing in the community and they look at the paper, the slides, and the tools, and build on it, and help make security better for everyone.

Rich: That’s really an interesting point. I think that people not involved with this field of programming tend to assume that you just sic the computer on things and it figures stuff out. It’s interesting that you point that out.

Karthik: This isn’t a panacea. You can make it fit to your problem, but you have to have some knowledge about the problem so that the solution can be brought to bear correctly.

Rich: Thanks so much for taking a few minutes to talk to me.

Karthik: My pleasure, Rich.

Tags: adobe, malware, Podcast