Machine learning (or artificial intelligence) is a must-have for scaling malware detection. But what type of machine learning should you look for, and how should it be applied?
This article is the second in our series looking at machine learning techniques. In the first, we explored supervised and unsupervised machine learning, how they differ and when to apply them. Here we consider what’s meant by Deep Learning and its application to cyber-security.
Deep Learning is a particular approach to machine learning based on (deep) neural networks. Comprised of many layers it is inspired by biological neural systems. Artificial neural networks have recently been very successful in a number of applications, and are used to estimate functions that can depend on large numbers of inputs.
A common way to describe how Deep Learning functions is that the more layers that can be introduced, the more the system begins to resemble a ‘functioning brain’. This enables it to adapt and learn from history. However, the ‘brain’ analogy is perhaps somewhat overstretched. Perhaps a more practical, and technically correct way to describe Deep Learning is that it stacks together thousands of elementary mathematical functions, until you obtain the complex one required to explain your data.
Deep learning draws on fundamental concepts used in many different areas of machine learning and is complementary, not superior, to other machine learning approaches. To make a distinction between machine learning and Deep Learning is, in fact, somewhat artificial – the basic building blocks used in Deep Learning are essentially the same as in other machine learning approaches. However, Deep Learning does differ from ‘traditional’ machine learning in that its starting point is the raw data. It learns the features that represent the data, as opposed to being guided by human intervention, in order to build a model and draw conclusions about the data.
While Deep Learning is currently ‘state-of-the-art’ for many applied AI tasks, especially in the computer vision field, as with other machine learning techniques it has advantages and disadvantages when applied to malware detection.
An advantage of Deep Learning is the way it starts at the raw data and aims to avoid the need for explicit feature engineering (which is one of the most time-consuming parts of supervised machine learning). This approach works well in areas such as image classification where the data (a collection of pixels) is a reflection of the true nature of the sample that needs to be classified.
We have an abundance of data available in the malware industry, and we can use this data to train the classifier. This plays to the strength of Deep Learning, which is why it is sometimes used. However, the huge amount of training data required for Deep Learning systems becomes a challenge when data is obfuscated or encrypted. This is a common issue in polymorphic malware detection. We will look at the specific problem this creates, and how to fix it, in later blog posts.
Suffice it to say, that Deep Learning requires a very large quantity of data to detect meaningful patterns. This is because, unlike in the case of other machine learning techniques, it cannot benefit from the intelligence gained from deeper levels of binary or behavioural analysis that comes from emulation, unpacking or de-obfuscation.
Deep Learning systems are extremely computationally expensive to train. In other words, they need a lot of processing power to learn about the data. Even when the system trains on a modest amount of data, it can take weeks to build a Deep Learning model using hundreds of machines and powerful graphics processing units (GPUs). As a result of this, Deep Learning systems lack the capacity to retrain quickly to adapt to new families of malware. The risk is that this could leave users vulnerable should the system misclassify a piece of malware.
There’s no doubt about it, Deep Learning is a very powerful technique. However, there may be simpler, faster, and more effective methods to identify malware than using Deep Learning or even machine learning in general, and it’s important to understand the problem we’re trying to solve, rather than choosing to solve the problem with one technology.
We use an ensemble of different machine learning techniques, ranging from linear models such as logistic regression to nonlinear models such as kernelized support vector machines and random forests. For problems where it is the best choice, we also use Deep Learning techniques such as convolutional neural networks. Each technique is chosen for the benefits it brings to the task at hand, for example: Malware, phishing detection and detection of abnormal network flows. It all depends on the needs of the user and the problem we are trying to solve.
We are always looking to choose the optimal mechanism to identify malware. Sometimes this will simply be a lookup of a hash – because the file is already known. It may be that a powerful scanning engine can identify malware, or sometimes a combination of sandboxing and machine learning.
We believe it’s the size of the toolbox that matters – and choosing the right tools for the job.
Look out for our upcoming blog series on the technologies that support the application of machine learning to malware analysis and in the meantime, for more about Applied AI, take a look at our white papers for more detail.