Primer: How GANs can be used to create synthetic data.
In our previous article we introduced Generative Adversarial Networks (GANs). We spoke about their enormous potential to create new, realistic, synthetic data across many industries, and the benefits this will bring. Here we’ll look at the relationship between GANs and other forms of machine learning.
We have previously discussed some of the pros and cons of different forms of machine learning with respect to their application to cybersecurity. Let’s do a short recap here and see where GANs fit in.
Unsupervised machine learning is used to explore and find structure in data we know little about. It’s ideal for a first analysis when the data is not labelled. For instance, we can use it to cluster data based on similarity, or help identify anomalies in the data. In cybersecurity, there is always an enormous amount of data to be handled, and the rate of increase in volume is unrelenting. This means that our learning methods need to be quick and able to scale. Much of the data we see in cyber-security is unlabelled, which makes unsupervised machine learning an ideal tool.
We apply supervised machine learning when there is a specific property of the data that we care about and we want to make an analysis based on that property. For instance, in cybersecurity, supervised learning can be used to detect whether a file is malicious or not, to decide whether a web page is a phishing page, or to classify threats in an IOT setting.
Semi-supervised learning is a class that lies between unsupervised and supervised.
Here we have the best of both worlds, utilizing both labelled as well as unlabelled data. Semi-supervised learning is particularly useful in the field of malware detection. Here we can combine the benefits of using labelled data annotated by expert analysts, with the additional information about the geometric structure of the data. This we can infer from the vast amount of available unlabelled samples.
Deep learning is a special class of machine learning. It can be supervised or unsupervised and is based on deep neural networks. The term ‘deep’ refers to the fact that the neural networks consist of multiple layers. These correspond to different representations of the data. We sometimes refer to methods that do not use deep learning as ‘shallow learning’.
Both unsupervised and supervised machine learning typically use special features of the data to develop an intermediate representation of data in higher dimensional vector space. In our case this space describes the features of the file that make it – or just as importantly, do not make it – malware. The main distinguishing property between deep learning and traditional methods is how one acquires those features. Traditional methods rely on separate feature extraction processes based on handcrafted features. Deep learning starts with the raw data. It then derives the features as part of the training process in the first layers.
This makes deep learning an ideal approach when we lack powerful feature extraction techniques upon which to train the model. Although best known for their use in computer vision, deep learning models are not uncommon in cybersecurity. While they avoid the investment required to develop powerful feature extraction engines, this benefit can come at the price of high false-positive rates and slow training times.
In practice, there is no one-fits-all approach. What is the best technique depends on the data and the problem at hand. Whether it is hand-crafted features by expert analysts or deep learning techniques such as convolutional neural networks – it is important to apply the right technique at the right time for the best result.
GANs in their most common form are a type of unsupervised learning because the data we start with is unlabelled. This is commonly a collection of images, video or binary files. The goal of the GAN is to learn about the structure of the data.
Although GANs generally are a form of unsupervised machine learning they also incorporate aspects of supervised learning. Internally the discriminator sets up a supervised learning problem. Its goal is to learn to distinguish between the two classes of ‘synthetic’ data and ‘original’ data. The generator then also considers this classification problem and tries to find adversarial examples, i.e. samples which will be misclassified by the discriminator.
Recently there have also been GAN architectures which work in a semi-supervised learning setting, which means that in addition to the unlabelled data, they also incorporate labelled data into the training process.
To implement the discriminator and generator, we typically choose to use some form of deep neural network, e.g. a convolutional network or a simple multilayer perceptron. The basic structure of GANs (the adversarial game between discriminator and generator) works with any kind of machine learning model. It does not necessarily require a deep one. However, a deep neural network is essential to really unleash the potential of GANs. This is particularly in the case of complex data such as images, audio or executable files. It is because deep neural networks can represent very complex functions which benefits both the discriminator and generator when creating realistic synthetic data.
In cybersecurity, any machine learning system’s ability to identify malware is limited by the quality of its training set. Consequently, the dataset upon which the system trains should be representative of all the different types of malware that have been or could be detected. Regular updates to the training set are also required. This is why speed is of the essence, and slow retraining times can leave systems vulnerable. This is where GANs can help us.
GANS are capable of generating new samples that follow the same distribution as the initial data. Thus, in cybersecurity they are immensely useful in helping us learn about the structure of the data distribution for novel types of malware. They also help provide insight into the process used by malware authors to generate new instances of a malware family. GANs are able to perform this task without the need to have an explicit model of the probability distribution of the data beforehand. The GAN learns by simply sampling the provided data.
We then make our models more robust and resistant to adversarial attacks by feeding the newly generated data back to our learning processes. In this way, GANs help us to anticipate actions by malicious attackers. They help us develop a better training set on which we can build our machine learning models. Ultimately, the more information we have with which to train our machine learning models, the better we will become at predicting new and novel forms of malware.
We expect that both cybersecurity companies and malware authors will start to use GANs, or similar techniques, to improve their capabilities. The good guys will use GANs to extend and harden their training sets. The bad guys will use it to create data which they will use to attack the good guys. Of course, it’s not quite as simplistic as that. How we protect ourselves against this new form of malware production is the subject of our next blog.