How to detect malware using computer vision

malware is still an increasing problem and most users and companies try to protect themself using a classic antivirus software. Meaning that the antivirus vendors takes care on finding and detecting every new antivirus everyday, calculating its fingerprints and protect their users by sending those fingerprints (signatures and heuristics) to the clients antivirus software. In this system the clients antivirus Software receives the fingerprints of all the bad boys out there daily.

This system worked out pretty well, until attackers changed their strategy on how they use their freshly created malware. A decade ago, attackers spread their new created malware through the web, trying to attack as many victims as possible. Antivirus vendors could easily find those viruses and protect their clients.

But since back in 2013 (!) the cyber security analysis of fireeye [1] found that 82 percent of malware disappears after an hour, and 70 percent of malware only exists once. This short lifespan means just a small percentage of antivirus detection signatures catch active threats. The rest just hunt ghosts. Though some companies have introduced new strategies to combat these adaptations, they haven’t been enough to fully keep up with fast-moving threats.

To tackle this problem, big companies spend hundreds of thousands of dollars annually to employ full IT-Sec Teams for building and maintaining file and malware analysis systems to analyze and classify every incoming file by themselves. Besides that this is extremely expensive and requires deep knowledge, those systems are not known to be the most performant systems at all. It is not unusual that a single analysis can take up to 15minutes, especially in scenarios with multiple hundreds or thousands of analysis per minute.

Our opinion is, that there must be a more intelligent way to detect unknown malware. Talking about intelligence, Artificial Intelligence (AI) is still receiving a tremendous hype. It feels like a new AI architecture is presented every month that put the previous one in the shade. Sometimes it doesn’t even feel like research, but like an ongoing war for supremacy in the distinction between cats and dogs, or which AI is better able to deep fake another celebrity. But, in specific image classification tasks AI is sometimes better in distinguishing one class from another, but in the end it does not help at all in detecting malware. And the reason for this is pretty simple. Those AI architectures aim to classify images, but image data is dramatically different to data provided in bytestreams representing PDF, MS-Office or PE files. To make those strong AI systems compatible to another type of data, it requires heavy adjustments of those AI architectures resulting in bad results. That’s why others quickly rejected this approach, but we had a crazy idea.

We thought that it might be possible to instead adapting the AI architectures to fit the problem of malware detection we could try to adapt computer files in a way that it can fit those strong AI architectures like they are.

The idea behind this is pretty simple and human medicine is a simple but fitting analogy. If someone feels sick, he visits the doctor. The doctor takes an xray and tries to identify unusual patterns/structures of shadows. Doing this the doctor might not be able to make an exact diagnosis, but he/she can be sure that something is wrong at a certain point.

Now that we have the idea, it is time to understand how AI is able to understand image date using computer Vision.

Computer Vision

Fist things first. In order to apply an AI (artificial intelligence) to the generated images, some things has to be considered. First we need to understand what computer Vision is and how it works. Spoken on a high level, with computer vision, the computer should be able to understand what is visibly shown onside a picture. Without computer vision the computer may know if a file is an image or a text file, but it may not know what the objective of the image or of the text is. Like in the following animation, with computer vision the computer is able to identify persons as persons and cars as cars.

For several years now, computer vision has received a tremendous amount of attention from the research sector with regular releases of new and more and more complex state-of-the-art architectures.

But the general idea behind those architectures is fairly simple. All of those architectures try to generalize complex information in a way, so that a minimal representation is achieved, which contains a maximum expressiveness about the original image.

For example, everyone knows Super Mario. We loved to play it as a child, an still love to play it as adults. And with every new release Mario slightly changed, and got new details. Computer Vision aims to bring complex Mario back to its root. For this purpose, computer vision uses so called “Convolutional Neural Networks” (CNNs) that contains of multiple layers. And each layer reduces even more details, just keeping the most relevant information.

Reducing all this information makes it way easier for the neural network learning to identify any version of Mario as Mario, as all the different versions of Mario will end up in very similar minimal representations. Now that we know how Computer Vision works, it is time to find a way to convert our malware data into images, so that we can computer vision to it.

Converting Data into images

Converting Data into images is way easier than you might think, because in the end digital data is just a sequence of bits(0s and 1s) and the easiest way is now to convert 0s into white and 1s into black pixels followed by a breakline after every X (e.g. 256) bits.

bitconversion

This conversion method will generate black and white images with the width of 256 pixels and a height depending on the input data size.

Now we can use this method to visualize binary data. And even with this very rudimentary conversion technique we can receive interesting insights about the data. Lets take a look at the linux tools id, sed and grep:

bit_by_bit-id-sed-grep

A few things are obvious now. The tool id is clearly smaller then the others, but the similarities of the images are clearly visible.

So even with this very simple encoding, even humans are able to identify different files of the same filetypes.

With this in mind we can try to convert malware into images now. Lets take a look at some Microsoft Windows Malware Samples:

The differences between those images are clearly visible. Even more, with this method it was even possible to sort >500GB of malware samples into its corresponding malware family (e. G. Dropper, Phishing, Ransomware, etc.) with an accuracy over 99%.

At this point, one could think, if we are able to sort any file into its specific malware family, then we know if the file is a malware or not. Sadly in reality this doesn’t work like this.

In real live scenarios not a single malware comes “naked” to a client. The attacker always tries to make it as hard as possible for the victim to detect the malware. For this attackers use tools like obfuscators [2] to the make the file hard to interpret for humans, adding unneeded noise like known benign patterns, use cryptors to encrypt some parts of the file, or they use protectors, different encodings, multiple compression stages using packers,… the possibilities are nearly infinit.

To get an idea how malware operates and how to analyze one, check out our blog Post Exploit, steganography and Delphi: unpacking DBatLoader

And last but not least over 90% of all attacks start with an Email. Meaning until the malware reaches the to the target, the file needs already be accepted by a multitude of systems like Microsoft Exchange or Microsoft Outlook. And there is a whole list of file types which are blocked by those systems by default [3]. Meaning, that a malware has to hide itself inside one the allowed file types. And most common file types which are used are Microsoft Office and PDF-Documents. So in real live scenarios a malware is completely spreat within a host file (shown in the picture below, malicious segments are red highlighted), using different encodings, encryption, compression, etc. generating all different and unknown image structures, that makes it unable to sort them into specific malware family classes.

This circumstance delivers a new problems to use image recognition for malware detection, because the images doesn’t contain any data objectives. Meaning the generated images are not expressive enough to get classified correctly, because correlating information are not located next to each other.

To solve the first problem, we need a different image encoding. For now we’ve just used a bit by bit enconding, generating images where each pixel is located at the exact same position as within the file. But what we need is conversion that is able to identify similar bytes and cluster them at the same region within the picture. And there is an easy way to reach this, we just need to convert our data into decimal values, creating 3-grams and plot them onto a 3D-Scatter plot.

With this simple conversion we can plat similar data onto similar locations onside the scatter plot. Lets take a look at the following data example.

By plotting this data we receive the following nice 3D Scatter:

With this plot we can finally interpret and detect similar information:

  • Area 1 = All points consisting of letters + special chars + numbers
  • Area 2 = All points consisting of letters only
  • Area 3 = All points consisting of numbers only
  • Area 4 = All points consisting of numbers + special characters
  • Area 5 = All points consisting of letters + special characters

Doing this with more data, adding some colors and reallocating the axis, we can get some very cool results and findings. The following animation shows two 3D plots of PDF documents. The one on the left is a benign document, the other one contains a malware.

Here it is very obvious, that the transformation of the right document looks pretty different from the other. We can see a high density of pixels at unusual locations, in opposite to the lefter document transformation, where the pixel density is pretty well balanced all over. So even without knowing anything about the payload, we can say for sure, that document on the right contains at least very strange data -> maybe special encodings through obfuscation, compressed information due packers, etc.

Now, if even human are able to distinguish between usual PDF transformations and malicious PDF transformations, computer vision also is.

This blog post wasn’t meant to present a perfect malware classifier based on computer vision, but it has shown that it is very possible to detect malware by just doing static visual analysis, at least for PDF documents. With this in mind, the blog posts ends here. Thx for reading and stay tuned for more stuff.

[1] https://www.fireeye.com/blog/executive-perspective/2014/05/ghost-hunting-with-anti-virus.html

[2] Obfuscation (software) - Wikipedia

[3] Blocked attachments in Outlook (microsoft.com)