A couple months back I remember reading a post from Symantec about visualizing entropy to identify infected Microsoft documents. At the time it didn’t really dawn upon me to visualize the PDF samples I had, but I did take a brief look into how entropy could be used in the detection of malicious PDFs and whether or not it was useful. I specifically looked at how entropy values (stream, nonstream and total) compared between a public dataset and a complete malicious dataset. During that period I found entropy to be quite useless in regards to detecting malicious samples as it never showed a pattern.
While looking back through some of my testing, I realized that the data mentioned above was always a full composite of entropy and not over the course of the file. I wanted the ability to see the randomness throughout the file, so I quickly hacked up two different entropy generators. The first was based on reading each line of the PDF whereas the second one used byte chunks. The interesting thing to note about the line-based entropy was that while it posed no aid in identifying anything malicious, it was able to find PDFs whos content matched everywhere except the payload. I thought this was pretty cool and while the byte chunks proved to do the same thing, it was not as defined.
Visual Entropy using Bytes
As you can see from the results, entropy appears to be useful when comparing malicious PDFs with each other. I would like to do more investigation as to how visualizations compare between samples with matching exploit methods with different payloads to see if patterns can actually be deduced. As for using visual entropy to classify a PDF as known good or bad, I just haven’t found a great use for it yet. I suppose it is nice to know, but when looking at non-infected PDF visual entropy, it looks just as random as a malicious one. Regardless of the lack of detection benefits, I plan on adding the byte chunk entropy data to the MalPdfObj tool and the chart representation to the web portal (not yet released).