Back when I was looking at averages of information collected across random/malicious documents, I noted that filesize seemed to be a helper in narrowing down whether or not a file could be suspicious (see malicious filter). Ever since I began collecting object content I wanted to see if there was a correlation or easy way to find the payload using general data that was easily available.
In the case of PDF objects it appeared that objects over a certain length would contain the payload or code to assist in the exploit. Obviously high length objects are going to be the norm in clean PDF documents, but if a document is already suspected of being malicious then this filtering may speed up the time it takes to identify the payload and begin reverse engineering.
The following statistics are derived from 247 malicious PDF documents in my repository. Using a simple script I went through and queried my malicious object samples for the following:
- count of all objects in each document
- count of all objects over 750 bytes in each document
- id of each object
- length of each object (our calculations and the /length value)
- encoded data of the object
- decoded data of the object
This data was split into two different parts. The first was a list that identified the amount of large objects alongside the total object count for a given PDF document. The second was a single document for each file that contained a breakout of all the large objects with the detailed data mentioned above. The goal of splitting this data out was to identify whether or not there was value in looking at large objects first and how accurate it was.
After running the script I was left with 245 files with a relatively small amount of objects that needed a manual review as to whether or not they were malicious or aided in the exploitation of the PDF document. Keep in mind that 2 documents failed to contain objects over 750 bytes and will need to manually reviewed to identify how the exploit is structured. It should also be noted that some documents may contain large objects of which do not actually contain exploit or payload data.
Of the 245 documents, 212 contained payloads which exceeded 750 bytes making object length useful when trying to find payloads within a suspicious document. Knowing that length shows some value in searching for payloads, it could also be helpful to sort output of decoded objects by order of size so that if something were suspicious, it would be easier seen.
As I went through all this data I kept in mind previous talks about the PDF specification. In Julia Wolf’s talk on her observations she mentioned that length could be any number or not there at all and somehow the parser could still make sense of everything. Others have mentioned similar issues with the specification, but in most cases, including length, these documents adhere to the rules and appear to contain an accurate length.
Furthermore, after sorting the data collected, patterns could be seen for certain exploits and structures based on the length (both versions) and the object ID. Take for example the lib TIF exploit (CVE). In my analysis I found 13 documents that used this exploit as its primary attack. What is interesting to note is that every one of these files was different yet they all used the ID of “1″. In 11 of the 13 cases of this exploit, the size generally stayed very close with an average of 2079 bytes. Using this data one could write a simple filter to flag on any object with the ID of one, close to 2079 bytes and contained the word TIF.
Just by viewing total large objects to all objects you could begin to make assumptions about the files being similar. Manual review later confirmed these assumptions to be correct. One example of this was malicious documents with a larger amount of objects were likely to contain a complete duplicate for every object. This appeared to fall in line with how PDF updates are done, but it should be noted that version did not appear to increase on objects that were the same. In some cases the object IDs did not match, but the content was the same giving indication that exploits may have been copied/pasted between samples.
One of the last observations noticed was that malicious documents did appear to share exploitation structure from one another. Just looking at JavaScript related exploits it showed about 20-30 unique exploitation styles. What I mean by this is that it literally looked like some documents copied and pasted the code from one document to the next. The difference that created a new hash (even at the object level) was the payload itself. Although this is not surprising given exploit kits are likely the main creator of these documents, it is interesting to see the small amount of combinations and the lack of uniqueness attackers take in hiding their efforts.
In conclusion I feel that length (object size) does matter when dealing with PDF analysis. If I had not went this route I would have had to search through all 250 documents with a total object count of 7,263. Instead, I was able to find the payload in 99% of the files by just looking through 1,347 objects. The task itself took around 6 hours to do meaning it would roughly take roughly 15 hours if I did not apply the filter. With the data I now have collected I can write filters to even further reduce the amount of time it takes to prune through documents for suspicious content.