I wanted to wait on releasing this until I made some more queries, but that may take a few days, so here it is now. The night I got all my malicious files into my MongoDB instance I started to query. One of the first things I wanted to tackle was identifying whether or not malicious PDFs would reuse objects or payloads. One of the things I made a point to store in my main PDF object was a list of hashes of all objects.
After playing around with the query language in Mongo I was able to get something that gave me what I wanted:
https://gist.github.com/764746
I am aware the JSON parsing is horrible, but until I write or find a recursive way of searching it will have to do. In any case, the script will query mongo in a drill down like fashion until it gets to where it wants. After parsing the return result, I am able to store each hash in a list, mark if it has been seen before and associate a count to it. You will see the clause in the end that ouputs anything that has more then one and after running on my data, I saw nothing.
I suppose this generally makes sense as I have a small set of malicious files, but it does lead me to believe that these files may have came from all different sources or they just don’t care enough to reuse. Given how these files follow the specification quite well, I would venture to say they don’t use a framework to pump these malicious files out (may be worth making) and instead create them each time for when they need them.