In my last post I mentioned that I wanted to put together an API for my malpdfobj tool, so sharing could be easier. The good news is that I have the RESTful API functioning complete with interactive API documentation, python interfaces and the ability to take in samples. I also have new statistics and malware-related details collected on my sample set. The bad news is that the material itself has been submitted to a conference and I will have to hold off on releasing it until I hear back on the acceptance choice. The latest this will be is March 20th, but I hope to hear back sooner. In the meantime, I plan on adding more features to the backend processing, so when I do release the tool it will be well-tested and full featured.
One of the other things I wanted to reflect back on was my original choice in using MongoDB as my backend data storage. Initially I used MySQL to store details, but that quickly got old when parsing larger PDF documents and not being able to account for all the new data I collected. I explained how MongoDB sounded like my solution, but when I posted I was still unsure. It has been about a month and a half since I started playing with MongoDB and I am impressed with how well it works for this project.
When building the API, I wanted to use PHP as the backend, but I needed the ability to query my MongoDB collection. Connecting to Mongo and getting my data was as easy as using the MySQLi connecter in PHP. I was impressed in how seamless the transition was between the two different interfaces and didn’t have to waste much time reading documentation. Once I got my data, from Mongo I was able to parse it how I wanted and then package it back up into JSON to send back to the client. In some cases I didn’t even need to parse the results because they were already JSON packaged. Having that BSON format made life easier and also made querying straight forward.
Of course I did need data for the whole API service to work out, so there had to be an easy way to get a parsed PDF into Mongo. I briefly highlighted on this in a few of my posts, but never went to far into detail. When adding data to Mongo all I had to do was get my data in a JSON format, pick my collection and insert the record. It was literally as simple as that. Python had very good support for MongoDB and using the Mongo option on the tool, I was able to bulk insert into Mongo without a single error.
Querying for the data can be a little confusing at times where results expected are not necessarily what gets returned. In some cases it appears that Mongo is not capable of returning a single object based on certain filtering clauses in a single query. Instead, multiple queries need to be made to obtain the exact object you want to have access to. While this can be a bit frustrating, it is not a big deal for this project given that I usually find myself querying the whole dataset and then foring through the results for further parsing.
I am currently in the process of porting bighands and dirtyhands over to support Mongodb insertion. The goal is to constantly reach out to the Internet for PDF documents, mine the data and then store them in an object format. While most PDFs will be considered good and not malicious, it will still help to find the commonalities within public PDFs on the web. I intend on using MongoDB for this task as well and it will ultimately plug into the released API. Expect to see a fully functional application go live very soon that should hopefully change how we share this kind of data.