I have been spending some time thinking about how to process pcap files ranging from a several Gigabytes to Petabytes in size. While tools such as Wireshark do some pretty fantastic analysis, they do not scale well. I quickly started thinking about some of the massive data processing architectures such as the Hadoop framework! I believe there may be a lot of potential in doing covert channel analysis of massive amounts of traffic by taking this type of approach This could be useful in analyzing past traffic when new exploits and covert channels are discovered.
When I was researching tools that attempt to accomplish this (favoring anything on the free side), I came across a tool called xtractor. However, this tool is limited to sizes up to only one Gigabyte. While I feel like they were on the right track, I was not sure about the limited flexibility of their web interface/CouchDB approach. I see a lot of potential in tailoring the HDFS/MapReduce/Hive projects to accommodate some really interesting calculations on enormous amounts of data in a distributed/clustered environment. While this will by no means provide real-time results, it will allow for historical data to be scanned for new threats much quicker than most processes that are currently in place.
One trick that must be dealt with involves how to appropriately split pcap files for efficiency in HDFS/MapReduce. Pcap files are not formatted in a way that a large file may be easily broken down into several pieces (although the new PcapNg format under development should help!). One idea I thought was interesting would be to extract data from packets at various sensors, inserting necessary information into an HBase table (perhaps organizing by flows, etc). HBase could then be used as the input into the MapReduce queries at a later time.
Really, I think there are many possible ways of combining the different Hadoop technologies into a structure that allows for much more powerful (and cheap) packet processing. Has anyone run into any other projects that successfully accomplish this?