How Cuckoo Filters Can Improve Existing Approximate Matching Techniques

If you have used VirusTotal, in additional information tab you will find a field for ssdeep. It is intriguing what this field represents among hashing functions SHA1, SHA256 and MD5. ssdeep is a approximate matching algorithm (AMA). NIST define approximate matching algorithms as:

"Approximate matching is a technique for identifying similarities between or among digital objects, such as files, storage media, or network streams, which do not match exactly".

In simple words, given two files, having certain degree of similarity, for example, two word files with second file being copy of first except a paragraph. Traditional hash functions can only tell about the exact similarity, but cannot provide degree or a confidence score out of 100 of how similar the files are.

AMA are used extensively in forensics investigation as they provide easy way to compare a hard disk against whitelisted or blacklisted corpus of files. Over past few years many AMA have been proposed. ssdeep was the first, followed by sdhash and mrsh-v2. In between, some other algorithms were also proposed, but they suffered with one shortcoming or other. All of the above algorithms use Bloom filters as their basic building block. In short, Bloom filter is a probabilistic data structure for performing set membership query efficiently. In this post I will not go into details of internal working of AMA nor of Bloom filter, but talk about an alternative of Bloom filter, namely Cuckoo filter. Cuckoo filter is modification of Cuckoo hashing approach to construct a data structure suitable for set membership determination. Cuckoo filter excel over Bloom filter on following parameters:

Cuckoo filter has better lookup performance
Cuckoo filter has better space efficiency than Bloom filter, when false positive rate desired is < 3%.
Unlike Bloom filter, Cuckoo filter supports deleting stored items dynamically.

In one of my recent works, I have used Cuckoo filter instead of Bloom filter in mrsh-v2 and was able to achieve significant performance gain. The final results can be summarized as:

In simple runtime performance, there is a gain of 37%.
Size of final fingerprints generated is halved.
Memory usage is 8th of approach with Bloom filter.

This work was presented at ICDF2C-2015, which took place on 6th October in Seoul, South Korea, titled How Cuckoo Filter Can Improve Existing Approximate Matching Techniques (pdf). The work was adjudged with best paper award by the committee.

In next coming posts I will try to cover what are approximate matching algorithms in detail and probably also look into each of these algorithms.

Keep hacking!!