Binary Hashing: Motivations and Algorithms

By Rylan O'Connell | November 26, 2019

As security researchers, we often spend a lot of time looking into the internals of libraries in products we are assessing. With this come some common time sinks, such as identifying library versions. While library version identification is relatively straightforward on the surface, other tasks are clearly more challenging – such as applying signatures to stripped binaries, porting defined types across libraries, and similar codebases.

For example, instead of upgrading an old, vulnerable library version, development teams sometimes choose to back-port and attempt to properly insert a patch into a vulnerable version of a code library. While there are a number of reasons this is done, it is difficult to accurately and automatically detect what has changed and what remains vulnerable.

We have recently been working under DARPA’s SafeDocs program to develop new methods for understanding and simplifying complex document formats (e.g., PDF) to safer, clearly understandable, and “verification-friendly” subsets.1 One way that we understand the format’s de facto standard is by conducting analysis of the common parsers which consume the data. As part of solving this problem, we often look at how exactly a parsing library is used. This includes how an input file traverses through this library when that library is nested deep inside various other programs. Binary hashing is one technique which can help us apply knowledge that a human analyst generates against one instantiation of the parser (e.g., annotation of a potentially exploitable function) to other instances of it being used.


Cryptographic hashing is common and familiar in the security industry. This involves taking in some input and producing a unique output that aims to minimize collisions (e.g., two separate inputs should have a very low probability of producing the same hash digest). This is critical for password comparisons, signature checking on software, or proof-of-work in a blockchain. However, in other instances we may want to comparatively maximize collisions. This includes applications such as audio/video fingerprinting, spam detection, matching text such as articles, and near-duplicate detection. In these cases, we leverage locality sensitive hashing which aims to put “similar” items into the same “bucket” together.

In applying this to binaries, we encounter a number of issues. Some, such as the order of instructions, are straightforward to address. For example, instructions within a basic block can be treated as unordered. There are also issues stemming from platform and architecture differences, which are harder to address. Compiler optimizations present even more difficult challenges.

Instruction Order
Platform and Architectural Differences
Compiler Optimizations

As a result of these and other challenges, there is no easy answer in the “real world” to binary hashing – each approach which we will discuss has its own strengths and weaknesses. Frequently, many methods are combined to help compensate and offset weaknesses. For example, Diaphora uses almost 50 heuristics2. Simply put, binary hashing is often messy in practice.

As a result, multiple mechanisms are often combined. Some prior art uses a number of simple algorithms in combination – specifically items such as cryptographic hashes which will not match fuzzily (by their nature) are combined in different ways. For example one heuristic is an MD5 of all non-relative bytes in the function; another is comparing the assembly code generated by IDA after ignoring auto-generated names. Others are more fuzzy – using a DeepToad fuzzy-match or a small-primes-product of the mnemonics in the function. Finally, some methods look at the connectedness of the graph – such as looking at callgraph matches.

From our review of the current practice and literature, we break these approaches down into two components:

Hash of the Basic Block (BB) Graph-Aware Measure
Hash of instructions contained within each individual basic block A secondary hash which accounts for the structure of a function’s CFG

Next, we’ll dive into these two categories in more detail.

Basic Block Hashing

We first provide an overview of a number of methods which are sometimes applied to basic block hashing, in implementation or academic literature. The below table summarizes what we see as the features and downsides of different algorithms.

Name Features Downsides References
Small Primes Product - Extremely simple to implement
- Unaffected by instruction order
- Low rate of false positives
- High rate of false negatives
- Only accounts for function contents, not structure
3, 4, 5
MD-Index - Accounts for function structure - Fails to account for function contents
- High rate of false positives
3, 4, 5
KOKA Hash - Very low rate of false positives
- Highly efficient
- Accounts for function contents and structure
- Only works with exact matches 6
Locality Sensitive Hash -Lower rate of false negatives
-Able to account for minor differences between binaries
-Unaffected by instruction order
-Requires tuning to reduce false positives
-Computationally expensive

For those interested, we include some high-level pseudocode for each algorithm discussed in an Appendix at the bottom of this article.

Locality Sensitive Hashing

Unlike the other algorithms described above, locality sensitive hashing algorithms are designed explicitly to collide similar inputs so that they result in the same hash. One popular implementation relies on dividing an N dimensional space with n hyperplanes such that an input’s “hash” can be derived from its position relative to each hyperplane.

We think the best way to learn this algorithm (7, 8) is through some visual depictions. We’ve attempted to simplify it into 2D space to share with you. You can click through the steps in the below diagrams:

In comparing this description with the other pseudocode in the Appendix, you’ll understand why this method is more computationally intensive - but it does yield a number of other benefits identified above.

Graph Hashing

For algorithms to support the graph hashing – e.g., which basic blocks are connected - we will compare Graph Edit Distance and the Weisfeiler Lehman Isomorphism Test.

Name Features Downsides References
Graph Edit Distance - Analog measure of similarity - NP Complete9
- Must be run on every pair of functions
Weisfeiler Lehman Isomorphism Test - Linear runtime complexity on CFGs
- Run once, compare resulting hashes
- Binary detection of isomorphism 8

Our takeaways from studies in this area is that Weisfeiler Lehman is likely “good enough” for signature recognition, while Graph Edit Distance (GED) may be better suited for use in “bindiff” type applications. Because the GED gives an analog comparison of the similarity between two functions, in order to capture its full potential, we would need to compute the edit distance between every single pair of functions in a binary. While this may be feasible on small binaries with relatively few functions, it would quickly become unrealistic on more complex binaries. Weisfeiler Lehman, on the other hand, allows us to build up a unique “ID” for each CFG. As such, we can compute Weisfeiler Lehman once for each function, and use the resulting ID as a basic hash for comparison.

Weisfeiler Lehman Isomorphism Test

Using the Weisfeiler Lehman Isomorphism Test (WLI) allows us to perform a significant amount of pre-computation on data sets, and later perform comparisons on the results. This is crucial to being able to scale our analysis.

WLI operates through a series of rounds of neighbor discovery and relabeling. During neighbor discovery, the WLI algorithm simply gathers every neighbor of a node, and collects them into an (ordered) list. After every node has undergone neighbor discovery, the nodes are then relabeled to their neighborhood list. Finally, after several rounds of neighbor discovery and relabeling, a “hash” can be computed by visiting every node in order and building up a list of labels (similar to the neighbor discovery process).

This means that for N rounds, v vertices, and e edges, this algorithm’s runtime complexity is effectively O(Nve). The keen eyed among you may have realized that in a general graph (without multi-edges), |e| ≤ |v|, so this complexity could be further reduced to O(Nv2), which is no longer linear. However, because we are specifically concerned about running this algorithm on CFGs, we know that a node can have at most 2 branches, meaning that WLI will still run efficiently even on large functions with complex CFGs.

Again, we shall turn to a visual demonstration of how this algorithm runs:


We hope that this has given you a brief introduction to the subject area. Keep an eye out for our next post, where we released:

  1. Details of our approach, specifically targeting the problem of porting information across similar binaries
  2. Measurable metrics for algorithm performance
  3. An open-source tool for you to use and extend!

River Loop Security works heavily in applying program analysis techniques to solve real-world problems in cybersecurity, from research projects such as DARPA’s SafeDocs program to commercial penetration testing engagements where we audit source code or binaries to find vulnerabilities.

We encourage you to contact us if you have any questions or comments based on this post or have questions about applying binary analysis to solve your cybersecurity challenges.

This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR001119C0074. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Defense Advanced Research Projects Agency (DARPA).

Appendix: Algorithm Pseudocode

Small Primes Product
  1. Assign every mnemonic a unique prime
  2. For every function, multiply all mnemonic primes
  3. Every hash is unique to that combination of instructions
  1. For every every edge in the CFG, construct a tuple with
    • t0 : # incoming edges in source block
    • t1 : # outgoing edges in source block
    • t2 : # incoming edges in destination block
    • t3 : # outgoing edges in destination block
    • t4 : “Position” in CFG
  2. block_hash = t0 + t1√(2) + t2√(3) + t3√(5) + t4√(7)
  3. Function hash = Σ(block_hash)

Construct “feature vector” containing:

  1. Number of basic blocks in each category
    • Normal
    • Entry points
    • Exit points
    • Traps
    • Self loops
    • Loop heads
    • Loop tails
  2. Number of edges in each category:
    • Basis edges
    • Forward edges
    • Back edges
    • Cross links
  3. Bitvector representing dominator tree
  4. Calls to function
  5. Call in function
  6. Instruction histogram
  7. String histogram
  8. Markov lumping of CFG
Graph Edit Distance (GED)

Sum the number of operations to convert G1 to G2:

  • Node insertion
  • Node deletion
  • Node label change
  • Edge insertion
  • Edge deletion

  1. Bratus, S. Safe Documents. Retrieved from [return]
  2. Koret, J. (2018, January 24). Heuristics and their explanation. Retrieved from [return]
  3. Flake, H., & Porst, S. (2010, April 26). ShaREing is Caring. Retrieved from [return]
  4. Dullien, T., Carrera, E., Eppler, S.-M., & Porst, S. (2010). Automated Attacker Correlation for Malicious Code. NATO OTAN. Retrieved from [return]
  5. Dullien, T., & Rolles, R. (n.d.). Graph-based comparison of Executable Objects. Retrieved from [return]
  6. Kehagias, A., & Karamitas, C. (n.d.). Efficient Features for Function Matching between Binary Executables. Retrieved from [return]
  7. Hari, S. (2018, July 5). Locality Sensitive Hashing for Similar Item Search. Retrieved from [return]
  8. Weisfeiler-Lehman Graph Kernel for Binary Function Analysis. (2019, September 9). Retrieved from [return]
  9. There are cases in which Graph Edit Distance can be efficiently computed, such as if the graph is a tree, however these situations do not appear to apply to most control flow graphs. [return]
  10. Abu-Aisheh, Z., Raveaux, R., Ramel, J.-Y., & Martineau, P. (n.d.). An Exact Graph Edit Distance Algorithm for Solving Pattern Recognition Problems. Retrieved from [return]