Header background

Hash4j: a new hash library for Java

Since their invention in 1953 by Hans Peter Luhn, hash functions have become indispensable in software engineering, as they’re able to map keys of arbitrary types to binary numbers.

Their primary use case is hash tables, which enable lookups in constant time by using a hash of the key to address the bucket in which the associated value is stored. However, many more applications lead to two main classes of hash functions:

  • Cryptographic hash functions, which are used to verify passwords or the integrity of files and messages. For these applications, it is critical that one can’t find some input to the hash function that maps to a given hash. In other words, reversing the hash function must be impossible in practice. Some examples of state-of-the-art cryptographic hash functions are SHA-3 (introduced in 2016) or BLAKE3 (2020).
  • Non-cryptographic hash functions, which produce hashes that ideally can’t be distinguished from independent and uniformly distributed random values. This deterministic mapping to pseudo-random values allows for the application of statistical methods. It is essential for many probabilistic data structures and algorithms, like Bloom Filter, HyperLogLog, MinHash, Consistent hashing, Rendezvous hashing, or Count-min sketch, to name a few. Cryptographic hash functions are not suitable for this purpose. They’re relatively slow, and their hashes are not necessarily uniformly distributed, as this is not their primary design objective.

What are good non-cryptographic hash functions?

A good non-cryptographic hash function must be fast, and its output must be high-quality, meaning the hash values look like true random values. Unfortunately, there are hardly any mathematical guarantees for most, especially the fastest, non-cryptographic hash functions. Hash quality can only be verified empirically by a series of statistical tests. A popular test suite is SMhasher, which lists the recently developed hash functions Wyhash (final v3, 2021) and Komihash (v4.3, 2021) among the fastest without known quality issues.

Why create a new library?

A few Java libraries like Guava, Apache Codec, or Zero-Allocation-Hashing already provide implementations of some non-cryptographic hash functions. However, they did not fully meet our requirements for the following reasons:

  • We need 100% compatibility with the native reference implementation. Due to unresolved bugs in the Guava and the Zero-Allocation-Hashing libraries, we could not use their MurmurHash implementations.
  • There were no Java implementations of the latest high-quality hash functions such as Wyhash and Komihash. Zero-Allocation-Hashing has an implementation of Wyhash, but it is not the latest version.
  • We need a fast streaming interface that allows direct hashing of arbitrary Java objects without first serializing them into byte arrays. This saves memory allocations, especially if the serialization size is not known in advance due to variable lengths of string or array fields. To our knowledge, Guava was the only Java library with a streaming interface. Unfortunately, as our benchmarks have shown, it is relatively slow, particularly for MurmurHash, when processing byte array or string fields.

To overcome these shortcomings, we decided to develop hash4j, a new hash library for Java which we open-sourced on GitHub. To reduce the risk of bugs, the library has a 100% branch test coverage policy. It is extensively tested against test vectors generated using the native reference implementations of the corresponding hash algorithms. Unlike other libraries like Guava or Zero-Allocation-Hashing, which rely on the problematic Unsafe class, hash4j uses the VarHandle class introduced in Java 9 to realize fast memory access. This also makes the library more (future-)safe.

How do you use it?

The integration of hash4j into your Java project is straightforward, as all releases have been made available in the Maven Central Repository. The first step to hashing a Java object is to select a hash algorithm and create a corresponding hasher instance. Using this instance, Java objects can be hashed in two different but logically equivalent ways, as exemplified in the code snippet below for a small test class:

  • The first variant creates a new hash stream into which the individual fields are passed using corresponding put-methods. The hash computation instantly consumes these contributions such that large internal buffers are not needed. The terminal operation (getAsLong) finally returns the computed hash value.
  • The second variant uses a so-called hash funnel to define how all the fields of an object are put into a sink. The advantage of hash funnels is that they can be statically defined, and the actual statement for hashing the object is reduced to a simple method call.

Hashing of variable length fields

A common pitfall when hashing a class with multiple fields of variable lengths is to simply concatenate the binary representations of their values. This may result in poor hash quality. For example, consider a class for storing the first and the last name. The two data records

  • [first name = “ANDRE”, last name = “WRIGHT”]
  • [first name = “ANDREW”, last name = “RIGHT”].

would have the same hash value as the concatenated binary sequences of the field values [“ANDRE”, “WRIGHT”] and [“ANDREW”, “RIGHT”] would be the same in both cases.

A possibility to avoid such hash collisions would be the use of a dedicated separator symbol. However, this might require escape sequences if the separator symbol itself may be contained in any of the field values. Another solution that does not require modifying the original byte sequence is to simply append the length of every field to its value. Consequently, the two records of our example would be treated as

[“ANDRE”, 5, “WRIGHT”, 6]

and

[“ANDREW”, 6, “RIGHT”, 5],

respectively. Since the corresponding binary representations are different, the hash collision risk is minimized. In contrast to Guava, hash4j comes with methods that automatically append the length information to variable length fields, such as strings, arrays, or also Java’s optional data types, to avoid these kinds of problems. The code snippet below demonstrates the hash computation for the example above:

Hashing of ordered collections

Hash4j also supports ordered collections such as lists. As they can also vary in length, their size is added automatically. The following code snippet demonstrates the hash computation for lists of strings and shows how hash collisions are prevented by appending length information:

Although in all three cases a hash is calculated from the same three characters “A”, “B” and “C”, the resulting hash values are all different. This is because the three characters are grouped differently in strings and lists. By taking individual string lengths and collection sizes into account, hash collisions are automatically avoided.

Hashing of unordered collections

Hash4j can also be applied to collections without inherent ordering, such as sets. The hash of a set must always be the same and independent of the iteration order over its elements. Therefore, a different method is required than for ordered data. A common approach is to hash each collection element and combine the element hashes with an associative and commutative operation like the addition or bitwise XOR.

While this might work well for sets where each element is unique, this approach results in a poor hash quality for multisets where each element may be present multiple times. With bitwise XORing, the contribution of an element with even multiplicity cancels out. This means the multiset {AAB} would yield the same hash value as {B}. However, the use of addition is also problematic in this case. Since the hash value of A is added twice, the least significant bit of the hash value of {AAB} is equal to that of {B} and thus depends only on B. In general, an element whose multiplicity is a power of 2 does not contribute to the least significant k bits, with k being the exponent.

Therefore, hash4j combines the individual element hash values according to a different strategy. The hash contribution of an unordered collection corresponds to that of the sorted list of all element hash values. The following code snippet demonstrates hashing of an unordered collection:

What’s next?

A toolkit that allows fast mapping of arbitrary objects to high-quality hash values is the foundation of many modern data structures and algorithms, for which production-ready implementations are also planned in the library. For example, the latest release of hash4j includes UltraLogLog, a more space-efficient alternative to HyperLogLog for approximate distinct counting. Furthermore, implementations of MinHash and SuperMinHash are already available. These similarity hashing techniques enable the computation of fingerprints of sets, which can then be used for fast similarity estimation.

More hash-based algorithms will follow. Try hash4j out and stay tuned!