I learned about hash functions in the same way that I imagine many of you did: a side-note in the implementation of a Hash Table. I was in University, and we were studying data structures and alogrithms. Hash Tables appeared alongside arrays, linked lists, queues, stacks, trees, and more. We spent a lot of time learning about the remarkable amortized runtime perormance of Hash Tables, and my Professor gave us a short lesson describing how to build a Hash Table with an array, a linked list and a hash function (keep your eyes peeled for part 2 in this series if you are interested!).
It was amazing to see the impressive algorithmic applications of Hash Tables, but I failed to recognize that the hash function had other applications!
Twenty years later, and I look back at that moment with humour. I had seen the forest, but missed the trees (certainly not the first, nor the last time that I did that). Hash Tables are one of the most important tools that we have in algorithmic optimization, they power our Objects in JavaScript and dictionaries in both Ruby and Python. It’s all too easy to focus on the Hash Table, with all it’s glamour, and to overlook the humble hash function at the heart of it all.
Hash functions have broad applications in software development, but they are especially important in the field of security. From data-integrity checking, to cryptographic signing, to proof of work: hash functions are one of the three primitive operations in security!
I am fascinated by hash functions, and I truly believe that understanding them has made me a more thoughtful and potent developer.
In this blog post I want to introduce the hash function, play around with a couple simple ones, and hopefully set you on the journey of discovery that I almost missed all those years ago.
Hash functions have a simple, humble definition:
“A hash function is a function that maps an arbitrary input to a fixed-size value”
It is easy to see then, why they are so easy to overlook. Let’s take a second to make sure we understand what this definition means.
I assume that most readers have a strong understanding of what a function is. We pass some inputs into a function, and we expect zero or one values as output1. In the case of a hash function, we add a further restriction - the function must accept exactly one input, and must produce exactly one output. The input to a hash function may be a string, a number, an object, or some arbitrary binary data. We call this an arbitrary input, because there is generally no restriction on the size of an input - it is arbitrarily sized.
Perhaps the most mysterious thing about hash functions is how you can take an arbitrary sized input and produce a useful fixed size output. Hash functions are necessarily lossy functions: there are fewer possible outputs than there are possible inputs. Counter intuitively, it is this restriction on the outputs of hash functions that is the key to their flexibility. The choices we make when implementing a hash function is essentially the choice of what data is lost. These choices impact the characteristics of the functions. When building hash tables, hash functions with certain characteristics shine, while when using hash functions for cryptography, hash functions with different characteristics are prized. The rest of this series will attempt to highlight some of these characteristics, describe why they are valuable, and give you an idea how the underlying hash functions are implemented.
The simplest hash function that I can imagine is just to consider the first character in an input string. For example here is an implementation in TypeScript
function firstLetterHash(input: string): number {
return input.charCodeAt(0);
}
Thanks to the type annotations, we can see that our firstLetterHash
function accepts a string as an input, and returns to us a number as output. The input space for this hash function, therefore, is all strings. To produce our output, we use the String#charCodeAt
function, and it defines an output range of the integers from 0 - 65535.
Let’s consider a more complicated hash function: the so called “division method”.
const hashSpace: number = 100;
function divisionMethodHash(input: string): number {
/* To convert a string, to a single number:
* split on "" to get an array with each character
* map the "charCodeAt" function over the array
* to get the unicode value of each char
* use reduce to sum the ascii values
*/
const numericInput = input.split("")
.map(c => c.charCodeAt())
.reduce((accum, num) => accum+num, 0);
/*
* Use the modulo operator to calculate the hash value
* the modulo operator returns the remainder of dividing two integers
* it reduces the output space from "all numbers", to a fixed range.
* Since m % n returns a value from 0 .. n-1,
* 0 .. n-1 is the output range of our function
*/
return numericInput % hashSpace;
}
Again, we can see that our input and output types are the same. This time, however we define a hashSpace
variable that allows us to control the size of the output space. The hash value will be a number between 0 and hashSpace
.
The most complicated part of this hash function is converting the input string into a numeric value. In order to accomplish this conversion, we take a three-step approach.
First we convert the string into an array, because the string object does not implement the map
and reduce
functions. Second, we recognize that every character has a corresponding unicode value, and we map the String#charCodeAt
function across the array to access these values. Finally, we use the reduce
function to calculate the sum of all of these unicode values. We take advantage of the higher order map
and reduce
functions to provide a compact implementation of this behaviour (for more information on map and reduce checkout this link).
Once we’ve determined the numeric value of the input string, we use modular arithmetic to fit that value into the parameterized hashSpace (review modular arithmetic).
Let’s compare the result of hashing various strings using these two hash functions
input | firstLetterHash | divisionMethodHash |
---|---|---|
“Hello, World!” | 104 | 29 |
“Hello, Reader!” | 104 | 4 |
“Let’s Hash” | 108 | 67 |
“Let’s Hash!” | 108 | 0 |
“foo” | 102 | 24 |
“food” | 102 | 24 |
“foodd” | 102 | 24 |
“fooddd” | 102 | 24 |
As we investigate the results in this table, we can begin to see a handful of properties that hash functions may or may not display. Understanding the various properties of hash functions will make it much easier for us to evaluate the appropriateness of a given function for a given task in the next series of posts.
The first property that we can notice is that each function has a very different output space. They have a different range of possible return values. Our firstLetterHash
function has a defined range - the output space is defined as the space of unicode characters: 0-65535 (the possible return values from String#charCodeAt
).
Our divisionMethodHash
function on the other hand has a range of 0-100 - but that value is a parameter, it’s controlled by the value in the hashSpace
variable. It would be trivial to change it’s range to 0-1000000. That’s why we say that the divisionMethodHash
function has a variable range.
A hash function is considered uniform if it uniformly distributes the input values across the output space of the function. If we think about the functions that we’ve implemented above, and we consider any string to be a valid input, then indeed these functions are uniform in their distribution - each hash value is equally likely.
However, the property of uniformity can be considered not only across all inputs, but also across the subsets of those inputs that we consider significant. When we look at real world examples of strings, we discover that the distribution of first characters is not at all uniform. Languages don’t have uniform character distributions, and certain characters are more likely to appear in the first position. Additionally, many unicode characters aren’t alphanumeric characters at all! For a given language, or even across the most common languages, our input sets are only going to feature a fraction of the sixty-five thousand or so possible characters as a first character. When the firstLetterHash
function is applied to only these “likely” strings, then there are many hash values that will never be produced. The firstLetterHash
function has poor uniformity across common strings.
The divisionMethodHash
function on the other hand will have much better uniformity across sets of average strings, and could be considered a uniform hash function.
The avalanche effect describes a type of hash function, where small changes in an input will result in large changes to the output value. This is a useful property in security because it makes predicting a hash function’s behaviour more difficult.
As we can see from our results table, our firstLetterHash
function definitely fails to show the Avalanche Effect, since the only character considered is the first character. Only changes to that single character will result in changes to the hash value.
On the other hand, we can see that for our divisionMethodHash
, changes to any character in the string results in changes in the hash value. As such, we might be tempted to say that the divisionMethodHash
displays the avalanche effect. However, we can see in the last four results, that there is a character that betrays this behaviour: ‘d
’. This is because the unicode value of the character ‘d
’ is 100 – the same as our hash space size. Adding or removing a ‘d
’ character from the input string is an identity operation on the divisionMethodHash
function. We should say then, that the divisonMethodHash
only weakly demonstrates the Avalanche Effect.
Collision resistance is the property that describes how difficult it is to find an input that hashes to some desired value. This property is useful to prevent a malicious actor undermining the performance of a hash table with targeted workloads that contain many collisions. Additionally, it’s key that cryptographic hash functions are expensive to reverse, and good collision resistance is one technique that adds to that cost.
In both of our hash functions outlined above, this property is incredibly weak. All an attacker would need to do in order to construct a matching input from a hash value with our firstLetterHash
function, is to look up the character with the same unicode value as the hash value they wish to collide with. Any string that begins with that character will produce the desired hash value.
With our divisionMethodHash
function, the unpredictability is stronger. However, because the divisionMethodHash
is rather simple, it would be easy for a malicious actor to discover it’s implementation by feeding it a handful of values, and watching the resulting output. As soon as they discover that the ‘d
’ can be added and removed at will, the implementation will be understood, and finding collisions becomes trivial.
In this gentle introduction to hash functions, we introduced the notion of a hash function, we discussed why it is such a great idea to learn more about them, and we demonstrated two very simple hash functions that you could write yourself and discussed the various properties of these hash functions. In part two of this series, we will focus on hash tables in some depth, we will discuss the desirable properties for hash functions that are to be used with hash tables, and we will examine the so called “universal hashing” approach.
Some languages, like Go use a pattern where they return two values data, err
instead of raising exceptions. Python also allows you to return tuples from functions, essentially allowing multiple returns. These are the exceptions that prove the rule. ↩
Thoughts, Ideas and Lessons from the Common Sense Software team