Conntrack hash function comparisons II

The comparison of the candidate hash functions for conntrack is based on Patrick Schaaf's cttest program. All the results below can be reproduced by running 'mkstat.pl' from the package.

The results of the first conntrack comparison can be found here.

The key of the hash functions for conntrack consists of the network parameters identifying a connection:

source IP address
destination IP address
protocol
source port
destination port

In the mathetamical formulae below I use the following notation:

Unfortunately it's not quite trivial how to decide which hash function is the best:

One could simply "look at" the graphs which show the number of hash buckets for the different bucket lengths. If we have multiple input datasets of the same sizes, just the maximum bucket lengths can be displayed as well (see the maximum list length graphs below). The human eye is an excellent tool and the trivially bad hashes can easily be spotted. However it's hard to say which hash is better, if they look similar or one looks better here, the other looks better there...
The probability of the longest bucket can be calculated:

and the results can be compared (see the probability of one longest bucket length graphs below). In order to avoid the overflow of the factorial calculations, the following formulae were used instead:

However, we face two problems here: first, if there are multiple longest buckets, then another, more complicated formula should be used:

Second: actually, we just look at the longest bucket lengths as at the previous approach, from another point of view.
We could calculate the usual statistical parameters which describe the distribution of the data in the hash table: mean, standard and mean absolute deviation (see standard and mean absolute deviation graphs below), etc:

However these parameters describe how a data distribution itself behaves. They are not much better suited to compare two datasets as the previous methods.
The chi-square is a much better parameter to qualify a hash function:

The result is expected to have a value close to the hash size (L) itself. Chi-square measures are usually reported in units of standard deviations, so the chi-square graphs below actually display:

The results are expected to be between -3 and 3, so those lines are printed on the graphs as 'min' and 'max' values.
Chi-square can be used also to answer the question wether two data sets are drawn from the same distribution, or how much two datasets differ. On the chi-square against random hash graphs below the hash functions are compared to the random hash:
We calculate the walking time of the produced hash table as well:

The conclusions below were drawn from the chi-square and walking time graphs.

In order to get more data, I widened the artifical test used in the previous conntrack hash function comparison. This time the following input datasets were used: 50000 connections (100000 entry to hash), each "connecting to" one fixed IP address at a given port. The clients are connecting from 50000 different ports. In the different datasets the number of clients (and thus the number of different source IP addresses) is varied:

	50000, 45000, 40000, ... 15000,
	10000, 9000, 8000, ... 2000,
	1000, 900, 800, ... 200,
	100, 99, 98, ... 1

That means 126 input datasets for every hash, at every hash size.

All hashes were tested at the following odd hash sizes:

	8191 8209
        16383 16411
        32767 32771
        65535 65537
        131071 131101
        262143 262147

The best hashes (abcd64 and the jenkins ones) which work for even hash sizes as well were tested additionally with the following hash sizes:

	8192 16384 32768 65536 131072 262144

Hashes original, rt, crc32, abcd64

The original hash, compared to other hashes.

The chi-square tests show clearly that there are real problems with the original hash.

Hashes rt, rt_ab, rt_ab2, rt_new, abcd64

The rt hashes were derived from the IP routing code by Joakim Axelsson and Martin Josefsson.

All of them can produce unsatisfactory results.

Hashes crc32, rusty, abcd64

The crc32 hash use the crc32 function of zlib, the implementation is by Patrick. The rusty hash is based on Rusty's include/linux/hash.h from Linux 2.5.

The crc32 hash is quite good, alas there are serious problems with the rusty hash.

Hashes guilmette, pearson, abcd64

The guilmette hash was suggested by Ron Guilmette on comp.compilers, 13 Dec 1990 06:25:10. The pearson hash is the famous one :-). The original version of both hashes generate hash keys between 0-255, I created the tested variants which produce unsigned long hash keys.

There are problems with the guilmette hash but the pearson hash is quite good.

Hashes abcd, abcd-1, abcdef, abcd_xor, abcd64

The abcd and abcdef hashes were proposed by Don Cohen and Rich Scroeppel, the latter coded by Joakim Axelsson. The abcd_xor hash is by Joakim Axelsson and Martin Josefsson. The abcd-1 variant (formerly abcdx) is the same as the abcd but with different coefficients. The abcd64 hash uses 64bit random numbers and works for even hash sizes as well.

Formerly the abcd hash was my favourite one. However the chi-square tests show that the abcd and abcd-1 hashes are not perfect. The abcd64 hash seems to be OK.

Hashes abcd-1, abcd-2, abcd-3, abcd-4

Another run to examine the abcd hash with different, worse and worse coefficients:

abcd-1 0x898092BB
0x1CE6487C
0xF219B787
0x677F6D40 10001001100000001001001010111011
00011100111001100100100001111100
11110010000110011011011110000111
01100111011111110110110101000000
abcd-2 0x0E3A9C80
0xE1338378
0xB4CC63E7
0x5BC57C1F 00001110001110101001110010000000
11100001001100111000001101111000
10110100110011000110001111100111
01011011110001010111110000011111
abcd-3 0x0000173C
0xF94471C8
0x1EBFC863
0xE7FBAE97 00000000000000000001011100111100
11111001010001000111000111001000
00011110101111111100100001100011
11100111111110111010111010010111
abcd-4 0x0000006A
0x03FD9B60
0xFDCFFE95
0xFE32659F 00000000000000000000000001101010
00000011111111011001101101100000
11111101110011111111111010010101
11111110001100100110010110011111

Unfortunately nothing much could be gained from the results except the suscepted one: there are naturally better and worse coefficients.

A closer look at the abcd hashes revealed, that there is a flawn in these hashes: an attacker could easily fill one hash bucket, if he can guess or find out the hash size. This is due to the algorithm behind the abcd hashes:

	(coef0 * srcip + coef1 * sport 
       + coef2 * dstip + coef3 * dport + proto) % hash_size

By keeping tree input parameters fixed and using the following source IP addresses srcip = const + n * hash_size the hash function becomes

	(CONST0 + coef0 * n * hash_size) % hash_size

which gives a constant hash value for any n. Rehashing does not stop the attack, it just moves the filling bucket.

The abcd64 variant fixes the problem by using 64-bit arithmetic and mixing the upper and lower 32-bit parts of the partial results.

Hashes jenkins, jenkins2, jenkins2b

The jenkins hash is by Bob Jenkins. The jenkins2 and jenkins2b variants are hashes optimized for conntrack by me.

They are overall very good hashes.

Hashes abcd64, abcd64b, abcd64c, abcd64d

Comparison of different variants of the abcd64 hash: abcd64 has 64-bit random coefficients while the b, c and d variants have 32-bit ones. abcd64b and abcd64c differs only in coding, an attempt to gain more speed (in vain). The abcd64d hash has the "best" coefficients in the sense that the four random coefficients were chosen so that they have 16 one's and 16 zeros.

The chi-square graphs show that the abcd64 hashes are not perfect for hash sizes of power of two.

Hashes abcd64d, jenkins2b, jenkins64

This is the comparison of the best hashes found so far. (The jenkins64 hash is the 64-bit variant of the jenkins hash.) Again, these look to be the best hashes, but the abcd64 hash has troubles with even hash sizes.

The jenkins hashes work reliably in every case.

Bit tests

There were a separated testing of the hashes, which probed wether every input bit affects every output bit half the time.

Only the jenkins hashes passed the bit test.

Speed comparisons

Finally the speed comparison of the hashes from the fastest to the slowest:

It is clear, that (taking into account the speed) the best hashes are the jenkins (jenkins2b) and the abcd64 (abcd64d) ones. There are pros and cons for both:

The abcd64d hash is faster than the jenkins2b hash.
Both hashes work fine for odd hash sizes. For even (power of two) hash sizes the abcd64 hashes cannot produce stable results.
The jenkins hash passed the bit test while the abcd64 hash doesn't.

In my opinion is we should adopt the jenkins2b hash.

Any suggestion for improvements are welcomed.

Jozsef Kadlecsik

abcd-1	0x898092BB 0x1CE6487C 0xF219B787 0x677F6D40	10001001100000001001001010111011 00011100111001100100100001111100 11110010000110011011011110000111 01100111011111110110110101000000
abcd-2	0x0E3A9C80 0xE1338378 0xB4CC63E7 0x5BC57C1F	00001110001110101001110010000000 11100001001100111000001101111000 10110100110011000110001111100111 01011011110001010111110000011111
abcd-3	0x0000173C 0xF94471C8 0x1EBFC863 0xE7FBAE97	00000000000000000001011100111100 11111001010001000111000111001000 00011110101111111100100001100011 11100111111110111010111010010111
abcd-4	0x0000006A 0x03FD9B60 0xFDCFFE95 0xFE32659F	00000000000000000000000001101010 00000011111111011001101101100000 11111101110011111111111010010101 11111110001100100110010110011111