Conntrack hash function comparisons II
The comparison of the candidate hash functions for conntrack is based
on Patrick Schaaf's cttest
program. All the results below can be reproduced by running 'mkstat.pl'
from the package.
The results of the first conntrack comparison can be found
here.
The key of the hash functions for conntrack consists of the network
parameters identifying a connection:
- source IP address
- destination IP address
- protocol
- source port
- destination port
In the mathetamical formulae below I use the following notation:
Unfortunately it's not quite trivial how to decide which
hash function is the best:
- One could simply "look at" the graphs which show the number of
hash buckets for the different bucket lengths. If we have multiple input
datasets of the same sizes, just the maximum bucket lengths can be displayed
as well (see the maximum list length graphs below). The human eye is an
excellent tool and the trivially bad hashes can easily be spotted.
However it's hard to say which hash is better, if they look similar or
one looks better here, the other looks better there...
- The probability of the longest bucket can be calculated:
and the results can be compared (see the probability of one longest bucket length
graphs below). In order to avoid the overflow of the factorial
calculations, the following formulae were used instead:
However, we face two problems here: first, if there are multiple longest
buckets, then another, more complicated formula should be used:
Second: actually, we just look at the longest bucket lengths
as at the previous approach, from another point of view.
- We could calculate the usual statistical parameters which describe
the distribution of the data in the hash table: mean, standard and mean absolute deviation
(see standard and mean absolute deviation graphs below), etc:
However these parameters describe how a data distribution itself behaves.
They are not much better suited to compare two datasets as the previous
methods.
- The chi-square is a much better parameter to qualify a hash function:
The result is expected to have a value close to the hash size (L) itself. Chi-square
measures are usually reported in units of standard deviations, so the
chi-square graphs below actually display:
The results are expected to be between -3 and 3, so those lines are printed
on the graphs as 'min' and 'max' values.
- Chi-square can be used also
to answer the question wether two data sets are drawn
from the same distribution, or how much two datasets differ. On the chi-square
against random hash graphs below the hash functions are compared to
the random hash:
- We calculate the walking time of the produced hash table as well:
The conclusions below were drawn from the chi-square and walking time graphs.
In order to get more data, I widened the artifical test used in the
previous conntrack hash function comparison.
This time the following input datasets were used: 50000 connections
(100000 entry to hash), each "connecting to" one fixed IP address at a given
port. The clients are connecting from 50000 different ports. In the different
datasets the number of clients (and thus the number of different source
IP addresses) is varied:
50000, 45000, 40000, ... 15000,
10000, 9000, 8000, ... 2000,
1000, 900, 800, ... 200,
100, 99, 98, ... 1
That means 126 input datasets for every hash, at every hash size.
All hashes were tested at the following odd hash sizes:
8191 8209
16383 16411
32767 32771
65535 65537
131071 131101
262143 262147
The best hashes (abcd64 and the jenkins ones) which work for even hash sizes as well were tested
additionally with the following hash sizes:
8192 16384 32768 65536 131072 262144
Hashes original, rt, crc32, abcd64
The original hash, compared to other hashes.
The chi-square tests show clearly that there are real problems with the original hash.
Hashes rt, rt_ab, rt_ab2, rt_new, abcd64
The rt hashes were derived from the IP routing code by Joakim Axelsson
and Martin Josefsson.
All of them can produce unsatisfactory results.
Hashes crc32, rusty, abcd64
The crc32 hash use the crc32 function of zlib, the implementation
is by Patrick. The rusty hash is based on Rusty's
include/linux/hash.h from Linux 2.5.
The crc32 hash is quite good, alas there are serious problems
with the rusty hash.
Hashes guilmette, pearson, abcd64
The guilmette hash was suggested by Ron Guilmette on comp.compilers,
13 Dec 1990 06:25:10. The pearson hash is the famous one :-).
The original version of both hashes generate hash keys between 0-255,
I created the tested variants which produce unsigned long hash keys.
There are problems with the guilmette hash but the pearson hash is quite good.
Hashes abcd, abcd-1, abcdef, abcd_xor, abcd64
The abcd and abcdef hashes
were proposed by Don Cohen and Rich Scroeppel, the latter coded by Joakim Axelsson. The
abcd_xor hash
is by Joakim Axelsson and Martin Josefsson. The abcd-1 variant
(formerly abcdx) is the same as the abcd but with different coefficients.
The abcd64 hash uses 64bit random numbers and works for even hash sizes
as well.
Formerly the abcd hash was my favourite one. However the chi-square tests show that the
abcd and abcd-1 hashes are not perfect. The abcd64 hash seems to be OK.
Hashes abcd-1, abcd-2, abcd-3, abcd-4
Another run to examine the abcd hash with different, worse and worse
coefficients:
abcd-1 | 0x898092BB 0x1CE6487C 0xF219B787 0x677F6D40
|
10001001100000001001001010111011
00011100111001100100100001111100
11110010000110011011011110000111
01100111011111110110110101000000
|
abcd-2 | 0x0E3A9C80 0xE1338378 0xB4CC63E7 0x5BC57C1F
|
00001110001110101001110010000000
11100001001100111000001101111000
10110100110011000110001111100111
01011011110001010111110000011111
|
abcd-3 | 0x0000173C 0xF94471C8 0x1EBFC863 0xE7FBAE97
|
00000000000000000001011100111100
11111001010001000111000111001000
00011110101111111100100001100011
11100111111110111010111010010111
|
abcd-4 | 0x0000006A 0x03FD9B60 0xFDCFFE95 0xFE32659F
|
00000000000000000000000001101010
00000011111111011001101101100000
11111101110011111111111010010101
11111110001100100110010110011111
|
Unfortunately nothing much could be gained from the results except
the suscepted one: there are naturally better and worse coefficients.
A closer look at the abcd hashes revealed, that there is a flawn in these
hashes: an attacker could easily fill one hash bucket, if he can guess or
find out the hash size. This is due to the algorithm behind the abcd hashes:
(coef0 * srcip + coef1 * sport
+ coef2 * dstip + coef3 * dport + proto) % hash_size
By keeping tree input parameters fixed and using the following source
IP addresses srcip = const + n * hash_size the hash function
becomes
(CONST0 + coef0 * n * hash_size) % hash_size
which gives a constant hash value for any n. Rehashing does
not stop the attack, it just moves the filling bucket.
The abcd64 variant fixes the problem by using 64-bit arithmetic and
mixing the upper and lower 32-bit parts of the partial results.
Hashes jenkins, jenkins2, jenkins2b
The jenkins hash
is by Bob Jenkins. The jenkins2 and jenkins2b variants are hashes optimized
for conntrack by me.
They are overall very good hashes.
Hashes abcd64, abcd64b, abcd64c, abcd64d
Comparison of different variants of the abcd64 hash: abcd64 has 64-bit
random coefficients while the b, c and d variants have 32-bit ones.
abcd64b and abcd64c differs only in coding, an attempt to gain more speed (in vain).
The abcd64d hash has the "best" coefficients in the sense that the four random
coefficients were chosen so that they have 16 one's and 16 zeros.
The chi-square graphs show that the abcd64 hashes are not perfect for hash
sizes of power of two.
Hashes abcd64d, jenkins2b, jenkins64
This is the comparison of the best hashes found so far. (The jenkins64
hash is the 64-bit variant of the jenkins hash.)
Again, these look to be the best hashes, but the abcd64 hash has troubles
with even hash sizes.
The jenkins hashes work reliably in every case.
Bit tests
There were a separated testing of the hashes, which probed wether
every input bit affects every output bit half the time.
Only the jenkins hashes passed the bit test.
Speed comparisons
Finally the speed comparison of the hashes from the fastest to the slowest:
It is clear, that (taking into account the speed) the best hashes are
the jenkins (jenkins2b) and the abcd64 (abcd64d) ones. There are pros and
cons for both:
- The abcd64d hash is faster than the jenkins2b hash.
- Both hashes work fine for odd hash sizes. For even
(power of two) hash sizes the abcd64 hashes cannot produce
stable results.
- The jenkins hash passed the bit test while the abcd64 hash
doesn't.
In my opinion is we should adopt the jenkins2b hash.
Any suggestion for improvements are welcomed.
Jozsef Kadlecsik