Speed – Ramblings of a Spooky-Possum

Back when I was programming on an old IBM PC (not an XT, the original PC) there were certain rules you followed if you wanted your code to be fast:

Multiplication was slow.
Division was slow.
Floating point was really awful (no hardware support).
You didn’t even think about square roots.
XOR AX,AX was the fast way to clear a register (1 cycle faster !).
If you could use a shift instead of a mulitplication or division, you did it.
The 8086 string instructions where the way to get speed for bulk memory operations.

We all know it is different now, but how different ? To answer this I sat down with a range of computers with processors spanning the past eight years and a simple test program that repeatedly ran various instructions. The results are below, all normalised to the speed of the one universally fast operation: addition. For comparison results for three “historical” processors are included at the end. These are based on the manufacturers timing information rather than direct measurement.

Arch	Imul	Idiv	Logical	Shift	Fadd	Fmul	Fdiv	sqrt	sqrt var	fsqrt	fsqrt var
Alpha	2.3	6.1	0.9	0.7	0.9	1.1	2.0	12.0	0.6	9.3	0.4
Athlon	1.2	24.0	1.1	1.2	1.2	1.2	4.0	47.1	14.2	28.4	8.5
PIII	1.1	16.1	1.2	1.1	1.2	1.3	3.5	34.5	14.9	28.5	14.1
P4	1.3	15.4	1.0	2.0	1.1	1.2	15.6	48.2	0.2	18.4	0.1
G3	1.0	7.6	1.1	1.0	1.5	1.7	8.5	46.4	0.1	–	–
G5	2.0	17.5	1.0	0.9	1.5	1.5	14.7	75.3	0.0	47.5	0.1
8086	40.0	50.0	1.0	0.7	–	–	–	–	–	–	–
68000	8.8	17.5	0.8	0.5	–	–	–	–	–	–	–
Z80	–	–	1.0	2.0	–	–	–	–	–	–	–

These results aren’t a comparison of raw speed: they highlight what various processors do well and what they don’t. They aren’t particularly accurate, the alpha result was for a single machine that was in use at the time (I’m not even sure what model it is, I suspect a 21164). The Athlon results are averaged across the entire family and there are differences between the orginial “K7” and the Athlon XP. All the results probably have an error on the order of 20%, so the entire “Logical” column can probably be read as 1.0.

So what has changed since the 80s ? Multiplication is no longer the dog it once was, at worst case you are looking at a factor of 2 and in most cases it is as fast as addition. Integer division is still slow, although both the Alpha and the G3 make it look almost respectable. The Athlon still hates division. Logical (&, ^, and |) and shift operations are almost universally fast, although the P4 is mysteriously slow at shifting.

Floating point is fast these days. These results are averaged across both double and float operands. Since floats are universally promoted to doubles in hardware there is no difference for basic operations. Floating point multiplication, like integer multiplication, is as fast as addition. Floating point division is still slower, but it is faster than integer division. The alpha absolutely screams when doing division. Conversely the G5 and the Pentium 4 are both significantly slower, in fact both have floating point division performance almost as bad as their integer division performance.

Finally I ran the square root function in both double and float forms and with different arguments. The “sqrt var” column reflects the spread of results between calls to sqrt with different arguments. As you can see most implementations have a constant cost for doing square roots, but the Athlon and the PIII use an algorithm which adapts itself to the required precision. This can result in more than a factor of two difference in speed from case to case. The sqrt function also shows you why you might still want to use floats rather than doubles: successive approximation algorithms don’t have as much work to do for a float.

So the new rules could be:

Multiplication is fast.
Floating point is fast.
Division is still slow.
Floating point division is faster than integer division.
Using shifts for division and multiplication is not really worth it anymore: write readable code, the compiler will use shifts anyway.
float types can be faster than doubles, but not for simple operations.

These rules also have the advantage of portability.