Finally we're at the point where we're ready to look into the techniques that can greatly improve the performance of a given piece of code. Most of them are not novel. Some of them have been around for so long that it would be difficult, if not impossible, to give proper credit. Others are ``obvious'' (once you know them...). But the key point here is that it is the characteristics of modern, and in particular Alpha-based, systems which make these techniques so important and worthwhile.

While the focus of this section is on the Alpha architecture, we would
also like to give an idea of how the same techniques work on other
architectures. The tricky part is that we want to do this without
igniting a war over which CPU architecture is the ``best'' or the
``fastest.'' Such terms are, to a good degree, meaningless since they
can be applied usefully only relative to a given problem. For this
reason, performance results are presented as follows: for the Alpha we
present both absolute and relative results. The absolute numbers are
useful to give a concrete feel for how fast the code is. The relative
results (i.e., speed ups) is what tells us how well a given technique
works. Where meaningful, we also list the speed up (but not the
absolute performance) achieved on a Pentium Pro based system. Only
listing speed up for the Pentium Pro case makes it impossible to tell
which system was faster on a given problem while still allowing us to
compare the relative benefits. (To avoid any misconception: this
arrangement was *not* chosen because the Alpha performed poorly;
suffice it to say that the author has been using Alphas for some time
now and is generally pleased with the performance level they achieve.)

Since an architecture per se doesn't perform at all, let us be a bit more specific about the systems used for the measurements:

- Alpha system:
- The Alpha system was an AlphaStation 600 5/333 (aka Alcor). It has a 333MHz 21164 processor with 4MB of third-level cache and 64MB of main-memory. While a nice (and very expensive) box, it is by no means the latest and greatest of the available Alpha systems. At the time of this writing, much faster and much cheaper 500MHz systems have been around for a while already.
- Pentium Pro system:
- The x86 system was a Gateway 2000 with a Pentium Pro processor with 32MB of main-memory and 256KB of second-level cache. For clarity, this system is referred to as ``P6'' during the remainder of this section.

It is also illustrative to compare `gcc` to commercial Alpha
compilers, such as Digital's GEM C compiler. The GEM C compiler usually
generates somewhat better code but every once in a while it creates
code much faster than `gcc`'s (this usually happens on floating
point intensive code). For this reason, some of the measurements also
include the results obtained with GEM C. This compiler was invoked as
`cc -migrate -O4 -tune ev5 -std1 -non_shared`. It's not clear what the
version of the compiler version was---it was whatever came with Digital
Unix version 3.2.

The Alpha architecture does not provide an integer division instruction. The rationale for this is that (a) such operations are relatively rare and (b) division is fundamentally of an iterative nature so implementing it in hardware is not all that much faster than a good software implementation [Alv91]. Nevertheless, there are important routines that depend on integer division. Hash-tables are a good example: computing a hash-table index typically involves dividing by an integer prime constant.

There are basically two ways to avoid integer division. Either, the
integer division is replaced by a floating point division or, if the
division is by a constant, it is possible to replace the division by a
multiplication with a constant, a shift, and a correction by one
(which isn't always necessary). Floating point division may sound
like a bad idea, but the Alpha has a very fast floating point unit and
since a 32 bit integer easily fits into a `double` without loss
of precision, it works surprisingly well. Replacing a division by a
constant with a multiplication by the inverse is certainly faster,
though it's also a bit tricky since care must be taken that the result
is always accurate (off-by-one errors are particularly common).
Fortunately, for compile time constants, `gcc` takes care of this
all by itself.

To illustrate the effect this can have, we measured how long it takes
to lookup all symbols in the standard C library (`libc.so`) using
the ELF hash-table lookup algorithm (which involves one integer
division by a prime constant). With integer division, roughly 1.2
million lookups per second can be performed. Using a `double`
multiplication by the inverse of the divisor instead brings this
number up to 1.95 million lookups per second (62% improvement).
Using integer multiplication instead gives the best performance of
2.05 million lookups per second (70% improvement). Since the
performance difference between the double and the integer
multiply-by-inverse version isn't all that big, it's usually better to
use the floating pointer version. This works perfectly well as long
as the operands fit in 52 bits and avoids having to worry about
off-by-one errors.

Two parts in modern CPUs that are easy to forget about are the data
and instruction translation lookaside buffers (TLBs). The TLBs hold
the most recently accessed page table entries. TLBs are necessary
since it would be far too slow to access the page tables on every
memory access. Since each TLB entry maps an entire page (8KB with the
current Alpha chips), the TLB is usually not the limiting factor to
performance. The catch is that when a program does suffer from
excessive TLB misses, performance will go down the drain fast. In
such a case, the slow down may easily be big enough that it is
worthwhile to switch to a completely different (normally slower)
algorithm that has a better TLB behavior. For example, on the Alpha
system reading one word from each page in an array of 63 pages (504KB)
takes about 15ns per access. But doing the same in an array of 64
pages takes 65ns per access---a slowdown of more than a factor of
four! Since the second-level cache in that system is 4MB large, this
jump in access time is purely due to the data TLB.^{1}

As said above, the TLB is not usually a first-order bottleneck but a small experiment did show that a hash-table that is 50% full and exceeds the data TLB size is no faster than a more compact search tree that needs two memory accesses per lookup (the hash-table needs just one memory access but since it exceeds the TLB size, that one access is slow). In general, the TLB may be the primary bottleneck when large data sets are accessed more or less randomly and sparsely.

On modern systems, memory accesses are bad and computation is good. In the ideal case we would like to completely replace memory accesses by computation. This obviously is not always possible; but where it is possible, the payoff can be big.

For example, let us consider the problem of reversing all the bits in each byte in a long. A byte is reversed as follows: bit 0 is swapped with bit 7, bit 1 with bit 6 and so on. Since we want to reverse all the bytes in a long, this algorithm is applied once for each byte in the long.

Why would anyone want to do this? As you may know, both the Alpha and the x86 architecture are little-endian, but when IBM designed the VGA graphics adapter, it chose to use a big-endian bit order for the pixels in the graphics memory. That is, bit 7 in a byte corresponds to the left-most pixel and bit 0 to the rightmost. This is backwards since the coordinate value for the left-most pixel is smaller. So, byte-reversal is indeed a relatively important operation for VGA X servers.

The traditional way of implementing byte-reversal might look something
like the code below. To conserve space, we show the code to reverse a
4-byte integer only. The 8-byte long case is a straight-forward
extension of this. The code assumes that array `byte_reversed`
has been initialized such that `byte_reversed[i]` is the
reversed value of `i`.

u_int byterev_naive (u_int L) { /* this is for sizeof(u_int)==4 */ return (((u_int)byte_reversed[(L >> 0) & 0xff] << 0) | ((u_int)byte_reversed[(L >> 8) & 0xff] << 8) | ((u_int)byte_reversed[(L >> 16) & 0xff] << 16) | ((u_int)byte_reversed[(L >> 24) & 0xff] << 24)); }

With this naive algorithm, each byte-reversal involves a table-lookup. Since the Alpha is a 64 bit architecture, this means that each long reversal involves eight memory accesses. How could we avoid these expensive memory accesses? Well, a simple and arguably more intuitive way to implement byte reversal is to use shifting and masking to swap the bits. The code that does this would look something like this:

Note that, except for the initialization ofu_long byterev_long (u_long L) { u_long mask = 0x0101010101010101; return (((L & (mask << 0)) << 7) | ((L >> 7) & (mask << 0)) | ((L & (mask << 1)) << 5) | ((L >> 5) & (mask << 1)) | ((L & (mask << 2)) << 3) | ((L >> 3) & (mask << 2)) | ((L & (mask << 3)) << 1) | ((L >> 1) & (mask << 3))); }

Now let's see how the implementations compare. The table below gives
the results for the Alpha, the Pentium Pro (P6), and a 120MHz Pentium
Notebook (P5). In addition to the two implementations shown above,
the table also includes a row that shows the performance of the naive
implementation when coded in x86 assembly code (implementation
`byterev_x86`). This routine comes straight from the
XFree86 v3.2 distribution.

Alpha | P6 | P5 | ||

Implementation | abs | rel | rel | rel |

byterev_naive |
55MB/s | 1.00 | 1.00 | 1.00 |

byterev_long |
72MB/s | 1.31 | 1.13 | 1.17 |

byterev_x86 |
n/a | n/a | 0.56 | 1.51 |

As the table shows, the byte-reversal routine that avoids memory accesses is over 30% faster on the Alpha. Interestingly, on the P6 this routine is also fastest. The benefit there is less (``only'' 13%), but given that it's more portable and more intuitive, there is no reason not to use it. What's stunning is the complete failure of the assembly version on the P6. That routine is only half as fast as the version that avoids memory accesses. This may be due the assembly routine's extensive use of byte accesses to CPU registers. For the Pentium, as shown in the P5 column, the assembly coded routine is the fastest version. But don't be misled by the relative performance numbers: in terms of absolute performance, the Pentium is just half as fast as the P6.

To summarize, not only is `byterev_long()` by far the fastest
version on the Alpha, but even on a P6 it appears like the right
solution.

One reason why `gcc` sometimes generates significantly worse code
than Digital's GEM compiler is that it does not perform
inter-procedural alias analysis. What this means is that `gcc`'s
alias analysis is sometimes unnecessarily conservative. To illustrate
the problem, consider the function below. It is a simple unrolled
loop that reverses all the bytes in array `src` and stores the
result in array `dst`.

void vecrev_naive (long *src, long *dst, long size) { long i; for (i = 0; i < size/sizeof(long); i += 4) { dst[i + 0] = byterev_long(src[i + 0]); dst[i + 1] = byterev_long(src[i + 1]); dst[i + 2] = byterev_long(src[i + 2]); dst[i + 3] = byterev_long(src[i + 3]); } }

The problem with this code is that the C compiler has no way of
knowing how the `src` and `dst` pointer relate to each
other. For all it knows, `dst` could point to the second element
of the array pointed to by `src`. When this happens, the two
pointers refer to overlapping regions of memory and they are said to
alias each other. For the compiler, it is important to know whether
the two pointers overlap since that determines the degree of freedom
it has in reordering instructions. For example, if the regions
overlap, then storing a value to `dst[i+0]` may affect the value
of `src[i+1]`. Thus, to be on the safe side, the compiler must
generate the loads and stores in the above function strictly in the
order in which they occur in the source code.

Now, if it is known that the two arrays passed to the function never
alias each other, then we can lend `gcc` a hand by explicitly
telling it so. We can do this by first reading all the values from
the memory, then doing all the computation and finally storing the
results back to memory. Thus, the above code would be transformed
into the code shown below.

void vecrev_sep (long *src, long *dst, long size) { unsigned long L0, L1, L2, L3; long i; for (i = 0; i < size/sizeof(long); i += 4) { /* loads: */ L0 = src[i + 0]; L1 = src[i + 1]; L2 = src[i + 2]; L3 = src[i + 3]; /* computation: */ L0 = byterev_long(L0); L1 = byterev_long(L1); L2 = byterev_long(L2); L3 = byterev_long(L3); /* stores: */ dst[i + 0] = L0; dst[i + 1] = L1; dst[i + 2] = L2; dst[i + 3] = L3; } }

Since the stores all happen at the end, `gcc` knows right away
that none of the stores may affect any of the preceding loads. This
provides it complete freedom in generating the best possible code for
the loads and computation (the assumption here is that
`byterev_long` is an inlined function).

On the Alpha and most other architectures with lots of registers
(e.g., most RISCs), this kind of code never (or at least very rarely)
hurts performance and usually improves performance for `gcc`.
Unfortunately, the same isn't true for the x86 architecture. The
problem there is that only few registers are available. So the code
that's better for RISCs is usually worse for the x86 due to additional
stores and loads that are necessary to access the temporaries that
ended up on the stack. To witness, performance numbers for the two
versions are given below:

Alpha | P6 | ||

Implementation |
abs | rel | rel |

vecrev_naive |
82MB/s | 1.00 | 1.00 |

vecrev_sep |
93MB/s | 1.13 | 0.73 |

By now you may be wondering where the promised order of magnitude improvements are. Let's consider a simple problem: matrix multiplication. While simple, it is typical of a problem that is considered floating-point intensive. In reality, most floating-point intensive problems are also very memory intensive. E.g., they process large vectors or matrices. Thus, the memory access pattern often plays a crucial role in achieving the best possible performance. The textbook implementation of matrix multiplication looks roughly like this:

Here, the matrix pointed to byvoid matmul0 (int dim, float *a, float *b, float *c) { int i, j, k; float dot; for (i = 0; i < dim; ++i) for (j = 0; j < dim; ++j) { dot = 0.0; for (k = 0; k < dim; ++k) dot += a[i*dim + k] * b[k*dim + j]; c[i*dim + j] = dot; } }

Rather than declaring the problem solved, let's think about the memory
access pattern for a minute. Each element in the result matrix
`c` is a dot product of a row in `a` and a column in
`b`. For example, the element `c[0][0]` is computed as the
dot product of the first row in `a` and the first column in
`b`. This is illustrated in Figure 2. In our
naive matrix multiply routine, this means that the accesses to
`a` causes a nice, dense, linear memory access pattern.
Unfortunately, things do not look quite as good for `b`. There
the memory access pattern is sparse: first, the element at offset 0 is
read, then the one at offset `dim` and so on. Such sparse access
patterns are bad for many reasons. Suffice it to say that it's
easiest to optimize a machine for dense, linear accesses, so it is
likely that those accesses will always be the fastest ones.
Fortunately, there is a simple trick that avoids the bad access
pattern for matrix `b`: before doing the actual matrix multiply,
we can simply transpose matrix `b`. Then, all memory accesses
are dense. Of course, transposing `b` causes extra work, but
since that matrix is accessed `dim` times, this may well be worth
the trouble.

So, let's change `matmul0` into `matmul1` by adding a matrix
transposition in front of the main loop. The code below assumes that
`tb` is an appropriately sized temporary variable to hold the
transposition of `b`.

If you thought 15MFlops is fast, think again:void matmul1 (int dim, float *a, float *b, float *c) { int i, j, k; float dot; /* transpose b: */ for (i = 0; i < dim; ++i) for (j = 0; j < dim; ++j) tb[i*dim + j] = b[j*dim + i]; ...rest like matmul0, except that b is replaced by tb... }

Was it worth the trouble? By all means yes:void matmul2 (int dim, float *a, float *b, float *c) { float dot0, dot1, dot2, dot3, a0, a1, a2, a3, b0, b1, b2, b3; float dot4, dot5, dot6, dot7, a4, a5, a6, a7, b4, b5, b6, b7; long I, J, i, j, k; ...transpose b into tb like in matmul1...; for (I = 0; I < dim*dim; I += dim) for (j = J = 0; j < dim; ++j, J += dim) { dot0 = dot1 = dot2 = dot3 = dot4 = dot5 = dot6 = dot7 = 0.0; for (k = 0; k < dim; k += 8) { a0 = a[I + k + 0]; b0 = tb[J + k + 0]; a1 = a[I + k + 1]; b1 = tb[J + k + 1]; a2 = a[I + k + 2]; b2 = tb[J + k + 2]; a3 = a[I + k + 3]; b3 = tb[J + k + 3]; a4 = a[I + k + 4]; b4 = tb[J + k + 4]; a5 = a[I + k + 5]; b5 = tb[J + k + 5]; a6 = a[I + k + 6]; b6 = tb[J + k + 6]; a7 = a[I + k + 7]; b7 = tb[J + k + 7]; dot0 += a0 * b0; dot1 += a1 * b1; dot2 += a2 * b2; dot3 += a3 * b3; dot4 += a4 * b4; dot5 += a5 * b5; dot6 += a6 * b6; dot7 += a7 * b7; } c[I + j] = dot0 + dot1 + dot2 + dot3 + dot4 + dot5 + dot6 + dot7; } }

The table below presents a summary of the performance results. For comparison, it also includes the results obtained when compiling the same code with Digital's GEM C compiler and the relative results for the P6.

Alpha | P6 | ||||

gcc |
GEM C | ||||

Implementation | abs | rel | abs | rel | rel |

matmul0 |
11MFlops | 1.00 | 13MFlops | 1.00 | 1.00 |

matmul1 |
47MFlops | 4.27 | 66MFlops | 5.08 | 2.53 |

matmul2 |
80MFlops | 7.27 | 105MFlops | 8.08 | 3.97 |

Multimedia applications are the rage these days. All mainstream CPU architectures (with the notable exception of the PowerPC) have so-called multimedia extensions. The Alpha is ideally suited for such applications since it has been a 64 bit architecture right from the start. In fact, the Alpha multimedia extension is completely trivial: it adds only four new instruction types (vector minimum/maximum, pixel error, pack and unpack). Since some of these instructions can operate on different data types (byte or word; signed or unsigned), the total number of instructions added is 13, which is much smaller than the corresponding number for other architectures. The reason the Alpha got away with so few additions is because its original instruction set already contains many of the instructions needed for multimedia applications. For example, there is an instruction that allows eight bytes to be compared in parallel---a seemingly simple instruction that can prove surprisingly powerful in a number of applications.

We illustrate this using `mpeg_play`, the Berkeley MPEG
decoder.^{2} Since there is not enough space
to illustrate all the optimizations that can be applied to this
program, we focus on one of the most important operations in the MPEG
decoder. This operation involves computing the average of two byte
vectors. This is a frequent operation since video often contains
images that can be represented as the average of an earlier and a
later image. The straight-forward way of averaging a byte vector is
shown below:

This loop executes at about 94ns per byte-average (iteration) when compiled withvoid byte_avg0 (u_long size, u_char *a, u_char *b, u_char *c) { int i; for (i = 0; i < size; ++i) c[i] = (a[i] + b[i]) / 2; }

To get even higher performance, we need to be a bit more aggressive.
Considering that the Alpha is a 64 bit architecture, it sure would be
nice if we could calculate the average of eight bytes in parallel.
Reformulating byte-oriented algorithms in such a data parallel format
is often trivial (see Section 4.3). For
byte-averaging, it's not quite as simple: the straight forward
implementation requires nine bits of precision, since 255+255=510.
But if we pack 8 bytes into a 64 bit word, there is no extra ninth
bit. How can we get around this? Obviously, we can divide the
operands by two *before* adding them. That way, the sum is at
most 127+127=254 which conveniently fits into 8 bits. But the catch
is that the result may be wrong: if both operands are odd, it will be
one too small. Fortunately, it's easy to correct for this: if bit 0
in both operands is set, a correction by one is necessary. In other
words, we can make space for that extra ninth bit by using an
additional long register that is used to hold the correction bits.
Since all intermediary results now fit into 8 bits, the obstacles to a
data-parallel implementation of byte-averaging have been removed. The
resulting code is shown below. For simplicity, it assumes that the
input vectors are long aligned and have a size that is an integer
multiple of the size of a long:

Note that macro#define CAT(v,x) ((u_long)(v)<<8 | (x)) #define VEC(x) CAT(CAT(CAT(CAT(CAT(CAT(CAT(x,x),x),x),x),x),x),x) void byte_avg2 (u_long size, u_char *ca, u_char *cb, u_char *cc) { u_long *a = (u_long*) ca; u_long *b = (u_long*) cb; u_long *c = (u_long*) cc; u_long A0 = a[0], B0 = b[0], A1, B1, CC, i, count = size / sizeof(u_long); for (i = 1; i < count; ) { A1 = a[i]; B1 = b[i]; ++i; /* read ahead */ CC = (A0 & B0) & VEC(0x01); A0 = (A0 >> 1) & VEC(0x7f); B0 = (B0 >> 1) & VEC(0x7f); c[i - 2] = A0 + B0 + CC; A0 = a[i]; B0 = b[i]; ++i; /* read ahead */ CC = (A1 & B1) & VEC(0x01); A1 = (A1 >> 1) & VEC(0x7f); B1 = (B1 >> 1) & VEC(0x7f); c[i - 2] = A1 + B1 + CC; } }

Despite its look, this code is actually very portable. For a 32-bit
architecture, all that needs to change is macro `VEC` (and even
that is necessary only to get rid of compiler warnings). Byte-order
is not an issue since even though the data is accessed a long at a
time, each byte is still processed individually. This is all nice and
fun, but lets not forget the real reason why we tried this:
performance. This data parallel version of the byte-averaging loop
runs at 5.3ns per byte-average---more than an order of magnitude
faster than the unrolled loop!

A summary of the three averaging routines is given in the table below.
The relative performance is in terms of throughput (number of
byte-averages per second) since that's both more intuitive and more
impressive. Results for the Alpha are presented both for `gcc`
and Digital's GEM C compiler; as usual, for the P6, `gcc` was
used.

Alpha | P6 | ||||

gcc |
GEM C | ||||

Implementation | abs | rel | abs | rel | rel |

byte_avg0 |
94ns/avg | 1.00 | 69ns/avg | 1.00 | 1.00 |

byte_avg1 |
60ns/avg | 1.57 | 56ns/avg | 1.23 | 0.82 |

byte_avg2 |
5ns/avg | 18.80 | 4ns/avg | 17.25 | 1.93 |

But how does all this affect performance of `mpeg_play`? This
is best illustrated by comparing the original Berkeley version with
the one optimized using the techniques described in this section
(particularly data-parallel processing and avoiding integer
divisions).^{3} Comparing MPEG
performance is a bit tricky since a large fraction of the time is
spent displaying images. This can be factored out by using option
```-dither none`'' which has the effect that nothing at all gets
displayed (while the MPEG stream is still decoded as usual). The
table below shows the result for this mode as well as when using an
ordered dither. The ordered dither itself was also optimized using
data-parallel processing, which resulted in a version called ordered4.
The exact set of options used for `mpeg_play` was
```-controls none -framerate 0 -dither D`''.

Version | Dither | Framerate |

Original | none |
45.3 frames/sec (1.00) |

Optimized | none |
139.8 frames/sec (3.09) |

Original | ordered |
26.9 frames/sec (1.00) |

Optimized | ordered |
58.9 frames/sec (2.19) |

Optimized | ordered4 |
98.2 frames/sec (3.65) |