Relative Speed of GMP Low-Level Functions updated for GMP 4.1
These diagrams show the approximate performance for the most
important low-level routines of GMP. These routines are the
work-horses of GMP, and for many applications nearly 100% of the run
time is spent therein.
For most CPU architectures, these routines are written in
hand-crafted assembly language.
These tables are normalized so that the fastest CPU gets 100.
Other CPU's values represent a percentage of the fastest CPU. Values
are at the indicated CPU frequencies, adjust them accordingly for
other CPU frequencies. First-level cache hit is assumed.
Grey bars indicate one of two things, either that the numbers are
not measured but instead are based on vendor-supplied documentation,
or that new code is being conceived that is believed to give the
indicated performance once it is implemented.
Pink/lightgrey indicate that the performance is higher for certain
operand values. The base performance is indicated using red/darkgrey.
Most GMP crunching depends most heavily on the speed of
mpn_addmul_1. For machines where mpn_addmul_N
is implemented, for some greater value of N, that becomes a more
important routine than mpn_addmul_1. That is because multiplication,
division, modulo, gcd, etc, all boil down to calls to
mpn_addmul_1 or mpn_addmul_N.
CPU manufacturers! You may want to improve the figures for your
CPU. Talk to gmp-devel@gmplib.org. This
list reaches GMP developers on the Net and on commercial companies,
some of which may want to help you for a fee.
Some comments about how certain processors perform
-
UltraSPARC performance suffers badly from its poor integer multiply
support. The ISA doesn't provide any instructions for the upper half of
a product, and the current implementations have slow, non-pipelined
integer multiply units. Worse, UltraSPARC 1 & 2 even stalls the entire
processor when the multiply unit is busy! GMP has to stick to using
floating-point, converting operands forth and back. (It seems Sun is
planning to address the ISA shortcomings; there are traces in the Solaris
8 assemblers of a new integer multiply instruction, umulxhi.)
-
Scientific code like GMP is ideal for the static Itanium
pipeline, but the code released with GMP 4 isn't well
optimized for Itanium. Swox has some experimental Itanium code that
runs really well, but it still needs a lot of work. Creating good
IA-64/Itanium code is a terrific challenge for humans and compilers
alike. But with some effort, the performance becomes awesome!
-
PowerPC 630 (aka POWER 3) have good multiply instructions, but IBM didn't
implement them with high-performance integer computations in mind; a full
128-bit product needs 18 cycles in the multiplier alone. But unlike
UltraSPARC, other parts of the processor continue executing. It would
therefore be possible to take advantage of the fast FPU, forming one or
even two 128-bit product with a number of floating-point operations while
waiting for the integer multiply units. Tests indicate that this could
quadruple the performance.
Please send comments about this page to
webmaster@gmplib.org
Copyright 2000-2003 Free Software Foundation
Verbatim copying and distribution of this entire article is permitted in any medium, provided this notice is preserved.