GMP Speed Charts

Relative Speed of GMP Low-Level Functions updated for GMP 4.1

These diagrams show the approximate performance for the most important low-level routines of GMP. These routines are the work-horses of GMP, and for many applications nearly 100% of the run time is spent therein.

For most CPU architectures, these routines are written in hand-crafted assembly language.

These tables are normalized so that the fastest CPU gets 100. Other CPU's values represent a percentage of the fastest CPU. Values are at the indicated CPU frequencies, adjust them accordingly for other CPU frequencies. First-level cache hit is assumed.

Grey bars indicate one of two things, either that the numbers are not measured but instead are based on vendor-supplied documentation, or that new code is being conceived that is believed to give the indicated performance once it is implemented.

Pink/lightgrey indicate that the performance is higher for certain operand values. The base performance is indicated using red/darkgrey.

Most GMP crunching depends most heavily on the speed of mpn_addmul_1. For machines where mpn_addmul_N is implemented, for some greater value of N, that becomes a more important routine than mpn_addmul_1. That is because multiplication, division, modulo, gcd, etc, all boil down to calls to mpn_addmul_1 or mpn_addmul_N.

CPU manufacturers! You may want to improve the figures for your CPU. Talk to gmp-devel@gmplib.org. This list reaches GMP developers on the Net and on commercial companies, some of which may want to help you for a fee.

Some comments about how certain processors perform

UltraSPARC performance suffers badly from its poor integer multiply support. The ISA doesn't provide any instructions for the upper half of a product, and the current implementations have slow, non-pipelined integer multiply units. Worse, UltraSPARC 1 & 2 even stalls the entire processor when the multiply unit is busy! GMP has to stick to using floating-point, converting operands forth and back. (It seems Sun is planning to address the ISA shortcomings; there are traces in the Solaris 8 assemblers of a new integer multiply instruction, umulxhi.)
Scientific code like GMP is ideal for the static Itanium pipeline, but the code released with GMP 4 isn't well optimized for Itanium. Swox has some experimental Itanium code that runs really well, but it still needs a lot of work. Creating good IA-64/Itanium code is a terrific challenge for humans and compilers alike. But with some effort, the performance becomes awesome!
PowerPC 630 (aka POWER 3) have good multiply instructions, but IBM didn't implement them with high-performance integer computations in mind; a full 128-bit product needs 18 cycles in the multiplier alone. But unlike UltraSPARC, other parts of the processor continue executing. It would therefore be possible to take advantage of the fast FPU, forming one or even two 128-bit product with a number of floating-point operations while waiting for the integer multiply units. Tests indicate that this could quadruple the performance.

mpn_addmul_1 and mpn_submul_1 performance

Please send comments about this page to webmaster@gmplib.org
Copyright 2000-2003 Free Software Foundation
Verbatim copying and distribution of this entire article is permitted in any medium, provided this notice is preserved.