Some time back, I was working on some primality testing algorithms in the FLINT ulong_extras
module (the Brillhart-Lehmer-Selfridge test to be precise). It required computing the cube root of the integer being tested. While implementing it, I realized that the ulong_extras module was missing a cube root function. We couldn't have used the cbrt()
function available in math.c
, as it was introduced in C99 and FLINT is guaranteed to work with C90 (You'll notice that I've used unsigned long long int instead of uint64_t below, because of the same reason). We decided to use pow(n, 1./3.)
temporarily, however it was not reliable due to the way 1./3. is stored in memory. So I decided to write one for the ulong_extras module.
It did sound like a trivial task initially, given that we had to calculate only the floor of the cube root and not a floating point answer. However there was so many different algorithms to try and test against each other for efficiency, that it did end up taking quite a lot of time. One of them which I really liked involved a lot of bit level hacks, and is probably one of the most interesting pieces of code I have worked on.
The implementation can be found here
The implementation exploits the way an IEEE Double-precision floating-point is stored in memory. Do read up a bit on it before proceeding further.
This representation makes our task to calculate the cube root very simple. In this form, the integer n is stored as
$$
n = {(-1)}^{sign} * 1.{{fraction}} * 2^{(exponent - 1023)}
$$
The core idea is to remove a multiple of three from the exponent, which can be easily extracted using bit masks. This reduces our problem to computing a cube root on $[0.5, 1)$.
Since we are doing our computations on unsigned integers, we can stop worrying about the sign bit in the formula. The above expression can be rewritten as,
$$
n = {{residue}} * 2^{(exponent - 1022)}
$$
where residue is $(1.{fraction}) / 2$. Calculating the cube root,
$$ {\sqrt[3] n} = {{residue}}^{1/3}* 2^{{(exponent - 1022)}/3} $$
We can evaluate this value in a very efficient manner by using some lookup tables. There are two parts which need to be calculated, the exponent part and the cube root of the residue part.
Calculating the exponent part is quite easy. We calculate
$$x = 1 << {(exponent - 1022)}/{3} $$
and
$$y = 2^{(exponent - 1022)\%{3}} $$
y can have three possible values (since we're working mod 3 there), and hence we can store these three values in a lookup table. Let's call this factor_table. It comprises of the values 1^(1/3), 2^(1/3), 4^(1/3). The value of the cube root of the exponent part is $x * y$.
Calculating the cube root of the residue is slightly trickier, since obviously we want to avoid making a call to pow()
. We know that $ residue \in [0.5, 1)$. I split this interval into sixteen equal ones, of range 0.03125 each, and then used approximating polynomials to compute a good approximation of $ \sqrt[3] {residue}$.
I used mpmath's chebyfit function to compute approximation polynomials of degree 2 for each of the 16 intervals, and stored their coefficients in another lookup table. Let's call this table coefficient_table. Mpmath's chebyfit function uses the chebyshev approximation formula. I used the following call to compute the coefficients for each interval. i and j are the endpoints of the range.
mpmath.chebyfit(lambda x: mpmath.root(x,3), [i, j], 3, error = False)
Using this, I computed a 16 X 3 2-dimensional float array (16 intervals between $[0.5, 1)$ and 3 coefficients per interval). The tricky part was how to use these values. I could have used an if else ladder to decide which of the coefficients to use, depending upon the value of the $residue$. However introducing so many branches seemed like a very bad idea. Imagine an if-else ladder with 16 branches!
It took me a while to figure out a better solution. A much better way to do was to simply use a few initial bits of the residue. Since the residue is greater than or equal 0.5, the first bit is definitely set. The next 4 bits of the residue are sufficient to decide which set of values to use (or which interval the residue lies in).
For example if the residue was 0.58, then it would be represented as 100101000... in the double
. Using a bit mask, we can extract bits 2..5, which will give us a value between $[0, 16)$. In this case, it will be 0010, i.e. 2. So we know that we have to use the values present in $coefficient\_table[2]$. Let's call this value the table_index.
So $\sqrt[3] {residue}$ can now be calculated as following. Let this value be z.
We can save one floating point multiplication here (quite expensive), by using Estrin's scheme of evaluating polynomials.
Now we have solved both our subproblems, our cube root is simple the integral part of $x * y * z$, as calculated above.
Now that we have a way to calculate the cube root, the question is, how do we actually compute the values of exponent and residue? C does not allow bit level operations on a floating point data type.
>kush@kush:~/Desktop$ gcc -o a test.c -lflint -lm
>test.c: In function ‘main’:
>test.c:162:13: error: invalid operands to binary | (have ‘double’ and ‘long int’)
testDouble | 0xF;
^
Here I was trying to use the or
operation on a double
, and as expected, the code gave errors on compilation. This causes problems, as we cannot access the bits of the double
directly. The solution to this, was to use unions.
Unions let us store different data types, in the same memory location. The same 64 bits stored somewhere in the memory, can be accessed both as a double
, as well as an unsigned long long int
using this union!
We can declare a union comprising of a double
and an unsigned long long int
(an unsigned word), and perform the bit level operations on the unsigned long long int
to access the bits of the double
.
typedef union {
unsigned long long int uword_val;
double double_val;
} uni;
So basically, if we're trying to compute the cube root of an unsigned long long int
n, we take the following steps,
uni alias;
alias.doube_val = (double) n;
unsigned long long int wanted_bits = alias.uword_val & bit_mask;
Now we can access all the bits using alias.uword_val
.
You can have a look at the complete implementation of this method below. The mp_limb_t
data type used is the same as an unsigned word, or a 64 bit unsigned integer in most (64 bit) processors.
The implementation is accurate, and manages to beat pow(n, 1./3.) in all cases. I timed the two a lot of times for various sizes of n and plotted them. The timings are for a 1000 random numbers of the given size. The new method is 1.1 to 1.2 times faster than calling pow(n, 1./3.).
The actual cube root function present in FLINT currently is slightly more complex. I used Newton iterations and an algorithm designed by W. Kahan for smaller integers, as it was turning out to be more efficient. In case of larger integers (> 46 bits), the method discussed above is more efficient, and hence used. Check it out here,
Let me know what you think about it in the comments section.
Till next time,
Kush