`ulong_extras`

module (the `cbrt()`

function available in `math.c`

, as it was introduced in `pow(n, 1./3.)`

temporarily, however it was not reliable due to the way 1./3. is stored in memory. So I decided to write one for the ulong_extras module.
It did sound like a trivial task initially, given that **we had to calculate only the floor of the cube root** and not a floating point answer. However there was so many different algorithms to try and test against each other for efficiency, that it did end up taking quite a lot of time. One of them which I really liked involved a lot of bit level hacks, and is probably one of the most interesting pieces of code I have worked on.

The implementation can be found here

The implementation exploits the way an IEEE Double-precision floating-point is stored in memory. Do read up a bit on it before proceeding further.

This representation makes our task to calculate the cube root very simple. In this form, the integer **n** is stored as

$$
n = {(-1)}^{sign} * 1.{{fraction}} * 2^{(exponent - 1023)}

$$

**The core idea is to remove a multiple of three from the exponent, which can be easily extracted using bit masks. This reduces our problem to computing a cube root on $[0.5, 1)$.**

Since we are doing our computations on **unsigned integers**, we can stop worrying about the sign bit in the formula. The above expression can be rewritten as,

$$
n = {{residue}} * 2^{(exponent - 1022)}

$$

where residue is $(1.{fraction}) / 2$. Calculating the cube root,

$$ {\sqrt[3] n} = {{residue}}^{1/3}* 2^{{(exponent - 1022)}/3} $$

We can evaluate this value in a very efficient manner by using some *lookup tables*. There are two parts which need to be calculated, the **exponent** part and the cube root of the **residue** part.

Calculating the exponent part is quite easy. We calculate

$$x = 1 << {(exponent - 1022)}/{3} $$

and

$$y = 2^{(exponent - 1022)\%{3}} $$

y can have three possible values (since we're working mod 3 there), and hence we can store these three values in a lookup table. Let's call this **factor_table**. It comprises of the values **1^(1/3), 2^(1/3), 4^(1/3)**. The value of the cube root of the exponent part is **$x * y$**.

Calculating the cube root of the residue is slightly trickier, since obviously we want to avoid making a call to `pow()`

. We know that $ residue \in [0.5, 1)$. I split this interval into sixteen equal ones, of range **0.03125** each, and then used **approximating polynomials** to compute a good approximation of $ \sqrt[3] {residue}$.

I used **mpmath's chebyfit** function to compute approximation polynomials of degree 2 for each of the 16 intervals, and stored their coefficients in another lookup table. Let's call this table **coefficient_table**. Mpmath's chebyfit function uses the **chebyshev approximation formula**. I used the following call to compute the coefficients for each interval. i and j are the endpoints of the range.

```
mpmath.chebyfit(lambda x: mpmath.root(x,3), [i, j], 3, error = False)
```

Using this, I computed a 16 X 3 2-dimensional float array (16 intervals between $[0.5, 1)$ and 3 coefficients per interval). The tricky part was **how** to use these values. I could have used an if else ladder to decide which of the coefficients to use, depending upon the value of the $residue$. However introducing so many branches seemed like a very bad idea. **Imagine an if-else ladder with 16 branches!**

It took me a while to figure out a better solution. A much better way to do was to **simply use a few initial bits of the residue**. Since the residue is greater than or equal 0.5, the first bit is definitely set. The next 4 bits of the residue are sufficient to decide which set of values to use (or which interval the residue lies in).

For example if the residue was 0.58, then it would be represented as **100101000...** in the `double`

. Using a bit mask, we can extract bits 2..5, which will give us a value between $[0, 16)$. In this case, it will be 0010, i.e. 2. So we know that we have to use the values present in $coefficient\_table[2]$. Let's call this value the **table_index**.

So $\sqrt[3] {residue}$ can now be calculated as following. Let this value be **z**.

We can save one floating point multiplication here (quite expensive), by using **Estrin's scheme** of evaluating polynomials.

Now we have solved both our subproblems, our cube root is simple the **integral part** of $x * y * z$, as calculated above.

Now that we have a way to calculate the cube root, the question is, **how do we actually compute the values of exponent and residue?** C does not allow bit level operations on a floating point data type.

```
>kush@kush:~/Desktop$ gcc -o a test.c -lflint -lm
>test.c: In function ‘main’:
>test.c:162:13: error: invalid operands to binary | (have ‘double’ and ‘long int’)
testDouble | 0xF;
^
```

Here I was trying to use the `or`

operation on a `double`

, and as expected, the code gave errors on compilation. This causes problems, as we *cannot access the bits of the double directly*. The solution to this, was to use

**Unions** let us store different data types, in the same memory location. The same 64 bits stored somewhere in the memory, can be accessed both as a `double`

, as well as an `unsigned long long int`

using this union!

We can declare a union comprising of a `double`

and an `unsigned long long int`

(an unsigned word), and perform the bit level operations on the `unsigned long long int`

to access the bits of the `double`

.

```
typedef union {
unsigned long long int uword_val;
double double_val;
} uni;
```

So basically, if we're trying to compute the cube root of an `unsigned long long int`

**n**, we take the following steps,

```
uni alias;
alias.doube_val = (double) n;
unsigned long long int wanted_bits = alias.uword_val & bit_mask;
```

Now we can access all the bits using `alias.uword_val`

.

You can have a look at the complete implementation of this method below. The `mp_limb_t`

data type used is the same as an unsigned word, or a 64 bit unsigned integer in most (64 bit) processors.

The implementation is accurate, and manages to beat pow(n, 1./3.) in all cases. I timed the two a lot of times for various sizes of n and plotted them. The timings are for a 1000 random numbers of the given size. The new method is 1.1 to 1.2 times faster than calling pow(n, 1./3.).

The actual cube root function present in FLINT currently is slightly more complex. I used **Newton iterations** and an algorithm designed by **W. Kahan** for smaller integers, as it was turning out to be more efficient. In case of larger integers (> 46 bits), the method discussed above is more efficient, and hence used. Check it out here,

Let me know what you think about it in the comments section.

Till next time,

Kush

I forked gabrielecirulli's 2048 repository and wrote the AI on top of it. This saved a lot of time, as I just had to write the AI, and not implement the whole game from scratch. **The source code for the AI is in** **this repository**.

2048 is played on a 4 X 4 grid, with four possible moves up, left, down and right. The objective of the game is to slide numbered tiles on the grid to combine them and **create a tile with the number 2048**.

There are **four possible moves** from which a player can choose at each turn. After each move, **a new tile is inserted into the grid at a random empty position**. The value of the tile can be either 2 or 4.

From the source code of the game, it was clear that the **value is 2 with a probability 0.9**. From the source code of the game,

```
var value = Math.random() < 0.9 ? 2 : 4;
```

I used a **depth bounded expectimax** search to write the AI. At every turn, the AI explores all the four possible directions and then decides the best one. The recurrence relation that it follows is

The **max node** is the one in which the **player chooses a move** (out of the four directions), and the **chance node** is the one in which the **board inserts a random tile**. The max node decides the correct move by maximising the score of its children, which are the chance nodes.

Due to space and time restrictions, it is obviously not possible to explore the complete search space. The **search is bounded by a depth**, at which the AI **evaluates the score of the state** (the leaf node) using some heuristic.

The pseudocode is given below.

```
bestMove(grid, depth, agent):
if depth is 0:
return score(grid)
elif agent is BOARD:
score = 0
for tile in grid.emptyTiles():
newGrid = grid.clone()
newGrid.insert(tile, 2)
score += 0.9 * bestMove(grid, depth - 1, PLAYER)
newGrid = grid.clone()
newGrid.insert(tile, 4)
score += 0.1 * bestMove(grid, depth - 1, PLAYER)
return score/len(grid.emptyTiles())
elif agent is PLAYER:
score = 0
for dir in [left, up, right, down]:
newGrid = grid.clone()
newGrid.move(dir)
score = max(score, bestMove(newGrid, depth - 1, BOARD)
return score
```

I used a combination of two heuristics to evaluate how **good** a state was.

In most of the games that I came close to winning, the bigger tiles were around the corner. Hence the first idea that I used was to **push the higher value tiles to one corner** of the grid. For this, I assigned different weights to different cells.

$$ W = \begin{pmatrix}

6 & 5 & 4 & 3 \\

5 & 4 & 3 & 2 \\

4 & 3 & 2 & 1 \\

3 & 2 & 1 & 0 \\

\end{pmatrix} $$

The score of a given 4 X 4 grid was calculated as following

```
score = 0
for i in range(0, 4):
for j in range(0, 4):
score += W[i][j] * grid[i][j].value
```

This had another advantage, the new randomly generated tiles got generated in the opposite corner, and hence were closer to the smaller value ones. This **increased the chances of the newly generated tiles getting merged**.

It is obviously better for us if more tiles get merged. This happens if in a state two same valued tiles are present next to each other.

I calculate another value, a **penalty**, and subtract it from the score calculated from heuristic one. This penalty is calculated as following:

```
penalty = 0
for each tile:
for each neighbour of tile:
penalty += absolute(tile.value - neighbour.value)
```

**This penalty becomes large when high value tiles are scattered across the grid**, hence indicating that that particular state is **bad**.

The heuristic function I'm using calculates returns `score - penalty`

, each calculated as mentioned above.

As far as I know, there is no method to prune an expectimax search. The only way out is to avoid exploring branches that are highly improbable, however this has no use for us as each branch has an equal probability (the new tile has equal probability of popping up on any of the empty tiles).

Initially I was exploring nodes even if the move played by the PLAYER had no effect on the grid (i.e. the grid was stuck and could not move in a particular direction, which is a common occurrence during gameplay). Eliminating such branches did enhance the performance a bit.

The solver performs quite well. With a **search depth of 6**, it formed the 2048 tile **9 times out of 10**. With a **search depth of 8, it formed the 2048 tile every time** I tested it, and went on to 4096 every time as well.

For a better winning rate, and taking the time taken to search per move into consideration, I keep a 8 ply lookahead if the number of empty tiles in the grid is less than 4, otherwise a 6 ply lookahead. This combination leads to a win every time, and **8 out of 10 times** forms the 4096 tile.

**The AI achieved a best score of 177352, reaching the 8192 tile. The average score was around 40000, and the AI always won the game.**

I forgot to take the screenshot of the best run, but a closer one's here.

Each move takes anywhere from 10ms to 200ms, depending upon the search space and how complex the grid is at that moment. The version here has a timeout function, calling the AI to find the next move every 50 ms.

It can perform better if the search depth is more, however that would make it quite slow. The fact that the code is in javascript doesn't makes things any better :D. Have a look at the AI beat the game!

Writing the artificial intelligence was quite a lot of fun, and I learnt a lot during the process. Do let me know what you think about it in the comments section!

Till next time,

Kush

I got some free time last night, so decided to work on it. I used the **Mechanize python library**, to access a google form (at a given url) and all its control fields, and fills it up with a random string/selects a random option.

The current script works on forms that are single page and do not require a login. The code with the installation and execution details is in this repository. Let me know what you think!

]]>The last time I came across such a thing, I was in the 10th grade and at that point, I couldn't do anything about it. However this time, powered with some knowledge about the **awesome** JavaScript (LOL), I decided to unblur the damn page.

After a quick inspection of the HTML code, I realized that Scribd isn't even trying -_- . I saw that Scribd was **actually sending the complete document to the client**, and not just the two unblurred pages. This is really dumb I believe, because once you send something to the browser, **any attempt to keep it secure/inaccessible is futile**.

All that had to be done, was change the style of each div which contained a part of the document (yes, Scribd inserts the document as unselectable text in the HTML!!), and increase the opacity of the image and voila, **the page is unblurred**. The script is in this repository.

```
// Remove the boxes that say you have to pay to view
$('.autogen_class_views_read2_page_blur_promo').remove();
// Removing irritating addverts
$('.between_page_ads').remove();
// Unbluring the text
$('.text_layer').css('text-shadow', '0px 0px 0px');
// Making images darker
$('.text_layer').css('color', '#000');
$('.absimg').css('opacity', '1.0');
```

The script requires jQuery to work.

I used this script to make a **chrome extension Unblur Scribd**. Install the extension, and as soon as you come across a blurred Scribd page, just click the extension to view the page.

One thing which I really liked about Scribd was that they **load their pages dynamically**. As soon as you reach a page x, an AJAX call to load page x + y is made. This saves so much overhead in the beginning when you open the URL, as you're actually loading just two pages of the document! However this makes unblurring the complete document impossible in one go, **without loading the complete document first**. The extension needs to be clicked every time the user sees a blurred page (that's about once every 5 to 6 pages).

Although I haven't been able to do everything

]]>The pencils down deadline for GSoC was **21st Aug, the previous Friday**. It has been quite a journey, working with FLINT for over three months. I've learnt quite a lot, not just math but how to write good quality code and testing.

Although I haven't been able to do everything I had mentioned in my proposal, I worked on a lot of stuff which I didn't mention. I worked extensively on the ECM, more than I had planned and ended up writing a very good stage I and II using mpn's. Apart from that I worked on the Pollard Rho Brent algorithm, and the MPQS. The MPQS has some problems and is not functional though.

This week I tried to debug the MPQS. The problem with MPQS is that the sieve value is not exceeding a limit, and hence does not get evaluated. No relations are being found. I believe this is because the roots of the polynomials are being computed incorrectly. I will be looking into this eventually, once I get a break in the semester.

Apart from the MPQS, I also worked on the `fmpz_factor_smooth()`

function. It aims to find a factor of a number, given that it is not semi prime (around n^1/3). It would be a combination of trial division, Pollard Rho Brent, p + 1 and ECM. The profiling of Pollard Rho Brent is yet to be done. Once I do that, I can finish this off.

What's next? I have a lot of stuff which I plan to do. First, is obviously to finish off MPQS and adding a double large prime version. Some other things which I want to work on is to improve the ECM stage II, and the `fmpz_factor_smooth()`

function. I will be working on them, however at a slower speed due to the course load.

I started working on a simple MPQS this week. Most of the code required (Knuth Mutiplier, factor base, linear algebra, square root functions) was already available in the SIQS module, written/ported by Nitin. I just had to make small changes to the code to make it compatible with the sieve I am writing. The part which I had to write again was the **polynomial code**. It wasn't that tough as self initialization was not required.

The polynomial I am using in the sieve is `Q(x) = (Ax + B)^2`

. Hence, we have to solve the equation `Q(x) - kN = 0`

, i.e. `A^2x^2 + 2ABx + B^2 - kN = 0`

.

While evaluating, we pull the factor **A** out, and try to factor `Ax^2 + 2Bx + (B^2 - kN)/A`

. So for evaluating the sieve at different positions, we have to compute **A**, **B**, and **C** (C equals B^2 - kn).

I am using this paper as a reference. The ideal value of **A** is the square of a prime **d** near the value **(N/(2 * M^2))^1/4**. **B** is chosen as the **odd square root of kN (mod A)**. This is done by first calculating the square root mod d (which is a prime), and then using Hensel lifting to calculate the square root mod **A** (which is d^2). **C** can be computed easily by computing **(B^2 - kN)/A**.

d^2). C can be computed easily by computing (B^2 - kN)/A.

I have coded a couple of functions which implement selecting multiple polynomials and computing associated data, the most important being the roots of `Q(x) = N (mod p)`

, for each prime p in the factor

- Functions to compute A, B &
- Functions to compute the Q(x) = N mod p for each prime p in the factor
- Function to initialize
- Function to compute the next polynomial

The code can be found in this file.

I also ported the code from the existing SIQS module for MPQS module. The current implementation has some problems, it does not factor. The control never reaches inside the `mpqs_evaluate_candidate()`

function. Hence **no relations are found**. I am looking into this problem currently.

Since I've missed out on some blog posts the last couple of weeks, I'll be summing up the stuff I've been up to in this post. The past month hasn't really been as productive as I would have wished, the reason being some wrong assumptions I made. I'll be giving week wise summary below.

This week I started reading the existing SIQS code, which I pulled from Nitin's SIQS branch. I took me some time to understand the existing code, and where exactly I would start working to add the double large prime version.

While reading the code, I also found a small bug in the code. In the `qsieve_evaluate_candidate()`

function, all powers of 2 were being removed before the Knuth Schroeppel multiplier is removed from the relation. **This would have slowed down factorisations in which the knuth schroeppel multiplier was even.**

An issue was reported this week, concerning one of the functions I had written earlier (#150). I spent some time reading the `n_cbrt()`

code, and noticed that I had made a terrible mistake earlier. **I had assigned wrong constants, which would have definitely caused overflows in 32 bit systems**. Although this did not solve the issue reported, it was definitely something which would have caused problems. I never noticed it earlier on as I never tested the code on a 32 bit machine.

This week I started coding the double large prime version. I was using Factoring with two large primes, A. K. Lenstra and M. S. Manasse as a reference. I started by reducing the cut off limit in `qsieve_evaluate_sieve()`

to **produce more partials and pp relations**. Once I got these relations, I used `n_factor()`

and `n_isprime()`

in `qsieve_evaluate_candidate()`

to check whether the remainder after pulling out the factor base primes is a prime / semi prime. If so, I wrote them to a file of the format `[type, number of factor base primes that divide it, factor base prime's index, factor base prime's exponent, large prime I, large prime II]`

. In case of a partial, the large prime II was simply 1.

Here is when I started to face problems. **I simply couldn't manage to produce a lot of partials and pp's**. I tried to explore the code and check whether I was making any mistake, but couldn't figure out what to do to fix it. At this point I should have switched over to the tested FLINT2 code, and extend it to a a double prime version, however I was under the impression that an SIQS was required before I could add a double large prime version. This was a wrong assumption I made.

I couldn't really do a lot this week, I had refresher modules and exams throughout the week at my institute (they take place before the beginning of the semester).

I did some minor work towards the FLINT 2.5 release this week. There was an issue #32 reported some time back. It was required to check that in each call of a function `*_si()`

, the si parameter is actually an **slong** and not a **ulong** or **mp_limb_t**. This had to be done for every call across the FLINT code base. I volunteered and checked some.

At the beginning of this week, I realised that it is too late to implement the double large prime MPQS completely. I decided to go back to the ECM code I had written and complete some of the work I had postponed till after the GSoC period. The main reason was that I knew what was going around in the ECM really well, and thought that I could do considerable work working on it.

The ECM code I had written was doing good, **under a factor of 2 compared to GMP-ECM**. There were a couple of algorithmic enhancements I had in mind, which could make it even faster and comparable to GMP ECM. These included **multi point evaluation in stage II**, the **Montgomery PRAC algorithm in stage I** and maybe **Edward coordinates**.

I started off by checking the code for any memory leaks and unnecessary memory allocations. I found unnecessary allocations in stage II, and cut them out using better indexing (in stage II). This **reduced the memory being used by almost half in the stage II code**. There were a couple of places where there were memory leaks. I resolved this as well.

I also fixed the hack I was using in `fmpz_factor_ecm_select_curve()`

. I couldn't figure out how exactly to call `mpn_gcdext()`

earlier on. I realised this week that I was passing an `mp_limb_t`

, whereas I should have been passing an `mp_size_t`

(**the latter is signed!**). So when the inverse was negative, I was facing problems (the function sets the "size" argument as negative if the mp_ptr is negative). Earlier on, I avoided using `mpn_gcdext()`

. When it was required, I was converting back to `fmpz_t's`

, calling `fmpz_gcdinv()`

and then converting back to mp_ptr's! Now I am making a direct call to `mpn_gcdext()`

.

I also started to incorporate multi point evaluation in stage II this week. I had tried this earlier on, however contrary to my expectation, it was almost two times slower. I coded the stage II again (thankfully I had the ECM fmpz code from some time back, coding directly using mpn's would have been very difficult). However yet again, it didn't turn out to be faster than than the existing code. Maybe I'm doing something wrong theoretically? However in such a case, stage II should never be able to factor, whereas it is definitely.

I made some more minor changes to the ECM code and polished it, it seems good to be merged to me.

After a brief conversation with my mentor, I have decided to go back to MPQS. In the coming week, I will be working on a simple sieve, which uses multiple polynomials, without any self initialisation. A lot of code required for this already exists in the FLINT trunk.

]]>This week I finished writing ecm using the mpn functions and reduction using pre-computed inverse for integers larger than one word. The code is present in these files:

- ECM structure init
- ECM structure clear
- Addition on curve
- Doubling on curve
- Scalar multiplication of a point on curve (Montgomery Ladder)
- Function to select Curve Parameters
- ECM stage I
- ECM stage II
- ECM outer function

Apart from these, I also wrote a couple of **helper functions** for mpn arithmetic:

These are not regular functions, and cannot be used with all elliptic points. These assume that the "n" passed has its highest bit set. I did this to save a couple of additions. We are anyway working with **normalized** (shifted) mpn's, and hence don't cause a problem.

Documentation and tests for `fmpz_factor_ecm()`

is present here:

The current ECM is doing a good job, it takes around **0.1 to 0.2 seconds** on an average to find a **50 bit factor** (~15 digits). I never got around to finding why using multi point polynomial evaluation was slowing down stage II, whereas theoretically it should have been faster. I'll probably look at it later on. Also, I haven't profiled it and found optimum parameters (B1 & B2) for different sized factors. These two things ought to speed it up further.

Another problem, or rather a code inconsistency exists in the `fmpz_factor_ecm_select_curve()`

function. I was having some trouble using `mpn_gcdext()`

at one point, so I am converting back to `fmpz_t`

, using `fmpz_gcdinv()`

, and then converting back to `mp_ptr`

. This works, although it's quite hacky. I'll figure out the right way to use `mpn_gcdext()`

and alter it.

The only serious problem, is that the current ECM implementation **cannot find factors below 7 bits**. The `fmpz_factor_ecm_select_curve()`

function fails to find a suitable curve for such integers.

Another thing I worked upon is `n_factor()`

in the ulong_extras module. **Brent-Pollard Rho and ECM seem to do a better job than SQUFOF**, which is used currently. So instead of just calling `factor_one_line()`

and SQUFOF, I changed it so that it calls Brent-Pollard Rho once, and then ECM for some tries. This gives almost a **4 to 5 time speedup** when factoring semi primes, and 1.5 to 2 times for random cases. SQUFOF is still there in the code, so as to make sure that n_factor doesn't fail (in case Brent Pollard Rho and ECM do), however I never saw it being called in all the random cases I tried.

Including Brent-Pollard Rho and ECM did cause a problem though. **Both of them require random states** to be passed, and changing the function definition of `n_factor()`

**will break existing FLINT code**. My mentor suggested that we should cache a `flint_rand_t`

in the factor function itself. I am working on this right now.

This week, I plan to finish the small part remaining in `n_factor()`

(caching of the `flint_rand_t`

) and then go back to double large prime MPQS. I am using the paper MPQS with three large primes as a reference. Although named three large primes, it also gives quite a lot of information on the double large prime version.

I spent the last week rewriting the ECM using **mpn's and pre-computed inverses, and ulongs**. Pollard Rho did show a significant speedup when I rewrote it using mpn's and pre-computed inverses, and I was hoping for the same here. Initially I was planning to write only till stage I using the faster options available, however I ended up working on both stage I and stage II.

The flow which I had in mind was similar to many existing functions in FLINT. If the integer to be factorized fits in one word, use the ulong module else use fmpz's/mpn's. So I decided to rewrite ECM using the ulong module first.

- Group Addition of co-ordinates
- Group Double of co-ordinates
- Group Scalar Multiplication of co-ordinates
- Function to select Curve Parameters
- ECM Stage I
- ECM Stage II
- Outer wrapper function for ECM

Next I attempted to write ECM using **mpn's**. The code which I have written so far (using mpn's) works fine for single word integers, however causes problems for larger integers. It is still work in progress.

The current version of ECM still has a lot of room for improvement. This includes **tuning** it so as to find optimal values for B1 and B2. Also I still haven't been able to pinpoint my error in the multipoint evaluation stage II I wrote a couple of weeks back. It works, however is slightly slower than than the initial stage II.

- Figure out what's causing problems for integers larger than one word in the mpn version of ECM and finish writing
- Time ECM against GMP
- Understand the double prime QS more thoroughly, go thorough the QS which Nitin (who is working on the SIQS) has written and start writing a double prime version using it.

I missed out last week's blog, couldn't write it as I was traveling and had poor internet connection. In the last two weeks, I have been trying out different implementations of stage II, using **multi point evaluation**.

Mid term evaluations began on the 26th of June, and I have

]]>Hi!

I missed out last week's blog, couldn't write it as I was traveling and had poor internet connection. In the last two weeks, I have been trying out different implementations of stage II, using **multi point evaluation**.

Mid term evaluations began on the 26th of June, and I have met most of the targets I had set. I have added an optimized version of pollard rho, and the ECM code should be completed by the end of this week. Although there are still a lot of things I'd like to try to make it faster, I don't want to deviate a lot from the timeline now. I'll probably get back to it after GSoC.

The current ECM timings are close to GMP-ECM, I'll be posting some timings next week, once I code it using mpn's. My main worry right now is the fact that **using multi point evaluation is slowing down ECM, contrary to the expected speedup**.

This week, I plan to find the reason why the multi point evaluation technique is not turning out to be faster, as expected. Also I would be rewriting ECM, at least the stage I, using mpn's.

]]>The past week, I optimized the current stage II, and tried out a new version of stage II ECM. The main way by which we can optimize stage II is by **reducing the number of points we compute**. My mentor pointed out a very nice way to do so. Unfortunately, it didn't have a significant speedup. I believe this is because the semi primes which we were factoring (at max 128 bits long), *did not need a lot of points to be computed in the first place*. The newer version definitely decreased the number of points being computed **(by more than a half)**, but the number points were not large enough in the first place to get noticeable speedup.

The most time consuming step in stage II is when we compute and accumulate the product of difference of the x coordinates (in projective form, in which I am working, it is x1*z2 - x2*z1). This step involves a lot of modular reductions, and hence **a lot if inversions and divisions**. The efforts put into the new implementation of stage II would definitely be helpful when I write stage II using *mpn's and precomputed inverses*.

I had to write a new outer function too since the newer version had different pre computations than the older one.

This week apart from the new ECM stage II, I also spent time on cleaning up the way variables were being passed. I wrapped up most of the important variables in an **ecm structure**, passing a reference to it as an argument to the functions instead of a lot of variables. This struct also has **four fmpz_t's for temporary calculations**, which eliminate the need of initing and clearing new fmpz_t's in every `ecm_add()`

and `ecm_mul()`

. This struct also allows me to precompute the `GCD_table`

(looks up whether `i and j are co-prime`

) and `prime_table`

(looks up whether `i * k + j is prime`

) for all curves instead of computing them each time for a curve.

You can have a look at my recent commits here.

Currently I am trying to understand the FFT techniques used for ECM stage II. I found quite a lot of citations to the **Ph.D. dissertation thesis of Peter Montgomery**, which was on this topic. Initially I couldn't find it, but some helpful people on the flint-devel mailing list pointed gave me the link. It is very interesting due to the the math involved, but at the same time quite tough to follow.

This week I wasn't able to give in as much time I would have liked. I usually work on all 7 days on the project, however this weekend I wasn't available due to a hackathon I was attending. I will pick up the speed this week. My goals for this week is to implement the FFT stage II and start to code the whole ECM with the mpn data types and functions.

]]>I spend the larger half of last week studying elliptic curves and understanding possible stage II extensions, rather than coding. I got stuck at a lot of places and had to read a lot to understand. Implementing stage II was certainly harder than stage I. More on that a bit later, first I'll summarize the work I have done this week for those who are in a hurry and cannot read the whole post.

In case someone is wondering where all the code I had blogged about in the last post went, it got shifted to the fmpz*factor module. My mentor siggested that it is better to move them to the fmpz*factor module, as these functions are highly specialised, and are not likely to be used in any way other than in ECM. This week I implemented not so optimized versions of:

Apart from this, I also edited the group operation functions **(double, add and multiply)**. I realised I was not taking care of corner cases (which involve the point O). This was causing a lot of bugs. You can have a look at my recent commits here.

It's quite weird Github shows the author to be unrecognised. I guess I forgot to set the global email in got config while setting up git again (I messed up my computer some time back, had to reinstall OSX).

Once I was done with the stage I implementation, I started thinking about how to implement the stage II. My first step was to implement a naive stage II, which *computes a scalar multiple of the point at the end of stage I for each prime between B1 and B2* (The two bounds). You can have a look at this algorithm in this paper (Algorithm 1). Now this was quite simple, as I already had all the functions required at my disposal (done in week 1). However this turned out to be **quite inefficient**.

To optimize this, the first algorithm that I tried out was in *Prime Numbers, A computational Perspective (page 353).* I decided to implement this primarily because it used Montgomery coordinates, and seemed straightforward enough. However, after spending a day and a half on it, I couldn't implement it correctly. There seemed nothing wrong with my implementation, but it just didn't factor a number, *no matter how large the B2 was*. So I decided to start looking for papers which follow a different technique.

I came across another algorithm in (this paper)[https://www.iacr.org/archive/ches2006/10/10.pdf] (Algorithm 4). I went forward an implemented it, only to get bad news again. The algorithm never finished running. After spending qutie some time on debugging, **I realised there were mistakes in my group operation functions**. *Maybe this was the reason for my first implementation of stage II not working too.*

When I sat down to recheck the group operations I had coded earlier, I came across a very weird thing. When I computed the point 4Q (using the Montgomery ladder) and the same point using addition (3Q + Q), my answers did not match. I found this quite weird, considering that my implementations were correct. I ended up asking a question on math exchange, only to realize that the points were in fact the same! **What I had forgotten was that I am dealing with projective coordinates. So the absolute value of the coordinates need not match, however their ratio will always be the same**. I was stuck at a non issue for more than a day :/

My goals for next week are to optimize the stage II further. I have been looking at some techniques such as the **FFT continuation for ECM**, which completely avoids calculating a table of prime numbers. I haven't really looked into it, I will be doing so in this week. Once I'm sure about what code I will be using for ECM, I'll code an mpn layer version for it.

This week I tried implementing the **Elliptic Curve Method of Factorization**. I spent most of the time trying to understand how the algorithm works, and going through existing implementations (GMP-ECM and pyECM). After spending quite some time, I decided to stick to the **Montgomery elliptic curve coordinates** for the stage I of the ECM. The reason was that **scalar multiplication turns out to be faster** when using Montgomery coordinates (no inversions required) and **selection of the curve is quite easy** (Suyama parameterisation).

This week I implemented not so optimised versions of :

- Addition of two points on an EC
- Doubling of two points on an EC
- Scalar Multiplication (Montgomery Ladder)
- Selection of a random curve (Suyama Parameterization)
- Stage I ECM

Currently Scalar Multiplication is carried out by the **Montgomery Ladder algorithm**. However I would like to try Montgomery's PRAC algorithm. It uses **addition subtraction Lucas chains**, and turns out to be faster than the Montgomery Ladder. But this would be done once I am done with a stage II implementation.

Although I haven't timed the current stage I implementation properly, I did try to factor the 7th Fermat Number `340282366920938463463374607431768211457`

. Stage I ECM found the factor `59649589127497217`

in **20.713978** seconds (B1 = 10000). Although this is better than the Pollard Rho function, it is still really slow. It is probably due to the fact that the B1 I am using is very large. Stage II would make this factorization faster.

This week, I would understand the ECM stage II algorithm, and implement it. This will be a simple implementation using the fmpz module. If I manage to complete it before the end of the week, I would work on the PRAC algorithm for speeding up scalar multiplication.

]]>The official coding period for GSoC begins tomorrow, and I plan to start off with ECM, before beginning with the MPQS implementation. I think it would be better to have both the auxiliary factoring algorithms in hand before I proceed with the MPQS. I spent most of the community bonding period getting used to the *fmpz module* and *MPN layer functions* of **FLINT** and **GMP**, and implemented the Pollard Rho algorithm.

The previous week I finished off with the Pollard Rho implementation. Along with the mpn layer code I was working on, I also implemented it using `mp_limb_t's`

for **one word inputs**. It turns out to be **five to six times faster** for one word integers, compared to the fmpz_t code.

Currently, there is a `fmpz_factor_pollard_brent()`

in the **fmpz_factor module**, which calls the ulong version of Pollard Brent (available in the **ulong_extras module**) for one word integers, and the mpn code for bigger integers. There is also a function `fmpz_factor_pollard_brent_single()`

present in the fmpz_factor module, which lets the user pass initial parameters (the constant of the polynomial and the initial value to start evaluation with) for Pollard Brent.

This week I also read some basic **elliptic curve theory**. I followed *Prime Numbers, A Computational Perspective by Crandall and Pomerance* for this purpose. I understood the various coordinate representations and the group operation addition, however I still have some doubts regarding addition of coordinates when represented in the Montgomery form. I haven't gone through it thoroughly, however I noticed **Lucas Chains** being mentioned somewhere, and this part confused me. The book had a good explanation of addition of affine and Projective coordinates, however I couldn't find such an article on Montgomery coordinates.

This week, I plan to understand addition and scalar multiplication of Montgomery coordinates, and implement them. Once this is done, I can start off with phase I of the ECM. I also plan to go through the source code of pyecm and GMP-ECM. Although pyecm is in python and not in C, it still might help me give some ideas about the implementation.

I received the welcome package from Google yesterday, and the goodies are amazing!! The package had a sticker, a red diary and a pen. The sticker's awesome, they just keep getting better every year. Let the coding begin :D

]]>This week I started working on my GSoC project. Before starting off with the code, I decided to give three to four days understanding Pollard Rho and ECM once again. I was looking up for books in the IIITD library, when I came across **Prime Numbers, A Computational Perspective**. It was quite a pleasant surprise as I had been looking around for it for quite some time now, and was almost on the verge of ordering it. I have referred to this book quite a lot of times in the past, but had access only to a PDF version.

Another resource which I found quite useful, pointed out by my mentors was a paper published by **Richard Brent, An Improved Monte Carlo Factorization Algorithm**. It gives a detailed explanation of the modification Brent proposed to the Pollard Rho algorithm. The Pollard Rho was initially based on Floyd's cycle finding algorithm. Brent improved upon it by **replacing the existing cycle finding algorithm by a better one**.

I started off by implementing the Pollard Rho using standard FLINT big ints, * fmpz_t's*. This was split into two functions, an inner one which executed the algorithm with the polynomial constant and initial values as parameters, and an outer one, which assigned random values to the constant and initial value. The outer calls the inner a fixed number of times. The code can be found

**Timings for the factorization can be found in the below pastes:**

My only concern is that the ** current code fails a lot of times for smaller semi primes**. This problem goes away if the inner is called 2 to 3 times. Moreover, I am not really worried about this as we would be using trial division for very small primes, which would cover these cases.

Next I tried implementing the same algorithm using `mpn's`

. Those who don't know about mpn's, can read up on it over **here**. They are the lowest level GMP functions, which are designed to be as fast as possible. However getting used to them can be quite tricky.

**The mpn layer is bare bones**. They can be insanely fast however making mistakes is very very easy. I did make some mistakes while coding Pollard Rho using mpn's. probably the biggest of them was that I **wasn't handling carries** while using `mpn_add_n()`

. Because of this, the polynomial evaluation was turning out to be wrong, and hence the algorithm was failing. The mpn code for Pollard Rho is currently in progress, however once I manage to implement it successfully, I think I will be quite at home with these functions. The code can be found **here**. In the next week, I plan to finish off Pollar Rho using mpn's, and then start off with ECM.