Having recently stumbled across Facenet (back then), I was curious whether I could make use of it to develop something similar. Although Facenet doesn't really allow us to construct an image from scratch, it does allow us to change the facial attributes of an input image. It uses a Variational Autoencoder to learn an attribute vector for input images. By making changes to the attribute vector, we can change the facial attributes of an input image as required.

TL;DR In this post, we try to come up with a framework where a user can gradually improve upon a face image as they want (with respect to its attributes). We use a GA to improve upon the attribute vector which generates the image. At each iteration, we pick the attribute vectors (chromosomes) which generated the top-k images (closest to the target/intended target). In this exercise, we initially pick up a target image and compared the intermediatory images by taking a simple MSE. In case of a real-life scenario, a user could be simply selecting the top k images which fit the criterion. Of course, we can expose this vector directly to the user and ask them to make changes as they deem fit. However, it might be more easy for a user to pick some top-k closest images to their target, rather than changing attributes directly.

The python notebook (as an HTML) is available here. It's an extremely large file, sorry about that! I no longer have access to the server with the code, so cannot regenerate it with lesser images.

A Variational Autoencoder (VAE) is a variation (haha) of the autoencoder which tries to learn the distribution of the training input samples rather than just a dense representation. We try to map the input onto a distribution (a latent vector), rather than a fixed vector. A simple way to do is to couple the standard autoencoder reconstruction loss with and the deviation of the learnt representation from an expected (pre-decided) distribution. This loss can be computed using something like KL divergence.

The Facenet VAE employs a much fancier approach to compute the loss, which in fact allows for extremely granular control over features of the generated image. The learnt latent vector, in fact, allows modifying facial attributes of an image (and hence is appropriately named **attribute vector**).

I will skip explaining VAEs in detail in this post to save up on space. This blog by Lilian Weng does a great job at explaining autoencoders and other variants, including VAE. Would really recommend having a look if this piques your interest.

The genetic algorithm is a fascinating evolutionary algorithm inspired by natural selection. It is an iterative process, each iteration comprising of a population of individuals (called a generation). Each individual is represented by a vector of genes called chromosome. The goal is to search for a set of chromosomes which are the *best*, according to some criterion.

Each generation gives rise to a set of chromosomes in the next iteration of GA. This is done via two techniques, known as **cross-over** and **mutation**. A crossover takes place between two parents and generates a chromosome (child) with genes taken both the parents. This combination can be done randomly, or via selecting certain contiguous segments of the parents' chromosome. Mutation generates a child by taking a chromosome (parent) and randomly changing any one of its genes.

We generally restrict the population size at each generation. Not doing so would make the population grow exponentially, making it computationally impossible to simulate. This is done by using a fitness function and selecting those individuals with the highest fitness score. In the beginning, we randomly generate a set of chromosomes. At every generation, we expect to have a set of chromosomes that have a higher (or at least equal) fitness score compared to the chromosomes in the previous generation. This process allows us to gradually reach a target chromosome.

The description of GA above is very brief, would recommend this blog by Burak Kanber for a detailed write-up.

In our GA formulation, we consider the latent vector used to generate images in the Facenet autoencoder as the chromosome of an individual in the population. We make use of the pre-trained Facenet decoder, passing the chromosome as input to generate an image. Starting off from a population of randomly initialized chromosome attribute vector, we would expect to gradually improve upon the generated image until it becomes the target image.

At each step, we generate a set of images and ask the user to select the top-k best (this can be thought of as the fitness function). Using this set, we generate the next generation of chromosomes (attribute vectors). It is important to note that the fitness function is not applied directly on the individuals in a population (since there is no target latent vector, only a target image).

For our experiment, due to the absence of a user, we chose a target image and then for each chromosome in a population, we calculate the fitness as follows. The chromosome is first decoded into an image, and then the fitness is calculated by taking the inverse of the MSE between the image generated and the target image. Let `g(z)`

be the generator function (the Facenet decoder) which generates the image `I`

after taking `z`

as an input. If the target image is `T`

, we calculate the fitness as the inverse of `MSE(I, T)`

, i.e. the inverse of `MSE(I, g(z))`

.

Below are the images with the highest fitness function at different generations. We see how we slowly move towards a latent vector which generates the required smile.

In another scenario, a user can choose the top few images that he/she thinks is ideal, which can form the next generation of individuals. This way, the user can gradually improve upon how he/she wants the attributes in the image to look like.

This is a follow up to my previous post on fast DLA simulations. Please do read that before continuing further!

In this article, we try to come up with a model which can help us estimate the stickiness of a DLA simulation. Given an image from a DLA simulation (a binary matrix, 1 indicating a particle and 0 free space), we would like to estimate the value of `k`

used for that particular simulation. This was actually given to me as a part of a recruitment process, and for simplicity, the range of `k`

was set to [1e-3, 5e-2]

P.S. Python notebooks with the analysis done in this article can be found here. The notebook with estimation approach can be found here.

`k`

I briefly touched upon `k`

in the previous post, the stopping criterion of a particle undergoing Brownian motion depends on this parameter. As soon as a moving particle collides with a particle at rest, the moving particle stops with the probability `k`

. The higher the value, the higher the chance of a particle coming to rest (and hence, the stickier the particle). This parameter tends to change a DLA simulation quite a bit. Let's have a look at 3 simulations with 10,000 particles, but with varying stickiness.

Just by examining a few such simulations, it is clear that a low k value entails a more "compact" (or "crowded") image. If we can somehow come up with a nice property that can capture this crowding, we can think of modelling that as a function of the stickiness parameter. For my analysis, I used the total surface area of the image to capture the crowding. This was computationally fast to compute (order of \(n^2\)) and gave a good idea about the level of crowding of the image. We would expect the surface area to increase (less crowding) as k increases.

Given a DLA simulation, let's look at what all information we can extract

- The size of the canvas
- The number of particles
- The position of each particle in the canvas

Clearly, surface area can easily be calculated from the image. We just need to check each particle and see how many of its neighbouring cells are empty. For the rest of the article, I'll use `n`

to denote the total number of particles, and `S.A.`

to denote the surface area.

Before we start modelling anything, it was essential to first get the data. This was pretty straightforward since we had already written the DLA simulation code. We split the range of `k`

into equal intervals using `np.linspace(1e-3, 5e-2, 40)`

, and run 15 simulations of 20,000 particles each. For each run, we calculate and save the surface area after the addition of every 100 particles. This gave us 15 readings of surface area for each n in `range(100, 20001, 100)`

. For each value of n, we take the average surface area across the 15 runs. In total, we generated roughly 8000 `(n, k, S.A.)`

tuples, which we further use in our model.

All this data is available here.

`n`

and `k`

As a sanity check, we first plotted the value of surface area vs n, keeping k fixed. Obviously, this should have a strong positive correlation.

From the above plots, it's clear that the total surface area increases with n. We initially assumed that the graph would be strictly increasing, but later on realised that its wrong to claim this, as in some rare cases addition of a new particle can lead to a decrease in the surface area of the figure (think of a 3x3 square of particles with one of the edges having an empty space).

Next, we tried to see how surface area varies with the stickiness value (keeping n fixed). If this looks like something we can capture using a model, we don't need to look further.

`n`

As soon as I plotted these, I got really excited. This clearly looks somewhat like a log distribution! However the only downside, it's not bounded in a fixed range (look at the y-axis for different n). That should have been expected since the total surface area is bound to depend on `n`

as well. So we need to find a way to eliminate this from our model. A very naive way to this is to divide the surface area with `n`

.

`n`

Honestly, I didn't expect this to work since I couldn't come up with a proof that justified that dividing by `n`

(and not some power of `n`

) should normalize the range, but these plots looked really promising. The average surface area seems to be a better property to fit a curve on. Next, I checked how `surface area / n`

varies with `log(k)`

`n`

These plots looked pretty good to fit a curve on, so I stopped further alterations to the data. It looks like a straight line as `k`

increases, small values of `k`

don't look good though. So I tried with second-degree polynomials as well.

There were two important things to decide before beginning to fit a curve

- How should we split the data into train/test
- What metric should be used to evaluate the model

The data which we generated from the simulations were `(n, k, S.A.)`

tuples. I really wanted to eliminate `n`

completely, since training separate models for each value of `n`

just wasn't possible. Let's have a look at the scattering of surface area values for different ranges of `n`

, which might help us understand whether we can eliminate `n`

or not.

`n`

. Points from all simulationsWe notice that as the value of `n`

increases, the average surface area tends to scatter less. Let's explore further for smaller values of `n`

.

`n`

. Points from all simulationsTwo inferences can be drawn from the analysis until this point

- Values of surface area scatter less as n increases. Hence, the estimation will be more accurate for larger n.
- As n increases, this scattering becomes negligible. Furthermore, for large n (above 1000), the curves are extremely similar. Hence we can eliminate n completely from models which have a large number of particles.

So eliminating `n`

completely was not feasible, since it will tend to fail for smaller values. Another interesting idea here is to use some sort of a windowed curve fitting. For a given `n`

, instead of using complete data, we only use the data from simulations where the number of particles is in the range [n - c, n] for some c.

To check whether my intuition was correct, I simply partitioned the data into two parts and used `np.polyfit`

for curve fitting, checking for up to 3-degree polynomials. I trained models on data where the number of particles were in `range(100, 15000, 100)`

and tested it on data from simulations where the number of particles were in `range(15000, 20000, 100)`

.

This did seem like a good fit, the error in predicted `k`

was low (we're bounded in [1e-3, 5e-2]). Error % = 0.00166000318459 / (5e-2 - 1e-3) = 3.38 %

However, notice that the training error is high. This was expected because we saw high scattering for small `k`

. Let's try to see if using only data closer to our given `n`

can give a better fit. The next plots show predictions by a model trained in `range(10000, 15000, 100)`

, tested on `range(15000, 20000, 100)`

.

As expected, the training error went down in this case. However, it was really interesting to notice that **error on the test has also decreased!** E % = .00106134661616 / (5e-2 - 1e-3) = 2.16 %. Our estimate improved when we discarded data for n much smaller than the target n. This strongly suggests that we should be using a local window regression model.

The idea is to use only data from a fixed range of n smaller than the given n. Since training time is negligible, we can train the model in real time depending on the given `n`

in the DLA image. We already have simulation data for n up to 20,000 particles. The next obvious question that pops up is, what happens if `n`

is so large that we don't have simulation data. From the above analysis, we can safely say that the curve more or less remains the same for any number of particles above 10,000.

For each value of n in `range(1000, 20000, 1000)`

, we trained regression models on simulation data where the number of particles was in the range [n - 4000, n - 100].

For the 3-degree poly, the worst error percentage (0.0019) is better than the average train error in the previous fit. Furthermore, the training error also went down. Hence a local fit is definitely a better one.

So our strategy to predict stickiness is the following. For a given n, we take a window of n from the data and fit a 3rd-degree polynomial. We use this polynomial to predict the k for the given n. Code for this approach can be found here.

]]>Diffusion Limited Aggregation is a growing object consisting of particles that diffuse and aggregate to the object, much like how flocks of snow are created. Particles are introduced one after the other and made to undergo a random walk. Eventually, a newly introduced particle reaches an existing particle and remains there. This clustering of particles leads to a formation of aggregates of particles. This theory, proposed by T.A. Witten Jr. and L.M. Sander in 1981 is applicable to aggregation in any system where diffusion is the primary means of transport in the system. (André Offringa's report).

Another interesting description (and my favourite) involves a city square surrounded by taverns. Drunks leave the taverns and stagger randomly around the square until they finally trip over one their insensate companions at which time, lulled by the sounds of peaceful snoring, they lie down and fall asleep. The tendril-like structure is an aerial view of the sleeping crowd in the morning (Paul Bourke's blog).

P.S. Code for the simulations is available in this repository.

This article is the first of two parts on DLA. It's mostly about my implementation of a 2D DLA simulation with a point attractor, and an optimization I use to make these simulations faster. I plan to follow this up with another one on the stickiness factor estimation (more on stickiness below).

Generating these images is pretty straightforward.

- We begin by taking a blank square grid (all white pixels) with one black pixel in the centre
- Then we introduce a new particle (a black pixel) randomly at any point on the boundary.
- This new particle undergoes Brownian motion till it comes in contact with another black pixel on the canvas.
- As soon as the new pixel comes in contact with another black pixel, it stops moving and stays there forever.

The pixel undergoing Brownian motion can move to any of its eight neighbours with equal probability. This version of the DLA which we discussed has a point attractor (i.e. new particles get attracted towards a central point). This attractor itself can have a different locus (it could be a line or a circle), but we just focus on point attractors for the moment.

Another attribute which I incorporate in this model is called 'stickiness'. As soon as the new particle undergoing Brownian motion comes in contact with an existing particle at rest, instead of stopping, it stops with some pre-decided probability (stickiness factor) `k`

.

A python notebook of this simulation can be found here. It has a `DLA`

class which can be initialised with the size of the square grid and the stickiness factor `k`

. The `addPoint`

method can be used to add any number of particles to the canvas, the `printState`

method can be used to print the grid out (uses `matplotlib.pyplot.imshow`

).

As you would have guessed, adding a new particle is a very time-consuming process, especially as the number of particles increases. To accommodate a large number of particles, we need a larger grid. As a result, the area which a new particle explores before coming to rest increases. Of course, an easy way to quicken the whole process is to not use Python. Let's discuss a more mathematical approach.

At each iteration of DLA, we introduce a particle \(C_1\) at a randomly selected position \((x, y)\) on the border of our m x m canvas. This particle follows Brownian motion in the canvas until it meets the termination criterion. As m increases, the area in which \(C_1\) can move also increases, making simulations slow. In order to quicken the process, we would like to decrease the search space. Intuitively, we would like to spawn as close as possible to the current group of particles in the canvas.

We can do this by finding a group of points \(L_2\) such that the probability that a point that spawned on the square border reaches any of the points in \(L_2\) is the same. Formally, this would be a locus such that any two points \(l_{21}\) and \(l_{22}\) in it satisfy the condition

\[W_{reach}(a, b) = \frac{n(a, b)}{\sum_{c \in L_1}n(c, b)}\]

\(L_1\) is the set of points which make up the square boundary. \(W_{reach}(a, b)\) is the ratio of the number of ways of reaching \(b\) from \(a\) to the total number of ways of reaching \(b\) from all points in \(L_1\). We take a ratio since there are infinitely many ways of reaching \(b\) from any point in \(L_1\).

If the area covered by locus \(L_2\) is less than the area of the square boundary, then we can hope to see a faster DLA. Intuitively it seems \(L_2\) can be a circle. We shall be proving it below.

\(W_{total}\) of a point \(C_2\) on the circle is calculated by summing \(W_{reach}(c, C_2)\) for all points \(c\) on the square. \[ W_{total}(C_2) = \sum_{c \in side1}W_{reach}(c, C_2) \quad + \sum_{c \in side2}W_{reach}(c, C_2) \] \[ \quad \quad \quad \quad \; \; + \sum_{c \in side3}W_{reach}(c, C_2) \quad + \sum_{c \in side4}W_{reach}(c, C_2) \]

Now we shall show that \(W_{total}\) is equal for all points on the circle. There are two assumptions we make to simplify our analysis

- Instead of considering discrete coordinates, we shall consider all points on the 2D plane.
- We define \(W_{reach}(a, b) = k(|a_x - b_x| + |a_y - b_y|)\), where \(k\) is a constant.

Using our two assumptions,

Similarly,

Therefore,

\[ W_{total}(C_2) = \sum_{c \in side1}W_{reach}(c, C_2) \quad + \sum_{c \in side2}W_{reach}(c, C_2) \] \[ \quad \quad \quad \quad \; \; + \sum_{c \in side3}W_{reach}(c, C_2) \quad + \sum_{c \in side4}W_{reach}(c, C_2) \] \[ = 4\frac{S^2}{2} - kSR\cos\theta + kSR\sin\theta + kSR\cos\theta - kSR\sin\theta \] \[ \quad \quad \quad \quad + 2R^2\cos^2\theta + 2R^2\sin^2\theta + 2R^2\cos^2\theta + 2R^2\sin^2\theta \] \[ = 2S^2 + 4R^2 \]

Clearly \(W_{total}(C_2)\) is a constant, for every \(C_2\) on the circle with radius \(R\). Hence the probability of reaching any point on a circle with radius \(R\) is the same. **As a result, we can choose to spawn at any point on the circle with a radius \(R\) instead of the square boundary.**

The only constraint is that all the points outside the circle should be free. And obviously, we would like \(R\) to be as small as possible. Hence while spawning a new point, we chose the smallest circle that encloses the current image (minimum bounding circle).

A python notebook of DLA simulations using this optimization can be found here. Unfortunately I haven't been able to do any kind of benchmarking comparing the running times, however, I was able to go up to 10,000 particles (compared to 500 in the previous version) in a reasonable amount of time.

]]>This monsoon semester I picked up a really interesting course, Digital Audio (DES514 @ IIITD). Although it had been ages since I'd done anything related to music (seriously), the realization that it's my last semester made me have the whole "screw grades let's pick up something fun" talk in my head.

P.S. This also led me to pick up a philosophy course.

P.P.S. The semester didn't have a happy ending.

Anyway, coming back to the topic. Probably the most intense part of the course was the project, which Anant and I did together. We wanted to do something related to audio modulation (obviously), but not just stick to computational manipulations.

Before I delve further,

- TL;DR We created an interface which allows a user to move a source of sound around in a ring of speakers using his hand
- A quick demo video can be found here
- All code can be found in this repo

Although it is really tough to capture this project on camera (since we are experimenting with 8 channels, whereas a standard camera has mostly 2 input channels), you may be able to observe some changes in sound amplitude (mostly unwarrented due to the reverb in the small room hahaha).

We decided that we'd like to build an interface that would meet three broad goals

- It should be capable of modifying an input audio signal real-time.
- It should allow the user to have a high degree of control over how the input is signal is modified.
- The interface design should be minimal and it should feel natural (no one likes wires hanging from their body).

The first goal was easy to meet. We had been using SuperCollider (an open source platform for audio synthesis and algorithmic composition) throughout the semester, which can be used to modify an input audio signal in a few lines of code. An example would be passing the input signal through a low pass filter (LPF). And the best part, we can change parameters (like the max pass frequency of the LPF) and it gets reflected in the output signal **in real-time**.

The real challenge was merging this with our last two goals. We definitely did not want a wearable device, and at the same time wanted to allow the user to use his/her entire body to modify the input signal. As you would have guessed, we decided to move forward with a Kinect sensor. Although that meant coding in Visual Studio (and using Windows), it was perfect for our use case (especially since it came with its own gesture detection APIs).

For our project (since we just had a week to finish), we decided to move forward with just spatialization of the input sound signal in a circle using hand movements of the user. Once this was possible, porting it to include other effects (such as an LPF/HPF would be just grunt work).

Our interface would constitute of a server running the Kinect SDK, using which we detect the direction in which the user's hand is pointing. This information is sent to another server running supercollider. The angle is used to pan a given input signal across the speakers in such a way that it seems as if the source of sound is at the direction in which the user is pointing.

We had access to a brand new 8.1 surround system (big shout out to Prof. Timothy!), so panning the input signal across speakers was something which we could actually test. A rough schema of what we had in mind

It was a straightforward design which has 3 different components, each of them explained in detail below. Getting these three components together was the tricky part.

The Kinect sensor uses depth sensing and other computer vision approaches to estimate the skeleton of the person standing in front of it. The Kinect developer SDK allows us to use this information captured by the hardware. Using this SDK, we can capture the 3-D coordinates of the certain important points detected on the body (joints, fingers etc).

Since we can detect the coordinates for both the body centre and the fingers, we used these to calculate the x and y coordinate of the hand in the plane perpendicular to the body of the user. This allows us to understand two things

- The direction in which the user is pointing towards.
- How close is the user's hand from the center of his/her body.

Image is borrowed

We send this information to the superCollider server running on a different machine using OSC, a protocol running over UDP.

For computing the output signal streams for the eight different channels (each for a given speaker), we used the VBAP plugin for Supercollider3. This takes a signal and an angle from the median plane and redistributes the signal into a number of channels assuming the angle to be the source of the sound. This plugin is based on Vector base amplitude panning, more information on which can be found here.

For ease in testing, we create a GUI which displays the entire ring of speakers. There is a movable pointer, which is used to indicate the source of the sound. The pointer can be moved around in the circle as shown below.

This pointer is used to indicate the source of the sound. If we point towards the speaker C, it would sound as if the source of sound is the speaker C.

We also introduce a parameter ‘spread’, which allows us to widen the area from which the sound comes. This allows us to change the distance of the sound source (or how far the user perceives it to be).

We were able to modify the code for a 5.1 surround panner (in superCollider examples) and adapt it to the 8.1 surround setup which we had.

The pointer can be moved around using the mouse, the spread can be changed using a slider. However, our end goal was to link this with the user’s movements. We do so by using the information captured by the Kinect sensor.

We make use of Open Sound Control protocol to transfer detected coordinates from the server running Kinect to the server running supercollider. OSC makes use of the UDP, which can be used between two machines on the same local network. While coding from the C# environment, we had to import a SharpOSC library into Visual Studios to be able to create a SharpOSC object and send information over the server that 2 laptops were connected to. In supercollider, we simply had to open a port and start listening to whatever was being transmitted by the Kinect server to the supercollider server IP (on the same port).

After receiving the user coordinates (at the superCollider server) form the Kinect stream, we simply updated the coordinates of the pointer and the spread values based on the distance between the user’s center and his/her hands. This update was done every 10 microseconds (using the AppClock API), allowing us to make updates in almost real time.

]]>`ulong_extras`

module (the `cbrt()`

function available in `math.c`

, as it was introduced in `pow(n, 1./3.)`

temporarily, however it was not reliable due to the way 1./3. is stored in memory. So I decided to write one for the ulong_extras module.
It did sound like a trivial task initially, given that **we had to calculate only the floor of the cube root** and not a floating point answer. However there was so many different algorithms to try and test against each other for efficiency, that it did end up taking quite a lot of time. One of them which I really liked involved a lot of bit level hacks, and is probably one of the most interesting pieces of code I have worked on.

The implementation can be found here

The implementation exploits the way an IEEE Double-precision floating-point is stored in memory. Do read up a bit on it before proceeding further.

This representation makes our task to calculate the cube root very simple. In this form, the integer **n** is stored as

$$
n = {(-1)}^{sign} * 1.{{fraction}} * 2^{(exponent - 1023)}

$$

**The core idea is to remove a multiple of three from the exponent, which can be easily extracted using bit masks. This reduces our problem to computing a cube root on $[0.5, 1)$.**

Since we are doing our computations on **unsigned integers**, we can stop worrying about the sign bit in the formula. The above expression can be rewritten as,

$$
n = {{residue}} * 2^{(exponent - 1022)}

$$

where residue is $(1.{fraction}) / 2$. Calculating the cube root,

$$ {\sqrt[3] n} = {{residue}}^{1/3}* 2^{{(exponent - 1022)}/3} $$

We can evaluate this value in a very efficient manner by using some *lookup tables*. There are two parts which need to be calculated, the **exponent** part and the cube root of the **residue** part.

Calculating the exponent part is quite easy. We calculate

$$x = 1 << {(exponent - 1022)}/{3} $$

and

$$y = 2^{(exponent - 1022)\%{3}} $$

y can have three possible values (since we're working mod 3 there), and hence we can store these three values in a lookup table. Let's call this **factor_table**. It comprises of the values **1^(1/3), 2^(1/3), 4^(1/3)**. The value of the cube root of the exponent part is **$x * y$**.

Calculating the cube root of the residue is slightly trickier, since obviously we want to avoid making a call to `pow()`

. We know that $ residue \in [0.5, 1)$. I split this interval into sixteen equal ones, of range **0.03125** each, and then used **approximating polynomials** to compute a good approximation of $ \sqrt[3] {residue}$.

I used **mpmath's chebyfit** function to compute approximation polynomials of degree 2 for each of the 16 intervals, and stored their coefficients in another lookup table. Let's call this table **coefficient_table**. Mpmath's chebyfit function uses the **chebyshev approximation formula**. I used the following call to compute the coefficients for each interval. i and j are the endpoints of the range.

```
mpmath.chebyfit(lambda x: mpmath.root(x,3), [i, j], 3, error = False)
```

Using this, I computed a 16 X 3 2-dimensional float array (16 intervals between $[0.5, 1)$ and 3 coefficients per interval). The tricky part was **how** to use these values. I could have used an if else ladder to decide which of the coefficients to use, depending upon the value of the $residue$. However introducing so many branches seemed like a very bad idea. **Imagine an if-else ladder with 16 branches!**

It took me a while to figure out a better solution. A much better way to do was to **simply use a few initial bits of the residue**. Since the residue is greater than or equal 0.5, the first bit is definitely set. The next 4 bits of the residue are sufficient to decide which set of values to use (or which interval the residue lies in).

For example if the residue was 0.58, then it would be represented as **100101000...** in the `double`

. Using a bit mask, we can extract bits 2..5, which will give us a value between $[0, 16)$. In this case, it will be 0010, i.e. 2. So we know that we have to use the values present in $coefficient\_table[2]$. Let's call this value the **table_index**.

So $\sqrt[3] {residue}$ can now be calculated as following. Let this value be **z**.

We can save one floating point multiplication here (quite expensive), by using **Estrin's scheme** of evaluating polynomials.

Now we have solved both our subproblems, our cube root is simple the **integral part** of $x * y * z$, as calculated above.

Now that we have a way to calculate the cube root, the question is, **how do we actually compute the values of exponent and residue?** C does not allow bit level operations on a floating point data type.

```
>kush@kush:~/Desktop$ gcc -o a test.c -lflint -lm
>test.c: In function ‘main’:
>test.c:162:13: error: invalid operands to binary | (have ‘double’ and ‘long int’)
testDouble | 0xF;
^
```

Here I was trying to use the `or`

operation on a `double`

, and as expected, the code gave errors on compilation. This causes problems, as we *cannot access the bits of the double directly*. The solution to this, was to use

**Unions** let us store different data types, in the same memory location. The same 64 bits stored somewhere in the memory, can be accessed both as a `double`

, as well as an `unsigned long long int`

using this union!

We can declare a union comprising of a `double`

and an `unsigned long long int`

(an unsigned word), and perform the bit level operations on the `unsigned long long int`

to access the bits of the `double`

.

```
typedef union {
unsigned long long int uword_val;
double double_val;
} uni;
```

So basically, if we're trying to compute the cube root of an `unsigned long long int`

**n**, we take the following steps,

```
uni alias;
alias.doube_val = (double) n;
unsigned long long int wanted_bits = alias.uword_val & bit_mask;
```

Now we can access all the bits using `alias.uword_val`

.

You can have a look at the complete implementation of this method below. The `mp_limb_t`

data type used is the same as an unsigned word, or a 64 bit unsigned integer in most (64 bit) processors.

The implementation is accurate, and manages to beat pow(n, 1./3.) in all cases. I timed the two a lot of times for various sizes of n and plotted them. The timings are for a 1000 random numbers of the given size. The new method is 1.1 to 1.2 times faster than calling pow(n, 1./3.).

The actual cube root function present in FLINT currently is slightly more complex. I used **Newton iterations** and an algorithm designed by **W. Kahan** for smaller integers, as it was turning out to be more efficient. In case of larger integers (> 46 bits), the method discussed above is more efficient, and hence used. Check it out here,

Let me know what you think about it in the comments section.

Till next time,

Kush

I forked gabrielecirulli's 2048 repository and wrote the AI on top of it. This saved a lot of time, as I just had to write the AI, and not implement the whole game from scratch. **The source code for the AI is in** **this repository**.

2048 is played on a 4 X 4 grid, with four possible moves up, left, down and right. The objective of the game is to slide numbered tiles on the grid to combine them and **create a tile with the number 2048**.

There are **four possible moves** from which a player can choose at each turn. After each move, **a new tile is inserted into the grid at a random empty position**. The value of the tile can be either 2 or 4.

From the source code of the game, it was clear that the **value is 2 with a probability 0.9**. From the source code of the game,

```
var value = Math.random() < 0.9 ? 2 : 4;
```

I used a **depth bounded expectimax** search to write the AI. At every turn, the AI explores all the four possible directions and then decides the best one. The recurrence relation that it follows is

The **max node** is the one in which the **player chooses a move** (out of the four directions), and the **chance node** is the one in which the **board inserts a random tile**. The max node decides the correct move by maximising the score of its children, which are the chance nodes.

Due to space and time restrictions, it is obviously not possible to explore the complete search space. The **search is bounded by a depth**, at which the AI **evaluates the score of the state** (the leaf node) using some heuristic.

The pseudocode is given below.

```
bestMove(grid, depth, agent):
if depth is 0:
return score(grid)
elif agent is BOARD:
score = 0
for tile in grid.emptyTiles():
newGrid = grid.clone()
newGrid.insert(tile, 2)
score += 0.9 * bestMove(grid, depth - 1, PLAYER)
newGrid = grid.clone()
newGrid.insert(tile, 4)
score += 0.1 * bestMove(grid, depth - 1, PLAYER)
return score/len(grid.emptyTiles())
elif agent is PLAYER:
score = 0
for dir in [left, up, right, down]:
newGrid = grid.clone()
newGrid.move(dir)
score = max(score, bestMove(newGrid, depth - 1, BOARD)
return score
```

I used a combination of two heuristics to evaluate how **good** a state was.

In most of the games that I came close to winning, the bigger tiles were around the corner. Hence the first idea that I used was to **push the higher value tiles to one corner** of the grid. For this, I assigned different weights to different cells.

$$ W = \begin{pmatrix}

6 & 5 & 4 & 3 \\

5 & 4 & 3 & 2 \\

4 & 3 & 2 & 1 \\

3 & 2 & 1 & 0 \\

\end{pmatrix} $$

The score of a given 4 X 4 grid was calculated as following

```
score = 0
for i in range(0, 4):
for j in range(0, 4):
score += W[i][j] * grid[i][j].value
```

This had another advantage, the new randomly generated tiles got generated in the opposite corner, and hence were closer to the smaller value ones. This **increased the chances of the newly generated tiles getting merged**.

It is obviously better for us if more tiles get merged. This happens if in a state two same valued tiles are present next to each other.

I calculate another value, a **penalty**, and subtract it from the score calculated from heuristic one. This penalty is calculated as following:

```
penalty = 0
for each tile:
for each neighbour of tile:
penalty += absolute(tile.value - neighbour.value)
```

**This penalty becomes large when high value tiles are scattered across the grid**, hence indicating that that particular state is **bad**.

The heuristic function I'm using calculates returns `score - penalty`

, each calculated as mentioned above.

As far as I know, there is no method to prune an expectimax search. The only way out is to avoid exploring branches that are highly improbable, however this has no use for us as each branch has an equal probability (the new tile has equal probability of popping up on any of the empty tiles).

Initially I was exploring nodes even if the move played by the PLAYER had no effect on the grid (i.e. the grid was stuck and could not move in a particular direction, which is a common occurrence during gameplay). Eliminating such branches did enhance the performance a bit.

The solver performs quite well. With a **search depth of 6**, it formed the 2048 tile **9 times out of 10**. With a **search depth of 8, it formed the 2048 tile every time** I tested it, and went on to 4096 every time as well.

For a better winning rate, and taking the time taken to search per move into consideration, I keep a 8 ply lookahead if the number of empty tiles in the grid is less than 4, otherwise a 6 ply lookahead. This combination leads to a win every time, and **8 out of 10 times** forms the 4096 tile.

**The AI achieved a best score of 177352, reaching the 8192 tile. The average score was around 40000, and the AI always won the game.**

I forgot to take the screenshot of the best run, but a closer one's here.

Each move takes anywhere from 10ms to 200ms, depending upon the search space and how complex the grid is at that moment. The version here has a timeout function, calling the AI to find the next move every 50 ms.

It can perform better if the search depth is more, however that would make it quite slow. The fact that the code is in javascript doesn't makes things any better :D. Have a look at the AI beat the game!

Writing the artificial intelligence was quite a lot of fun, and I learnt a lot during the process. Do let me know what you think about it in the comments section!

Till next time,

Kush

The last time I came across such a thing, I was in the 10th grade and at that point, I couldn't do anything about it. However this time, powered with some knowledge about the **awesome** JavaScript (LOL), I decided to unblur the damn page.

After a quick inspection of the HTML code, I realized that Scribd isn't even trying -_- . I saw that Scribd was **actually sending the complete document to the client**, and not just the two unblurred pages. This is really dumb I believe, because once you send something to the browser, **any attempt to keep it secure/inaccessible is futile**.

All that had to be done, was change the style of each div which contained a part of the document (yes, Scribd inserts the document as unselectable text in the HTML!!), and increase the opacity of the image and voila, **the page is unblurred**. The script is in this repository.

```
// Remove the boxes that say you have to pay to view
$('.autogen_class_views_read2_page_blur_promo').remove();
// Removing irritating addverts
$('.between_page_ads').remove();
// Unbluring the text
$('.text_layer').css('text-shadow', '0px 0px 0px');
// Making images darker
$('.text_layer').css('color', '#000');
$('.absimg').css('opacity', '1.0');
```

The script requires jQuery to work.

I used this script to make a **chrome extension Unblur Scribd**. Install the extension, and as soon as you come across a blurred Scribd page, just click the extension to view the page.

One thing which I really liked about Scribd was that they **load their pages dynamically**. As soon as you reach a page x, an AJAX call to load page x + y is made. This saves so much overhead in the beginning when you open the URL, as you're actually loading just two pages of the document! However this makes unblurring the complete document impossible in one go, **without loading the complete document first**. The extension needs to be clicked every time the user sees a blurred page (that's about once every 5 to 6 pages).

Although I haven't been able to do everything

]]>The pencils down deadline for GSoC was **21st Aug, the previous Friday**. It has been quite a journey, working with FLINT for over three months. I've learnt quite a lot, not just math but how to write good quality code and testing.

Although I haven't been able to do everything I had mentioned in my proposal, I worked on a lot of stuff which I didn't mention. I worked extensively on the ECM, more than I had planned and ended up writing a very good stage I and II using mpn's. Apart from that I worked on the Pollard Rho Brent algorithm, and the MPQS. The MPQS has some problems and is not functional though.

This week I tried to debug the MPQS. The problem with MPQS is that the sieve value is not exceeding a limit, and hence does not get evaluated. No relations are being found. I believe this is because the roots of the polynomials are being computed incorrectly. I will be looking into this eventually, once I get a break in the semester.

Apart from the MPQS, I also worked on the `fmpz_factor_smooth()`

function. It aims to find a factor of a number, given that it is not semi prime (around n^1/3). It would be a combination of trial division, Pollard Rho Brent, p + 1 and ECM. The profiling of Pollard Rho Brent is yet to be done. Once I do that, I can finish this off.

What's next? I have a lot of stuff which I plan to do. First, is obviously to finish off MPQS and adding a double large prime version. Some other things which I want to work on is to improve the ECM stage II, and the `fmpz_factor_smooth()`

function. I will be working on them, however at a slower speed due to the course load.

I started working on a simple MPQS this week. Most of the code required (Knuth Mutiplier, factor base, linear algebra, square root functions) was already available in the SIQS module, written/ported by Nitin. I just had to make small changes to the code to make it compatible with the sieve I am writing. The part which I had to write again was the **polynomial code**. It wasn't that tough as self initialization was not required.

The polynomial I am using in the sieve is `Q(x) = (Ax + B)^2`

. Hence, we have to solve the equation `Q(x) - kN = 0`

, i.e. `A^2x^2 + 2ABx + B^2 - kN = 0`

.

While evaluating, we pull the factor **A** out, and try to factor `Ax^2 + 2Bx + (B^2 - kN)/A`

. So for evaluating the sieve at different positions, we have to compute **A**, **B**, and **C** (C equals B^2 - kn).

I am using this paper as a reference. The ideal value of **A** is the square of a prime **d** near the value **(N/(2 * M^2))^1/4**. **B** is chosen as the **odd square root of kN (mod A)**. This is done by first calculating the square root mod d (which is a prime), and then using Hensel lifting to calculate the square root mod **A** (which is d^2). **C** can be computed easily by computing **(B^2 - kN)/A**.

d^2). C can be computed easily by computing (B^2 - kN)/A.

I have coded a couple of functions which implement selecting multiple polynomials and computing associated data, the most important being the roots of `Q(x) = N (mod p)`

, for each prime p in the factor

- Functions to compute A, B &
- Functions to compute the Q(x) = N mod p for each prime p in the factor
- Function to initialize
- Function to compute the next polynomial

The code can be found in this file.

I also ported the code from the existing SIQS module for MPQS module. The current implementation has some problems, it does not factor. The control never reaches inside the `mpqs_evaluate_candidate()`

function. Hence **no relations are found**. I am looking into this problem currently.

Since I've missed out on some blog posts the last couple of weeks, I'll be summing up the stuff I've been up to in this post. The past month hasn't really been as productive as I would have wished, the reason being some wrong assumptions I made. I'll be giving week wise summary below.

This week I started reading the existing SIQS code, which I pulled from Nitin's SIQS branch. I took me some time to understand the existing code, and where exactly I would start working to add the double large prime version.

While reading the code, I also found a small bug in the code. In the `qsieve_evaluate_candidate()`

function, all powers of 2 were being removed before the Knuth Schroeppel multiplier is removed from the relation. **This would have slowed down factorisations in which the knuth schroeppel multiplier was even.**

An issue was reported this week, concerning one of the functions I had written earlier (#150). I spent some time reading the `n_cbrt()`

code, and noticed that I had made a terrible mistake earlier. **I had assigned wrong constants, which would have definitely caused overflows in 32 bit systems**. Although this did not solve the issue reported, it was definitely something which would have caused problems. I never noticed it earlier on as I never tested the code on a 32 bit machine.

This week I started coding the double large prime version. I was using Factoring with two large primes, A. K. Lenstra and M. S. Manasse as a reference. I started by reducing the cut off limit in `qsieve_evaluate_sieve()`

to **produce more partials and pp relations**. Once I got these relations, I used `n_factor()`

and `n_isprime()`

in `qsieve_evaluate_candidate()`

to check whether the remainder after pulling out the factor base primes is a prime / semi prime. If so, I wrote them to a file of the format `[type, number of factor base primes that divide it, factor base prime's index, factor base prime's exponent, large prime I, large prime II]`

. In case of a partial, the large prime II was simply 1.

Here is when I started to face problems. **I simply couldn't manage to produce a lot of partials and pp's**. I tried to explore the code and check whether I was making any mistake, but couldn't figure out what to do to fix it. At this point I should have switched over to the tested FLINT2 code, and extend it to a a double prime version, however I was under the impression that an SIQS was required before I could add a double large prime version. This was a wrong assumption I made.

I couldn't really do a lot this week, I had refresher modules and exams throughout the week at my institute (they take place before the beginning of the semester).

I did some minor work towards the FLINT 2.5 release this week. There was an issue #32 reported some time back. It was required to check that in each call of a function `*_si()`

, the si parameter is actually an **slong** and not a **ulong** or **mp_limb_t**. This had to be done for every call across the FLINT code base. I volunteered and checked some.

At the beginning of this week, I realised that it is too late to implement the double large prime MPQS completely. I decided to go back to the ECM code I had written and complete some of the work I had postponed till after the GSoC period. The main reason was that I knew what was going around in the ECM really well, and thought that I could do considerable work working on it.

The ECM code I had written was doing good, **under a factor of 2 compared to GMP-ECM**. There were a couple of algorithmic enhancements I had in mind, which could make it even faster and comparable to GMP ECM. These included **multi point evaluation in stage II**, the **Montgomery PRAC algorithm in stage I** and maybe **Edward coordinates**.

I started off by checking the code for any memory leaks and unnecessary memory allocations. I found unnecessary allocations in stage II, and cut them out using better indexing (in stage II). This **reduced the memory being used by almost half in the stage II code**. There were a couple of places where there were memory leaks. I resolved this as well.

I also fixed the hack I was using in `fmpz_factor_ecm_select_curve()`

. I couldn't figure out how exactly to call `mpn_gcdext()`

earlier on. I realised this week that I was passing an `mp_limb_t`

, whereas I should have been passing an `mp_size_t`

(**the latter is signed!**). So when the inverse was negative, I was facing problems (the function sets the "size" argument as negative if the mp_ptr is negative). Earlier on, I avoided using `mpn_gcdext()`

. When it was required, I was converting back to `fmpz_t's`

, calling `fmpz_gcdinv()`

and then converting back to mp_ptr's! Now I am making a direct call to `mpn_gcdext()`

.

I also started to incorporate multi point evaluation in stage II this week. I had tried this earlier on, however contrary to my expectation, it was almost two times slower. I coded the stage II again (thankfully I had the ECM fmpz code from some time back, coding directly using mpn's would have been very difficult). However yet again, it didn't turn out to be faster than than the existing code. Maybe I'm doing something wrong theoretically? However in such a case, stage II should never be able to factor, whereas it is definitely.

I made some more minor changes to the ECM code and polished it, it seems good to be merged to me.

After a brief conversation with my mentor, I have decided to go back to MPQS. In the coming week, I will be working on a simple sieve, which uses multiple polynomials, without any self initialisation. A lot of code required for this already exists in the FLINT trunk.

]]>This week I finished writing ecm using the mpn functions and reduction using pre-computed inverse for integers larger than one word. The code is present in these files:

- ECM structure init
- ECM structure clear
- Addition on curve
- Doubling on curve
- Scalar multiplication of a point on curve (Montgomery Ladder)
- Function to select Curve Parameters
- ECM stage I
- ECM stage II
- ECM outer function

Apart from these, I also wrote a couple of **helper functions** for mpn arithmetic:

These are not regular functions, and cannot be used with all elliptic points. These assume that the "n" passed has its highest bit set. I did this to save a couple of additions. We are anyway working with **normalized** (shifted) mpn's, and hence don't cause a problem.

Documentation and tests for `fmpz_factor_ecm()`

is present here:

The current ECM is doing a good job, it takes around **0.1 to 0.2 seconds** on an average to find a **50 bit factor** (~15 digits). I never got around to finding why using multi point polynomial evaluation was slowing down stage II, whereas theoretically it should have been faster. I'll probably look at it later on. Also, I haven't profiled it and found optimum parameters (B1 & B2) for different sized factors. These two things ought to speed it up further.

Another problem, or rather a code inconsistency exists in the `fmpz_factor_ecm_select_curve()`

function. I was having some trouble using `mpn_gcdext()`

at one point, so I am converting back to `fmpz_t`

, using `fmpz_gcdinv()`

, and then converting back to `mp_ptr`

. This works, although it's quite hacky. I'll figure out the right way to use `mpn_gcdext()`

and alter it.

The only serious problem, is that the current ECM implementation **cannot find factors below 7 bits**. The `fmpz_factor_ecm_select_curve()`

function fails to find a suitable curve for such integers.

Another thing I worked upon is `n_factor()`

in the ulong_extras module. **Brent-Pollard Rho and ECM seem to do a better job than SQUFOF**, which is used currently. So instead of just calling `factor_one_line()`

and SQUFOF, I changed it so that it calls Brent-Pollard Rho once, and then ECM for some tries. This gives almost a **4 to 5 time speedup** when factoring semi primes, and 1.5 to 2 times for random cases. SQUFOF is still there in the code, so as to make sure that n_factor doesn't fail (in case Brent Pollard Rho and ECM do), however I never saw it being called in all the random cases I tried.

Including Brent-Pollard Rho and ECM did cause a problem though. **Both of them require random states** to be passed, and changing the function definition of `n_factor()`

**will break existing FLINT code**. My mentor suggested that we should cache a `flint_rand_t`

in the factor function itself. I am working on this right now.

This week, I plan to finish the small part remaining in `n_factor()`

(caching of the `flint_rand_t`

) and then go back to double large prime MPQS. I am using the paper MPQS with three large primes as a reference. Although named three large primes, it also gives quite a lot of information on the double large prime version.

I spent the last week rewriting the ECM using **mpn's and pre-computed inverses, and ulongs**. Pollard Rho did show a significant speedup when I rewrote it using mpn's and pre-computed inverses, and I was hoping for the same here. Initially I was planning to write only till stage I using the faster options available, however I ended up working on both stage I and stage II.

The flow which I had in mind was similar to many existing functions in FLINT. If the integer to be factorized fits in one word, use the ulong module else use fmpz's/mpn's. So I decided to rewrite ECM using the ulong module first.

- Group Addition of co-ordinates
- Group Double of co-ordinates
- Group Scalar Multiplication of co-ordinates
- Function to select Curve Parameters
- ECM Stage I
- ECM Stage II
- Outer wrapper function for ECM

Next I attempted to write ECM using **mpn's**. The code which I have written so far (using mpn's) works fine for single word integers, however causes problems for larger integers. It is still work in progress.

The current version of ECM still has a lot of room for improvement. This includes **tuning** it so as to find optimal values for B1 and B2. Also I still haven't been able to pinpoint my error in the multipoint evaluation stage II I wrote a couple of weeks back. It works, however is slightly slower than than the initial stage II.

- Figure out what's causing problems for integers larger than one word in the mpn version of ECM and finish writing
- Time ECM against GMP
- Understand the double prime QS more thoroughly, go thorough the QS which Nitin (who is working on the SIQS) has written and start writing a double prime version using it.

I missed out last week's blog, couldn't write it as I was traveling and had poor internet connection. In the last two weeks, I have been trying out different implementations of stage II, using **multi point evaluation**.

Mid term evaluations began on the 26th of June, and I have

]]>Hi!

I missed out last week's blog, couldn't write it as I was traveling and had poor internet connection. In the last two weeks, I have been trying out different implementations of stage II, using **multi point evaluation**.

Mid term evaluations began on the 26th of June, and I have met most of the targets I had set. I have added an optimized version of pollard rho, and the ECM code should be completed by the end of this week. Although there are still a lot of things I'd like to try to make it faster, I don't want to deviate a lot from the timeline now. I'll probably get back to it after GSoC.

The current ECM timings are close to GMP-ECM, I'll be posting some timings next week, once I code it using mpn's. My main worry right now is the fact that **using multi point evaluation is slowing down ECM, contrary to the expected speedup**.

This week, I plan to find the reason why the multi point evaluation technique is not turning out to be faster, as expected. Also I would be rewriting ECM, at least the stage I, using mpn's.

]]>The past week, I optimized the current stage II, and tried out a new version of stage II ECM. The main way by which we can optimize stage II is by **reducing the number of points we compute**. My mentor pointed out a very nice way to do so. Unfortunately, it didn't have a significant speedup. I believe this is because the semi primes which we were factoring (at max 128 bits long), *did not need a lot of points to be computed in the first place*. The newer version definitely decreased the number of points being computed **(by more than a half)**, but the number points were not large enough in the first place to get noticeable speedup.

The most time consuming step in stage II is when we compute and accumulate the product of difference of the x coordinates (in projective form, in which I am working, it is x1*z2 - x2*z1). This step involves a lot of modular reductions, and hence **a lot if inversions and divisions**. The efforts put into the new implementation of stage II would definitely be helpful when I write stage II using *mpn's and precomputed inverses*.

I had to write a new outer function too since the newer version had different pre computations than the older one.

This week apart from the new ECM stage II, I also spent time on cleaning up the way variables were being passed. I wrapped up most of the important variables in an **ecm structure**, passing a reference to it as an argument to the functions instead of a lot of variables. This struct also has **four fmpz_t's for temporary calculations**, which eliminate the need of initing and clearing new fmpz_t's in every `ecm_add()`

and `ecm_mul()`

. This struct also allows me to precompute the `GCD_table`

(looks up whether `i and j are co-prime`

) and `prime_table`

(looks up whether `i * k + j is prime`

) for all curves instead of computing them each time for a curve.

You can have a look at my recent commits here.

Currently I am trying to understand the FFT techniques used for ECM stage II. I found quite a lot of citations to the **Ph.D. dissertation thesis of Peter Montgomery**, which was on this topic. Initially I couldn't find it, but some helpful people on the flint-devel mailing list pointed gave me the link. It is very interesting due to the the math involved, but at the same time quite tough to follow.

This week I wasn't able to give in as much time I would have liked. I usually work on all 7 days on the project, however this weekend I wasn't available due to a hackathon I was attending. I will pick up the speed this week. My goals for this week is to implement the FFT stage II and start to code the whole ECM with the mpn data types and functions.

]]>I spend the larger half of last week studying elliptic curves and understanding possible stage II extensions, rather than coding. I got stuck at a lot of places and had to read a lot to understand. Implementing stage II was certainly harder than stage I. More on that a bit later, first I'll summarize the work I have done this week for those who are in a hurry and cannot read the whole post.

In case someone is wondering where all the code I had blogged about in the last post went, it got shifted to the fmpz*factor module. My mentor siggested that it is better to move them to the fmpz*factor module, as these functions are highly specialised, and are not likely to be used in any way other than in ECM. This week I implemented not so optimized versions of:

Apart from this, I also edited the group operation functions **(double, add and multiply)**. I realised I was not taking care of corner cases (which involve the point O). This was causing a lot of bugs. You can have a look at my recent commits here.

It's quite weird Github shows the author to be unrecognised. I guess I forgot to set the global email in got config while setting up git again (I messed up my computer some time back, had to reinstall OSX).

Once I was done with the stage I implementation, I started thinking about how to implement the stage II. My first step was to implement a naive stage II, which *computes a scalar multiple of the point at the end of stage I for each prime between B1 and B2* (The two bounds). You can have a look at this algorithm in this paper (Algorithm 1). Now this was quite simple, as I already had all the functions required at my disposal (done in week 1). However this turned out to be **quite inefficient**.

To optimize this, the first algorithm that I tried out was in *Prime Numbers, A computational Perspective (page 353).* I decided to implement this primarily because it used Montgomery coordinates, and seemed straightforward enough. However, after spending a day and a half on it, I couldn't implement it correctly. There seemed nothing wrong with my implementation, but it just didn't factor a number, *no matter how large the B2 was*. So I decided to start looking for papers which follow a different technique.

I came across another algorithm in (this paper)[https://www.iacr.org/archive/ches2006/10/10.pdf] (Algorithm 4). I went forward an implemented it, only to get bad news again. The algorithm never finished running. After spending qutie some time on debugging, **I realised there were mistakes in my group operation functions**. *Maybe this was the reason for my first implementation of stage II not working too.*

When I sat down to recheck the group operations I had coded earlier, I came across a very weird thing. When I computed the point 4Q (using the Montgomery ladder) and the same point using addition (3Q + Q), my answers did not match. I found this quite weird, considering that my implementations were correct. I ended up asking a question on math exchange, only to realize that the points were in fact the same! **What I had forgotten was that I am dealing with projective coordinates. So the absolute value of the coordinates need not match, however their ratio will always be the same**. I was stuck at a non issue for more than a day :/

My goals for next week are to optimize the stage II further. I have been looking at some techniques such as the **FFT continuation for ECM**, which completely avoids calculating a table of prime numbers. I haven't really looked into it, I will be doing so in this week. Once I'm sure about what code I will be using for ECM, I'll code an mpn layer version for it.