Back in November 2019, we reported that Reliance Jio is able to block HTTPS internet traffic by means of a deep packet inspection (DPI) technique. In response, some readers messaged us saying that they ran our test and were able to reproduce similar behaviour on Airtel mobile networks. According to TRAI's Performance Indicators report for Jul-Sep 2019, Reliance Jio and Airtel serve roughly 52% and 23% of internet subscribers in India respectively. This essentially means that SNI inspection based censorship is now impacting every 3 out of 4 internet connections in India.
Although the previous test was able to detect the presence of SNI inspection based censorship, it was not very insightful. In this post, we delve into a more informative test which not only confirms the presence of SNI inspection based censorship, but also helps us identify the exact mechanism. Furthermore, it also allows us to identify middleboxes which are actively inspecting SNI in TLS handshakes and censoring requests. Using this method, we were able to discover 25 different middleboxes registered to Airtel, which are actively censoring HTTPS traffic.
Quick links to different sections of this post:
1. Transport Layer Security
1.1 Server Name Indication
2. SNI Inspection based censorship
3. Iterative Network Tracing
4. Data Preparation
5. Methodology
6. Examining Airtel's behaviour
7. References
All the code for replicating this experiment, as well as the logs from our test runs can be found in this repository. Big shout-out to IPinfo for giving us access to their IP address dataset, and Gurshabad Grover for his suggestions while ideating the methodology and for editing this post.
Transport Layer Security (TLS) is a cryptographic protocol for providing communication confidentiality and authenticity, commonly used for encrypting web traffic (as done in HTTPS). Normally TLS is used over TCP, as it requires a reliable in-order data stream. A quick refresher on TLS by Cloudflare.
Server Name Indication (SNI), defined first in RFC4366 and then in RFC6066, is a TLS extension designed to facilitate the hosting of multiple HTTPS websites on the same IP address. While sending a ClientHello
message (which initiates the establishment of a secure connection), the client is expected to fill in the SNI attribute with the hostname of the website it wishes to connect to. SNI, unfortunately, travels on the network in cleartext, i.e. network operators can not only see the websites you’re visiting, but also filter traffic based on this information.
Since the SNI present is in cleartext, anyone in the network can inspect and filter traffic based on its value. As seen in other countries, ISPs can leverage this to deny access to certain websites. We can observe the same by attempting a TLS connection using openssl and monitoring packets to the host.
openssl s_client -state -connect 103.224.212.222:443 -servername fullhd720.com
103.224.212.222
, with SNI fullhd720.com
. We observe a RST packet immediately after the ClientHello message containing the SNI is sent.For instance, using Airtel, we can see that the client receives a TCP RST packet when it tries to connect to a blocked website "fullhd720.com". The RST packet seems to be originating from the actual host, and is received right after the ClientHello message containing the SNI is sent. PCAP.
To confirm that the connection termination was indeed due to the SNI, we can reattempt the connection with a different SNI which we don't expect to be blocked (in this case we use facebook.com
).
openssl s_client -state -connect 103.224.212.222:443 -servername facebook.com
103.224.212.222
with a different SNI, facebook.com
. In this case, we observe a successful TLS handshakeThis time we notice a successful connection, indicating that the RST in the previous attempt was indeed due to the specified SNI. PCAP.
Although this test does demonstrate the presence of SNI inspection based censorship, the packet dumps are not sufficient to prove that the RST packet was actually forged by a middlebox belonging to the ISP.
For a given host, let's call the minimum Time to Live (TTL) required for a packet to reach from the client to the host, min_ttl
. Any packet where the TTL set is less than min_ttl
would expire in transit, and never reach the host. Ideally, the router at which the TTL of the packet expired should respond with an ICMP Time Exceeded (ICMP message type 11) message. However, this is not guaranteed, and some routers are even configured to not send them (in order to hide the topology of the network).
So if the RST received is forged by a middlebox, we should receive it even when we send the ClientHello message with TTL less than min_ttl
. This approach, known as Iterative Network Tracing (INT), has been previously used to ascertain the presence of middleboxes which censor DNS and HTTP traffic in India [Yadav et al.] and China Xu et al. Similar to these studies, we use INT to detect censorship of TLS traffic (explained further in the methodology section).
We run our tests using a list of potentially blocked websites (PBWs), curated from leaked court and government orders. The list and more information pertaining to it can be found here.
Using Google's DNS over HTTPS (DoH) service, each hostname was resolved to its correct IP address. Using DoH here is important as it ensures that no DNS based censorship intervenes with the test. This resulted in roughly 5000 (hostname, ip) pairs. Next we selected a random subset and checked for TCP connectivity to port 443 to each of those ips (since not all would support HTTPS traffic), filtering our list down to 1370 pairs.
For each of these test points, we establish a TCP connection with the resolved_ip, and send a TLS ClientHello with the SNI set as the correct_hostname. We sniff and save these ClientHello packets (just the SSL layer) for use later. Similarly, we save the ClientHello packet with the SNI set as facebook.com
. These sniffed packets can be found here.
The input to the test is a 2-tuple, (correct_hostname
, resolved_ip
). We would like to understand the behaviour of a middlebox when it observes a ClientHello message containing an SNI for a website it wishes to block.
First, we calculate the min_ttl
for a given test point. We begin by establishing a TCP connection with resolved_ip
.
import socket
import random
from scapy.all import *
resolved_ip = "103.224.212.222"
dport = 443 # TLS connection
sport = random.randint(1024, 65535) # Random source port
def create_connection(resolved_ip):
s = socket.socket(socket.AF_PACKET, socket.SOCK_RAW)
s.bind(("usb0", 0)) # Was using a tethered mobile connection for the experiment
IP_PACKET = IP(dst = resolved_ip)
seq = random.randint(12345, 67890) # Randomise initial seq number
SYN = TCP(sport = sport, dport = dport, flags = "S", seq = seq)
SYNACK = sr1(IP_PACKET / SYN)
ACK = TCP(sport = sport, dport = dport, flags = "A", seq = seq + 1, ack = SYNACK.seq + 1)
send(IP_PACKET / ACK)
return IP_PACKET, ACK
Note: When the linux kernel feature gets a TCP packet to an unknown socket, it sends a RST back to the originator. Since we'll be creating our own raw sockets, we need to suppress these outbound RSTs from the kernel using iptables before running experiments.
sudo iptables -A OUTPUT -p tcp --tcp-flags RST RST -j DROP
Once the TCP connection has been established, we send ClientHello messages (containing facebook.com
in SNI) after updating the TTL (probe_ttl
) in the underlying IP header. We specify facebook.com
in the SNI so that the middlebox doesn't attempt to terminate the connection. min_ttl
would be the minimum TTL at which we receive a TLS ServerHello or TLS Alert from the host.
# Load ClientHello with garbled hostname in SNI (sniffed earlier, read Data Preparation)
max_ttl = 35
def find_min_ttl(resolved_ip)
with open("tls_client_hellos/facebook.com", 'rb') as fp:
tls_client_hello_facebook_com_sni = fp.read()
for probe_ttl in range(1, max_ttl):
IP_PACKET, ACK = create_connection(resolved_ip)
IP_PACKET.ttl = probe_ttl
del IP_PACKET.chksum # Will force scapy to recalculate checksum after TTL update
resp, _ = sr(IP_PACKET / ACK / tls_client_hello_facebook_com_sni, timeout = 2, retry = 0, multi = True)
for _, ans_packet in resp:
tls_alert = ans_packet.get(tls.TLS, {}).get(tls.TLSAlert)
tls_server_hello = ans_packet.get(tls.TLS, {}).get(tls.TLSHandshakes, {}).get(tls.TLSServerHello)
if tls_alert or tls_server_hello:
return probe_ttl # min_ttl found!
Next, we send ClientHello messages containing the correct_hostname
in the SNI with TTL increasing from 1 to min_ttl
- 1. If there is no middlebox interfering with the connection, all such requests should receive either an ICMP Time Exceeded in response or no response at all. If at any point we receive an RST packet which seems to be originating from resolved_ip
, we can say with certainty that the packet was forged by a middlebox.
min_ttl = find_min_ttl(resolved_ip, tls_client_hello)
with open("tls_client_hellos/fullhd720.com", 'rb') as fp:
tls_client_hello_correct_sni = fp.read()
for probe_ttl in range(1, min_ttl):
IP_PACKET, ACK = create_connection(resolved_ip)
IP_PACKET.ttl = probe_ttl
del IP_PACKET.chksum # Will force scapy to recalculate checksum after TTL update
resp, _ = sr(IP_PACKET / ACK / tls_client_hello_correct_sni, timeout = 2, retry = 0, multi = True)
for _, ans_packet in resp:
icmp_packet = ans_packet.get(ICMP)
if icmp_packet is not None:
print ("Found ICMP message of type %d" %(ans_packet[ICMP].type))
continue
tcp_packet = ans_packet.get(TCP)
if tcp_packet and (tcp_packet.flags >> 2) % 2 == 1:
print ("Found RST at hop %d" % (probe_ttl))
Using our methodology, we mine the following information for our each point in our test list:
The script for running this test can be found here. Logs for each test run can be found here. Code for mining the information above from the logs is present in this python notebook.
From our list of 1370 potentially blocked websites, there were 1058 instances where we received RST packets in response to ClientHellos with correct_sni
. In all of these cases, the RST seemed to be originating from resolved_ip
, and min_correct_sni_RST
was less than min_ttl
. This implies the presence of a middlebox deliberately terminating connections.
Furthermore, in 170 cases, we also received ICMP Time Exceeded alerts at the same probe_ttl
at which we received RST packets. On further analysis, we found these RST packets to be originating from 25 unique middleboxes. Checking with ipinfo.io revealed that 16 of these were registered to airtel.com
, and 9 were registered to bhartitelesonic.com
. More information regarding these middleboxes can be found in this python notebook.
In 290 cases, we received TLS ServerHellos / TLS Alerts in response, indicating no network interference. This is expected, since we started with a list of potentially blocked websites.
Apart from the above, there were a few test failure due to connectivity issues, which we did not probe further.
Nation states around the world engage in web censorship using a variety of legal and technical methods. India is no different in this regard: the Government of India can
]]>Update (11th April 2020): This paper has been accepted at ACM Web Science 2020. A preprint can be accessed on arXiv.
Nation states around the world engage in web censorship using a variety of legal and technical methods. India is no different in this regard: the Government of India can legally order internet service providers (ISPs) operating in its jurisdiction to block access to certain websites for its users. This makes the situation different from jurisdictions like Iran and China, where internet censorship is largely centralised. Legal provisions in India, namely Section 69A and Section 79 of the Information Technology (IT) Act, allow the Central Government and the various courts in the country to issue website-blocking orders that ISPs are legally bound to comply with. Most of these orders are not publically available.
Recent events and the opaque nature of internet censorship in India motivated us at The Center for Internet and Society to study India's censorship mechanism in detail. We spent the last year trying to answer two questions pertaining to how internet users in India experience web censorship:
Our work has been so far the largest study of web censorship in India, both in terms of the number of censorship mechanisms that we test for and the number of potentially-blocked websites (PBWs).
We compiled a list of PBWs from three sources:
Collecting data from these sources led to a total of 9673 unique URLs, which yielded 5798 unique websites. To limit ourselves to active websites, we exclude all websites for which we could not resolve via Tor circuits, culminating in a corpus of 4379 PBWs.
We designed four network tests that probe the existence of censorship at the DNS, TCP, HTTP, and TLS level. For the sake of brevity, I'll skip elaborating on the tests in this post; the details can be found in our preprint.
We run these tests for each website in our corpus from connections of six different ISPs (Jio, Airtel, Vodafone, MTNL, BSNL, and ACT), which together serve more than 98% of Internet users in India. Our findings not only confirm that ISPs are using different techniques to block websites, but also demonstrate that different ISPs are not blocking the same websites.
In terms of censorship methods, our results confirm that ISPs in India are at liberty to use any technical filtering mechanism they wish: there was, in fact, no single mechanism common across ISPs.
We observe ISPs to be using a melange of techniques for blocking access, such as DNS poisoning and HTTP host header inspection. Our tests also discern the use of SNI inspection being employed by the largest ISP in India (Jio) to block HTTPS communication, the use of which is previously undocumented in the Indian context.
Further, we notice that all ISPs using multiple censorship mechanisms are not blocking the same websites with each mechanism. For instance, ACT uses only DNS censorship for blocking 233 websites, only HTTP censorship for 1873 websites, and both to block 1615 websites. Such irregularities are illustrated below.
Our study has recorded large inconsistencies in website blocklists of different Indian ISPs. From our list of 4379 PBWs, we find that 4033 are being blocked by at least one ISP’s blocklist. In terms of absolute numbers, we notice that ACT blocks the maximum number of websites (3721). Compared to ACT, Airtel blocks roughly half the number of websites (1892).
Perhaps most surprisingly, we find that only 1115 websites out of the 4033 (just 27.64%) are blocked by all six ISPs. Simply stated, we find conclusive proof that Internet users in India can have wildly different experiences of web censorship.
Analysing inconsistencies in blocklists also makes it clear that ISPs in India are:
This has important legal ramifications: India’s Net Neutrality regulations, codified in the license agreements that ISPs enter with the Government of India, explicitly prohibit such behaviour.
Our study also points to how the choice of technical methods used by ISPs to censor websites can decrease transparency about state-ordered censorship in India. While some ISPs were serving censorship notices, other ISPs made no such effort. For instance, Airtel responded to DNS queries for websites it wishes to block with NXDOMAIN. Jio used SNI-inspection to block websites, a choice which makes it technically impossible for them to serve censorship notices. Thus, the selection of certain technical methods by ISPs exacerbates the concerns created by the opaque legal process that allows the Government to censor websites.
Web censorship is a curtailment of the right to freedom of expression guaranteed to all Indians. There is an urgent need to reevaluate the legal and technical mechanisms of web censorship in India to make sure the curtailment is transparent, and the actors accountable.
The whimsical attitude towards web censorship from both ISPs and the Government necessitates the development of a crowdsourced tool to monitor and measure such censorship from different vantage points in the country. This will shed further light into the geographical variation of censorship practices by ISPs across India, which is still unclear.
To probe this further we have ported our network tests into an android application, and are looking for volunteers who are willing to run it on their mobile networks. The entire process will be completely anonymous; we will not be collecting any user-specific information. If you live in India, please consider running Censorwatch.
Acks - This study was done in collaboration with Gurshabad Grover and Varun Bansal at The Center for Internet and Society, graciously supported by the MacArthur Foundation.
]]>Having recently stumbled across Facenet (back then), I was curious whether I could make use of it to develop something similar. Although Facenet doesn't really allow us to construct an image from scratch, it does allow us to change the facial attributes of an input image. It uses a Variational Autoencoder to learn an attribute vector for input images. By making changes to the attribute vector, we can change the facial attributes of an input image as required.
TL;DR In this post, we try to come up with a framework where a user can gradually improve upon a face image as they want (with respect to its attributes). We use a GA to improve upon the attribute vector which generates the image. At each iteration, we pick the attribute vectors (chromosomes) which generated the top-k images (closest to the target/intended target). In this exercise, we initially pick up a target image and compared the intermediatory images by taking a simple MSE. In case of a real-life scenario, a user could be simply selecting the top k images which fit the criterion. Of course, we can expose this vector directly to the user and ask them to make changes as they deem fit. However, it might be more easy for a user to pick some top-k closest images to their target, rather than changing attributes directly.
The python notebook (as an HTML) is available here. It's an extremely large file, sorry about that! I no longer have access to the server with the code, so cannot regenerate it with lesser images.
A Variational Autoencoder (VAE) is a variation (haha) of the autoencoder which tries to learn the distribution of the training input samples rather than just a dense representation. We try to map the input onto a distribution (a latent vector), rather than a fixed vector. A simple way to do is to couple the standard autoencoder reconstruction loss with and the deviation of the learnt representation from an expected (pre-decided) distribution. This loss can be computed using something like KL divergence.
The Facenet VAE employs a much fancier approach to compute the loss, which in fact allows for extremely granular control over features of the generated image. The learnt latent vector, in fact, allows modifying facial attributes of an image (and hence is appropriately named attribute vector).
I will skip explaining VAEs in detail in this post to save up on space. This blog by Lilian Weng does a great job at explaining autoencoders and other variants, including VAE. Would really recommend having a look if this piques your interest.
The genetic algorithm is a fascinating evolutionary algorithm inspired by natural selection. It is an iterative process, each iteration comprising of a population of individuals (called a generation). Each individual is represented by a vector of genes called chromosome. The goal is to search for a set of chromosomes which are the best, according to some criterion.
Each generation gives rise to a set of chromosomes in the next iteration of GA. This is done via two techniques, known as cross-over and mutation. A crossover takes place between two parents and generates a chromosome (child) with genes taken both the parents. This combination can be done randomly, or via selecting certain contiguous segments of the parents' chromosome. Mutation generates a child by taking a chromosome (parent) and randomly changing any one of its genes.
We generally restrict the population size at each generation. Not doing so would make the population grow exponentially, making it computationally impossible to simulate. This is done by using a fitness function and selecting those individuals with the highest fitness score. In the beginning, we randomly generate a set of chromosomes. At every generation, we expect to have a set of chromosomes that have a higher (or at least equal) fitness score compared to the chromosomes in the previous generation. This process allows us to gradually reach a target chromosome.
The description of GA above is very brief, would recommend this blog by Burak Kanber for a detailed write-up.
In our GA formulation, we consider the latent vector used to generate images in the Facenet autoencoder as the chromosome of an individual in the population. We make use of the pre-trained Facenet decoder, passing the chromosome as input to generate an image. Starting off from a population of randomly initialized chromosome attribute vector, we would expect to gradually improve upon the generated image until it becomes the target image.
At each step, we generate a set of images and ask the user to select the top-k best (this can be thought of as the fitness function). Using this set, we generate the next generation of chromosomes (attribute vectors). It is important to note that the fitness function is not applied directly on the individuals in a population (since there is no target latent vector, only a target image).
For our experiment, due to the absence of a user, we chose a target image and then for each chromosome in a population, we calculate the fitness as follows. The chromosome is first decoded into an image, and then the fitness is calculated by taking the inverse of the MSE between the image generated and the target image. Let g(z)
be the generator function (the Facenet decoder) which generates the image I
after taking z
as an input. If the target image is T
, we calculate the fitness as the inverse of MSE(I, T)
, i.e. the inverse of MSE(I, g(z))
.
Below are the images with the highest fitness function at different generations. We see how we slowly move towards a latent vector which generates the required smile.
In another scenario, a user can choose the top few images that he/she thinks is ideal, which can form the next generation of individuals. This way, the user can gradually improve upon how he/she wants the attributes in the image to look like.
This is a follow up to my previous post on fast DLA simulations. Please do read that before continuing further!
In this article, we try to come up with a model which can help us estimate the stickiness of a DLA simulation. Given an image from a DLA simulation (a binary matrix, 1 indicating a particle and 0 free space), we would like to estimate the value of k
used for that particular simulation. This was actually given to me as a part of a recruitment process, and for simplicity, the range of k
was set to [1e-3, 5e-2]
P.S. Python notebooks with the analysis done in this article can be found here. The notebook with estimation approach can be found here.
k
I briefly touched upon k
in the previous post, the stopping criterion of a particle undergoing Brownian motion depends on this parameter. As soon as a moving particle collides with a particle at rest, the moving particle stops with the probability k
. The higher the value, the higher the chance of a particle coming to rest (and hence, the stickier the particle). This parameter tends to change a DLA simulation quite a bit. Let's have a look at 3 simulations with 10,000 particles, but with varying stickiness.
Just by examining a few such simulations, it is clear that a low k value entails a more "compact" (or "crowded") image. If we can somehow come up with a nice property that can capture this crowding, we can think of modelling that as a function of the stickiness parameter. For my analysis, I used the total surface area of the image to capture the crowding. This was computationally fast to compute (order of \(n^2\)) and gave a good idea about the level of crowding of the image. We would expect the surface area to increase (less crowding) as k increases.
Given a DLA simulation, let's look at what all information we can extract
Clearly, surface area can easily be calculated from the image. We just need to check each particle and see how many of its neighbouring cells are empty. For the rest of the article, I'll use n
to denote the total number of particles, and S.A.
to denote the surface area.
Before we start modelling anything, it was essential to first get the data. This was pretty straightforward since we had already written the DLA simulation code. We split the range of k
into equal intervals using np.linspace(1e-3, 5e-2, 40)
, and run 15 simulations of 20,000 particles each. For each run, we calculate and save the surface area after the addition of every 100 particles. This gave us 15 readings of surface area for each n in range(100, 20001, 100)
. For each value of n, we take the average surface area across the 15 runs. In total, we generated roughly 8000 (n, k, S.A.)
tuples, which we further use in our model.
All this data is available here.
n
and k
As a sanity check, we first plotted the value of surface area vs n, keeping k fixed. Obviously, this should have a strong positive correlation.
From the above plots, it's clear that the total surface area increases with n. We initially assumed that the graph would be strictly increasing, but later on realised that its wrong to claim this, as in some rare cases addition of a new particle can lead to a decrease in the surface area of the figure (think of a 3x3 square of particles with one of the edges having an empty space).
Next, we tried to see how surface area varies with the stickiness value (keeping n fixed). If this looks like something we can capture using a model, we don't need to look further.
n
As soon as I plotted these, I got really excited. This clearly looks somewhat like a log distribution! However the only downside, it's not bounded in a fixed range (look at the y-axis for different n). That should have been expected since the total surface area is bound to depend on n
as well. So we need to find a way to eliminate this from our model. A very naive way to this is to divide the surface area with n
.
n
Honestly, I didn't expect this to work since I couldn't come up with a proof that justified that dividing by n
(and not some power of n
) should normalize the range, but these plots looked really promising. The average surface area seems to be a better property to fit a curve on. Next, I checked how surface area / n
varies with log(k)
n
These plots looked pretty good to fit a curve on, so I stopped further alterations to the data. It looks like a straight line as k
increases, small values of k
don't look good though. So I tried with second-degree polynomials as well.
There were two important things to decide before beginning to fit a curve
The data which we generated from the simulations were (n, k, S.A.)
tuples. I really wanted to eliminate n
completely, since training separate models for each value of n
just wasn't possible. Let's have a look at the scattering of surface area values for different ranges of n
, which might help us understand whether we can eliminate n
or not.
n
. Points from all simulationsWe notice that as the value of n
increases, the average surface area tends to scatter less. Let's explore further for smaller values of n
.
n
. Points from all simulationsTwo inferences can be drawn from the analysis until this point
So eliminating n
completely was not feasible, since it will tend to fail for smaller values. Another interesting idea here is to use some sort of a windowed curve fitting. For a given n
, instead of using complete data, we only use the data from simulations where the number of particles is in the range [n - c, n] for some c.
To check whether my intuition was correct, I simply partitioned the data into two parts and used np.polyfit
for curve fitting, checking for up to 3-degree polynomials. I trained models on data where the number of particles were in range(100, 15000, 100)
and tested it on data from simulations where the number of particles were in range(15000, 20000, 100)
.
This did seem like a good fit, the error in predicted k
was low (we're bounded in [1e-3, 5e-2]). Error % = 0.00166000318459 / (5e-2 - 1e-3) = 3.38 %
However, notice that the training error is high. This was expected because we saw high scattering for small k
. Let's try to see if using only data closer to our given n
can give a better fit. The next plots show predictions by a model trained in range(10000, 15000, 100)
, tested on range(15000, 20000, 100)
.
As expected, the training error went down in this case. However, it was really interesting to notice that error on the test has also decreased! E % = .00106134661616 / (5e-2 - 1e-3) = 2.16 %. Our estimate improved when we discarded data for n much smaller than the target n. This strongly suggests that we should be using a local window regression model.
The idea is to use only data from a fixed range of n smaller than the given n. Since training time is negligible, we can train the model in real time depending on the given n
in the DLA image. We already have simulation data for n up to 20,000 particles. The next obvious question that pops up is, what happens if n
is so large that we don't have simulation data. From the above analysis, we can safely say that the curve more or less remains the same for any number of particles above 10,000.
For each value of n in range(1000, 20000, 1000)
, we trained regression models on simulation data where the number of particles was in the range [n - 4000, n - 100].
For the 3-degree poly, the worst error percentage (0.0019) is better than the average train error in the previous fit. Furthermore, the training error also went down. Hence a local fit is definitely a better one.
So our strategy to predict stickiness is the following. For a given n, we take a window of n from the data and fit a 3rd-degree polynomial. We use this polynomial to predict the k for the given n. Code for this approach can be found here.
]]>Diffusion Limited Aggregation is a growing object consisting of particles that diffuse and aggregate to the object, much like how flocks of snow are created. Particles are introduced one after the other and made to undergo a random walk. Eventually, a newly introduced particle reaches an existing particle and remains there. This clustering of particles leads to a formation of aggregates of particles. This theory, proposed by T.A. Witten Jr. and L.M. Sander in 1981 is applicable to aggregation in any system where diffusion is the primary means of transport in the system. (André Offringa's report).
Another interesting description (and my favourite) involves a city square surrounded by taverns. Drunks leave the taverns and stagger randomly around the square until they finally trip over one their insensate companions at which time, lulled by the sounds of peaceful snoring, they lie down and fall asleep. The tendril-like structure is an aerial view of the sleeping crowd in the morning (Paul Bourke's blog).
P.S. Code for the simulations is available in this repository.
This article is the first of two parts on DLA. It's mostly about my implementation of a 2D DLA simulation with a point attractor, and an optimization I use to make these simulations faster. I plan to follow this up with another one on the stickiness factor estimation (more on stickiness below).
Generating these images is pretty straightforward.
The pixel undergoing Brownian motion can move to any of its eight neighbours with equal probability. This version of the DLA which we discussed has a point attractor (i.e. new particles get attracted towards a central point). This attractor itself can have a different locus (it could be a line or a circle), but we just focus on point attractors for the moment.
Another attribute which I incorporate in this model is called 'stickiness'. As soon as the new particle undergoing Brownian motion comes in contact with an existing particle at rest, instead of stopping, it stops with some pre-decided probability (stickiness factor) k
.
A python notebook of this simulation can be found here. It has a DLA
class which can be initialised with the size of the square grid and the stickiness factor k
. The addPoint
method can be used to add any number of particles to the canvas, the printState
method can be used to print the grid out (uses matplotlib.pyplot.imshow
).
As you would have guessed, adding a new particle is a very time-consuming process, especially as the number of particles increases. To accommodate a large number of particles, we need a larger grid. As a result, the area which a new particle explores before coming to rest increases. Of course, an easy way to quicken the whole process is to not use Python. Let's discuss a more mathematical approach.
At each iteration of DLA, we introduce a particle \(C_1\) at a randomly selected position \((x, y)\) on the border of our m x m canvas. This particle follows Brownian motion in the canvas until it meets the termination criterion. As m increases, the area in which \(C_1\) can move also increases, making simulations slow. In order to quicken the process, we would like to decrease the search space. Intuitively, we would like to spawn as close as possible to the current group of particles in the canvas.
We can do this by finding a group of points \(L_2\) such that the probability that a point that spawned on the square border reaches any of the points in \(L_2\) is the same. Formally, this would be a locus such that any two points \(l_{21}\) and \(l_{22}\) in it satisfy the condition
\[W_{reach}(a, b) = \frac{n(a, b)}{\sum_{c \in L_1}n(c, b)}\]
\(L_1\) is the set of points which make up the square boundary. \(W_{reach}(a, b)\) is the ratio of the number of ways of reaching \(b\) from \(a\) to the total number of ways of reaching \(b\) from all points in \(L_1\). We take a ratio since there are infinitely many ways of reaching \(b\) from any point in \(L_1\).
If the area covered by locus \(L_2\) is less than the area of the square boundary, then we can hope to see a faster DLA. Intuitively it seems \(L_2\) can be a circle. We shall be proving it below.
\(W_{total}\) of a point \(C_2\) on the circle is calculated by summing \(W_{reach}(c, C_2)\) for all points \(c\) on the square. \[ W_{total}(C_2) = \sum_{c \in side1}W_{reach}(c, C_2) \quad + \sum_{c \in side2}W_{reach}(c, C_2) \] \[ \quad \quad \quad \quad \; \; + \sum_{c \in side3}W_{reach}(c, C_2) \quad + \sum_{c \in side4}W_{reach}(c, C_2) \]
Now we shall show that \(W_{total}\) is equal for all points on the circle. There are two assumptions we make to simplify our analysis
Using our two assumptions,
Similarly,
Therefore,
\[ W_{total}(C_2) = \sum_{c \in side1}W_{reach}(c, C_2) \quad + \sum_{c \in side2}W_{reach}(c, C_2) \] \[ \quad \quad \quad \quad \; \; + \sum_{c \in side3}W_{reach}(c, C_2) \quad + \sum_{c \in side4}W_{reach}(c, C_2) \] \[ = 4\frac{S^2}{2} - kSR\cos\theta + kSR\sin\theta + kSR\cos\theta - kSR\sin\theta \] \[ \quad \quad \quad \quad + 2R^2\cos^2\theta + 2R^2\sin^2\theta + 2R^2\cos^2\theta + 2R^2\sin^2\theta \] \[ = 2S^2 + 4R^2 \]
Clearly \(W_{total}(C_2)\) is a constant, for every \(C_2\) on the circle with radius \(R\). Hence the probability of reaching any point on a circle with radius \(R\) is the same. As a result, we can choose to spawn at any point on the circle with a radius \(R\) instead of the square boundary.
The only constraint is that all the points outside the circle should be free. And obviously, we would like \(R\) to be as small as possible. Hence while spawning a new point, we chose the smallest circle that encloses the current image (minimum bounding circle).
A python notebook of DLA simulations using this optimization can be found here. Unfortunately I haven't been able to do any kind of benchmarking comparing the running times, however, I was able to go up to 10,000 particles (compared to 500 in the previous version) in a reasonable amount of time.
]]>This monsoon semester I picked up a really interesting course, Digital Audio (DES514 @ IIITD). Although it had been ages since I'd done anything related to music (seriously), the realization that it's my last semester made me have the whole "screw grades let's pick up something fun" talk in my head.
P.S. This also led me to pick up a philosophy course.
P.P.S. The semester didn't have a happy ending.
Anyway, coming back to the topic. Probably the most intense part of the course was the project, which Anant and I did together. We wanted to do something related to audio modulation (obviously), but not just stick to computational manipulations.
Before I delve further,
Although it is really tough to capture this project on camera (since we are experimenting with 8 channels, whereas a standard camera has mostly 2 input channels), you may be able to observe some changes in sound amplitude (mostly unwarrented due to the reverb in the small room hahaha).
We decided that we'd like to build an interface that would meet three broad goals
The first goal was easy to meet. We had been using SuperCollider (an open source platform for audio synthesis and algorithmic composition) throughout the semester, which can be used to modify an input audio signal in a few lines of code. An example would be passing the input signal through a low pass filter (LPF). And the best part, we can change parameters (like the max pass frequency of the LPF) and it gets reflected in the output signal in real-time.
The real challenge was merging this with our last two goals. We definitely did not want a wearable device, and at the same time wanted to allow the user to use his/her entire body to modify the input signal. As you would have guessed, we decided to move forward with a Kinect sensor. Although that meant coding in Visual Studio (and using Windows), it was perfect for our use case (especially since it came with its own gesture detection APIs).
For our project (since we just had a week to finish), we decided to move forward with just spatialization of the input sound signal in a circle using hand movements of the user. Once this was possible, porting it to include other effects (such as an LPF/HPF would be just grunt work).
Our interface would constitute of a server running the Kinect SDK, using which we detect the direction in which the user's hand is pointing. This information is sent to another server running supercollider. The angle is used to pan a given input signal across the speakers in such a way that it seems as if the source of sound is at the direction in which the user is pointing.
We had access to a brand new 8.1 surround system (big shout out to Prof. Timothy!), so panning the input signal across speakers was something which we could actually test. A rough schema of what we had in mind
It was a straightforward design which has 3 different components, each of them explained in detail below. Getting these three components together was the tricky part.
The Kinect sensor uses depth sensing and other computer vision approaches to estimate the skeleton of the person standing in front of it. The Kinect developer SDK allows us to use this information captured by the hardware. Using this SDK, we can capture the 3-D coordinates of the certain important points detected on the body (joints, fingers etc).
Since we can detect the coordinates for both the body centre and the fingers, we used these to calculate the x and y coordinate of the hand in the plane perpendicular to the body of the user. This allows us to understand two things
Image is borrowed
We send this information to the superCollider server running on a different machine using OSC, a protocol running over UDP.
For computing the output signal streams for the eight different channels (each for a given speaker), we used the VBAP plugin for Supercollider3. This takes a signal and an angle from the median plane and redistributes the signal into a number of channels assuming the angle to be the source of the sound. This plugin is based on Vector base amplitude panning, more information on which can be found here.
For ease in testing, we create a GUI which displays the entire ring of speakers. There is a movable pointer, which is used to indicate the source of the sound. The pointer can be moved around in the circle as shown below.
This pointer is used to indicate the source of the sound. If we point towards the speaker C, it would sound as if the source of sound is the speaker C.
We also introduce a parameter ‘spread’, which allows us to widen the area from which the sound comes. This allows us to change the distance of the sound source (or how far the user perceives it to be).
We were able to modify the code for a 5.1 surround panner (in superCollider examples) and adapt it to the 8.1 surround setup which we had.
The pointer can be moved around using the mouse, the spread can be changed using a slider. However, our end goal was to link this with the user’s movements. We do so by using the information captured by the Kinect sensor.
We make use of Open Sound Control protocol to transfer detected coordinates from the server running Kinect to the server running supercollider. OSC makes use of the UDP, which can be used between two machines on the same local network. While coding from the C# environment, we had to import a SharpOSC library into Visual Studios to be able to create a SharpOSC object and send information over the server that 2 laptops were connected to. In supercollider, we simply had to open a port and start listening to whatever was being transmitted by the Kinect server to the supercollider server IP (on the same port).
After receiving the user coordinates (at the superCollider server) form the Kinect stream, we simply updated the coordinates of the pointer and the spread values based on the distance between the user’s center and his/her hands. This update was done every 10 microseconds (using the AppClock API), allowing us to make updates in almost real time.
]]>I forked gabrielecirulli's 2048 repository and wrote the AI on top of it. This saved a lot of time, as I just had to write the AI, and not implement the whole game from scratch. The source code for the AI is in this repository.
2048 is played on a 4 X 4 grid, with four possible moves up, left, down and right. The objective of the game is to slide numbered tiles on the grid to combine them and create a tile with the number 2048.
There are four possible moves from which a player can choose at each turn. After each move, a new tile is inserted into the grid at a random empty position. The value of the tile can be either 2 or 4.
From the source code of the game, it was clear that the value is 2 with a probability 0.9. From the source code of the game,
var value = Math.random() < 0.9 ? 2 : 4;
I used a depth bounded expectimax search to write the AI. At every turn, the AI explores all the four possible directions and then decides the best one. The recurrence relation that it follows is
The max node is the one in which the player chooses a move (out of the four directions), and the chance node is the one in which the board inserts a random tile. The max node decides the correct move by maximising the score of its children, which are the chance nodes.
Due to space and time restrictions, it is obviously not possible to explore the complete search space. The search is bounded by a depth, at which the AI evaluates the score of the state (the leaf node) using some heuristic.
The pseudocode is given below.
bestMove(grid, depth, agent):
if depth is 0:
return score(grid)
elif agent is BOARD:
score = 0
for tile in grid.emptyTiles():
newGrid = grid.clone()
newGrid.insert(tile, 2)
score += 0.9 * bestMove(grid, depth - 1, PLAYER)
newGrid = grid.clone()
newGrid.insert(tile, 4)
score += 0.1 * bestMove(grid, depth - 1, PLAYER)
return score/len(grid.emptyTiles())
elif agent is PLAYER:
score = 0
for dir in [left, up, right, down]:
newGrid = grid.clone()
newGrid.move(dir)
score = max(score, bestMove(newGrid, depth - 1, BOARD)
return score
I used a combination of two heuristics to evaluate how good a state was.
In most of the games that I came close to winning, the bigger tiles were around the corner. Hence the first idea that I used was to push the higher value tiles to one corner of the grid. For this, I assigned different weights to different cells.
The score of a given 4 X 4 grid was calculated as following
score = 0
for i in range(0, 4):
for j in range(0, 4):
score += W[i][j] * grid[i][j].value
This had another advantage, the new randomly generated tiles got generated in the opposite corner, and hence were closer to the smaller value ones. This increased the chances of the newly generated tiles getting merged.
It is obviously better for us if more tiles get merged. This happens if in a state two same valued tiles are present next to each other.
I calculate another value, a penalty, and subtract it from the score calculated from heuristic one. This penalty is calculated as following:
penalty = 0
for each tile:
for each neighbour of tile:
penalty += absolute(tile.value - neighbour.value)
This penalty becomes large when high value tiles are scattered across the grid, hence indicating that that particular state is bad.
The heuristic function I'm using calculates returns score - penalty
, each calculated as mentioned above.
As far as I know, there is no method to prune an expectimax search. The only way out is to avoid exploring branches that are highly improbable, however this has no use for us as each branch has an equal probability (the new tile has equal probability of popping up on any of the empty tiles).
Initially I was exploring nodes even if the move played by the PLAYER had no effect on the grid (i.e. the grid was stuck and could not move in a particular direction, which is a common occurrence during gameplay). Eliminating such branches did enhance the performance a bit.
The solver performs quite well. With a search depth of 6, it formed the 2048 tile 9 times out of 10. With a search depth of 8, it formed the 2048 tile every time I tested it, and went on to 4096 every time as well.
For a better winning rate, and taking the time taken to search per move into consideration, I keep a 8 ply lookahead if the number of empty tiles in the grid is less than 4, otherwise a 6 ply lookahead. This combination leads to a win every time, and 8 out of 10 times forms the 4096 tile.
The AI achieved a best score of 177352, reaching the 8192 tile. The average score was around 40000, and the AI always won the game.
I forgot to take the screenshot of the best run, but a closer one's here.
Each move takes anywhere from 10ms to 200ms, depending upon the search space and how complex the grid is at that moment. The version here has a timeout function, calling the AI to find the next move every 50 ms.
It can perform better if the search depth is more, however that would make it quite slow. The fact that the code is in javascript doesn't makes things any better :D. Have a look at the AI beat the game!
Writing the artificial intelligence was quite a lot of fun, and I learnt a lot during the process. Do let me know what you think about it in the comments section!
Till next time,
Kush
ulong_extras
module (the Brillhart-Lehmer-Selfridge test to be precise). It required computing the cube root of the integer being tested. While implementing it, I realized that the ulong_extras module was missing a cube root function. We couldn't have used the cbrt()
function available in math.c
, as it was introduced in C99 and FLINT is guaranteed to work with C90 (You'll notice that I've used unsigned long long int instead of uint64_t below, because of the same reason). We decided to use pow(n, 1./3.)
temporarily, however it was not reliable due to the way 1./3. is stored in memory. So I decided to write one for the ulong_extras module.
It did sound like a trivial task initially, given that we had to calculate only the floor of the cube root and not a floating point answer. However there was so many different algorithms to try and test against each other for efficiency, that it did end up taking quite a lot of time. One of them which I really liked involved a lot of bit level hacks, and is probably one of the most interesting pieces of code I have worked on.
The implementation can be found here
The implementation exploits the way an IEEE Double-precision floating-point is stored in memory. Do read up a bit on it before proceeding further.
This representation makes our task to calculate the cube root very simple. In this form, the integer n is stored as
$$
n = {(-1)}^{sign} * 1.{{fraction}} * 2^{(exponent - 1023)}
$$
The core idea is to remove a multiple of three from the exponent, which can be easily extracted using bit masks. This reduces our problem to computing a cube root on $[0.5, 1)$.
Since we are doing our computations on unsigned integers, we can stop worrying about the sign bit in the formula. The above expression can be rewritten as,
$$
n = {{residue}} * 2^{(exponent - 1022)}
$$
where residue is $(1.{fraction}) / 2$. Calculating the cube root,
$$ {\sqrt[3] n} = {{residue}}^{1/3}* 2^{{(exponent - 1022)}/3} $$
We can evaluate this value in a very efficient manner by using some lookup tables. There are two parts which need to be calculated, the exponent part and the cube root of the residue part.
Calculating the exponent part is quite easy. We calculate
$$x = 1 << {(exponent - 1022)}/{3} $$
and
$$y = 2^{(exponent - 1022)\%{3}} $$
y can have three possible values (since we're working mod 3 there), and hence we can store these three values in a lookup table. Let's call this factor_table. It comprises of the values 1^(1/3), 2^(1/3), 4^(1/3). The value of the cube root of the exponent part is $x * y$.
Calculating the cube root of the residue is slightly trickier, since obviously we want to avoid making a call to pow()
. We know that $ residue \in [0.5, 1)$. I split this interval into sixteen equal ones, of range 0.03125 each, and then used approximating polynomials to compute a good approximation of $ \sqrt[3] {residue}$.
I used mpmath's chebyfit function to compute approximation polynomials of degree 2 for each of the 16 intervals, and stored their coefficients in another lookup table. Let's call this table coefficient_table. Mpmath's chebyfit function uses the chebyshev approximation formula. I used the following call to compute the coefficients for each interval. i and j are the endpoints of the range.
mpmath.chebyfit(lambda x: mpmath.root(x,3), [i, j], 3, error = False)
Using this, I computed a 16 X 3 2-dimensional float array (16 intervals between $[0.5, 1)$ and 3 coefficients per interval). The tricky part was how to use these values. I could have used an if else ladder to decide which of the coefficients to use, depending upon the value of the $residue$. However introducing so many branches seemed like a very bad idea. Imagine an if-else ladder with 16 branches!
It took me a while to figure out a better solution. A much better way to do was to simply use a few initial bits of the residue. Since the residue is greater than or equal 0.5, the first bit is definitely set. The next 4 bits of the residue are sufficient to decide which set of values to use (or which interval the residue lies in).
For example if the residue was 0.58, then it would be represented as 100101000... in the double
. Using a bit mask, we can extract bits 2..5, which will give us a value between $[0, 16)$. In this case, it will be 0010, i.e. 2. So we know that we have to use the values present in $coefficient\_table[2]$. Let's call this value the table_index.
So $\sqrt[3] {residue}$ can now be calculated as following. Let this value be z.
We can save one floating point multiplication here (quite expensive), by using Estrin's scheme of evaluating polynomials.
Now we have solved both our subproblems, our cube root is simple the integral part of $x * y * z$, as calculated above.
Now that we have a way to calculate the cube root, the question is, how do we actually compute the values of exponent and residue? C does not allow bit level operations on a floating point data type.
>kush@kush:~/Desktop$ gcc -o a test.c -lflint -lm
>test.c: In function ‘main’:
>test.c:162:13: error: invalid operands to binary | (have ‘double’ and ‘long int’)
testDouble | 0xF;
^
Here I was trying to use the or
operation on a double
, and as expected, the code gave errors on compilation. This causes problems, as we cannot access the bits of the double
directly. The solution to this, was to use unions.
Unions let us store different data types, in the same memory location. The same 64 bits stored somewhere in the memory, can be accessed both as a double
, as well as an unsigned long long int
using this union!
We can declare a union comprising of a double
and an unsigned long long int
(an unsigned word), and perform the bit level operations on the unsigned long long int
to access the bits of the double
.
typedef union {
unsigned long long int uword_val;
double double_val;
} uni;
So basically, if we're trying to compute the cube root of an unsigned long long int
n, we take the following steps,
uni alias;
alias.doube_val = (double) n;
unsigned long long int wanted_bits = alias.uword_val & bit_mask;
Now we can access all the bits using alias.uword_val
.
You can have a look at the complete implementation of this method below. The mp_limb_t
data type used is the same as an unsigned word, or a 64 bit unsigned integer in most (64 bit) processors.
The implementation is accurate, and manages to beat pow(n, 1./3.) in all cases. I timed the two a lot of times for various sizes of n and plotted them. The timings are for a 1000 random numbers of the given size. The new method is 1.1 to 1.2 times faster than calling pow(n, 1./3.).
The actual cube root function present in FLINT currently is slightly more complex. I used Newton iterations and an algorithm designed by W. Kahan for smaller integers, as it was turning out to be more efficient. In case of larger integers (> 46 bits), the method discussed above is more efficient, and hence used. Check it out here,
Let me know what you think about it in the comments section.
Till next time,
Kush
The last time I came across such a thing, I was in the 10th grade and at that point, I couldn't do anything about it. However this time, powered with some knowledge about the awesome JavaScript (LOL), I decided to unblur the damn page.
After a quick inspection of the HTML code, I realized that Scribd isn't even trying -_- . I saw that Scribd was actually sending the complete document to the client, and not just the two unblurred pages. This is really dumb I believe, because once you send something to the browser, any attempt to keep it secure/inaccessible is futile.
All that had to be done, was change the style of each div which contained a part of the document (yes, Scribd inserts the document as unselectable text in the HTML!!), and increase the opacity of the image and voila, the page is unblurred. The script is in this repository.
// Remove the boxes that say you have to pay to view
$('.autogen_class_views_read2_page_blur_promo').remove();
// Removing irritating addverts
$('.between_page_ads').remove();
// Unbluring the text
$('.text_layer').css('text-shadow', '0px 0px 0px');
// Making images darker
$('.text_layer').css('color', '#000');
$('.absimg').css('opacity', '1.0');
The script requires jQuery to work.
I used this script to make a chrome extension Unblur Scribd. Install the extension, and as soon as you come across a blurred Scribd page, just click the extension to view the page.
One thing which I really liked about Scribd was that they load their pages dynamically. As soon as you reach a page x, an AJAX call to load page x + y is made. This saves so much overhead in the beginning when you open the URL, as you're actually loading just two pages of the document! However this makes unblurring the complete document impossible in one go, without loading the complete document first. The extension needs to be clicked every time the user sees a blurred page (that's about once every 5 to 6 pages).
]]>Although I haven't been able to do everything
]]>The pencils down deadline for GSoC was 21st Aug, the previous Friday. It has been quite a journey, working with FLINT for over three months. I've learnt quite a lot, not just math but how to write good quality code and testing.
Although I haven't been able to do everything I had mentioned in my proposal, I worked on a lot of stuff which I didn't mention. I worked extensively on the ECM, more than I had planned and ended up writing a very good stage I and II using mpn's. Apart from that I worked on the Pollard Rho Brent algorithm, and the MPQS. The MPQS has some problems and is not functional though.
This week I tried to debug the MPQS. The problem with MPQS is that the sieve value is not exceeding a limit, and hence does not get evaluated. No relations are being found. I believe this is because the roots of the polynomials are being computed incorrectly. I will be looking into this eventually, once I get a break in the semester.
Apart from the MPQS, I also worked on the fmpz_factor_smooth()
function. It aims to find a factor of a number, given that it is not semi prime (around n^1/3). It would be a combination of trial division, Pollard Rho Brent, p + 1 and ECM. The profiling of Pollard Rho Brent is yet to be done. Once I do that, I can finish this off.
What's next? I have a lot of stuff which I plan to do. First, is obviously to finish off MPQS and adding a double large prime version. Some other things which I want to work on is to improve the ECM stage II, and the fmpz_factor_smooth()
function. I will be working on them, however at a slower speed due to the course load.
I started working on a simple MPQS this week. Most of the code required (Knuth Mutiplier, factor base, linear algebra, square root functions) was already available in the SIQS module, written/ported by Nitin. I just had to make small changes to the code to make it compatible with the sieve I am writing. The part which I had to write again was the polynomial code. It wasn't that tough as self initialization was not required.
The polynomial I am using in the sieve is Q(x) = (Ax + B)^2
. Hence, we have to solve the equation Q(x) - kN = 0
, i.e. A^2x^2 + 2ABx + B^2 - kN = 0
.
While evaluating, we pull the factor A out, and try to factor Ax^2 + 2Bx + (B^2 - kN)/A
. So for evaluating the sieve at different positions, we have to compute A, B, and C (C equals B^2 - kn).
I am using this paper as a reference. The ideal value of A is the square of a prime d near the value (N/(2 * M^2))^1/4. B is chosen as the odd square root of kN (mod A). This is done by first calculating the square root mod d (which is a prime), and then using Hensel lifting to calculate the square root mod A (which is d^2). C can be computed easily by computing (B^2 - kN)/A.
d^2). C can be computed easily by computing (B^2 - kN)/A.
I have coded a couple of functions which implement selecting multiple polynomials and computing associated data, the most important being the roots of Q(x) = N (mod p)
, for each prime p in the factor
The code can be found in this file.
I also ported the code from the existing SIQS module for MPQS module. The current implementation has some problems, it does not factor. The control never reaches inside the mpqs_evaluate_candidate()
function. Hence no relations are found. I am looking into this problem currently.
Since I've missed out on some blog posts the last couple of weeks, I'll be summing up the stuff I've been up to in this post. The past month hasn't really been as productive as I would have wished, the reason being some wrong assumptions I made. I'll be giving week wise summary below.
This week I started reading the existing SIQS code, which I pulled from Nitin's SIQS branch. I took me some time to understand the existing code, and where exactly I would start working to add the double large prime version.
While reading the code, I also found a small bug in the code. In the qsieve_evaluate_candidate()
function, all powers of 2 were being removed before the Knuth Schroeppel multiplier is removed from the relation. This would have slowed down factorisations in which the knuth schroeppel multiplier was even.
An issue was reported this week, concerning one of the functions I had written earlier (#150). I spent some time reading the n_cbrt()
code, and noticed that I had made a terrible mistake earlier. I had assigned wrong constants, which would have definitely caused overflows in 32 bit systems. Although this did not solve the issue reported, it was definitely something which would have caused problems. I never noticed it earlier on as I never tested the code on a 32 bit machine.
This week I started coding the double large prime version. I was using Factoring with two large primes, A. K. Lenstra and M. S. Manasse as a reference. I started by reducing the cut off limit in qsieve_evaluate_sieve()
to produce more partials and pp relations. Once I got these relations, I used n_factor()
and n_isprime()
in qsieve_evaluate_candidate()
to check whether the remainder after pulling out the factor base primes is a prime / semi prime. If so, I wrote them to a file of the format [type, number of factor base primes that divide it, factor base prime's index, factor base prime's exponent, large prime I, large prime II]
. In case of a partial, the large prime II was simply 1.
Here is when I started to face problems. I simply couldn't manage to produce a lot of partials and pp's. I tried to explore the code and check whether I was making any mistake, but couldn't figure out what to do to fix it. At this point I should have switched over to the tested FLINT2 code, and extend it to a a double prime version, however I was under the impression that an SIQS was required before I could add a double large prime version. This was a wrong assumption I made.
I couldn't really do a lot this week, I had refresher modules and exams throughout the week at my institute (they take place before the beginning of the semester).
I did some minor work towards the FLINT 2.5 release this week. There was an issue #32 reported some time back. It was required to check that in each call of a function *_si()
, the si parameter is actually an slong and not a ulong or mp_limb_t. This had to be done for every call across the FLINT code base. I volunteered and checked some.
At the beginning of this week, I realised that it is too late to implement the double large prime MPQS completely. I decided to go back to the ECM code I had written and complete some of the work I had postponed till after the GSoC period. The main reason was that I knew what was going around in the ECM really well, and thought that I could do considerable work working on it.
The ECM code I had written was doing good, under a factor of 2 compared to GMP-ECM. There were a couple of algorithmic enhancements I had in mind, which could make it even faster and comparable to GMP ECM. These included multi point evaluation in stage II, the Montgomery PRAC algorithm in stage I and maybe Edward coordinates.
I started off by checking the code for any memory leaks and unnecessary memory allocations. I found unnecessary allocations in stage II, and cut them out using better indexing (in stage II). This reduced the memory being used by almost half in the stage II code. There were a couple of places where there were memory leaks. I resolved this as well.
I also fixed the hack I was using in fmpz_factor_ecm_select_curve()
. I couldn't figure out how exactly to call mpn_gcdext()
earlier on. I realised this week that I was passing an mp_limb_t
, whereas I should have been passing an mp_size_t
(the latter is signed!). So when the inverse was negative, I was facing problems (the function sets the "size" argument as negative if the mp_ptr is negative). Earlier on, I avoided using mpn_gcdext()
. When it was required, I was converting back to fmpz_t's
, calling fmpz_gcdinv()
and then converting back to mp_ptr's! Now I am making a direct call to mpn_gcdext()
.
I also started to incorporate multi point evaluation in stage II this week. I had tried this earlier on, however contrary to my expectation, it was almost two times slower. I coded the stage II again (thankfully I had the ECM fmpz code from some time back, coding directly using mpn's would have been very difficult). However yet again, it didn't turn out to be faster than than the existing code. Maybe I'm doing something wrong theoretically? However in such a case, stage II should never be able to factor, whereas it is definitely.
I made some more minor changes to the ECM code and polished it, it seems good to be merged to me.
After a brief conversation with my mentor, I have decided to go back to MPQS. In the coming week, I will be working on a simple sieve, which uses multiple polynomials, without any self initialisation. A lot of code required for this already exists in the FLINT trunk.
]]>This week I finished writing ecm using the mpn functions and reduction using pre-computed inverse for integers larger than one word. The code is present in these files:
Apart from these, I also wrote a couple of helper functions for mpn arithmetic:
These are not regular functions, and cannot be used with all elliptic points. These assume that the "n" passed has its highest bit set. I did this to save a couple of additions. We are anyway working with normalized (shifted) mpn's, and hence don't cause a problem.
Documentation and tests for fmpz_factor_ecm()
is present here:
The current ECM is doing a good job, it takes around 0.1 to 0.2 seconds on an average to find a 50 bit factor (~15 digits). I never got around to finding why using multi point polynomial evaluation was slowing down stage II, whereas theoretically it should have been faster. I'll probably look at it later on. Also, I haven't profiled it and found optimum parameters (B1 & B2) for different sized factors. These two things ought to speed it up further.
Another problem, or rather a code inconsistency exists in the fmpz_factor_ecm_select_curve()
function. I was having some trouble using mpn_gcdext()
at one point, so I am converting back to fmpz_t
, using fmpz_gcdinv()
, and then converting back to mp_ptr
. This works, although it's quite hacky. I'll figure out the right way to use mpn_gcdext()
and alter it.
The only serious problem, is that the current ECM implementation cannot find factors below 7 bits. The fmpz_factor_ecm_select_curve()
function fails to find a suitable curve for such integers.
Another thing I worked upon is n_factor()
in the ulong_extras module. Brent-Pollard Rho and ECM seem to do a better job than SQUFOF, which is used currently. So instead of just calling factor_one_line()
and SQUFOF, I changed it so that it calls Brent-Pollard Rho once, and then ECM for some tries. This gives almost a 4 to 5 time speedup when factoring semi primes, and 1.5 to 2 times for random cases. SQUFOF is still there in the code, so as to make sure that n_factor doesn't fail (in case Brent Pollard Rho and ECM do), however I never saw it being called in all the random cases I tried.
Including Brent-Pollard Rho and ECM did cause a problem though. Both of them require random states to be passed, and changing the function definition of n_factor()
will break existing FLINT code. My mentor suggested that we should cache a flint_rand_t
in the factor function itself. I am working on this right now.
This week, I plan to finish the small part remaining in n_factor()
(caching of the flint_rand_t
) and then go back to double large prime MPQS. I am using the paper MPQS with three large primes as a reference. Although named three large primes, it also gives quite a lot of information on the double large prime version.
I spent the last week rewriting the ECM using mpn's and pre-computed inverses, and ulongs. Pollard Rho did show a significant speedup when I rewrote it using mpn's and pre-computed inverses, and I was hoping for the same here. Initially I was planning to write only till stage I using the faster options available, however I ended up working on both stage I and stage II.
The flow which I had in mind was similar to many existing functions in FLINT. If the integer to be factorized fits in one word, use the ulong module else use fmpz's/mpn's. So I decided to rewrite ECM using the ulong module first.
Next I attempted to write ECM using mpn's. The code which I have written so far (using mpn's) works fine for single word integers, however causes problems for larger integers. It is still work in progress.
The current version of ECM still has a lot of room for improvement. This includes tuning it so as to find optimal values for B1 and B2. Also I still haven't been able to pinpoint my error in the multipoint evaluation stage II I wrote a couple of weeks back. It works, however is slightly slower than than the initial stage II.
I missed out last week's blog, couldn't write it as I was traveling and had poor internet connection. In the last two weeks, I have been trying out different implementations of stage II, using multi point evaluation.
Mid term evaluations began on the 26th of June, and I have
]]>Hi!
I missed out last week's blog, couldn't write it as I was traveling and had poor internet connection. In the last two weeks, I have been trying out different implementations of stage II, using multi point evaluation.
Mid term evaluations began on the 26th of June, and I have met most of the targets I had set. I have added an optimized version of pollard rho, and the ECM code should be completed by the end of this week. Although there are still a lot of things I'd like to try to make it faster, I don't want to deviate a lot from the timeline now. I'll probably get back to it after GSoC.
The current ECM timings are close to GMP-ECM, I'll be posting some timings next week, once I code it using mpn's. My main worry right now is the fact that using multi point evaluation is slowing down ECM, contrary to the expected speedup.
This week, I plan to find the reason why the multi point evaluation technique is not turning out to be faster, as expected. Also I would be rewriting ECM, at least the stage I, using mpn's.
]]>