Not all data is important -- Estimating stickiness of DLA simulations using local regressors

This is a follow up to my previous post on fast DLA simulations. Please do read that before continuing further!

In this article, we try to come up with a model which can help us estimate the stickiness of a DLA simulation. Given an image from a DLA simulation (a binary matrix, 1 indicating a particle and 0 free space), we would like to estimate the value of k used for that particular simulation. This was actually given to me as a part of a recruitment process, and for simplicity, the range of k was set to [1e-3, 5e-2]

P.S. Python notebooks with the analysis done in this article can be found here. The notebook with estimation approach can be found here.

The stickiness parameter -- k

I briefly touched upon k in the previous post, the stopping criterion of a particle undergoing Brownian motion depends on this parameter. As soon as a moving particle collides with a particle at rest, the moving particle stops with the probability k. The higher the value, the higher the chance of a particle coming to rest (and hence, the stickier the particle). This parameter tends to change a DLA simulation quite a bit. Let's have a look at 3 simulations with 10,000 particles, but with varying stickiness.


Simulations with stickiness value 1.0, 0.5 and 0.1 respectively

Just by examining a few such simulations, it is clear that a low k value entails a more "compact" (or "crowded") image. If we can somehow come up with a nice property that can capture this crowding, we can think of modelling that as a function of the stickiness parameter. For my analysis, I used the total surface area of the image to capture the crowding. This was computationally fast to compute (order of \(n^2\)) and gave a good idea about the level of crowding of the image. We would expect the surface area to increase (less crowding) as k increases.

Extracting information

Given a DLA simulation, let's look at what all information we can extract

  • The size of the canvas
  • The number of particles
  • The position of each particle in the canvas

Clearly, surface area can easily be calculated from the image. We just need to check each particle and see how many of its neighbouring cells are empty. For the rest of the article, I'll use n to denote the total number of particles, and S.A. to denote the surface area.

Generating data

Before we start modelling anything, it was essential to first get the data. This was pretty straightforward since we had already written the DLA simulation code. We split the range of k into equal intervals using np.linspace(1e-3, 5e-2, 40), and run 15 simulations of 20,000 particles each. For each run, we calculate and save the surface area after the addition of every 100 particles. This gave us 15 readings of surface area for each n in range(100, 20001, 100). For each value of n, we take the average surface area across the 15 runs. In total, we generated roughly 8000 (n, k, S.A.) tuples, which we further use in our model.
All this data is available here.

Variation of Surface Area with n and k

As a sanity check, we first plotted the value of surface area vs n, keeping k fixed. Obviously, this should have a strong positive correlation.

sa vs n

Surface area vs number of particles for different values of stickiness

From the above plots, it's clear that the total surface area increases with n. We initially assumed that the graph would be strictly increasing, but later on realised that its wrong to claim this, as in some rare cases addition of a new particle can lead to a decrease in the surface area of the figure (think of a 3x3 square of particles with one of the edges having an empty space).

Next, we tried to see how surface area varies with the stickiness value (keeping n fixed). If this looks like something we can capture using a model, we don't need to look further.

sa vs k

Surface area vs stickiness for different n

As soon as I plotted these, I got really excited. This clearly looks somewhat like a log distribution! However the only downside, it's not bounded in a fixed range (look at the y-axis for different n). That should have been expected since the total surface area is bound to depend on n as well. So we need to find a way to eliminate this from our model. A very naive way to this is to divide the surface area with n.

sa vs k

Surface area / n vs stickiness for different n

Honestly, I didn't expect this to work since I couldn't come up with a proof that justified that dividing by n (and not some power of n) should normalize the range, but these plots looked really promising. The average surface area seems to be a better property to fit a curve on. Next, I checked how surface area / n varies with log(k)

sa/n vs log(k)

Surface area / n vs log(stickiness) for different n

These plots looked pretty good to fit a curve on, so I stopped further alterations to the data. It looks like a straight line as k increases, small values of k don't look good though. So I tried with second-degree polynomials as well.

Fitting a curve

There were two important things to decide before beginning to fit a curve

  • How should we split the data into train/test
  • What metric should be used to evaluate the model

The data which we generated from the simulations were (n, k, S.A.) tuples. I really wanted to eliminate n completely, since training separate models for each value of n just wasn't possible. Let's have a look at the scattering of surface area values for different ranges of n, which might help us understand whether we can eliminate n or not.

sa/n vs log(k)

Surface area / n vs log(stickiness) for different n. Points from all simulations

We notice that as the value of n increases, the average surface area tends to scatter less. Let's explore further for smaller values of n.

sa/n vs log(k)

Surface area / n vs log(stickiness) for different n. Points from all simulations

Two inferences can be drawn from the analysis until this point

  • Values of surface area scatter less as n increases. Hence, the estimation will be more accurate for larger n.
  • As n increases, this scattering becomes negligible. Furthermore, for large n (above 1000), the curves are extremely similar. Hence we can eliminate n completely from models which have a large number of particles.

So eliminating n completely was not feasible, since it will tend to fail for smaller values. Another interesting idea here is to use some sort of a windowed curve fitting. For a given n, instead of using complete data, we only use the data from simulations where the number of particles is in the range [n - c, n] for some c.

To check whether my intuition was correct, I simply partitioned the data into two parts and used np.polyfit for curve fitting, checking for up to 3-degree polynomials. I trained models on data where the number of particles were in range(100, 15000, 100) and tested it on data from simulations where the number of particles were in range(15000, 20000, 100).

sa/n vs log(k)

Predictions on the test set, using complete training data

This did seem like a good fit, the error in predicted k was low (we're bounded in [1e-3, 5e-2]). Error % = 0.00166000318459 / (5e-2 - 1e-3) = 3.38 %

However, notice that the training error is high. This was expected because we saw high scattering for small k. Let's try to see if using only data closer to our given n can give a better fit. The next plots show predictions by a model trained in range(10000, 15000, 100), tested on range(15000, 20000, 100).

sa/n vs log(k)

Predictions on the test set, using partial training data closer to test set

As expected, the training error went down in this case. However, it was really interesting to notice that error on the test has also decreased! E % = .00106134661616 / (5e-2 - 1e-3) = 2.16 %. Our estimate improved when we discarded data for n much smaller than the target n. This strongly suggests that we should be using a local window regression model.

Training local models

The idea is to use only data from a fixed range of n smaller than the given n. Since training time is negligible, we can train the model in real time depending on the given n in the DLA image. We already have simulation data for n up to 20,000 particles. The next obvious question that pops up is, what happens if n is so large that we don't have simulation data. From the above analysis, we can safely say that the curve more or less remains the same for any number of particles above 10,000.

For each value of n in range(1000, 20000, 1000), we trained regression models on simulation data where the number of particles was in the range [n - 4000, n - 100].

sa/n vs log(k)

Predictions on the test set, using local regressors

For the 3-degree poly, the worst error percentage (0.0019) is better than the average train error in the previous fit. Furthermore, the training error also went down. Hence a local fit is definitely a better one.

So our strategy to predict stickiness is the following. For a given n, we take a window of n from the data and fit a 3rd-degree polynomial. We use this polynomial to predict the k for the given n. Code for this approach can be found here.

Show Comments