The Problem

Given a data set and a question to be answered, one generally tends to solve the problem in the following fashion -

  • Understand the data and its patterns
  • Hypothesize a model which will fit the data in the most relevant manner

A theme that overarches all such models is that they need to be fast, accurate and robust. The models need to be reasonably fast to provide actionable results in meaningful time, need to be accurate for the user to rely on and need to be robust in scenarios where the data is adversarial. The following discussion is particularly inclined towards solving for robustness while not significantly impacting the other two metrics.

Talking of adversarial data, it could come from error introduced during manual sampling or extrinsic factors like noise. In other cases, it could just be the nature of the data, making it less predictable. Data samples which don’t follow the general “trend” of the data are known as outliers and generally introduce an unwanted outcomes including results being inaccurate.

An Illustration

To understand why the outliers are a problem, let us consider the following data. For the purpose of this problem, consider the data points that come from

\[ y = f(x) = x \]

The points are represeted in the code below by variables \( x \) and \( y \) consisting of 100 randomly generated points. I’ve chosen one-dimensional data as it is easy to visualize and understand the implications of changes.

%matplotlib inline

from sklearn import linear_model
import numpy as np
import matplotlib.pyplot as plt

num_samples = 100
x = np.sort(np.random.rand(1, num_samples) * 500)
y = np.copy(x)

For the sake of this illustration, let us artificially introduce Gaussian noise (i.e. noise with the probability density function equal to the normal distribution) and add that to the original sample \( y \).

noise = np.random.normal(0, 1, (1, num_samples))
y_noise = np.copy(y) + noise * 30

Note that this noise will only bring minor jitters to the actual line. So, to make things worse, let us introduce some outliers manually. For instance, I chose to change all values of \( x <= 40 \) to be magnified by a factor of \( 10 \).

y_noise[np.where(x <= 40)] *= 10
y_noise[np.where((x >= 200) & (x <= 230))] *= -4
y_noise[np.where((x >= 400) & (x <= 430))] *= 4

Now, let us see how the final artifically constructed data looks as compared to the originally clean line $y=x$. Looks like we have done a fair amount of damage to the data.

original_data_plot, *od = plt.plot(x, y, 'g+')
noisy_data_ploy, *nd = plt.plot(x, y_noise, 'r1')
plt.legend([original_data_plot, noisy_data_ploy], ['Original Data', 'Noisy Data'])

Original Data v/s Noisy Data

Training a Linear Model

To make things simple, let us train a linear model to fit a line which best matches the data points. As with all general approaches, we’ll take a \( 60:40 \) train to test split ratio for the purpose of this model. By design of the problem we know that there is only one variable and hence the Linear Regressor will fit data to the model -

\[ y = mx + c \]

Remember that we added noise to data which had \( m=1 \) and \( c=0 \).

lm = linear_model.LinearRegression()
num_train = np.floor(num_samples * 0.6).astype(int)
x_train = x[0, :num_train].reshape(-1, 1)
y_train = y_noise[0, :num_train].reshape(-1, 1)
x_test = x[0, num_train:].reshape(-1, 1)
y_test = y_noise[0, num_train:].reshape(-1, 1), y_train)
print(lm.coef_, lm.intercept_)

Great! We have a coefficient and the intercept. That seems way off. Let us see how our predictions look as compared to original data.

original_data_plot, *od = plt.plot(x, y_noise, 'r1')
predicted_data_plot, *pd = plt.plot(x, x * lm.coef_ + lm.intercept_, 'b.')
plt.legend([original_data_plot, predicted_data_plot], ['Original Data', 'Predicted Data'])
plt.annotate('outliers', xy=(240, -800), xytext=(300,-600), arrowprops=dict(linewidth=0.1))
plt.annotate('outliers', xy=(400, 1600), xytext=(270,1400), arrowprops=dict(linewidth=0.1))

Original Data v/s Trained Data via Simple Linear Regression

Oops! That is an awfully skewed line and will not be an accurate predictor of the data we started with. As you might have guessed already, this problem aries because of the outliers labelled above.

The Solution

We need a way to de-sensitize our model to the outliers which are leading to the trained model in a bad shape seen above. One of the most popular approaches to outlier detection is RANSAC or Random Sample Consesus. The algorithm performs the following steps -


  1. Sample a subset of data uniformly at random (the minimum number of points needed to estimate the model)
  2. Estimate parameters for the model of choice using the sampled subset.
  3. Calculate the error for all remaining samples using an error function.
  4. Count number of inliers i.e., all samples below a given threshold error.
  5. If the number of inliers are above a given threshold, recompute the model using all inliers and hypothesis points
  6. Repeat the above to find the best model.

While there is nothing complex about the algorithm, it is an extremely elegant and efficient idea. It is important to understand that RANSAC is not a regressor itself but helps improve the robustness of a given model by non-deterministically sampling subsets of data and training on them. Think of it as a more efficient alternative to k-fold cross-validation.

Let us see how the RANSAC Regressor works for our problem.

rlm = linear_model.RANSACRegressor(linear_model.LinearRegression()), y_train)
original_data_plot, *od = plt.plot(x, y_noise, 'r1')
predicted_data_plot, *pd = plt.plot(x, x * rlm.estimator_.coef_ + rlm.estimator_.intercept_, 'b.')
plt.legend([original_data_plot, predicted_data_plot], ['Original Data', 'Predicted Data (RANSAC)'])

Original Data v/s Trained Data via Simple Linear Regression Augmented with RANSAC

Wow! That looks way better than the previous result and is generally moving in a correct direction. In essence, RANSAC works because it tries out different model hypotheses and returns the best possible match after a given number of trials. In that effort, the model hypothesis which would have contained the outliers would have been outright rejected in the error thresholding step.

Mathematical Analysis

Like all non-deterministic algorithms, we must show that this algorithms returns result with reasonable probabilistic bounds. This analysis will also expose us to the various parameters that go behind the scenes.

Let \( k \) be the number of trials we run for the algorithm above to select a subset of \( n \) good samples. Let \( w \) be the probability that a given sample is an inlier. Then the expected number of trials are,

\[ E(k) = \displaystyle\sum_{i=1}^\infty i p(i) = \displaystyle\sum_{i=1}^\infty i a^\text{i-1} b = b \displaystyle\sum_{i=1}^\infty i a^\text{i-1} \]

where \( b = w^n \) and \( a = 1 - b \). In plain words, \( p(i) \) represents the situation where first \( k - 1 \) trials have at least one bad sample and the \( k^\text{th} \) trial has all good samples. To solve the summation term of the expression above, lets expand and observe the pattern -

\[ \displaystyle\sum_{i=1}^\infty i a^\text{i-1} = 1 + 2a + 3a^2 + \mathellipsis + ia^\text{i-1} + \mathellipsis \]

Consider the identity,

\[ \frac{a}{1-a} = a + a^2 + a^3 + \mathellipsis + a^i + \mathellipsis \]

which when differentiated wrt. \( a \) and substituting in the above summation expressions gives us

\[ E(k) = b \frac{1}{1-a^2} = \frac{1}{b} = w^\text{-n} \]

An alternate analysis would be from the perspective where we want to ensure with a probability \( z \) that at least one of the random selection is outlier free. Hence,

\[ 1 - z = (1 - b)^k \] \[ k = \frac{log(1-z)}{log(1-b)} \]

These analyses can be used to estimate the number of trials \(k \) needed to obtain a guarantee as needed by the application on a case by case basis by simply plugging in the values.


RANSAC was first used to estimate the Location Determination Problem LDP (estimating points in space to an appropriate point in image). It has become a fairly common algorithm especially in Computer Vision due to the relatively low additional computational and storage requirements. It has found regular usage in problems pertaining to 2D Homography Estimation and Pose Estimation from images where the number of keypoints generated are fairly large and testing the complete sample space for solution is practically impossible in real-time.

The core idea behind RANSAC, though powerful, is still not enough for practical purposes. For instance, one of the biggest assumptions behind using the technique is the existence of outliers. Understandably, it is bound to fail when the inlier ratio \( w \) is low. Among other things, the error function to estimate whether the sample is an inlier or not will have a significant impact to the accuracy of the final model.

To solve the above problems, a family of RANSAC algorithms have been proposed which I’ll leave to be explored in further reading below.

Readings & References

  1. Fischler, Bolles - Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography, 1981

  2. Choi, Kim, Yu - Performance Evaluation of RANSAC Family, 2009