A Comprehensive Introduction To Linear Regression

In the 5th post of this series taking you from a total beginner to being able to a pro at creating your machine learning algorithms, we will take an in-depth look at linear regression. By the end, you will have a comprehensive understanding of everything you need to implement it fully!

You may have seen it used in your calculator or on your favorite statistics software, but what is the theory behind how linear regression is implemented?

We will be using concepts discussed in the previous posts, so if you aren’t familiar with what’s going on, I’d recommend reading them before you begin with this one.

The Hypothesis

Well, the hypothesis in machine learning is not the same as the “hypothesis” you learned in science class.

It is the predicting function that will output a result for a given input.

We will train parameters, also called the weights, which are all represented by one variable, θ (Theta). Then, these parameters will be multiplied by the values of our inputs to generate outputs. Doesn’t make sense? The example below will help clarify this:

As this is linear regression, we will be generating a straight line hypothesis thus the format will be:

The parameters theta 0 and theta 1 will represent the slope and the y-intercept respectively. Hence, for linear regression we only have 2 parameters, but when we get to polynomial regressions we will have many more parameters.

For example, for quadratic regression, we will have 2; for cubic, we will have 3.

Now, as to the number of features, or the number of variables that make up a single input, for this section, we will be dealing with only 1: the feature that is multiplied by theta 1 above. Theta 0 is a constant value so it does not need an additional feature.

What our machine learning algorithm will do is generate these parameters and then we can use these parameters inside our hypothesis to predict values for future, unseen inputs.

The Cost Function

Something we didn’t talk about earlier was the the cost function as I had felt it would be easier to explain when we started implementing a specific machine learning algorithm.

The function, for a given set of parameters (Theta), calculates the accuracy of our hypothesis in relation to the training data set.

Basically, it calculates how far our predicted outputs were from the actual outputs. This represents the error or the “cost”.

Thus, the higher the cost function, the lower the accuracy of our hypothesis. The lower the cost function, the more accurate our hypothesis — the more our predicted outputs match the actual ones.

The cost function we will be using is the average squared error cost function. Basically, the difference between the predicted result and the actual value will be squared. This difference will be calculated for all of the training inputs and the differences will be summed together. Finally, we will take the average of the squared differences and then divide this average by two.

The mathematical formula below demonstrates this concisely:

From i = 1 (representing the first input) to i = m (the last input), sum the squared differences

How does the cost function help create a hypothesis?

If you learned about minimization and optimization functions in math, this is basically it.

What we will be trying to do here is find such a set of weights, Theta, that it will result in the smallest possible for the cost function. This is called the gradient descent algorithm, and this is pretty much its goal.

Image result for gradient descent algorithm
Image source

In the graph above, we can see that we have a starting set of weights (or parameters), the initial weights. What the algorithm does is try to follow a negative gradient because by following a gradient that goes downwards, we are decreasing the cost function.

We keep following this gradient until no negative gradient exists: we have reached the lowest possible point for the cost function we can reach. Hence, we have obtained the optimal set of weights for our hypothesis.

Note, you may have studied in your math class, that there can be multiple minimums in a graph as can be see in the 3D representation below.

Image result for gradient descent multiplem minimum
Image source

What we as machine learning developers can do is randomly initialize the set of starting weights and run our algorithm multiple times (each time the weights will be randomized) and then pick the hypothesis that yielded the lowest cost function. By running it multiple times, we aim to eliminate the gradient descent algorithms that got stuck on a local minimum as opposed to a global minimum.

However, for models requiring linear regression, there is usually one one minimum in the graph so we do not have to worry about this (for now).

The Actual Gradient Descent Algorithm

We have described it generally in visual terms, but how actually does the algorithm find and follow the negative gradient? Well, it involves a lot of technical math and calculus (yikes, partial derivatives), but it is basically as below:

Image source

Bear along with me. Theta j here refers to how we will do this for each one of our Theta values. Now, let’s say for Theta 1, what we will do is take the partial derivative of the cost function with respect to Theta 1 (though we will feed it both Theta values). The partial derivative basically is taking the derivative of one variable while treating the other variables as constant.

Then we we will multiply this partial derivative with a (alpha): the learning rate of the algorithm. Basically, how fast the algorithm moves to find the minimum (we will discuss this in greater in the next section).

Finally, we subtract this from the current value of Theta 1. Then, we do the same process for Theta 2. This process continues until we reach a minimum of the graph.

Through this process, our parameters of Theta reach the optimal value that will yield the lowest cost function.

Don’t worry about the math too much: you do not really need to understand it to implement the algorithm. Just remember that in calculus we find the derivatives in order to calculate the slope and that is how we are able to follow the downwards slope here. At each point, we find the steepest downward slope and we follow that for each value of Theta until we reach a minimum.

An interesting thing about Gradient Descent

It can be noted that gradient descent is used almost everywhere in Machine Learning, not just in linear regression. It is a generalized algorithm.

The reason it can be applied to other machine learning algorithms is because other algorithms have different cost functions. The algorithm above presents a cost function but does not specify the exact implementation of the algorithm. Each machine learning algorithm has its own implementation of its cost function and this cost function is fed to the generalized gradient descent.

In fact, Octave, which we looked at earlier, due to this cool feature has a built-in optimization function “fmincg“. All you need to do is define a function that implements the cost function and then feed it to this ready made function.

So, substituting the cost function we had defined a few sections above into this gradient descent algorithm, we get the following:

Image source

We have basically substituted the average squared error cost function in for J(Theta 0, Theta 1) into the gradient descent algorithm above. Then, we computed the partial derivative and simplified the result to get the above two equations for Theta 0 and Theta 1 respectively.

Now this algorithm is specific to linear regression due to this substitution. Instead of using fmincg, we can simply implement the above algorithm above ourselves in Octave.

What is alpha, the learning rate?

Let us consider what happens with the value is too low or too high.

If the learning rate is too low, our algorithm will run painfully slow to the point that it could take several hours for a simple set of data

If the learning rate is too high, our algorithm can ‘overshoot’ the minimum. Let’s say it reaches a point next to the minimum in the parabolic representation we saw above, and then since it’s learning rate is too high, it takes too big of a jump and then lands on the opposite side of the parabola. Then, it follows the new downward slope but it still overshoots the minimum. It keeps doing this on and on, never finding the convergent minimum point.

The learning rate is multiplied by the partial derivative to determine how big the strides, or the steps, the gradient descent algorithm takes.

Isn’t having a cost value as close to 0 the best for an algorithm? (Optional)

That’s a fair assumption to make. But, keep in mind, that this is not only impossible to do due to the variance of the data, but also it may not be an ideal thing to do in the first place. This brings us to the concepts of over-fitting and under-fitting.

These concepts do not apply to our machine learning algorithm yet as it is just a linear model. However, if you are interested, feel free to read on.

If you create a function that maps the existing data that you used to train the function extremely perfectly (each point lies on the hypothesis), new data in the future that is introduced may lie completely off the best fit line we have generated. This is because we created a too accurate function, a too flexible function: it maps the data too well to the point it does not generalize well. It is then said to be over-fitting.

On the other hand, if you do the opposite, if you make your function too biased (you apply linear regression to a data that requires a quadratic model, for example), it will not be accurate on the training data as well as on future, unseen data. It is then said to be under-fitting.

The diagram below demonstrates this well. The image on the far left is under-fitting while that on the far right is over-fitting; on the other hand, the middle figure describes the ideal hypothesis.

Image result for overfitting vs underfitting
Image source

Our next steps

Now that we have the entire theory of linear regression down, we are 100% ready to implement the algorithm in the next post. In fact, you have all the details you need to implement it on your own if you can’t wait to get your hands dirty.

In the next edition, we will go step by step, explaining each line of our linear regression model. On top of that, I’ll post the code to an open source platform for everyone to freely browse after we have implemented it.

As always, if you have any questions or comments on this article and series, please, without hesitation, do share them below!

If you liked this post and felt you learned a lot from it, feel free it to share it with your friends and colleagues.

To stay tuned and up to date with when the next post will arrive, follow us on Twitter

See you guys in the next post!

One thought on “A Comprehensive Introduction To Linear Regression

  1. Pingback: [S] How Statistics Software Generate Linear Regressions For Your Data - Nevin Manimala's Blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s