Polynomial Regression
Machine learning: Polynomial regression using numpy, training loop, weight updating via gradient.
Using only numpy, this code shows how polynomial regression works. The model takes the degree and learning rate as the inputs. Using weights W, as the coefficients of the polynomial equation:
$$Y\ = \ {x^{0}w}{0}\ + \ x^{1}w{1}\ + x^{2}w_{2}\ + \ x^{3}w_{3}\ldots(1)$$
The model tries to arrive at the correct weights, which produce the same
result as the actual results. It must be noted here that
At each iteration, the weights are updated using gradient descent algorithm. The weight updating logic is as follows:
The cost function used here is the Mean square error, which can be different based on user requirement. When a different cost function is used, the entire code changes, as the gradient of cost function changes. However, the weight updating logic, (2) remains the same.
The intuition of weight updating is simple. Gradient of a function is nothing but the partial derivative of the function, treating all other variables as constant, except the one we are interested in. Here, in our case, gradient is taken with respect to the weight parameter. Keeping that particular weight parameter as the variable and treating the rest as (input x, as well as rest of the weights w) as constants, derivative of the cost function gives the gradient.
This gradient indicates the direction towards which the cost function is moving at the given combination of weights. So, intuitively, the direction of the cost function must be towards zero, since 0 cost indicates that the predicted output is the same as the required/actual output.
Since we have calculated the direction towards which the cost function is moving for the given combination of the weights, we need to modify each of the weights such that the gradient moves towards zero, or, to be clearer, modify weights in the negative direction of the gradient.
When weights are modified in the negative direction of the gradient, the new cost which uses the updated weights, will be lesser comparatively.
It might naturally occur to us that we can get the weights in a single step, by reducing/incrementing the weight as much as the gradient, so that the next combination will bring the cost to zero.
The important practical aspect that comes to picture at this point of time, is the learning rate. The gradient, even though it gives the direction towards which the cost is moving, does not give the amount the cost is away from zero. It must be noted that it gives only the direction of movement of cost. Not the amount the cost has already moved.
So, the weight must be updated by a small amount in the direction of the gradient. If the learning rate is too small, the number of iterations to arrive at the zero cost will be too large. But, at the same time, if the learning rate is higher, the next combination will cross the optimal combination. It will result in a gradient in the opposite direction and the next update will again result in crossing the zero-cost point and the weights will never converge. The cost will keep on oscillating from one direction to another, without becoming zero.
The other practical aspect to note here is that cost might not become perfectly zero. The ideal condition is for the cost to become zero, but in practise the cost might arrive at a minimal non zero point. The objective of the algorithm will be to find the point at which the cost becomes minimum.
It must be noted here that the point at which the cost is minimum, is the point at which the slope of the curve of the cost function becomes zero. Since slope of a curve is nothing but the derivative of the curve, it implies that the gradient is also zero. Our previous intuition of updating the weight in the direction of gradient matches this statement. When the slope is zero, the gradient is zero. When gradient is zero, the weights are not updated in (2).
Back to mathematics, we define error as the simple difference between the actual output and the predicted output.
The cost, MSE in our case, is defined as
Where N is the number of input samples. At the application level, after we have attained the optimal weights, the number of samples we input to our model to get the predictions is N.
The predicted output y is a scalar. A single floating-point number. The required/actual output Y is also a single floating-point number. Thus, if we provide N number of inputs to the model, N number of errors will be produced. These are the errors whose mean we are calculating in (4) as the cost function.
Now that the notations are clear, we will rewrite the equations, so as to match the above situation where we will provide N inputs to the model.
$$Y_{j}\ = \ {x_{j}^{0}w}{0}\ + \ x{j}^{1}w_{1}\ + x_{j}^{2}w_{2}\ + \ x_{j}^{3}w_{3}\ldots(5)$$
Where
Now we can re write (3) and (4)
It must be noted that
So, cost C is calculated once for one set of inputs and weights are updated once. When it comes to updating the weights, the gradient of this cost is calculated with respect to each of the weights and each of the weights is updated according to the calculated gradient. This process is indicated in (2).
The process (2) is repeatedly calculated thousands of times for each of the weights, so that the combination of the weights converges to the optimal combination which produces minimum cost. This repeated calculation and updating is called training the network.
Getting back to calculus. The gradient that we’re talking about from the beginning of this post is not yet calculated. We’ll do it now. Expanding (7)
$$C = \ \frac{1}{N}\ \sum_{j = 1}^{N}{(\ \ Y_{j} - \left( {x_{j}^{0}w}{0}\ + \ x{j}^{1}w_{1}\ + x_{j}^{2}w_{2}\ + \ x_{j}^{3}w_{3} \right)\ )}^{2}\ldots(8)$$
Here
This cost function must be differentiated (degree+1) number of times to get (degree+1) number of gradients that will be used to update (degree+1) number weights. If the degree of the polynomial is 3, as shown in (8), then 4 gradients will be calculated and 4 weights will be updated for one set of N inputs. Note that there are (degree) number of coefficients or weights and one bias parameter, which makes up the total to be (degree+1). This process, of updating 4 weights, will be repeated thousands of times for all 4 of them to arrive at the optimal combination which produces minimum cost.
For
Where K2 is the constant,
The other weights are constant with respect to
Here we can notice that
Similarly, gradient with respect to the next weight
This equation again simplifies when we notice that the second term is
nothing but the error of the prediction. The final gradient with respect
to weight
Using the above method, we can calculate the gradient for all weights and can write a generalised equation exclusively for polynomial regressions that use MSE as the cost function:
Here
Finally using equation (11) in our main weight updating equation (2) we arrive at
This becomes our main equation does all the magic inside out regression model, which is used in the code.