### WEBINAR:

On-Demand

Building the Right Environment to Support AI, Machine Learning and Deep Learning

Linear least squares lets you find the line that best fits a set of data points but sometimes the points don't lie along a line. The second part of this two-part article explains how you can use polynomial least squares to find curves that fit your data.
As the

first part of this article explained, sometimes you can use an advanced math concept -- linear least squares -- to predict the future. If you can find a line that closely fits your data, you can extend that line into the future to make predictions about what's to come. If your data represents such things as web site traffic, retail sales, or electricity usage over time, then a good linear least squares fit may help you predict future revenue and expenses.

But what if your data doesn't fall along a neat line? Take a look at the data points plotted in Figure 1. These points clearly don't lie in a line but they do form some sort of pattern. If you can find a curve that the points do fit, then you can use that curve to predict future values.

**Figure 1. Out of Line:** These points don't lie along a line but there is a clear pattern.

If the points fall along some polynomial curve, then you can use techniques similar to those described earlier in this article to find a polynomial least squares fit. In other words, you can find a polynomial A0 + A1 * x + A2 * x2 + ... + Ad * xd for values A0, A1, ..., Ad to minimize the sum of the squares of the vertical distances between the points and the curve. In this equation, d is called the polynomial's degree.

Figure 2 shows a degree 2 polynomial least squares fit for the points shown in Figure 1. This curve fits the data pretty well.

**Figure 2. Polynomial Perfection:** This data fits closely to a degree 2 polynomial.

To find the least squares polynomial fit to the data, you need to use calculus that's very similar to the calculus you used to find the linear least squares fit. (For more information on that calculus, see the

first part of this article. You also need to use a little linear algebra. Neither of these is too hard but if you don't want to work through them, you can skip the next few sections.

### A Little Calculus

Consider the general equation for a polynomial of degree d.

A0 + A1 * x + A2 * x2 + ... + Ad * xd
You need to find the values of A0, A1, A2, ... Ad that minimizes the sum of the squares of the vertical distances between the points and the curve.

For the ith data point, you can plug in the value xi to find the corresponding value on the curve. You can then subtract that value from the data point's y coordinate yi and square the result to get the ith error term.

Ei = (yi - (A0 + A1 * xi + A2 * xi2 + ... + Ad * xid))2

This is all analogous to the discussion of linear least squares in the first part of this article. In fact, if the polynomial has degree 1, this is the same equation you use for linear least squares.

The next step is to take partial derivatives of this equation with respect to the values you want to find: A0, A1, A2, ... Ad. That may seem like a terrifying mess but it's really not too bad if you use the same three rules described in the first part of the article.
To take the partial derivative of a function E with respect to variable A0, written ?/?A0(E) or ?E/?A0, you can use these rules:

The partial derivative of a sum is the sum of the derivatives. The partial derivative of a constant (something that doesn't have the variable A0 in it) is 0.

The partial derivative of a function to a power ?/?A0(FK) equals K * (FK-1) * ?/?A0(F).
If you use these three rules, you will find that the partial derivative with respect to Aj of the ith error term is:

?Ei/?Aj = 2 * (yi - (A0 + A1 * xi + ... + Ad * xid)) * (-xij)
= -2 * (yi * xij - A0 * xij - A1 * xij+1 - ... - Ad * xij+d)

Adding up the terms for the all of the data points you get:

?E/?Aj = ??Ei/?Aj = 2 * ?(yi * xij - A0 * xij - A1 * xij+1 - ... - Ad * xij+d)

If you pull out the Ai terms you can rewrite this as:

?E/?Aj = 2 * (?yi * xij - A0 * ?xij - A1 * ?xij+1 - ... - Ad * ?xij+d)

This equation is intimidating but all of the xi and yi values are coordinates of the data points so they're just numbers. That means this equation has the following form for some constant S values.

?E/?Aj = 2 * (S - A0 * S0 - A1 * S1 - ... - Ad * Sd)

This is a fairly simple equation with d + 1 unknowns A0, A1, ... , Ad.

When you take the partial derivatives with respect to Aj as j varies from 0 to d, you get d + 1 equations with d + 1 unknowns. Now you can solve the equations to find the coefficients and you're done.

For a linear least squares fit, solving 2 equations with 2 unknowns isn't too hard. For this more general case where the degree of the polynomial might be large, however, you need a more general solution. That's where the linear algebra comes in.

### A Little Linear Algebra

Suppose you have a series of n equations with n unknown values x1, x2, ... , xn:

A11 * x1 + A12 * x2 + ... + A1n * xn = C1

A21 * x1 + A22 * x2 + ... + A2n * xn = C2

...
An1 * x1 + An2 * x2 + ... + Ann * xn = Cn

(Note: To keep this discussion general and more consistent with what you'll see in other places, the variables here are x1, x2, and so on and the known constants are A11, A12, and so forth. When we use this to find polynomial least squares fits, you'll be solving for A1, A2, and so forth.)

One way to solve this system of equations is Gaussian elimination. (Actually I use a slightly modified version of Gaussian elimination that avoids the need for back substitution. For a more rigorous discussion of Gaussian elimination, look in

Wolfram MathWorld or

Wikipedia.)

It's a fairly simple method but learning it from a description can be a bit tricky so after the following description I'll work through an example that should make things clearer.

You can write the system of equations as a matrix equation like this:

The first step is to put a 1 in the matrix's upper left corner. That's easy. Simply divide each of the entries in the first row by the value currently in the upper left corner. This is called normalizing the row. If you think about it, you'll realize that this doesn't change the truth of the first equation. All you're doing is dividing both sides of the equation by the same number so the equation remains true.

Now you have an augmented matrix that looks like the following where ? means an entry whose value isn't important right now. You'll need those values later but they don't change the basic shape of the augmented matrix.

Now you repeat these steps to place a 1 in the second row's second column and 0s in all of the other entries in that column. Continue the process until the matrix has this form:

Now if you go back to the original meaning of this matrix, you can read out the xi values. Using the values in this matrix to reconstruct the first equation you get x1 + 0 + 0 + ... + 0 = V1 so x1 = V1. Using the other rows in the matrix you can see that xi = Vi. There's one odd special case that may throw a monkey wrench into this process. When you try to change the ith entry on row i into a 1, there's a chance that the entry will have value 0. In that case, you cannot divide the other entries in that row by 0 to normalize the row.

If that happens, simply swap that row with one of the following rows that does not have a 0 in this column and continue as before. (If there is no such row, then the system of equations does not have a unique solution.) If you think about what this swap means, you can see that it doesn't change the truth of the equations. This operation corresponds to switching the original order of the equations and that certainly doesn't change whether they are true or false.