If you do a little planning, it's easy to gather information about a business's past performance. You can easily track web site hits, retail sales, software downloads, or enrollment in your training courses over time. You can also track expenses such as new equipment purchases, electricity use, and overtime requests over time.
Once you have that data, however, what do you do with it? Just looking at the numbers you can probably tell whether the values are increasing, decreasing, or staying about the same over time. It's amazing how many companies leave it at that without considering questions such as, "How much
are sales increasing over time?" If you could answer that question, you might be able to predict how future costs and revenue.
For example, suppose the data points shown in Figure 1 represent enrollment in a training course over time. From this data it's clear that more people are enrolling in the course as time goes on. That's good to know but it would be even better if you could use the data to predict future enrollment. Then you could decide whether you need to hire more instructors, buy more supplies, or find more classroom space.
Figure 1. Disorganized Data:
It looks like this data falls roughly along a line, but which line?
One approach to making these sorts of predictions is to find the line that best fits the data. Then you can predict enrollment by extending the line into the future. Your prediction might not be perfect but it will give you an estimate that you can use for planning.
One way to find a line that fits the data is to consider the vertical distance between a line and the data points. Figure 2 shows five data points and a possible line to represent them. The red lines shows the vertical distances between the green line and the data points.
Figure 2. Error Apparent:
The red lines indicate the vertical distance between the green line and the data points.
To find a linear least squares fit, you find the line that minimizes the sum of the squares of the vertical distances between the line and the points. This is an easier calculation than you might think but it does require a little calculus. If you don't like calculus, skip the next section.
A Little Calculus
Suppose the best fit line has equation y = m * x + b for some values m and b. Our goal is to find m and b that minimize the sum of the squares of the distances between this line and the data points.
1. The partial derivative of a sum is the sum of the derivatives.
2. The partial derivative of a constant (something that doesn't have the variable m in it) is 0.
Plugging that into the earlier equation gives:
These equations are a lot simpler and you can solve them for m and b to get:
So the equation for this line is y = –2 * x + 3. If you plug in the x coordinates of the two data points (1, 1) and (4, –5), you’ll find that the equation gives the points’ y coordinates so the line exactly fits the two points.
The following code shows how the LinearLeastSquares example program shown in Figure 3 (which is available in the download area in C# and Visual Basic versions) calculates a linear least squares fit.
// Find the least squares linear fit.
private void FindLinearLeastSquaresFit(List<PointF> points,
out double m, out double b)
// Make sure we have at least two points.
if (points.Count < 2)
throw new ArgumentOutOfRangeException("points",
"FindLinearLeastSquaresFit: Parameter points " +
"must contain at least two points.");
// Perform the calculation.
// Find the values S1, Sx, Sy, Sxx, and Sxy.
double S1 = points.Count;
double Sx = 0;
double Sy = 0;
double Sxx = 0;
double Sxy = 0;
foreach (PointF pt in points)
Sx += pt.X;
Sy += pt.Y;
Sxx += pt.X * pt.X;
Sxy += pt.X * pt.Y;
// Solve for m and b.
m = (Sxy * S1 - Sx * Sy) / (Sxx * S1 - Sx * Sx);
b = (Sxy * Sx - Sy * Sxx) / (Sx * Sx - S1 * Sxx);
The code first checks that the input data contains at least two points. It then loops through the points calculating the S values. It finishes by plugging the S values into the equations for m and b. (Note that the code is a bit light on error handling. For example, if you enter two points with the same x coordinate, the program tries to divide by 0 and crashes.)
Figure 3. A Fine Line:
Program LinearLeastSquares finds a least squares fit for a set of data points.
Using these techniques, you can find the line that best fits a set of data but what if your data doesn't lie along a line? What if the data fits more closely to a quadratic curve a * x + b * x2 + c for constants a, b, and c? In that case you might get a better result if you use least squares methods to find the best polynomial that fits the data instead of using lines.
The second part of this article, Predicting the Future with Polynomial Least Squares, explains how you can use least squares techniques similar to those described here to let you find quadratic curves and polynomials of even higher degrees that fit your data closely.