Linear Regression

This applet explores least squares linear regression fitting of polynomials to data. The user can see the computed least squares fit or guess a polynomial fit and compare it to the least squares fit. There are several methods for inputting or generating new data.

The data model
This applet uses a model that assumes that \(x\) is somehow given and that

\[y = P(x) + e,\]

where \(P(x)\) is a polynomial of a specified order and \(e\) is a random deviation from \(P(x)\) which has mean \(0\) and is independent of \(x\). We also assume that for each data point \(e\) was independently drawn from the same distribution.

The goal of the least squares fit is to find the polynomial \(P_b(x)\) that best fits the data.

Polynomials
The applet uses three polynonials:

\(P_b(x)\) the best fit polynomial. We label the order of \(P_b\) with \(m\). The order can be set using the m-slider. The coefficients of \(P_b\) are then determined by the data.
\(P_g(x)\) the user's guess for the best fit polynomial. The order is the same as that of \(P_b\). The coefficients are set using the b-sliders.
\(P_r(x)\) the polynomial used in generating random data. We label the order of \(P_r\) with \(r\). The order can be set using the \(r\)-slider. Once the order is set, \(P_r\) uses preset coefficients. You will need to use the applet to discover the values of these coefficients. Random data is generated by randomly picking an \(x\) value and then letting \(y = P_r(x) + e\), where \(e\) is randomly drawn from a normal distribution \(N(0,\sigma)\).*

Some elements are always available and their behavior is independent of the state of the applet.

The \(\boldsymbol{x}\) and \(\boldsymbol{y}\) readouts show the position of the mouse in the graph window.
The \(\boldsymbol{n}\) readout shows the current number of data points.
The \(\boldsymbol{m}\) slider: Sets the order of both \(P_b\) and \(P_g\). They always have the same order.
The data is always shown as a scatter plot.
The Clear button removes all the data.
The \(\boldsymbol{b}\)-sliders control the coefficients of \(P_g\).

The other controls change the state of the applet. They determine what is shown in the graphs, which coefficients the b-sliders control and how new data is added.

Visibility of graphics and error readouts

When Best fit is checked:
- The graph of \(P_b(x)\) is shown in the same color as the text for the [Best fit] checkbox.
- The coefficients of \(P_b\) are shown in the same color next to the b-sliders.
- The average error squared for \(P_b\) is shown in the same color as well. (See below for details on average squared error.)
- Because the curves are very sensitive to the values of the parameters, especially the higher order parameters, when Best fit is checked and the value of a b-slider is within a small distance of the best fit value, the applet sets the slider value to the exact best fit value.
When Guessed fit is checked:
- The graph of \(P_g\) is shown in the same color as the text for the [Guessed fit] checkbox.
- The average error squared for \(P_g\) is shown in the same color. (See below for details on the average squared error.)
- The Guessed fit errors checkbox is available and, when checked, vertical lines indicating the error are drawn from the guessed fit curve to the data points. The lines have the same color as the text in the [Guessed fit errors] checkbox.

Data generation

The applet allows new data to be created in several ways. It can also be removed.

When the Add point radio button is selected: The user can add a data point by clicking in the graphing window.
When the Remove point radio button is selected: Clicking the mouse in the graphing window will remove the nearest data point.
When the Random radio button is selected:
- The random generation controls are visible.
- The Generate data button generates random data using the following algorithm:
  - An \(x\) value is chosen randomly using a uniform distribution on \([-3,3]\).
  - Then we take \(y = P_r(x) + e\) where \(e\) is randomly chosen from a normal distribution \(N(0, \sigma)\).
  Note: because the generated data is random it is possible that it will generate points that are outside the graphing window.
- The \(\boldsymbol{N}\)-slider controls the number of random data points generated at a time.
- The \(\boldsymbol{r}\)-slider controls the order \(r\) of \(P_r\). For each order, the coefficients of \(P_r\) are preset within the applet.
- The \(\boldsymbol{\sigma}\) slider controls the value of the standard deviation in the normal distribution \(N(0, \sigma)\).
When the Preset radio button is selected: the \(\boldsymbol{P}\)-slider is visible. Choosing a value of \(P\) produces one of the \(9\) prepared datasets in the applet.

The average squared error is indicated by \(\sum \frac{e_i^2}{n}\). That is, \(\sum \frac{e_j^2}{n} = \sum\frac{(y_j - P(x_j))^2}{n}\) where \(P(x)\) is either \(P_b\) or \(P_g\).
If the Best fit checkbox is checked then the average squared error for the best fit is shown in the same color as the text in the checkbox. Likewise if the guessed fit checkbox is checked then the average squared error for the guessed fit is shown in the same color as that checkbox.

*Technically this is not exactly true. If a randomly generated point is outside the range of the plot, then it is rejected and another point is generated. The polynomials and range of \(\sigma\) are chosen so that this is extremely rare. Thus, \(e\) is very close to normally distributed.