-
Notifications
You must be signed in to change notification settings - Fork 6
Linear Regression
We want to be able to predict unseen values we don't know the outcome of <math>\widehat{Y}_{unseen}</math>, given some new examples <math>X_{unseen}</math>.
It can be expressed in mathematical terms by:
<math>\widehat{Y} = X \cdot \beta</math>
The simple case where every example <math>X</math> has only one dimension (<math>N=1</math>) and the output <math>Y</math> has one single regressor (<math>O=1</math>) is illustrated in the image above.
There is a number of different methods to learn the weights out there. The simplest one is the Least Squares method (LS). It minimizes the following error function <math>\epsilon</math>, which is the same as minimizing the average of the squared prediction error for every different regressor <math>o</math>:
<math>\epsilon_o^2 = \sum_{i=1}^m (\widehat{Y}_o-Y_o)^2</math>
where <math>(\cdot)^2</math> expresses the dot product.
In a nutshell, the straight-forward learning equation is:
<math>\beta = (X^TX)^{-1}X^TY</math>
This model is simple, it guarantees finding a result (if any), and is also quite fast.
It is quite direct to see the performance of a linear regressor (how well it generalizes), and there is a number of ways the fitness of the model can be described. For instance, we could look at the squared error <math>\epsilon^2</math>. Other commonly used measures could be error variance or R squared values.
In the one-dimensional case (where <math>N=1</math> and <math>O=1</math>), we can easily interpret the scalar <math>\beta</math> by observing its sign, allowing us to determine if there is a positive or negative correlation between the input <math>X</math> and the output <math>Y</math>. In the N-dimensional space, this interpretation will be more complex.
The linear model explained above generates a hyperplane that crosses the <math>Y</math> axis origin. In order for us to generalize to the case where <math>Y</math> has a different intercept, it is common to create a new <math>\widetilde{X}=(1|X)</math>, being <math>1 \;\; \epsilon \;\; \mathbb{R}^{M1}</math> a column vector full of ones, and feed it to the model.
Categorical inputs are not ingested well by a linear regressor, the reason being that it can only ingest numerical values. Changing the labels into numbers does not help either because this column will be understood as a numerical attribute, so the order and proportion you decided your classes' labels to have will influence the results. The common trick is to create as many new variable as classes you have and turn them in 1-0 binary classifiers.
A common practice when the model does not correctly predict the data, is to add extra columns to <math>X</math>, creating a new <math>\widetilde{X}</math> that has more features. Those extra columns are produced by making non-linear transformations of one or more of the previously existing variables. The output is not anymore a linear combination of <math>X</math> but it will be of <math>\widetilde{X}</math>. This leads on the one hand to a combinatorial explosion of possibilities, and on the other hand to an even greater explosion of computational time requirements if we keep adding non-linear combinations. The rule of thumb is that unless we understand the nature of the input variables and the need of a specific non-linear normalization, the result of adding non-linearities only results in a waste of time.
As we can see in the example above, the model does not do a very good job at predicting one of the classes at these areas where we have no evidence of ambiguity. In other words, the model is prone to underfitting.
This model cuts the <math>(X|Y)</math> hyperspace with a hyperplane, and all the 'dots' are supposed to fly close to its surface. The linear regressor expects the underlying model to also be a linear combination of the inputs. If it is, even if the observations have some added noise, the prediction is the most accurate possible, given the data available, as long as the noise has zero mean. This is however not the case in general.
- Machine Learning
- Deep Learning