11) Heat Map Correlation Matrix

This Blog entry is from the Linear Regression section in Learn R.

Multicollinearity refers to an Independent variable that while having a strong correlation to the Dependent Variable, also has an often-unhelpful correlation to another variable, with that variable also being quite well correlated to the Dependent Variable.

Multicollinearity can cause several issues, the most significant is the understatement of Independent Variable coefficients that would otherwise have a remarkable contribution to a model.

Multicollinearity is identified with the help of a Correlation Matrix, which has hitherto been used to identity the relationship between the Independent Variable and the Dependent Variable only.

From Blog entry creating the correlation matrix, there exists a large correlation matrix:

1.png

The task is to use matrix logic to identify correlations which exceed 0.7 or is below -0.7 (as both extremes of +1 and – 1 are equally troubling in this example).  The statement will use the or operator (i.e. |) and create a new correlation matrix:

PearsonColinearity <- Pearson <= -0.7 | Pearson >= 0.7
2.png

Run the line of script to console:

3.png

It can be seen that a new matrix has been created in the environment pane:

4.png

A click returns the matrix:

5.png

This matrix now shows, with a TRUE statement, any variable combination which may suggest collinearity and requiring further inspection.

10) Create a Stepwise Linear Regression Model

This Blog entry is from the Linear Regression section in Learn R.

A stepwise Linear Regression model refers to adding independent variables in the order of their correlation strength in an effort to improve the overall predictive power of the model.  Referring to the output of the correlation analysis:

1.png

It can be seen that the next strongest independent variable, when taking a Pearson correlation is Skew_3 followed by Range_2_Pearson_Correlation.  The process of forward stepwise linear regression would be adding these variables to the model one by one, seeking improvement in the multiple r while retaining good P values.  To create a multiple linear regression model of the strongest correlating independent variables:

MultipleLinearRegression <- lm(Dependent ~ Skew_3 + Range_2_PearsonCorrelation)
2.png

Run the line of script to console:

3.png

Write the summary out to observe the multiple R:

summary(MultipleLinearRegression)

4.png

Run the line of script to console:

5.png

Several statistics are of interest in the multiple linear regression.  The first is the p values relating to the overall model and the independent variables, each of these references scientific notation and so we can infer that it is an extremely small number far below the 0.05 cut off that is arbitrarily used.  Secondarily, the multiple R statistic is of interest, which will be the target of improvement in subsequent iterations.

The next step is to add the next strongest correlating independent variable, which is PointStep_5_PearsonCorrelation:

MultipleLinearRegression <- lm(Dependent ~ Skew_3 + Range_2_PearsonCorrelation + PointStep_5_PearsonCorrelation)

6.png

Run the line of script to console:

7.png

In this example, it can be seen that the R squared has increased, so it can be inferred that the model has improved, while the p values are still extremely small. A more relevant value to pay attention to would be the adjusted R, which takes into account the number of independent variables and writes the multiple r accordingly, as such it is prudent to pay close attention to this value.

Repeat the procedure until such time as the improvement in multiple r plateaus or the performance of the P values decreases.

9) Identifying Confidence Intervals.

This Blog entry is from the Linear Regression section in Learn R.

The confidence intervals can the thought of as the boundaries for which the coefficient, for a given independent variable, can be moved up and down while still maintaining statistical confidence.  Unusually for regression software, the confidence intervals are not written out by default, and they need to be called by passing the linear regression model to the confint() function:

confint.lm(LinearRegression,level=0.95)
1.png

Run the line of script to console:

2.png

The confidence intervals for each of the values required to construct the linear regression formula have been written out.

8) Using the predict function for a one way linear regression one.

This Blog entry is from the Linear Regression section in Learn R.

Deploying a linear regression model manually is rather simple, however, there is an even simpler method available in calling the predict() function which takes a model and a data frame as its parameter,  returning a prediction vector. 

AutomaticLinearRegression <- predict.lm(LinearRegression,FDX)
1.png

Run the line of script to console:

2.png

Add the newly created vector to the FDX data frame:

FDX <- mutate(FDX, AutomaticlLinearRegression)
3.png

Run the line of script to console:

4.png

To view the last two columns of the data frame, containing a manually derived prediction and automatically derived prediction:

View(FDX[,203:204])
5.png

Run the line of script to console:

6.png

The manual and automatic prediction shown side by side are identical to each other.  It follows that the automatic prediction is a much more concise means to execute the prediction based upon a linear regression model created in R:

7.png

7) Deploying a One Way Linear Regression Manually with vector arithmetic.

This Blog entry is from the Linear Regression section in Learn R.

The deployment formula for a linear regression model is quite straightforward and is simply a matter of taking the intercept then adding, in this example, the Median_4 value multiplied by the coefficient:

ManualLinearRegression <- 0.01758027 + (FDX$Median_4 * -0.05595731)
1.png

Run the line of script to console:

2.png

As vector arithmetic, has been performed, the formula has been applied to every row of the data frame.  To add this vector to the FDX data frame, use the mutate function of dplyr():

FDX <- mutate(FDX, ManualLinearRegression)
3.png

Run the line of script to console:

4.png

The mutate() function appends the vector to the FDX data frame.  To verify the column has been appended, view the FDX data frame:

View(FDX[,203])
5.png

Run the line of script to console:

6.png

The use of subsetting in the call to the View() function is far less than ideal and it is to compensate for the inability of RStudio to display more than 100 columns in the grid.  In this example, prior to calling the mutate() function there were 202 columns,  after which there were 203:

7.png

The call to the View() function in this manner yields evidence that column has been successfully added:

8.png