11) Heat Map Correlation Matrix

This Blog entry is from the Linear Regression section in Learn R.

Multicollinearity refers to an Independent variable that while having a strong correlation to the Dependent Variable, also has an often-unhelpful correlation to another variable, with that variable also being quite well correlated to the Dependent Variable.

Multicollinearity can cause several issues, the most significant is the understatement of Independent Variable coefficients that would otherwise have a remarkable contribution to a model.

Multicollinearity is identified with the help of a Correlation Matrix, which has hitherto been used to identity the relationship between the Independent Variable and the Dependent Variable only.

From Blog entry creating the correlation matrix, there exists a large correlation matrix:

1.png

The task is to use matrix logic to identify correlations which exceed 0.7 or is below -0.7 (as both extremes of +1 and – 1 are equally troubling in this example).  The statement will use the or operator (i.e. |) and create a new correlation matrix:

PearsonColinearity <- Pearson <= -0.7 | Pearson >= 0.7
2.png

Run the line of script to console:

3.png

It can be seen that a new matrix has been created in the environment pane:

4.png

A click returns the matrix:

5.png

This matrix now shows, with a TRUE statement, any variable combination which may suggest collinearity and requiring further inspection.

10) Create a Stepwise Linear Regression Model

This Blog entry is from the Linear Regression section in Learn R.

A stepwise Linear Regression model refers to adding independent variables in the order of their correlation strength in an effort to improve the overall predictive power of the model.  Referring to the output of the correlation analysis:

1.png

It can be seen that the next strongest independent variable, when taking a Pearson correlation is Skew_3 followed by Range_2_Pearson_Correlation.  The process of forward stepwise linear regression would be adding these variables to the model one by one, seeking improvement in the multiple r while retaining good P values.  To create a multiple linear regression model of the strongest correlating independent variables:

MultipleLinearRegression <- lm(Dependent ~ Skew_3 + Range_2_PearsonCorrelation)
2.png

Run the line of script to console:

3.png

Write the summary out to observe the multiple R:

summary(MultipleLinearRegression)

4.png

Run the line of script to console:

5.png

Several statistics are of interest in the multiple linear regression.  The first is the p values relating to the overall model and the independent variables, each of these references scientific notation and so we can infer that it is an extremely small number far below the 0.05 cut off that is arbitrarily used.  Secondarily, the multiple R statistic is of interest, which will be the target of improvement in subsequent iterations.

The next step is to add the next strongest correlating independent variable, which is PointStep_5_PearsonCorrelation:

MultipleLinearRegression <- lm(Dependent ~ Skew_3 + Range_2_PearsonCorrelation + PointStep_5_PearsonCorrelation)

6.png

Run the line of script to console:

7.png

In this example, it can be seen that the R squared has increased, so it can be inferred that the model has improved, while the p values are still extremely small. A more relevant value to pay attention to would be the adjusted R, which takes into account the number of independent variables and writes the multiple r accordingly, as such it is prudent to pay close attention to this value.

Repeat the procedure until such time as the improvement in multiple r plateaus or the performance of the P values decreases.

4) Ranking Correlation by Absolute Strength.

This Blog entry is from the Linear Regression section in Learn R.

In the previous Blog entry a correlation matrix was created and the first column was transposed into a matrix by the name PearsonCorrelation.  The PearsonCorrelation matrix has the strength of relationship between each of the independent variables and the dependent variables.

The first task is to order the variables by their strength of their ABSOLUTE correlation, as both -1 and +1 are equally interesting extremes.  The abs() function in R makes this transformation effortless:

PearsonDependentAbs <- abs(PearsonDependent)
1.png

Run the line of script to console:

2.png

It can be seen from the environment pane window that a new matrix has been created:

3.png

In this instance, any negative number has been turned into a positive number, as observed by a single click in the environment pane:

4.png

The task remains to order the matrix by highest value to the lowest value.  This can be achieved with a simple click on the column in the matrix viewer (click once for ascending, again for descending):

5.png

While there are methods to order a matrix in R, they are extremely convoluted and the arrange() function as presented in dplyr does not work as the matrix is not a data frame (although it could be converted to a data frame by just passing the matrix to the data.frame() function). 

In view of this process being exploratory and not necessarily needing to be recreated, the manual ordering in the view pane is adequate.

3) Create a Correlation Matrix using Spearman and Pearson.

This Blog entry is from the Linear Regression section in Learn R.

Correlation is a measure of relationship and direction of that relationship.  It is a single value that ranges from -1 to +1, which would signal the direction and strength of a relationship.  Both -1 and +1 are, in their extremes, equally interesting.  A correlation matrix takes all the variables together and produces the correlation value, the strength of their relationship in one director of another, between each variable.

The matrix will be the foundation for many of the techniques used in the following Blog entry. In R the cor() function is used to produce correlation matrices upon data frames.  To create a Pearson correlation matrix:

Pearson = cor(FDX,use="complete",method="pearson") 
1.png

It can be seen that the cor() function takes the FDX data frame as its source.  The method argument specifies which type of correlation calculation to perform, an alternative would be "spearman".

Lastly "use" argument tells the cor() function how to deal with missing or bad data,  whereby the default is to throw an error,  hence it is a good idea to specify "complete" when working with very large datasets else it is likely the entire matrix would be returned as "NA".

Run the line of script to console:

2.png

It can be seen that a matrix by the name of Pearson has been created and is available in the environment pane:

3.png

Clicking on the entry in the environment pane would expand a view panel and display a more visually satisfying correlation matrix:

4.png

As the Pearson correlation is a matrix object, it can be interacted with via subscripting.  While the correlation matrix is extremely useful for identifying collinearity, at this stage the main point of interest is the relationships to the dependent variable only. 

To return just the Dependent column:

PearsonDependent <- Pearson[,"Dependent",drop="false"]
5.png

In this example the matrix is being subset to bring back all rows by leaving the first argument blank,  while specifying only the "Dependent" column.  By default subsetting will return the simplest structure and it cannot be assumed that it will be the same structure as the oroginal matrix,  hence the drop="false" argument is used to ensure that the structure is the same (this is to say a matrix of rows and colums).

Run the line of script to console:

6.png

It can be seen that a new matrix has been created in the environment pane:

7.png

Clicking on the new matrix titled PearsonDependent will expand into the script window:

8.png

It can be seen that only the first column has been returned making the matrix less foreboding to work with in subsequent Blog entries.