10) Create a Stepwise Linear Regression Model

This Blog entry is from the Linear Regression section in Learn R.

A stepwise Linear Regression model refers to adding independent variables in the order of their correlation strength in an effort to improve the overall predictive power of the model.  Referring to the output of the correlation analysis:

It can be seen that the next strongest independent variable, when taking a Pearson correlation is Skew_3 followed by Range_2_Pearson_Correlation.  The process of forward stepwise linear regression would be adding these variables to the model one by one, seeking improvement in the multiple r while retaining good P values.  To create a multiple linear regression model of the strongest correlating independent variables:

MultipleLinearRegression <- lm(Dependent ~ Skew_3 + Range_2_PearsonCorrelation)

Run the line of script to console:

Write the summary out to observe the multiple R:

summary(MultipleLinearRegression)

Run the line of script to console:

Several statistics are of interest in the multiple linear regression.  The first is the p values relating to the overall model and the independent variables, each of these references scientific notation and so we can infer that it is an extremely small number far below the 0.05 cut off that is arbitrarily used.  Secondarily, the multiple R statistic is of interest, which will be the target of improvement in subsequent iterations.

The next step is to add the next strongest correlating independent variable, which is PointStep_5_PearsonCorrelation:

MultipleLinearRegression <- lm(Dependent ~ Skew_3 + Range_2_PearsonCorrelation + PointStep_5_PearsonCorrelation)

Run the line of script to console:

In this example, it can be seen that the R squared has increased, so it can be inferred that the model has improved, while the p values are still extremely small. A more relevant value to pay attention to would be the adjusted R, which takes into account the number of independent variables and writes the multiple r accordingly, as such it is prudent to pay close attention to this value.

Repeat the procedure until such time as the improvement in multiple r plateaus or the performance of the P values decreases.

9) Identifying Confidence Intervals.

This Blog entry is from the Linear Regression section in Learn R.

The confidence intervals can the thought of as the boundaries for which the coefficient, for a given independent variable, can be moved up and down while still maintaining statistical confidence.  Unusually for regression software, the confidence intervals are not written out by default, and they need to be called by passing the linear regression model to the confint() function:

confint.lm(LinearRegression,level=0.95)

Run the line of script to console:

The confidence intervals for each of the values required to construct the linear regression formula have been written out.

7) Deploying a One Way Linear Regression Manually with vector arithmetic.

This Blog entry is from the Linear Regression section in Learn R.

The deployment formula for a linear regression model is quite straightforward and is simply a matter of taking the intercept then adding, in this example, the Median_4 value multiplied by the coefficient:

ManualLinearRegression <- 0.01758027 + (FDX\$Median_4 * -0.05595731)

Run the line of script to console:

As vector arithmetic, has been performed, the formula has been applied to every row of the data frame.  To add this vector to the FDX data frame, use the mutate function of dplyr():

FDX <- mutate(FDX, ManualLinearRegression)

Run the line of script to console:

The mutate() function appends the vector to the FDX data frame.  To verify the column has been appended, view the FDX data frame:

View(FDX[,203])

Run the line of script to console:

The use of subsetting in the call to the View() function is far less than ideal and it is to compensate for the inability of RStudio to display more than 100 columns in the grid.  In this example, prior to calling the mutate() function there were 202 columns,  after which there were 203:

The call to the View() function in this manner yields evidence that column has been successfully added:

5) Adding a Trend Line to a Scatter Plot.

This Blog entry is from the Linear Regression section in Learn R.

In a subsequent procedure a scatter plot comparing the dependent variable and the independent variable was created of Median_4.  In the scatter plot, there was, just about, a relationship identified.  To better visualise this relationship a trend line can be added based on a line of best fit through the points on the scatter plot.

Firstly, revisit previous Blog entries in this section to create the scatter plot using ggplot2 and the qplot() function:

Run the line of script to console:

The actual formula for linear regression, as created by the lm() function is to be explained in more depth in subsequent procedures,  however for the moment the lm() function is going to specified as the method of the stat_smooth() method of ggplot2:

qplot(FDX\$Median_4,FDX\$Dependent) + stat_smooth(method="lm")

Run the line of script to console:

It can be seen that a plot has been created as before, yet this time with a trend line representing a linear regression model:

It can be seen that there is a very shallow downward trend and this linear regression solution has some predictive power, albeit very weak in isolation (hence the importance of multiple linear regression, and to be explained).

1) Scanning Scatter Plots for Relationships.

This Blog entry is from the Linear Regression section in Learn R.

R has a function called pairs() which is incredibly useful for visualizing the relationships existing between variables inside a data frame on a fairly exhaustive basis.  It is possible to simply pass the data frame as an argument to the pairs function for an exhaustive visualization to be produced:

pairs(FDX)

Run the line of script to console:

In this example, the data frame is far too large, having hundreds of columns, which would create a visualization that is many times larger than the RStudio plots pane.  It follows that more selectivity in the vectors to be used in the visualization need be mustered, a simple matter of subscripting the data frame using square brackets as an argument to the Pairs function:

pairs[c("Dependent"," Median_1"," Median_1_PearsonCorrelation"," Median_1_ZScore "," Mode_1"," Mode_1_PearsonCorrelation","Mode_1_ZScore")]

Run the line of script to console to produce a matrix of scatter plots:

In this example, the relationship between the dependent variable and the independent variables is most interesting, at a moment's glance it can be seen that several extreme relationships exist.

This process would be repeated, including the dependent variable, for several other groups of independent variables until such time as a familiarity of relationships has been amassed and a good feel for how independent variables relate to the dependent variable has been obtained.  This process can help identify independent variables that correlate well with the dependent variable, carrying these variables forward for the purposes of modeling.