8) Create a Heat Map Correlation Matrix

This Blog entry is from the Linear Regression section in Learn Palisade.

Multicollinearity refers to an Independent variable that while having a strong correlation to the Dependent Variable, also has an often unhelpful correlation to another variable, with that variable also being quite well correlated to the Dependent Variable. 

Multicollinearity can cause several issue, the most significant is the understatement of Independent Variable coefficients that would otherwise have a remarkable contribution to a model.

We identity Multicollinearity with the help of a Correlation Matrix, which has hitherto been used to identity the relationship between just the Independent Variable and the Dependent Variable.

Start by creating a correlation matrix using Pearson or Spearman:

1.png

Highlight the body of the correlation table, which is where the conditional formatting rules will be applied:

2.png

The heat map is created by using a Not Between conditional formatting rule.  Click on Conditional Formatting to present the menu:

3.png

Then select New Rule.

Select the option ‘Format only cells that contain’:

4.png

Select the function ‘Not Between’:

5.png

Enter the acceptable range, in this case -0.7 and 0.7:

6.png

Click on the Format button to specify the highlighting for the heat map:

7.png

In this example the highlighting is simple red text, so change the font colour to Red from its default of Automatic:

8.png

Click OK to commit the formatting:

9.png

Then click OK to commit the rule:

10.png

Then click OK to commit all rules together:

11.png

The correlation matrix will now highlight in red those correlations that unacceptably covary:

12.png

7) Create a Stepwise Linear Regression Model.

This Blog entry is from the Linear Regression section in Learn Palisade.

Linear Regression models are not just limited to as single independent variable, rather they can have many.  As a methodology, the goal is to keep adding independent variables and observe the model improving in performance, then stopping at the point the model begins to degrade as more independent variables are added.

Create a one-way Linear Regression model, yet in this Blog entry specify the very next strongest correlating value as discovered, which in this example is Point 100, as a further independent variable for the analysis (i.e. both Kurtosis and Point 100 are selected in the I column):

1.png
2.png

Clicking ok will produce the same analysis, retaining just the single constant (i.e. starting point) yet calculating two coefficients for each independent variable specified:

3.png

Note two values from the first model created, the Multiple R and the P Values:

4.png

By adding more Independent Variables as per this Blog entry, the goal is to get a higher multiple R value (similar in this regard to an absolute correlation) while maintaining P Values of less than 5% across the enter model (this would be values less than 0.05, although it is open to great subjective interpretation):

5.png

In this example, an improvement in the Multiple R has been observed, while the P values look to be infinitesimally small, thus procedural, it could be said that the new model is champion, having challenged the previous model.  Deploying the model would be identical to that described in Linear Regression, except the formula would be extended as below to include the new independent variable coefficient:

= 0.000284873 + (Kurtosis Value * 0.00017277) + (Point100 Value * 0.039631235)

Where Kurtosis Value is in cell N2 and Point100 Value is in cell W2 in this example, set out this formula in the same manner as Linear Regression:

6.png

Commit the formula, fill down and name the column Model_2:

7.png

Repeat this Block entry for the very next strongest correlation until such time that the Multiple R ceases to increase or the P Values increase beyond 5% (or an acceptable threshold as determined by the analyst), at which point stop with a view to improving the model using only analyst best judgement (i.e. adding and removing independent variables that make intuitive sense to garner improvement).  Using this Blog entry, it is expected that a model will improve to around five variables, but could be many more depending on the complexity (indeed creativity) of the abstraction process.

6) Deploying a One Way Linear Regression Model in Excel.

This Blog entry is from the Linear Regression section in Learn Palisade.

The one-way linear regression output contains, at this stage, two values to be concerned with for the purpose of deployment, the Constant value and the Coefficient for the value for the single independent variable:

1.png

The Constant represents a static value that will always be the starting point of a linear regression model, whereas the coefficient value for the Independent Variable would be a value that is added to the constant, as the multiplication of the Independent variables prevailing value. 

In this example, the one-way linear regression model will be deployed as a further abstracted variable in the original dataset (a precursor to ensemble modelling, including boosting).  Making note of the two values of interest in the Linear Regression output, navigate back to the dataset and select the cell in first row immediately after the last variable, in this example cell AD1:

2.png

The formula that will be built in this Blog entry takes the form:

= Constant + (Kurtosis Value * Kurtosis Coefficient)

Put differently, using the values output from this Blog entries Linear Regression model:

= 0.000244006 + (Kurtosis Value * 0.000147134)

Kurtosis Value, in this example, will be the dynamic value contained in the example, which in this case is located in cell N2.  To construct the formula, start by entering the formula in cell AD1:

= 0.000244006 + (

3.png

Then reference the variable Kurtosis in the exemplar, in this example cell N2:

=0.000244006 + (N2

4.png

Multiplying the Kurtosis Value by the coefficient are specified, indeed calculated, in the Linear Regression model:

=0.000244006 + (N2 * 0.000147134

5.png

Close the parenthesis to complete the formula, fill down, and name the column Model_1:

6.png

The model has now been deployed across all examples in the file.  Scientific notation in the variable, the E value, suggests that the changes are forecast to be very small, hence many decimal places. 

Scientific notation (the E) can be corrected by setting the number value in the column to halt at the fifth decimal place, for example, using the format cell functionality of Excel (a Blog entry detailed outside of the scope of explicit documentation in this guide):

7.png
8.png

5) Creating a One Way Linear Regression Model.

This Blog entry is from the Linear Regression section in Learn Palisade.

A one-way linear regression model is the first model created in this Blog that has predictive power.  In actuality, a linear regression model was brought to bear in drawing a trend line, however this Blog entry introduces a more detailed means to create a predictive model via linear regression.

Having ranked the correlation strength of the independent variables vs the dependent variable, note the strongest correlating value which is in this case the variable Kurtosis.  The strongest correlating variable will always be the starting point in the development of a predictive model in this methodology.

For the dataset, in the StatTools ribbon, select the menu Regression and Classification towards the centre right of the ribbon, then click on the sub menu Regression (referring to linear regression):

1.png

The regression window will appear, shifting from the nomenclature of x and Y (referring to the axis on a chart) to Dependent and Independent (D and I respectively):

2.png

The dependent variable for the purpose of these Blog entries is intuitively labelled Dependent in the datasets.  For the variable Dependent, ensure that the D column is selected (only one can be selected):

3.png

The Independent Variable, being the first and strongest correlating value in this example Kurtosis will be selected in the I column I:

4.png

The default parameters are adequate for the model, thus clicking OK will set about calculating a Linear Regression model.  StatTools will return a linear regression output:

5.png

4) Adding a Trend Line to a Scatter Plot.

This Blog entry is from the Linear Regression section in Learn Palisade.

The scatter plot example created is visually difficult to inspect, there is just about a relationship depicted, but it is shallow and the direction of relationship not immediately obvious. 

Excel can help by adding a trend line which, at this stage at least, will give a general indication of relationship and direction of that relationship.  In fact, this is an implementation of one-way linear regression, covered in more depth in this module.

 Hover over the chart created by StatTools, then right click:

1.png

Click the sub menu item Trend Line:

2.png

A trend line is drawn that shows a very slight relationship, however it would be reasonable to conclude that the relationship is slim to non-existent in this data.  Financial market data is however notoriously difficult to forecast, hence performing this analysis across a wider array of variables and bringing several variables together is required for the purposes of predictive modelling.

3) Ranking Correlation by Absolute Strength.

This Blog entry is from the Linear Regression section in Learn Palisade.

Correlations, and strength, is used predominantly to find the strongest correlating independent variables for the purposes of carrying that variable through to predictive analytics.  The variables with the strongest correlations, in one direction or another, are carried forward to predictive analytics techniques in subsequent Blog entries.

Firstly, select the Dependents variable and select all of the independent variables correlating, to whatever degree, to that Independent Variable.  In this example it would involve selecting the first two columns in the correlation matrix, where Dependent is across the top, and the independent variables are along the left-hand side, and of course the values to the right of the names.  Once selected, right click on this selection, then copy:

1.png

Create a new worksheet by clicking the tab with a plus sign at the base of the workbook currently in focus:

2.png

Once the new workbook is in focus, select cell A1, right click and paste the values:

3.png

A more succinct comparison between the Dependent and Independent variables is presented, although it is not ordered by the strength of correlation:

4.png

Both -1 and +1 are equally interesting for the purposes of variable selection, and so, we cannot simply order by the correlation without first levelling it to an absolute – this is positive – value.  Start by creating a new column called ABS, in this example select cell C1, setting the value to ‘ABS Value’, although the naming is quite unimportant:

5.png

For each of the Independent Variables, apply the Excel ABS Function to the value.  In this example, the first cell of interest is in C3, being the Mean.  In cell C3, start by typing the formula:

=ABS(

6.png

Select the Independent Variable correlation value, which is in cell B3 in this example:

=ABS(B3

7.png

Close the parenthesis to complete the formula, then fill down:

8.png

At this stage, as the Independent Variable correlations have been set to an absolute value, only values close to +1 are deemed interesting. It is helpful to order the absolute correlations from high to low, where the strongest correlations are at the top, the weakest at the bottom. 

Click cell A1 to reset the selection, if anything is selected, then navigate to the data ribbon, select the icon Sort towards the middle of the ribbon, which will infer the dataset:

9.png

Select the parameters Sort By ‘ABS Value’ then the order ‘Largest to Smallest’:

10.png

On clicking Ok, the values are ordered based on their strength of correlation:

11.png

It is extremely likely that those correlations towards the top of the table will be included in predictive modelling, while those correlations towards the bottom of the table will be disregarded. 

It is not, at this stage at least, advisable to make a determination of what variables should or should not be in a predictive analytics model, instead relying on stepwise approaches as covered in subsequent Blog entries.

2) Create a Correlation Matrix using Spearman and Pearson.

This Blog entry is from the Linear Regression section in Learn Palisade.

Correlation is a measure of relationship and direction of that relationship.  It is a single value that ranges from -1 to +1, which would signal the direction and strength of a relationship.  Both -1 and +1 in their extremes are equally interesting.  A correlation matrix takes all the variables together and produces the correlation value between each variable.  This matrix will be the bedrock, starting point, for many of the techniques used in the following Blog entries.

In the StatTools ribbon, towards the centre of the ribbon select summary statistics then the sub menu Correlation and Covariance:

1.png

The correlation and covariance window will open:

2.png

In this Blog entry perform correlation analysis against all abstracted variables, this is the Independent and Dependent variables created.  Start by selecting all variables.  Towards the top left of the variables grid right click, then click select all:

3.png

Upon having selected all the variables, deselect the raw variables in the file – this is all variables before the dependent variable in this example:

4.png
5.png

Notice that the Correlation analysis is set to Pearson Linear, while the Covariances is also selected.  Covariances are superfluous to this Blog entry and can be deselected:

6.png

Click OK to produce the Correlation matrix.  NOTE THAT THIS MAY TAKE SOME TIME IN EXTREMLY LARGE DATASETS.  The correlation matrix will return after some time processing:

7.png

1) Creating a Scatter Plot to inspect a potential relationship between Dependent and Independent Variable.

This Blog entry is from the Linear Regression section in Learn Palisade.

This Blog thus far has focused on performing analysis of just a single independent variable.  This would be referred to as Univariate Analysis.  Bivariate analysis would refer to bringing two values together to estimate their relationship.  In terms of visualisation of a relationship, this could be achieved by visually inspecting a scatter plot comparing one value to another as dots on a chart with two axes.

The Scatter Plot Analysis is available in the StatTools Ribbon, click the button Summary Graphs towards the centre of the ribbon, then ScatterPlot in the sub menu:

1.png

The ScatterPlot window will open which will ask for parameters for the X and Y:

2.png
3.png

Y is the vertical axis; X is the horizontal axis.  In this example, the Dependent Variable (titled Dependent, being the actual price change observed) is to be on the Y axis, with the Skew (this is the prices tending towards higher or lower) on the X axis.

4.png

At this stage it is not necessary to display the Correlation Coefficient, however, the default is acceptable, where Display Correlation Coefficient is checked.  Click OK to perform the analysis, a Scatter Plot will be drawn and returned:

5.png

The intention is to visually inspect the scatter plot to see if the mass tends with an increase. in the Skew. Furthermore, it shows how the values are clustered together, in some instances there may be distinct groups emerge which need to be split (i.e. maybe a different model for one cluster).

In this example, it is visually quite hard to draw any conclusion as the scatter plot is far too dense, having too many dots overlaying each other.  A solution is to take a random sample, to reduce the size of the dataset without losing any significance of meaning.  Creating a scatter plot on the smaller dataset makes the Scatter Plot much easier to understand.

6.png

In the above example while there are far fewer points on the scatter plot,  the overall shape is broadly the same.

Introduction to Linear Regression

This Blog entry is from the Linear Regression section in Learn Palisade.

Linear Regression is a modelling technique that can be used for numeric value prediction.  It is a feature of StatTools, as such loading StatTools as in Module 2 is a prerequisite.

The dataset that is used in these examples is available in \Training\Data\FX\EURUSD\EURUSD_Abstraction_Example_Clean.csv which is a clean \ finalised version of Desktop\Training\Data\FX\EURUSD\EURUSD_Abstraction_Example.xslx, which contains reference formulas for training purposes.

Open the file \Training\Data\FX\EURUSD\EURUSD_Abstraction_Example_Clean.csv and perform procedure 1 to load the file into the StatTools realm as a dataset:

1.png