4) Fitting a distribution

This Blog entry is from the Monte Carlo Model section in Learn Palisade.

A triangular distribution is extremely useful for modelling a variety of business scenarios, especially if created as part of elicitation and in the absence of data. 

If data does exist though, it would be more useful to set a distribution reflective of that distribution. 

@Risk can handle a very wide range of distribution and has a means to perform statistical tests to see which distribution sees to reflect the data most closely.

Start by clicking opening a dataset for which you wish to perform fitting.  In this example the underlying dataset used to create the Regression Model will be used, the GBPUSD with 5M bars available at \Data\FX\GBPUSD with file name GBPUSD_Abstraction_Example_Clean.csv:

1.png

Start by highlighting the column to fit:

J:J

Click the button Distribution Fitting in the Palisade @Risk Ribbon:

Then on the sub menu click Fit to fitting options:

4.png

Retaining the default options click the Fit button:

5.png

The tool will try and fit the selected data against a series of distributions, before selecting a distribution with the best fit:

6.png

In this example the distribution appears to be relatively normal in distribution about the average, it has determined that this is a Weibull distribution.  Scores presented for several other distributions towards the left hand side of the window, it may be appropriate to select one on a more subjective or intuitive basis:

7.png

In this example, the most fitting will be written out as an @Risk function which can be natively used in future simulations. 

Click write to cell to review the distribution formula:

8.png

It is quite common that the dataset being used is not the same dataset that the @Risk model has been deployed to.  It follows that it is often easier to copy \ make a note of the formula for continued and future use.  As such, highlight the formula in the text box:

9.png

=RiskWeibull(6.8331,0.049509,RiskShift(-0.045066),RiskName("Mean"))

An alternative approach is to click Next, where a target cell can be specified:

10.png

Selecting a cell to write the formula is a less flexible approach in practical use and thus the former method is recommended, keeping a journal of fitted distributions for future reference.

3) Performing Optimisation and Mitigation on Monte Carlo Simulation Data.

This Blog entry is from the Monte Carlo Model section in Learn Palisade.

To Optimise and Mitigate we will need to review the Simulation Data.  Click on the Excel Reports button, then select Simulation Data (Combined):

1.png

On clicking OK a new spreadsheet will be created which will show each simulation:

Owing to this spreadsheet showing the inputs and the output for each simulation we are able to filter to identify optimal and \ or troublesome scenarios.  In this example the filter will be for all records where the forecast is less than -0.7, being a high move:

3.png

Using Summary Statistics we are now able to identify the scenario that would cause a high move which can be compared with other scenarios. 

In the below example we can infer that the average skew for an optimal trade would be -23 vs the more typical value of -8 which would include losing trades.

4.png
5.png

It follows that in terms of optimisation we may determine to focus on trades where the skew is more pronounced.

2) Running Monte Carlo Simulations.

This Blog entry is from the Monte Carlo Model section in Learn Palisade.

Set the simulation iterations to a suitably large number, for example 100,000:

1.png

Click Start Simulation to begin the simulation.  The probability density function will start to take shape:

2.png

In the above example we can determine that in all probability the market is going to move downward, in fact 90% likely to move downward between 0 and -0.7. 

The market is extremely unlikely to go up based on the last 8000 5M observations (although this could just as well indicate a correction). 

The shape of the distribution and the lack of normality would be cause for concern and lead us to reconsider promoting this model in all likelihood revisiting the validity of the model.  It is fine for the purposes of our example albeit with the caveat that the value of the optimisation and mitigation process is closely linked to the predictive performance of the model in the first instance.

1) Configure a Logistic Regression Model and Define Triangular Distribution.

This Blog entry is from the Monte Carlo Model section in Learn Palisade.

In this guide we have implemented a Linear Regression Model in an Excel Spreadsheet across four strongly correlating variables from the training dataset.  We are using intraday GBPUSD with 5M candlesticks whereby we have a chart configured with 700 of the most recent candlesticks. 

The model is comprised of the following independent variables, coefficient and intercept:

Name

Value

Intercept

-0.000810676

_4_Mode

-0.000200686

_4_Median

0.03422778

_4_Skew

-0.000148064

_4_HighPrice

-0.070443585

1.png

The triangular distributions are determined as follows and as based on the data:

Field

Max

Min

Most Likely

_4_Mode

0.00191

-0.03038

-0.010

_4_Median

0.023948

-0.01913

-0.00078

_4_Skew

1.542245

-26.2921

-0.10329

_4_HighPrice

0.00191

-0.03038

-0.01086

Although we are using an explainable model, the function could just as easily be a call to a Neural Network DLL (although in example of Jube Capital optimisation software this is performed automatically upon Neural Network retraining as shown in the conclusion):

2.png

Launch @Risk and click on the value cells:

3.png

Click on Define Assumption, then click Triang when the dialog opens:

4.png

Click Select Distribution to be presented with the configuration.  Using the values as described in the above table, enter the Min, Max and Most Likely for each variable:

5.png

Click ok to place the distribution assumption and then repeat for each variable:

6.png

Click on the formula output cell in yellow and click on the Add Output button:

7.png

Specify an appropriate name or retain the default, then click ok.

2) Output Logistic Regression Model as Probability and set Threshold Function.

This Blog entry is from the Linear Regression section in Learn Palisade.

Logistic regression output is constructed in exactly the same manner as in Linear Regression, in so far as there is a constant representing the starting point, thereafter the addition of the independent variable values multiplied by a coefficient, in this example:

= -0.899434346 + (High_Risk_Country * 2.926138785)

1.png

Fill down and name the column Model_1:

2.png

The output ranges from -5 to +5, however, it is not linear, rather it is a logarithm:

3.png

The output is substantially more intuitive if converted to a probability which ranges from 0 to 100 (or 0 to 1 if being represented as a pure probability), the formula to convert the current output to a probability is:

P = exp(Ouput) / (1+exp(Ouput))

The formula above uses the exp function in excel.  As if creating a model output, select the last cell in the spreadsheet, in this example AW2:

4.png

The function will reference the output of the first model, which in our example is cell AV2 start entering the formula:

=exp(

Select cell AV2 as the model output in its raw state:

5.png

Then complete the formula referencing the output in the same manner:

P = exp(AV2) / (1+exp(AV2))

6.png

Fill down and label the column Model_1_P:

7.png

Unlike the Linear Regression models which simply give an output of a numeric value for use, classification models creating a score, probability or otherwise, rely on the setting of a threshold as an activation function to declare, in this example, fraud.  In our example, the threshold is 80% probability of fraud for the prediction to be considered as such.  Once again, the IF function will be brought to bear for the purposes of creating an activation function.

Select the last cell in the spreadsheet, in this example AX2, and begin an IF function referencing the Model_1_P value in cell AW2:

=IF(AW2>

8.png

A probability is expressed between 0 and 1, therefore .8 would represent 80% likelihood.  It follows that the threshold value would be .8, which would complete the IF function:

=IF(AW2>0.8

9.png

Enter the remaining parameter that will be returned try as 1, then the false return value as 0:

=IF(AW2>0.8,1,0

10.png

Complete the formula by closing the parentheses, fill down and name the column Model_1_Is_Fraud.  Accordingly, any example with a value of one, would be considered activated:

11.png

1) Forward Stepwise Logistic Regression

This Blog entry is from the Linear Regression section in Learn Palisade.

The Blog entry to create a Logistic Regression model is almost identical to that of creating a Linear Regression model, in that default options suffice while the concepts of Dependent and Independent variables are used in for the purposes of creating the model over the X and Y specifications that had previously been used in other analysis.

Logistic Regression is available by clicking the Regression and Classification menu on the StatTools ribbon, then clicking Logistic Regression on the sub menu:

1.png

The logistic regression window will open:

2.png

The concept of stepwise Logistic Regression exists in the same manner as it does in Linear Regression and although not explicitly mentioned, this Blog entry assumes that correlation analysis has been performed on all variables and the variable with the strongest correlation is carried forward as the starting independent variable, in this case High_Risk_Country (a pivoted categorical variable):

3.png

The dependent variable in this dataset is titled Dependent and represents the transaction being fraudulent or not:

4.png
5.png

While it is the default option, it is important to select ‘Include Classification Summary’ option as this is an important performance measure for stepwise Logistic Regression.

Clicking OK will produce the Logistic Regression output:

6.png

Stepwise Linear Regression has now become familar, for which the same concepts exist with Logistic Regression.  The performance measures in Logistic Regression differ from that of Linear Regression; P-Values need to be optimised in the same way and should never ideally exceed 5%, while further optimisation values relate to the classification accuracy of the logistic regression model, for which performance should always be sought:

7.png

It follows that the Logistic Regression model should be improved by adding the next strongest correlating variable seeking improvement in the classification accuracy while maintaining good P-Values.

8) Create a Heat Map Correlation Matrix

This Blog entry is from the Linear Regression section in Learn Palisade.

Multicollinearity refers to an Independent variable that while having a strong correlation to the Dependent Variable, also has an often unhelpful correlation to another variable, with that variable also being quite well correlated to the Dependent Variable. 

Multicollinearity can cause several issue, the most significant is the understatement of Independent Variable coefficients that would otherwise have a remarkable contribution to a model.

We identity Multicollinearity with the help of a Correlation Matrix, which has hitherto been used to identity the relationship between just the Independent Variable and the Dependent Variable.

Start by creating a correlation matrix using Pearson or Spearman:

1.png

Highlight the body of the correlation table, which is where the conditional formatting rules will be applied:

2.png

The heat map is created by using a Not Between conditional formatting rule.  Click on Conditional Formatting to present the menu:

3.png

Then select New Rule.

Select the option ‘Format only cells that contain’:

4.png

Select the function ‘Not Between’:

5.png

Enter the acceptable range, in this case -0.7 and 0.7:

6.png

Click on the Format button to specify the highlighting for the heat map:

7.png

In this example the highlighting is simple red text, so change the font colour to Red from its default of Automatic:

8.png

Click OK to commit the formatting:

9.png

Then click OK to commit the rule:

10.png

Then click OK to commit all rules together:

11.png

The correlation matrix will now highlight in red those correlations that unacceptably covary:

12.png

7) Create a Stepwise Linear Regression Model.

This Blog entry is from the Linear Regression section in Learn Palisade.

Linear Regression models are not just limited to as single independent variable, rather they can have many.  As a methodology, the goal is to keep adding independent variables and observe the model improving in performance, then stopping at the point the model begins to degrade as more independent variables are added.

Create a one-way Linear Regression model, yet in this Blog entry specify the very next strongest correlating value as discovered, which in this example is Point 100, as a further independent variable for the analysis (i.e. both Kurtosis and Point 100 are selected in the I column):

1.png
2.png

Clicking ok will produce the same analysis, retaining just the single constant (i.e. starting point) yet calculating two coefficients for each independent variable specified:

3.png

Note two values from the first model created, the Multiple R and the P Values:

4.png

By adding more Independent Variables as per this Blog entry, the goal is to get a higher multiple R value (similar in this regard to an absolute correlation) while maintaining P Values of less than 5% across the enter model (this would be values less than 0.05, although it is open to great subjective interpretation):

5.png

In this example, an improvement in the Multiple R has been observed, while the P values look to be infinitesimally small, thus procedural, it could be said that the new model is champion, having challenged the previous model.  Deploying the model would be identical to that described in Linear Regression, except the formula would be extended as below to include the new independent variable coefficient:

= 0.000284873 + (Kurtosis Value * 0.00017277) + (Point100 Value * 0.039631235)

Where Kurtosis Value is in cell N2 and Point100 Value is in cell W2 in this example, set out this formula in the same manner as Linear Regression:

6.png

Commit the formula, fill down and name the column Model_2:

7.png

Repeat this Block entry for the very next strongest correlation until such time that the Multiple R ceases to increase or the P Values increase beyond 5% (or an acceptable threshold as determined by the analyst), at which point stop with a view to improving the model using only analyst best judgement (i.e. adding and removing independent variables that make intuitive sense to garner improvement).  Using this Blog entry, it is expected that a model will improve to around five variables, but could be many more depending on the complexity (indeed creativity) of the abstraction process.

6) Deploying a One Way Linear Regression Model in Excel.

This Blog entry is from the Linear Regression section in Learn Palisade.

The one-way linear regression output contains, at this stage, two values to be concerned with for the purpose of deployment, the Constant value and the Coefficient for the value for the single independent variable:

1.png

The Constant represents a static value that will always be the starting point of a linear regression model, whereas the coefficient value for the Independent Variable would be a value that is added to the constant, as the multiplication of the Independent variables prevailing value. 

In this example, the one-way linear regression model will be deployed as a further abstracted variable in the original dataset (a precursor to ensemble modelling, including boosting).  Making note of the two values of interest in the Linear Regression output, navigate back to the dataset and select the cell in first row immediately after the last variable, in this example cell AD1:

2.png

The formula that will be built in this Blog entry takes the form:

= Constant + (Kurtosis Value * Kurtosis Coefficient)

Put differently, using the values output from this Blog entries Linear Regression model:

= 0.000244006 + (Kurtosis Value * 0.000147134)

Kurtosis Value, in this example, will be the dynamic value contained in the example, which in this case is located in cell N2.  To construct the formula, start by entering the formula in cell AD1:

= 0.000244006 + (

3.png

Then reference the variable Kurtosis in the exemplar, in this example cell N2:

=0.000244006 + (N2

4.png

Multiplying the Kurtosis Value by the coefficient are specified, indeed calculated, in the Linear Regression model:

=0.000244006 + (N2 * 0.000147134

5.png

Close the parenthesis to complete the formula, fill down, and name the column Model_1:

6.png

The model has now been deployed across all examples in the file.  Scientific notation in the variable, the E value, suggests that the changes are forecast to be very small, hence many decimal places. 

Scientific notation (the E) can be corrected by setting the number value in the column to halt at the fifth decimal place, for example, using the format cell functionality of Excel (a Blog entry detailed outside of the scope of explicit documentation in this guide):

7.png
8.png

5) Creating a One Way Linear Regression Model.

This Blog entry is from the Linear Regression section in Learn Palisade.

A one-way linear regression model is the first model created in this Blog that has predictive power.  In actuality, a linear regression model was brought to bear in drawing a trend line, however this Blog entry introduces a more detailed means to create a predictive model via linear regression.

Having ranked the correlation strength of the independent variables vs the dependent variable, note the strongest correlating value which is in this case the variable Kurtosis.  The strongest correlating variable will always be the starting point in the development of a predictive model in this methodology.

For the dataset, in the StatTools ribbon, select the menu Regression and Classification towards the centre right of the ribbon, then click on the sub menu Regression (referring to linear regression):

1.png

The regression window will appear, shifting from the nomenclature of x and Y (referring to the axis on a chart) to Dependent and Independent (D and I respectively):

2.png

The dependent variable for the purpose of these Blog entries is intuitively labelled Dependent in the datasets.  For the variable Dependent, ensure that the D column is selected (only one can be selected):

3.png

The Independent Variable, being the first and strongest correlating value in this example Kurtosis will be selected in the I column I:

4.png

The default parameters are adequate for the model, thus clicking OK will set about calculating a Linear Regression model.  StatTools will return a linear regression output:

5.png

4) Adding a Trend Line to a Scatter Plot.

This Blog entry is from the Linear Regression section in Learn Palisade.

The scatter plot example created is visually difficult to inspect, there is just about a relationship depicted, but it is shallow and the direction of relationship not immediately obvious. 

Excel can help by adding a trend line which, at this stage at least, will give a general indication of relationship and direction of that relationship.  In fact, this is an implementation of one-way linear regression, covered in more depth in this module.

 Hover over the chart created by StatTools, then right click:

1.png

Click the sub menu item Trend Line:

2.png

A trend line is drawn that shows a very slight relationship, however it would be reasonable to conclude that the relationship is slim to non-existent in this data.  Financial market data is however notoriously difficult to forecast, hence performing this analysis across a wider array of variables and bringing several variables together is required for the purposes of predictive modelling.

3) Ranking Correlation by Absolute Strength.

This Blog entry is from the Linear Regression section in Learn Palisade.

Correlations, and strength, is used predominantly to find the strongest correlating independent variables for the purposes of carrying that variable through to predictive analytics.  The variables with the strongest correlations, in one direction or another, are carried forward to predictive analytics techniques in subsequent Blog entries.

Firstly, select the Dependents variable and select all of the independent variables correlating, to whatever degree, to that Independent Variable.  In this example it would involve selecting the first two columns in the correlation matrix, where Dependent is across the top, and the independent variables are along the left-hand side, and of course the values to the right of the names.  Once selected, right click on this selection, then copy:

1.png

Create a new worksheet by clicking the tab with a plus sign at the base of the workbook currently in focus:

2.png

Once the new workbook is in focus, select cell A1, right click and paste the values:

3.png

A more succinct comparison between the Dependent and Independent variables is presented, although it is not ordered by the strength of correlation:

4.png

Both -1 and +1 are equally interesting for the purposes of variable selection, and so, we cannot simply order by the correlation without first levelling it to an absolute – this is positive – value.  Start by creating a new column called ABS, in this example select cell C1, setting the value to ‘ABS Value’, although the naming is quite unimportant:

5.png

For each of the Independent Variables, apply the Excel ABS Function to the value.  In this example, the first cell of interest is in C3, being the Mean.  In cell C3, start by typing the formula:

=ABS(

6.png

Select the Independent Variable correlation value, which is in cell B3 in this example:

=ABS(B3

7.png

Close the parenthesis to complete the formula, then fill down:

8.png

At this stage, as the Independent Variable correlations have been set to an absolute value, only values close to +1 are deemed interesting. It is helpful to order the absolute correlations from high to low, where the strongest correlations are at the top, the weakest at the bottom. 

Click cell A1 to reset the selection, if anything is selected, then navigate to the data ribbon, select the icon Sort towards the middle of the ribbon, which will infer the dataset:

9.png

Select the parameters Sort By ‘ABS Value’ then the order ‘Largest to Smallest’:

10.png

On clicking Ok, the values are ordered based on their strength of correlation:

11.png

It is extremely likely that those correlations towards the top of the table will be included in predictive modelling, while those correlations towards the bottom of the table will be disregarded. 

It is not, at this stage at least, advisable to make a determination of what variables should or should not be in a predictive analytics model, instead relying on stepwise approaches as covered in subsequent Blog entries.

2) Create a Correlation Matrix using Spearman and Pearson.

This Blog entry is from the Linear Regression section in Learn Palisade.

Correlation is a measure of relationship and direction of that relationship.  It is a single value that ranges from -1 to +1, which would signal the direction and strength of a relationship.  Both -1 and +1 in their extremes are equally interesting.  A correlation matrix takes all the variables together and produces the correlation value between each variable.  This matrix will be the bedrock, starting point, for many of the techniques used in the following Blog entries.

In the StatTools ribbon, towards the centre of the ribbon select summary statistics then the sub menu Correlation and Covariance:

1.png

The correlation and covariance window will open:

2.png

In this Blog entry perform correlation analysis against all abstracted variables, this is the Independent and Dependent variables created.  Start by selecting all variables.  Towards the top left of the variables grid right click, then click select all:

3.png

Upon having selected all the variables, deselect the raw variables in the file – this is all variables before the dependent variable in this example:

4.png
5.png

Notice that the Correlation analysis is set to Pearson Linear, while the Covariances is also selected.  Covariances are superfluous to this Blog entry and can be deselected:

6.png

Click OK to produce the Correlation matrix.  NOTE THAT THIS MAY TAKE SOME TIME IN EXTREMLY LARGE DATASETS.  The correlation matrix will return after some time processing:

7.png

Introduction to Monte Carlo Model

This Blog entry is from the Monte Carlo Model section in Learn Palisade.

Monte Carlo Simulation is a technique to create many random simulations based upon a random case (i.e. a transaction).   The random value can be forced to obey certain statistical assumptions, which in this example will be a triangular distribution.  Monte Carlo simulation is an enormous topic in its own right yet these procedures are intended to give just a basic overview of the tool and allow for the simulation of models created in these procedures.

Simulation for Communication refers to being able to run models based on explainable statically assumptions so to facilitate expectation setting for the models impact.  Furthermore, that millions of random simulations will be exposed to the model, where records of both the randomly generated record and the output are retained, Monte Carlo simulation can help identify scenarios where there is potential for optimisation or risk mitigation.

Introduction to Logistic Regression

This Blog entry is from the Linear Regression section in Learn Palisade.

Logistic regression is extremely similar to Linear Regression in the manner in which the output is presented as Constants and Independent Variable coefficients, except it’s role is to classify instead of forecast numeric values, this is to say that it is looking to predict the likelihood of a binary dependent variable outcome rather than that of a continuous dependent variable.

The file to be used in this example is contained in \Training\Data\FraudRisk and is titled FraudRisk.xslx.  Logistic Regression is a feature of StatTools and thus the starting point is to open the file, although do not at this stage load the file into the StatTools realm as a dataset.