There are lies, dam lies, and statistics, so the saying goes. So what does that make statistics for machine learning?
Many people have argued passionately on either side of the fence as to whether machine learning is just big data statistics.
I am not going to get into that today.
However, regardless of what you believe about machine learning’s origin story, if you want to practice this technique, you need to understand statistics.
Ready to get started with Machine Learning Algorithms? Try the FREE Bootcamp
Why do you need to understand statistics for machine learning?
For today, let’s ignore all the statistical calculations that go on as a machine learning algorithm is learning.
You do not need to know exactly what is happening inside a learner to implement an ML project.
However, understanding statistics becomes vital to machine learning when you are evaluating the model.
To understand whether or not your model is performing, you need to look at the statistics.
The metrics I am going to talk about today are those most commonly used by data scientists.
These metrics are:
- Mean
- Standard deviation
- Accuracy
- Absolute Error
- Mean Squared Error (MSE)
- R^2 (r-squared)
- P-value*
- Precision
- Recall
* It is worth knowing that p-values aren’t generally used for evaluating machine learning models. However, if you are working with business teams running multi-arm bandit AB testing on new products or features, p-values are often used to evaluate results when making launch decisions. Therefore I have included them as a useful practical addition to your knowledge.
I recommend getting comfortable with the ones we cover here before moving on to broaden your skillset.
Before we get into the good stuff; however, I am going to share a critical step in the process of developing an effective machine learning algorithm.
You have to split your data.
Why you need both a training and a test set of data
When you start running projects in machine learning, you will see that one of the first steps is to split your data into a training set and a test set.
The reason you do this is so that you can check that your model is not over-optimized for the data you use to train it.
Without checking on test data, you run the risk the model will fail on new, real-world data.
This failure of the algorithm on new data is often caused by the algorithm ‘overfitting’ to your training data. To prevent overfitting during development, you first train the model on a portion of your data (say, 80%) and then check it works on a test set (the remaining 20%).
What is overfitting?
Overfitting occurs when your algorithm is trained to work ‘too well’ on the training data set.
When the algorithm overfits, you will see high performance on the data it trained on, but this performance will rapidly decline when introduced to new data.
Essentially the algorithm is unable to generalize.
To evaluate the model for overfitting, you will need to use the metrics we will go through next.
You can identify overfitting by looking at the accuracy metric for your algorithm on the training and test set.
If accuracy is high on the training set but significantly lower on the test set, then it is likely due to overfitting. A small difference is expected, but anything more significant should be cause for concern.
Ok, let’s move on to understanding the statistical metrics.
An Overview of Machine Learning Metrics
Most people are more comfortable with the term metrics that statistics. However, most metrics leverage statistics.
The below table is going to take you through the vital statistical metrics you will use to evaluate machine learning algorithms.
Metric | Definition | How to use it |
Mean | A way of calculating the average value | A useful metric to have when looking at raw data |
Standard deviation | The average amount the data differs from the average | A useful metric to have when looking at raw data |
Accuracy | The percentage (or probability) that your prediction is correct | A simple metric to understand the performance of an algorithm on a test set |
Absolute Error | The difference between your predicted and actual results | Gives you a more detailed understanding of how ‘wrong’ the predictions were on the data set |
Mean Squared Error (MSE) | Determines the difference between your prediction and the actual results. It squares in order to remove negative values and adds a higher weight to bigger errors. | Similar to absolute error but the weightings make it easier to use for comparisons |
R^2 (r-squared) | Another way of calculating the difference between the prediction and actual values. Tells you how close your predictions were as a percentage. Always shown as a value between 0 and +/-1 | Good to have alongside MSE to give you a greater understanding of the dataset |
p-value* | Calculates the probability that your prediction is not random i.e. is significant. A p-value of 0.05 means there is a 95% chance your results are not random or just by chance. | Helps you understand how significant results are statistically |
Precision | The percentage of the predictions that were made and correctly predicted | Good for evaluating the model’s overall performance |
Recall | The percentage of times that you were able to make a prediction on the data | Used alongside precision to evaluate the model’s overall performance |
* It is worth knowing that p-values aren’t generally used for evaluating machine learning models. However, if you are working with business teams running multi-arm bandit AB testing on new products or features, p-values are often used to evaluate results when making launch decisions. Therefore I have included them as a useful practical addition to your knowledge.
What are the right metrics for your problem?
The simple answer to reach metrics are right for your problem is, it depends on the issue.
The main thing to bear in mind is whether your problem is a classification problem or a regression problem.
For classification problems, metrics like accuracy, precision, and recall are essential to evaluate a models performance.
For regression problems, you would use other error metrics such as MSE, R^2, and absolute error.
If you are working on a specific project for a client or a Kaggle competition, often they will give you the metric to use to evaluate the model.
This way is fantastic for you as you don’t have to think about it! 😀
Evaluating Different Statistical Metrics using Python
Now we are going to take a look at how you will use each of the different metrics we have discussed working in Python.
Mean
- Overview: Used to understand the average data point
- Aim: Better understand how your data is structured
- Implementation: Available in Pandas and NumPy using .mean()
Standard deviation
- Overview: The average amount the data points differ from the average
- Aim: Better understand how your data is structured
- Implementation: Available in Pandas and NumPy using .std()
Shortcut to get all the statistics for your dataset!
Instead of calculating the different metrics for your dataset individually, you can use the .describe() in pandas.
Accuracy
- Overview: The percentage (or probability) that your prediction is correct
- Aim: The understand how good your model is at making predictions on training and test set
- Implementation: Using the sklearn library: sklearn.metrics.accuracy_score(y_true, y_pred, normalize=True, sample_weight=None)
Be aware, an accuracy value of greater than 95% may look good but make sure that it is replicated on the test and validation sets. This test is to ensure your algorithm isn’t overfitting.
Absolute Error
- Overview: The difference between your predicted and actual results. The absolute error makes all error values positive and adds them together.
- Aim: Measures the actual difference between your predicted value and the correct value
- Implementation: Using sklearn library: sklearn.metrics.mean_absolute_error(y_true, y_pred, sample_weight=None, multioutput=’uniform_average’)
Remember you are looking to minimize absolute error on both training and test sets. It is wise to not just look at absolute error but to combine with other metrics.
Mean Squared Error (MSE)
- Overview: Determines the difference between your prediction and the actual results. It squares to remove negative values and adds a higher weight to more significant errors.
- Aim: Measures the weighted difference between your predicted value and the correct value
- Implementation: Using sklearn: sklearn.metrics.mean_squared_error(y_true, y_pred, sample_weight=None, multioutput=’uniform_average’)
R^2 (r-squared)
- Overview: Another way of calculating the difference between the prediction and actual values. Tells you how close your predictions were as a percentage. Always shown as a value between 0 and +/-1
- Aim: Measures the weighted difference between your predicted value and the correct value
- Implementation: Using sklearn: sklearn.metrics.r2_score(y_true, y_pred, sample_weight=None, multioutput=’uniform_average’)
If you have a value larger than +/- 1, this means that the error in your algorithm is enormous. Your algorithm is crap so you should start again – sorry!
Code to implement error metrics:
P-value
- Overview: Calculates the probability that your prediction is not random i.e. is significant. A p-value of 0.05 means there is a 95% chance your results are not random or just by chance
- Aim: Used to understand the likelihood your forecast is accurate.
- Implementation: You can set a target p-value for your algorithm. If you want to calculate the p-values, you can do so by coding it yourself, or you can do so as the output of a feature selection regression algorithm.
Precision
- Overview: The percentage of the predictions that were made and correctly predicted
- Aim: Understand how good your algorithm is at making predictions when it can predict.
- Implementation: Using sklearn: sklearn.metrics.precision_score(y_true, y_pred, labels=None, pos_label=1, average=’binary’, sample_weight=None)
Precision and recall are a package deal so you should always look at them both when you want to use one of them.
Recall
- Overview: The percentage of times that you were able to predict the data
- Aim: Understand how often your algorithm can make predictions
- Implementation: Using sklearn: sklearn.metrics.recall_score(y_true, y_pred, labels=None, pos_label=1, average=’binary’, sample_weight=None)
Onto evaluating models with statistics for machine learning!
So now you understand why it is vital to have a good grasp of statistics to evaluate your machine learning algorithm.
The important things to remember are:
- Always have the training and a test set of data to evaluate model performance
- There are a variety of metrics you can use depending on your problem
- Some of the most common statistical parameters used in machine learning are:
- Accuracy
- Absolute Error
- Mean Squared Error (MSE)
- R^2 (r-squared)
- Precision
- Recall
I hope this tutorial has been exciting for you, and you do not feel overwhelmed.
If you want to dive deeper into statistics, I have some great book recommendations here.
Good luck and happy modeling!