Applying Machine Learning to Sports Betting
Using regression and classification models to predict baseball (MLB) outcomes.
Machine learning has proven to be an extraordinarily powerful resource when it comes to predictive tasks such as forecasting revenue, predicting health outcomes, and even accurately estimating geological impacts.
While there is ample evidence of this effectiveness, there is little information on the applications to sports betting. Well, that's where we'll fill in and dive into the diverse ways machine learning is applied to sports betting.
Classification Models
In machine learning, classification models are used to predict the class or category of an observation based on its features. When it comes to sports betting, classification models offer a fascinating approach to analyzing and predicting outcomes, enabling bettors to make more informed decisions.
One popular classification model that has strong potential to be used in sports betting is the random forest algorithm. The random forest algorithm constructs multiple decision trees, each using a subset of the available data and a random selection of features. These decision trees independently predict the outcome of the game based on their unique set of features. The algorithm then combines the predictions of all the decision trees to make a final classification decision.
So, how does the random forest algorithm work in the context of sports betting? Let's dive into a real-world example. In this example, we want to predict whether or not a batter will record a hit for a given game. But for the algorithm to generate predictions, we first need to select our features.
It will be useful to have historical data for both the pitchers and the batters, so for our pitchers we will pull the following data points:
- Games Played
- Runs Allowed
- Strikeouts Thrown
- Hits Allowed
- Their Name
Then, for our batters, we will pull the following data points:
- Games Played
- Runs Scored
- Strikeouts Recorded
- Hits Recorded
- RBIs Recorded
- Their Name
Finally, it will be helpful to consider other variables like the temperature and the stadium the game is played in, so we include those as well.
Here is what our training dataset will look like:
Each row represents a pitcher-batter matchup of a given game. The "batting hit recorded" column is set to a 1 if the given batter recorded a hit for that game, and it is given a 0 if the batter did not record a hit. This makes the dataset suitable for a binary classification model as the random forest will try to predict either a 0 or 1.
Let's see how accurately the model predicts whether or not a batter records a hit:
This summary output returns a wealth of information, but before drawing any insights from it, let's first break down the terminology of the summary statistics:
- Accuracy: Accuracy measures the proportion of correct predictions made by a classification model. It is calculated by dividing the number of correct predictions by the total number of predictions, providing an overall assessment of the model's performance.
- AUC (Area Under the Curve): AUC is a performance metric commonly used in binary classification tasks. It represents the area under the receiver operating characteristic (ROC) curve, which plots the true positive rate against the false positive rate. AUC provides a measure of the model's ability to distinguish between positive and negative classes, with a higher AUC indicating better performance.
- Recall: Recall, also known as sensitivity or true positive rate, measures the proportion of true positive predictions out of all actual positive instances. It highlights the model's ability to correctly identify positive cases, which is particularly important in scenarios where detecting positive instances is crucial.
- Precision: Precision calculates the proportion of true positive predictions out of all positive predictions made by the model. It focuses on the accuracy of positive predictions and helps assess the model's precision in correctly identifying positive cases, minimizing false positives.
- F1 Score: The F1 score is the harmonic mean of precision and recall. It provides a single metric that balances both precision and recall, offering a comprehensive evaluation of a model's performance. The F1 score is useful when the dataset has imbalanced classes, as it considers both false positives and false negatives.
- Kappa: Kappa, also known as Cohen's kappa, is a statistic that measures the agreement between predicted and observed classifications, taking into account the possibility of agreement occurring by chance alone. It considers both the accuracy of the model and the expected agreement, providing a more robust evaluation metric.
- MCC (Matthews Correlation Coefficient): MCC is a correlation coefficient that takes into account true positive, true negative, false positive, and false negative predictions. It provides a balanced measure of a model's performance, accounting for imbalanced datasets and offering a reliable evaluation metric, especially in binary classification tasks.
We will focus primarily on the precision column, as it represents the question of "when the model predicted the batter to record a hit, how often did they actually record one?", also known as the true positive rate. As demonstrated, the precision of the model towers at an impressive 61%! This means that when the model predicted a batter to get a hit, they actually recorded one 61% of the time!
Fortunately for us, we can take this model even further.
Classification models generally create a sigmoid function and the predictions outputted exist on a probability curve. So, if the output of the function is greater than 0.50, the model will assign a prediction of 1. If it is less than or equal to 0.50, the model will assign a prediction of 0.
Having a probability output for each prediction is immensely useful as we can convert the probabilities of the predictions into American sports book odds which we can then use to compare to what's actually offered by the sportsbook.
To do this, we would need to take the implied probability of the prediction and input it to the American odds formula:
Applying this formula to the predictions will allow us to calculate theoretical fair odds, let's take a look:
To wrap things up, machine learning has shown great potential in the field of sports betting. Classification models, such as the random forest algorithm, can be employed to analyze and predict outcomes, empowering bettors to make more informed decisions. By selecting relevant features and utilizing historical data for pitchers and batters, along with additional variables like temperature and the particular stadium, we were able train a robust classification model.
The model's performance can be evaluated using various metrics such as accuracy, AUC, recall, precision, F1 score, kappa, and MCC. In the example discussed, the precision of the model was found to be 61%, indicating that when the model predicted a batter to get a hit, they actually recorded one 61% of the time. This demonstrates the potential effectiveness of the model in predicting specific outcomes in sports betting.
By leveraging machine learning techniques and incorporating them into sports betting, individuals can enhance their decision-making process and potentially increase their chances of making profitable bets.
While we covered a lot of ground, there's no need to stop here! We've designed a structured curriculum that walks you through:
- Building an end-to-end machine learning workflow for MLB events
- Predicting outcomes for strikeouts, hits, home runs and more
- Choosing the optimal betting and bankroll management strategy
- And finally, live testing in a realistic environment before deploying into production!
So, if you're eager to supercharge your data science and machine learning skills, we encourage you to explore our course. Enroll today and see for yourself the power of machine learning in sports betting!