## Introduction

In our previous post, we described two different machine learning models developed in our office during March Madness. We explained the models themselves; now let’s turn to analysis.

## Evaluation & Model Selection

There are many different ways to evaluate models, depending on how the model works and what your goal is.

For Drew, model selection involved experimenting with different types of algorithms and measuring how accurate they are in their predictions, while ensuring the models are also robust to overfitting. Different classifiers were tried, including naive Bayes, random forest, and a neural net; finally, logistic regression was selected as the best classifier. The model parameters were chosen using cross validation, where applicable, on a 70% split of the data for training. The performance metrics were measured on the remaining 30% of sample data.

The performance in testing was determined by looking at the accuracy of the predictions. In future iterations of the model, overall bracket performance might be the goal, but for now we wanted to measure only how well it predicts if each prediction is assumed to be independent. While accuracy, or the percentage of correct predictions, was the main metric, recall and precision were also assessed to understand the strengths and weaknesses of the models. Recall measures the percentage of each class you accurately recover — that is, are we better at finding all the upsets or games where the higher ranked team wins. Precision measures the accuracy of prediction for each class — that is, when we predict an upset, we are right 80% of the time, compared to not an upset, when our accuracy is 60%. Interestingly, the less sophisticated models were eager to predict upsets and recovered more of them (higher upset recall but lower precision), whereas the more sophisticated models were more conservative (lower upset recall but higher precision).

Comparatively, all models produced fairly similar performance, with the naive Bayes model performing the worst with a 69% accuracy score, and logistic regression giving the highest accuracy at 80%. The random forest model and neural net were within 2% of the accuracy score of the logistic model, but given the slightly better performance and model simplicity, the logistic model is the most favorable choice.

Despite not being the best model for our team based predictions, the random forest model provided useful insight into what contributes to a team win, because feature importances can be measured by the significance of each feature in fitting the decision trees that make up the model. It found the main contributing factors to be the simple rating system score, the average number of defensive rebounds, the team’s strength of schedule, two-point attempts, number of assists per game, three-point attempts, the opposition’s average points per game (a measure of defensive strength), and free-throw attempts. Not only is this a sensible result, but it confirms the old adage that “defense wins championships.

For Bryan’s model, he evaluated our choices based on overall accuracy in a tournament, accuracy in the first round, and the mean absolute error for each player.

In fact none of these are perfect measures, and it would have been better to evaluate on the object we are trying to maximize, that is how we would score in the competition. The different bracket sites have different scoring mechanisms, but most are set up to provide an equal number of points during each round, with the exception of Sports Illustrated’s changeable bracket. (This means that games later in the tournament are worth more than games earlier in the tournament.) Some bracket games (Yahoo) give bonus points for upsets, but we ignore that option.

Bryan evaluated many different variations of his model, including different features and whether only player similarity or team similarity should be used. These evaluations were done on tournaments from 2012 to 2018.

## What others are doing

Predicting the NCAA tournament has become a popular challenge for data scientists, and we are not the first to attempt this. One popular service that sells their predictions is TeamRankings, which uses a type of simulation known as a genetic algorithm to predict the winners of each match up. Since the 2009 tournament, they have correctly predicted 72.1% of all games in March and April (TeamRankings does not make available specific accuracy of NCAA tournament picks that we have seen

Kaggle, the machine learning competition site, has run a NCAA tournament competition since 2014. Since 2018 Google, which owns Kaggle, has sponsored the competition with financial rewards. This year the top place finisher gets $10,000. This competition is set up as a regression problem, where participants must predict the probability of a team winning each possible game. Most people do not publicize their methods, but the 2018 winner used random forests, a popular ensemble method for problems like these.

## NET

Before the tournament even starts, the selection committee has the difficult task of selecting the at-large bids as well as assigning each team a seed in the tournament. For years, the NCAA used a model called the Rating Percentage Index (RPI), which was designed to include not just the number of wins and losses a team had, but also how good the teams that they played were. For this season, the RPI was replaced by the NCAA Evaluation Tool (NET).

The NET is made up of five parts, and the NCAA has not clarified how these parts are combined into a final score as of this writing, except to say they are presented in order of their importance.

The five parts are:

**The Team Value Index
**An unknown algorithm that incorporates who the opponent is, where the game was played, and who won. In an article on NCAA.com, this was described as “If you beat a team that you’re expected to beat, then it doesn’t do as much for your ranking. Losing to teams that you were expected to beat will hurt your ranking.” This sound suspiciously similar to a location-adjusted ELO rating.

The other four parts are more straightforward.

**The net efficiency rating
**Calculated as $$\frac{\textrm{Points Team Scores}}{\textrm{Field Goals Attempted} + \textrm{Offensive Rebounds} + \textrm{Turnovers} + 0.475 * \textrm{Free Throws Attempted}} -$$

$$\frac{\textrm{Points Opponent Scores}}{\textrm{Opponent Field Goal Attempts} – \textrm{Opponent’s Offensive Rebounds} + \textrm{Opponent’s Turnovers} + 0.475 * \textrm{Opponent’s Free Throw Attempts }}$$

**The winning percentage
**The number of games won / number of games

**The adjusted winning percentage
**Takes into account the location of the game, rewarding away wins the most and penalizing the most for home losses.

**The score margin
**Team score minus opponent score for each game, capped at 10 points per game to prevent teams from running up the score or any tomfoolery involving gambling.

The NET score is not used directly to decide a teams seeding, but is presented to the selection committee members along with the teams results against other teams in different ranges of the NET rankings, which the NCAA calls quadrants.

## Real Time Prediction

One of our favorite things to look at during the tournament is the real-time predictions offered by FiveThirtyEight. These predictions are trained on previous seasons of NCAA basketball, using logistic regression, which produces a value between 0 and 1.

For each play in a game, the following 4 parameters are used:

- Time remaining in the game
- Score difference
- Pregame win probabilities
- Which team has possession, with a special adjustment if the team is shooting free throws

For each game state representing these values, the model is trained to predict a 1 if the team went on the win the game and a 0 if it went on to lose. This produces a model that can then give a probability estimate based on previous games of how likely a team is to win. You can see that towards the end of close games, the probability can swing wildly back and forth, like the Duke-Michigan State game from the Elite Eight.

## Thoughts for the future

### Drew’s Thoughts

This first swipe at the problem of building a better bracket was successful, but there is still ample room for improvement. For the supervised learning model there are several ways in which the input data can be improved. The athletic performance of the teams in the tournament is the result of a choreographed development meant for them to peak right in time for the big dance. This means that the teams that play in the tournament are very different from the ones that started the season. Since Drew’s model used the yearly averages, it measures only the typical team play, so there is no way to account for variation over time. Ultimately, to improve the model we need to be able to account for typical performance and the timing of performance increases. Additionally, taking a more granular approach to understand how a certain team plays against teams that are perceived to be stronger or weaker, something like Bryan’s approach, could be helpful for getting the yearly statistics more accurately. Finally, it is important to remember that we are observing one realization of a probabilistic system, so even the best model will never be perfectly accurate. In other words… to win your office pool you need a bit of luck. As we saw this year, with the exception of a few games, most games were very close. A chalk bracket performed well in the first two rounds as the coin flip landed favorably, but blew up in the third round as it took a turn for the worst. Even if you have a wonderful model, producing a great bracket will require you to tempt fate and gamble on a few unpredictable upsets.

### Bryan’s Thoughts

Bryan initially developed his model to be based on player similarities, and added in the team comparisons at the last minute as a fix to some issues he was seeing. Because of this, not as much care was given to the features that were used to compare teams. Some features such as usage percentage, or other stats based on a players contribution to their team, don’t make sense on a team level and should most likely be dropped.

There are also team-specific stats like pace that cannot be calculated for individual players. This type of fstats could be vital when comparing teams, and should be examined more closely.

Another big change would be to replace this hand-crafted model with a neural network that uses team and player statistics as input and predicts the number of points a player will score using supervised learning. If done correctly, this should learn player and team similarities and factor them in when making a prediction.

## Conclusion

The 2019 tournament was a wild one. As we recover from the excitement, and work off all of the chicken wings we ate, there are a few on which we can reflect. While the models were developed just for fun, overall, they did pretty well and even picked up on a few Cinderella stories that we would have missed. For instance, Drew’s model had Auburn going to the final, which it barely missed out on doing in their Final Four game against Virginia, the eventual Champions. Bryan’s model had a rocky start to the tournament. It really seemed to struggle with how rare upsets were in the first round, although it did accurately predict the Murray State and Ohio State victories, but it shone in the later rounds by accurately predicting Oregon going to the sweet 16 and Michigan State defeating Duke. With further improvement these models could be very competitive and give you an edge in your office pool, but it would only improve your performance year-after-year as the statistical edge shows itself. If you want to truly have the opportunity to win your pool on a given year, you will have to make the gamble to go against a few of the model predictions as the reality of the tournament often defies statistics and logic. Even though the last of the confetti may have just been swept up, we are already excited to improve our models for next year and eagerly awaiting the surprises that next year’s tournament will bring.