Every March, we huddle over empty NCAA tournament brackets with 63 games waiting for our enlightened predictions. Whether we analyze detailed game stats or blindly flip a coin to make our choices, there is always optimism that this is the year we defy the odds and have a perfect bracket. Usually by the end of the first round of the tournament, these hopes are dashed as upsets devastate our brackets and leave us with nothing to do except wonder, “Where did I go wrong?”

At Miner & Kasch we implement data science solutions to solve hard problems. As such, we wanted to see if there are ways we could use our skills to model the tournament outcomes and make better predictions, or if we are doomed to simply roll the dice and hope that fate is kind to us. Spoiler alert: it’s a little of both.

“Big data” — the ability to process large amounts of data efficiently and effectively — has transformed how we do business, especially when paired with advanced statistical methods like machine learning. For decades the world of sports has been acutely aware that using analytics to gauge the strengths, weaknesses, and strategy of competing teams provides insights that help create decisive advantages. There are rich statistics on how sports are played — in this case, basketball. We figured that there must surely be a way to use all this data with new machine learning tools to our advantage.

**The foundation**

First, we need data. We were a bit crunched for time, so we needed data that was relatively easy to acquire and useful. We also need models for our data.

We settled on two different approaches. Drew’s measured imbalances in team performance from the yearly performance metrics like the average number of each type of shots that were taken, the percentage of each that were made, the number of offensive and defensive rebounds, the strength of schedule, and even the variances of the number of points scored and points allowed over the entire season. For this model, we need input data (stats to model) and targets to train the model on (game outcomes).

Bryan’s tried to predict the number of points each team will score given players’ past performance against similar teams. For this model, we only need input data to make predictions.

The results of each game in the tournament are easily downloaded as a spreadsheet from “https://data.world/sports/ncaa-mens-march-madness”. The input data takes a bit more work, but from www.sports-reference.com we can gather statistics for the models to ingest that goes back decades. We need statistics on all teams going back many years, which is such a large volume of data that it is impractical for a human to try to collect. Instead, we used Python scripts to scrape the site and provide only the data that we need.

What do we do with this wealth of data? A naive baseline model always takes the higher seeded team. History tells us this model will predict 68% of the NCAA tournament games correctly. This result doesn’t take into the fact that upsets do not propagate forward in your bracket, but it still serves as an estimate. Our goal in producing models is to effectively model data to determine which teams are mis-seeded and therefore represent likely upsets.

**Drew’s Model**

Drew takes a top-level view of the two teams that are playing and employs supervised learning.

To make a predictive model of a game’s outcome from the statistics we scraped, we considered directly comparing the statistics of each team, but this approach has two major drawbacks. First, the feature set is sufficiently large that the variance of the feature vector could swamp the signal, making it difficult to build a robust model. Second, this approach does not really capture the comparison we want to model.

(For reference, for the team level data we gathered 35 relevant statistics, which become features for the model. The total statistics for each team in each matchup gives us 70 total features.)

Instead of a direct comparison, the vector of statistics of the lower ranked team is subtracted from that of the higher ranked team. This provides a measure of the imbalance, plus a proxy for mismatches in various aspects in skills and performance. It also helps level the playing field by removing value offsets so that the prediction of two evenly matched teams is determined by the same decrement in average performance, whether the teams are strong or weak.

The model is built as an ensemble of 100 logistic regressors, all trained on different portions of the data to help fight overfitting. The logistic regressor is a fairly simple model, but ubiquitous because it works well for binary classification problems. It’s also fairly easy to understand, which adds transparency to modeling efforts. Rather than fitting a line, as in linear regression, we fit the logit function. This is a sigmoid of the form y=1/(1-e^z), where y is our class prediction and z is the linear input function based on our input vector x: z=mx+b. We can see that the logit function will asymptote to 0 as z approaches negative infinity and 1 when it approaches infinity. The wins and losses of the higher ranked team are encoded as 1 and 0, respectively, for labels in our training and for the final predictions.

To help the model generalize well, the logistic regressors were fitted and the data they were trained on was bootstrapped, or randomly drawn with replacement between draws. Each model saw 70% of the entire dataset and final class prediction was taken to be the most predicted class. For most games the selection was unanimous, but for about 10% there was slight deviation and a few were nearly 50%, indicating a particularly uncertain prediction.

# Bryan’s Model

Recent successes in many sports have been achieved by modeling player performance rather than team performance. A popular approach is finding players with similar statistics at the same point in their career, and using their future statistics to make a prediction, such as the CARMELO method used by 538. Applying this approach to college basketball is more challenging than professional sports, though. The season is shorter, and the longest a playing career can be is four years, with many of the best players playing only one season. Instead of predicting the career arc of any one player, we predict the number of points a player will score against a given team.

This estimate is a weighted sum made up of two parts:

- How similar players did against the opponent this year
- How the player did against similar opponents

## Part 1: How Similar Players Fared

To find similar players, we compare players at the same position (Guard, Center, or Forward), based on their mean and standard deviation for 40 different statistics over all regular season and conference tournament games. These include common box score statistics such number of three-pointers made and attempted and number of steals and turnovers. They also include “percentage” statistics, which record the percentage of a particular statistic the player contributed in a game for their team. Offensive and defensive ratings per game are also recorded. sports-reference.com defines these statistics as an “estimate of […] points produced per 100 possessions” and an “estimate of points allowed per 100 possession respectively”. All statistics were min-max normalized.

The five most similar players who played against the upcoming opponent are found, using the Euclidean distance. The average number of points they scored against the team is used as the prediction of the number of points the player we are interested will score.

Let’s look at UMBC player KJ Maura’s 2018 performance against Virginia as an example. Full disclosure: Bryan is a proud UMBC alum.

Position | Mean Assist Percentage | Mean Assists | Mean Block Percentage | Mean Field Goals | Mean Free Throw Attempts Rate | Mean Free Throw Attempts | Mean Minutes | Mean Points | Std Field Goals | Std Free Throw Attempts | Std Points |
---|---|---|---|---|---|---|---|---|---|---|---|

Guard | 27.3226 | 5.22581 | 0 | 3.77419 | 0.36123 | 2.19355 | 34.871 | 11.3871 | 2.15576 | 3.01038 | 6.12469 |

With these numbers, we can find the five most similar players who faced Virginia this season.

Player | School | Distance | Points against UVA | Average Pts against UVA |
---|---|---|---|---|

Braxton Beverly | NC State | 0.662 | 4 | 4 |

Shelton Mitchell | Clemson | 0.716 | 0, 18 | 9 |

Zach Sellers | Savannah State | 0.748 | 3 | 3 |

Brandon Childress | Wake Forest | 0.750 | 10 | 10 |

Quentin Snider | Louisville | 0.750 | 3, 13, 5 | 7 |

Our prediction using similar players is 6.6 points against Virginia.

## Part 2: Players’ performance against similar teams

Sometimes there aren’t that many similar players in a given season. In this case, we can look at a player’s performance against similar teams. We use the same statistics as before, but now averaged over total per game for each team. Because there are fewer teams a player faces during the season than players, we look at only the three teams most similar to a given opponent.

Some of Virginia’s stats for the 2018 season:

Mean Assist Percentage | Mean Assists | Mean Block Percentage | Mean Field Goals | Mean Free Throw Attempts Rate | Mean Free Throw Attempts | Mean Points | Std Field Goals | Std Free Throw Attempts | Std Points |
---|---|---|---|---|---|---|---|---|---|

112.2 | 13.727 | 27.439 | 25 | 2.927 | 13.424 | 67.546 | 3.873 | 6.237 | 8.885 |

Using these statistics, the most similar teams KJ Maura faced in the 2018 season were Binghamton, Colgate, and Hartford.

Opponent | Distance | Points Against |
---|---|---|

Binghamton | 1.658 | 16, 4 |

Hartford | 1.699 | 25, 18, 3 |

Colgate | 1.733 | 14 |

Maura’s average against these teams was 13.3 points per game.

## Part 3: Weighting similarity scores

Notice how much smaller the similarity scores are for similar players than for similar teams. This is usually, but not always, the case. To account for this, we add the two predicted point outcomes, weighting them by their similarities.

The average similarity score for players is 0.7252, while the average similarity score among teams was 1.6966. This means that 72% of the weight should be on the prediction from the similar scores, while the remaining 28% should come from the team based prediction.

Repeating this exercise for all players on each team we would predict a score of 72.768 for UMBC and 67.540. In the end the game finished 74 for UMBC, 54 for Virginia.

**Next up**

In a follow-up post, we’ll evaluate these two models and talk about how Drew and Bryan selected the models they did. We’ll compare them to other models, and provide some further thoughts about how all the March Madness followers might do things differently next time.