-
7 votes
-
Predicting the NBA MVP with Machine Learning
Predicting the NBA MVP with Machine Learning Thesis Every season, basketball fans debate who deserves the MVP award. We built 3 machine learning models that attempt to answer that question using...
Predicting the NBA MVP with Machine Learning
Thesis
Every season, basketball fans debate who deserves the MVP award. We built 3 machine learning models that attempt to answer that question using box score statistics. At the end of each season, this award is determined by a panel of voters.
Methodology
Each model is trained on every NBA season from 1974 to 2017. For each player season, it looks at nine statistics:
- Points, assists, blocks, defensive rebounds, and field goals per game the core production numbers
- Win Shares (WS): an estimate of how many wins a player contributed to their team
- Value Over Replacement Player (VORP): how much better a player is than a league average replacement
- Box Plus/Minus (BPM): a player's net impact per 100 possessions
- Usage Rate (USG%): what share of team plays run through that player
From those nine numbers, the model learns what a typical MVP season looks like versus a non MVP season, then applies that knowledge to current players. Each model outputs an independent probability that a given player wins MVP, not a share of a single pool, so the values do not sum to 1. Think of it as each player's individual odds.
Three Models, One Question
Rather than relying on a single approach, the system runs three different models and lets you compare:
Logistic Regression
The simplest of the three. It draws a straight line through the data, each statistic gets a weight, and a player's score is the weighted sum of their stats. It's easy to interpret (a higher coefficient means that stat matters more).
Win Shares (WS) is by far the most influential feature, with an absolute coefficient of ~1.85, nearly double the next most important feature. Box Plus/Minus (BPM) ranks second at ~1.0, followed by Defensive Rebounds per Game (DRBPG, ~0.85) and Assists per Game (ASTPG, ~0.70). VORP and Field Goals per Game (FGPG) contribute moderately at ~0.50. Blocks per Game (BLKPG), Points per Game (PTSPG), and Usage Rate (USG%) have minimal weight, all under 0.15.
Random Forest
Builds hundreds of decision trees, each one asking a series of "is this stat above or below X?" questions and averages their answers. It handles complex relationships between stats well and is less sensitive to any one unusual data point. Think of it as a large committee of simple rules voting together.
WS again dominates at ~0.31, accounting for roughly twice the importance of the next feature. VORP (~0.15) and BPM (~0.125) rank second and third. DRBPG (~0.10), PTSPG (~0.08), BLKPG (~0.07), FGPG (~0.065), and ASTPG (~0.06) contribute in a fairly tight mid-range band. USG% is the least important at ~0.05. Compared to logistic regression, the Random Forest spreads importance more evenly across features.
Gradient Boosting
Also uses decision trees, but builds them sequentially: each new tree focuses on correcting the mistakes the previous ones made.
This model is heavily concentrated on just two features: BPM (~0.47) and WS (~0.41) together account for roughly 88% of total feature importance. All remaining features, PTSPG, VORP, ASTPG, DRBPG, contribute ~0.02–0.03 each, and BLKPG, USG%, and FGPG are effectively unused (near zero). This suggests the gradient boosting model learned that BPM and WS alone are nearly sufficient to separate MVP candidates.
Historical Results
The models were trained on data through 2017, so every season from 2018 onward is a genuine out of sample test, the models have never seen these players or seasons before.
Season Actual MVP LR RF GB 2018 James Harden #2 #2 #1 ✓ 2019 Giannis Antetokounmpo #1 ✓ #1 ✓ #1 ✓ 2020 Giannis Antetokounmpo #1 ✓ #1 ✓ #1 ✓ 2021 Nikola Jokić #1 ✓ #1 ✓ #1 ✓ 2022 Nikola Jokić #1 ✓ #1 ✓ #1 ✓ 2023 Joel Embiid #2 #4 #2 2024 Nikola Jokić #1 ✓ #1 ✓ #1 ✓ 2025 Shai Gilgeous-Alexander #3 #2 #569 Top-1 accuracy: LR 5/8 · RF 5/8 · GB 6/8
Top-3 accuracy: LR 8/8 · RF 7/8 · GB 7/8
Top-3 accuracy: LR 8/8 · RF 7/8 · GB 7/8
For five straight seasons (2019–2022 + 2024), all three models agreed on the same #1 pick, and were right every time.
In 2023, every model ranked Nikola Jokić #1, and by the numbers, he arguably had the better season. Joel Embiid won the award anyway, the kind of outcome that may reflect voter narrative/fatigue and team performance rather than pure statistics. In 2025, Gradient Boosting ranked Shai Gilgeous-Alexander outside the top 500, while Logistic Regression and Random Forest had him at #3 and #2 respectively. I have no idea why GB did this. Likely a bug.
Future Direction
No model is perfect, and these have known blind spots. Team record is not included, MVP voters have historically punished players on losing teams regardless of individual stats. Injuries and narrative don't appear in a box score. And the training data skews toward an older era; the three point revolution and the rise of players like SGA have introduced statistical profiles the 1970s–1990s data doesn't fully capture.
Current Season Predictions (2025–26)
LR RF GB #1 Nikola Jokić Shai Gilgeous-Alexander Nikola Jokić #2 Shai Gilgeous-Alexander Nikola Jokić Victor Wembanyama #3 Victor Wembanyama Victor Wembanyama Giannis Antetokounmpo #4 Luka Dončić Giannis Antetokounmpo Kawhi Leonard #5 Jalen Johnson Luka Dončić Luka Dončić Two of the three models have Nikola Jokić as the frontrunner. Random Forest is the dissenter, putting Shai Gilgeous-Alexander ahead. Victor Wembanyama appears in all three top 3s in just his second season, which is notable. Before running the models, I expected him to be #1 for all of them considering the way the models use advanced stats.
Conclusion
Thank you for reading. I hope you found this interesting. Basketball reference also has their own model if you would like to see a different result. Please do not gamble on my models!
13 votes -
Installing every* Firefox extension
59 votes -
US imports more from Taiwan than China for first time in decades
20 votes -
Rasmus Dahlin hasn't just improved his own game, he has dragged the entire team and altered the trajectory of the Buffalo Sabres franchise
3 votes -
When video games were brown
28 votes -
Opta removes all advanced statistical data from fbref.com
7 votes -
Luxury apartments reduced rent in some big US cities
22 votes -
Is vaping less harmful than smoking, and does it help people quit?
45 votes -
Why are so many pedestrians killed by cars in the US?
51 votes -
South Korea officially recognises same-sex couples in national census
32 votes -
Iceland to propose higher tourist tax following record-breaking number of visitors – 1.7 million international tourists in the first seven months of 2025
23 votes -
With nine goals in his first seven Premier League games, Erling Haaland has started the season on fire – in the early running for this season's Golden Boot
6 votes -
Norwegian striker Erling Haaland has taken his tally to 48 goals in 45 games for his country – what makes him such a phenomenal goalscorer?
5 votes -
The precarious "economy” of Fallout: New Vegas
23 votes -
What heritability actually means
14 votes -
How embryo selection exploits our flawed intuitions about risk
17 votes -
Question - how would you best explain how an LLM functions to someone who has never taken a statistics class?
My understanding of how large language models work is rooted in my knowledge of statistics. However a significant number of people have never been to college and statistics is a required course...
My understanding of how large language models work is rooted in my knowledge of statistics. However a significant number of people have never been to college and statistics is a required course only for some degree programs.
How should chatgpt etc be explained to the public at large to avoid the worst problems that are emerging from widespread use?
37 votes -
From Brighton flop to hot property – Viktor Gyökeres is one of Europe's most prolific goalscorers, but can he do it at the very highest level?
2 votes -
The quiet revolutions that have prevented millions of cancer deaths
16 votes -
Farmers who don't farm: The curious rise of the zero-sales farmer (2017)
9 votes -
Swedish team Djurgårdens IF Fotboll are in the semifinals of a European competition for the first time in the club's 134-year history
4 votes -
Who will maintain Vim? A demo of Git Who
20 votes -
Erling Haaland becomes the fastest player to record 100 goal involvements (goals and assists) in the Premier League – also first to make it in fewer than 100 appearances
9 votes -
FK Bodø/Glimt beat Olympiacos FC over two legs in the last sixteen of the 2024-25 UEFA Europa League – set to face SS Lazio in the quarter-finals
6 votes -
Danish deposit system: 93% of bottles and cans are returned and of those, 99.7% recycled (translation in comment)
40 votes -
Overfitting to theories of overfitting
10 votes -
US voters were right about the economy. The data was wrong.
39 votes -
Why probability probably doesn't exist (but it's useful to act like it does)
11 votes -
More than a million people in the United States earn $500,000 or more
12 votes -
Global value of music copyright soars to $45.5bn, now worth more than cinema
11 votes -
Is the love song dying?
16 votes -
What’s behind the sudden surge in young Americans’ wealth?
21 votes -
Roads in Africa are among world’s deadliest despite few cars
9 votes -
New study shows that hurricanes lead to excess mortality long after the storm has passed
20 votes -
Weight loss drugs appear to be having an effect at the population level
24 votes -
Who migrates from developing countries?
15 votes -
Human drivers keep rear-ending Waymos
37 votes -
Eight basic rules for causal inference
9 votes -
Genomic prediction of IQ is modern snake oil
11 votes -
The rarest move in chess
5 votes -
Josh Gibson becomes MLB career and season batting leader as Negro Leagues statistics incorporated
13 votes -
The US maternal mortality crisis is a statistical illusion
31 votes -
Data show that the amount of sexual content in top films has sharply declined since 2000
33 votes -
Homicides are plummeting in most American cities
20 votes -
The Dunning-Kruger effect is autocorrelation
30 votes -
Fake grass, real injuries? Dissecting the NFL’s artificial turf debate.
14 votes -
Hugo voting data from Chengdu WorldCon raises suspicions of vote tampering and incorrect eligibility rulings
31 votes -
How a Kalman filter works, in pictures
17 votes -
Covid kills nearly 10,000 in a month as holidays fuel spread, World Health Organization says
63 votes