Fantasy NBA 2
from pprint import pprint
import numpy as np
import pandas as pd
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
import scipy
from scipy.stats import expon, skewnorm, norm
import nba_api
from nba_api.stats.static import teams, players
from nba_api.stats.endpoints import shotchartdetail, playercareerstats, playergamelog
import ballDontLie
from ballDontLie.util.api_nba import find_player_id
from ballDontLie.util.fantasy import compute_fantasy_points
seasons_range = ['2018-19', '2017-18', '2016-17', '2015-16']
players_range = ['Anthony Davis', 'James Harden', 'Stephen Curry', 'Giannis Antetokounmpo', 'Karl-Anthony Towns',
'Nikola Jokic', 'Joel Embiid', 'Paul George', 'Kawhi Leonard', 'Damian Lillard', 'Jimmy Butler',
'LeBron James', "Bradley Beal"]
player_id_map = {a: find_player_id(a) for a in players_range}
For the various players and the various seasons, let's look at the distributions of some of their box stats
for player, player_id in player_id_map.items():
fig, ax = plt.subplots(1,1)
df = pd.read_csv('data/{}.csv'.format(player.replace(" ","")))
df.hist(column=['FGM', 'FGA', 'FTM', 'FTA', "REB", 'AST',
'STL', 'BLK', "PTS"], ax=ax)
fig.suptitle(player)
I'm going off by what these distributions sort of look like over all the players:
- AST: Skewed normal
- BLK: Exponential
- FGA: Normal
- FGM: Normal
- FTA: Skewed normal
- FTM: Skewed normal
- PTS: Normal
- REB: Skewed normal
- STL: Skewed normal
For all players, I'm going to model each box stat as such. Given the gamelog data (blue), fit the model to that data, generate some values with that model (orange), and compare to the actual gamelog data.
Some comments:
For the "bigger" numbers like PTS, FGA, FGM, REB, the model distributions fit pretty well.
For the "smaller" numbers like BLK or STL (a player will usually have 0, 1, 2, 3, or maybe 4 of that stat) - these numbers are more discrete than the "bigger numbers". If you can score points between 0 and 40, each actually reported points behaves more continuously since there is more variety.
From earlier work with PyMC for Bayesian probability modeling, I could have tried using PyMC to sample parameters for each stat-distribution, rather than just do a singular fitting. While that could help report a variety of parameters for each stat-distribution in addition to a sense of variation or uncertainty, I don't think it's super necessary to really venture into exploring the different distributions and their parameters that could fit each box stat; the fitting schemes via scipy seem to work well.
It's possible there are better models to fit some of the data - I can't say my brain-database of statistical models is extensive, so I just kinda perused through scipy.stats
.
Fitting a distribution helps formalize how much a player's game can vary (is he consistently a 20ppg player? Or are is he hot and cold between 10 and 30 ppg?) Furthermore, if a player is out (injured or some other reason), that implicitly gets captured by a gamelog of 0pts, 0reb, etc. This is definitely important in fantasy because some may value a more reliable/consistent player who will show up to 80/82 games rather than a glass weapon who could drop 50 points, but will only play 40-50/82 games
These distributions assume we can ignore: coaching changes, team roster changes, and maybe player development. For player development, a younger player between 2015-2019 will demonstrate huge variance in two ways - young players are inconsistent game-to-game, but young players can also develop rapidly season-by-season. At the very least, these distributions try to describe variance, which shows room where a young player could go off or bust on a given night. Factoring season-by-season improvement will be hard - one would need to try to forecast a player's future stats rather than draw samples from a "fixed" distribution based on previous stats
stat_model_map = {"AST": skewnorm, "BLK": expon, "FGA": norm, "FGM": norm,
"FTA": skewnorm, "FTM": skewnorm, "PTS": norm, "REB": skewnorm,
"STL": skewnorm}
for player, player_id in player_id_map.items():
fig, axarray = plt.subplots(3,3)
df = pd.read_csv('data/{}.csv'.format(player.replace(" ","")))
for i, (stat, model) in enumerate(stat_model_map.items()):
row = i // 3
col = i % 3
axarray[row, col].hist(df[stat], alpha=0.3)
axarray[row, col].set_title(stat)
params = model.fit(df[stat])
axarray[row, col].hist(model.rvs(*params, size=len(df[stat])), alpha=0.3)
fig.suptitle(player)
fig.tight_layout()
At this point, for each player and box stat, we have a distribution that can describe their game-by-game performance. Maybe we can sample from this distribution 82 times (82 games per season) to get an idea of the fantasy points they'll yield (the fantasy points will depend on the league settings and how each league weights the box stats).
To simulate a season for a player, we will model the distribution for each box stat, and sample from it 82 times. This is our simulated season.
simulated_season = pd.DataFrame()
for player, player_id in player_id_map.items():
df = pd.read_csv('data/{}.csv'.format(player.replace(" ","")))
simulated_player_log = {}
for stat, model in stat_model_map.items():
params = model.fit(df[stat])
sample = model.rvs(*params, size=82)
simulated_player_log[stat] = sample
simulated_player_log_series = pd.Series(data=simulated_player_log, name=player)
simulated_season = simulated_season.append(simulated_player_log_series)
In addition to getting an 82-list of ast, blk, fga, etc. We can compute an 82-list of fantasy points (point values will depend on the league, but the default args for compute_fantasy_points
are pulled from ESPN head-to-head points league default categories
simulated_season = compute_fantasy_points(simulated_season)
simulated_season
To make things simpler to read, we will compress the dataframe into totals for the entire season, including the total fantasy points for that season
simulated_totals = simulated_season.copy()
for col in simulated_totals.columns:
simulated_totals[col] = [sum(a) for a in simulated_totals[col]]
simulated_totals.sort_values('FP', ascending=False)
Generally speaking, this method is in-line with many other fantasy predictions. James Harden, Anthony Davis, LeBron James, Karl-Anthony Towns, Steph Curry, Giannis, and Joel Embiid all top the list.
In this "simulation" our sample size was 82 to match a season. We could repeat this simulation multiple times (so 82 * n times). That effectively increases our sample size from 82 to much larger.
Sampling enough is always a question, so we'll address that by simulating multiple seasons. Discussion of the approach will follow later
def simulate_n_seasons(player_id_map, stat_model_map, n=5):
# For a season, we just want the player, FP, and the rank
# Initialize dictionary of dictionary of lists to store this information across "epochs"
epoch_results = {}
for player in player_id_map:
epoch_results[player] = {'FP':[], 'rank':[]}
for i in range(n):
# Just copy-pasted code for convenience in a notebook
# If this were a python script, I would probably put these functions in a module/library somewhere
# Model the distribution of a player's box stats, simulate 82 times, compute fantasy points
simulated_season = pd.DataFrame()
for player, player_id in player_id_map.items():
df = pd.read_csv('data/{}.csv'.format(player.replace(" ","")))
simulated_player_log = {}
for stat, model in stat_model_map.items():
params = model.fit(df[stat])
sample = model.rvs(*params, size=82)
simulated_player_log[stat] = sample
simulated_player_log_series = pd.Series(data=simulated_player_log, name=player)
simulated_season = simulated_season.append(simulated_player_log_series)
simulated_season = compute_fantasy_points(simulated_season)
simulated_totals = simulated_season.copy()
for col in simulated_totals.columns:
simulated_totals[col] = [sum(a) for a in simulated_totals[col]]
simulated_totals = simulated_totals.sort_values('FP', ascending=False)
# Store the fantasy points and player rank for that simulated season
for player in player_id_map:
epoch_results[player]['FP'].append(simulated_totals[simulated_totals.index==player]['FP'].values[0])
epoch_results[player]['rank'].append(simulated_totals.index.get_loc(player))
return epoch_results
epoch_results = simulate_n_seasons(player_id_map, stat_model_map, n=10)
pprint(epoch_results)
To make things prettier, we can just summarize the player ranks over all the simulated seasons, providing us an estimated average rank and error
def summarize_epoch_results(epoch_results):
summary_stats = {}
for player in epoch_results:
summary_stats[player] = {}
avg_rank = np.mean(epoch_results[player]['rank'])
std_rank = np.std(epoch_results[player]['rank'])
summary_stats[player]['rank'] = avg_rank
summary_stats[player]['err'] = std_rank
return summary_stats
summary_stats = summarize_epoch_results(epoch_results)
sorted(summary_stats.items(), key=lambda v: v[1]['rank'])
Room for improvement
- Is building a distribution from year 2015-onward a good idea?
- Pick better models to represent the distribution of a player's box stats?
- How do we account for player development? Forecasting player stats, not just modeling
- How do we account for roster/team changes?
- Can we account for hot streaks for a player?
- Is there a more robust way to deal with player injury rather than hoping for 0/0/0 in the gamelogs?
- Correlation between stats? If a player is on, they might end up playing better overall
- Can we try to time schedules? I.e. some NBA players will have 4-game weeks, can a corresponding fantasy player use that based on the fantasy schedule and truly trying to beat your fantasy opponent?
- Is there a need to draft a player in reaction to other fantasy player draftpicks? This may depend on how specific your team roles have to be. If team roles are lax, then choose the best fantasy option. If you need to fill out a roster, then you have to start weighing your roster choices vs what opponents may end up drafting