Fantasy basketball, building a roster around relative stats
Overview
Fantasy basketball comes in 2 genres: points and categories leagues. In points leagues, a player's counting stats combine to yield a fantasy points value (via scoring settings) that contributes to your fantasy team's success. In category leagues, a player's counting stats are considered separately: you count how many rebounds your team obtained, how many points, how many assists, etc.
In both leagues, there are generally some roster criteria. For example, you can play 1 PG, 1 SG, 1 SF, 1 PF, 1 C, 1 G, 1 F, and 2-3 UTIL players each day. This only introduces a mild positional complexity to fantasy basketball. I say mild because, on any given day, only a few of your players will be playing, so you are likely not constrained by available positions you can play. This is unlike fantasy football, where all your players play once a week, so positional constraints are strong. At the very least, however, there are some positional curveballs in fantasy basketball leagues. One such curveball is some leagues only allow up to 4 Cs per team.
In points leagues, evaluating players is somewhat 1-dimensional: who is projected to score more fantasy points as a function of their projected box score stats and league scoring settings?
In category leagues, scoring is 9-dimensional (for 9 categories): you can try to optimize on all 9 dimensions, or you can try to admit defeat and "punt" on a couple categories such that you can more easily succeed in the remaining categories.
This post will focus on category leagues, aiming to discuss a quantitative approach to punting and leveraging positional superiority.
Data exercise
NBA player projections are pulled from hashtagbasketball. In fact, for this exercise, you could pull NBA stats from anywhere you want and anyone you trust -- we just need to have some box score stats per NBA player. These hashtag basketball projections include average draft pick (ADP) from some common fantasy basketball platforms as well.
Data cleaning aside, we can see the top 10 players are led by two centers, a couple of wings, but a lot of guards. We can also see a bunch of numbers for each box score, but they become hard to interpret and compare due to scale:
- Is 6 AST a lot or a little?
- 10 REB seems like a lot for a center, but to what extent?
- How do we juxtapose that against 9 AST from a guard?
from IPython.display import display
import pandas as pd
from scipy.stats import zscore
STAT_COLS = ["FG%", "FT%", "3pm", "PTS", "TREB", "AST", "STL", "BLK", "TO"]
POSITIONS = ["PG", "SG", "SF", "PF", "C"]
def format_percentages(val):
""" String formatting on the percentages columns """
if "(" in val:
return float(val[0:val.index('(')])
else:
return float(val)
def encode_positions(val):
""" Split comma-joined list of positions into separate columns """
positions = {
pos: False
for pos in POSITIONS
}
for code in val.split(","):
positions[code] = True
return pd.Series(positions)
def clean_df(df):
""" Clean and format data """
df["FG%"] = df["FG%"].apply(format_percentages)
df["FT%"] = df["FT%"].apply(format_percentages)
positions = df["POS"].apply(encode_positions)
df = df.merge(positions, left_index=True, right_index=True)
return df
df = (
pd.read_csv("files/2023hashtagbasketballprojections.csv", index_col=2)
.sort_values("ADP")
.pipe(clean_df)
)
df.drop(columns=["PG", "SG", "SF", "PF", "C"]).head(10)
Z-scores to normalize and compare stats
The Z-score is a statistical technique to scale data into more "common sense" ranges. Specifically, when you normalize a range of data, the average shifts to 0 and a "standard deviation" becomes +/-1 . This is a useful tool to better understand how much better certain players are relative to other players.
Note: for the picky ones, the appication of the z-score does not require your data follow a normal distribution. While standardizing data that do follow a normal distribution, you get the added benefit of heuristics of the top 95/99th percentiles. NBA counting stats don't strongly follow a normal distribution, so we can't say that a player with an AST z-score of 2 is in the 95th percentile. If we did a deeper investigation into these distributions, we could identify these cumulative distribution properties, but that's for another time.
Continuing, we can take the example of Nikola Jokic. His REB z-score is 2.5, AST z-score of 2.99, and PTS z-score of 1.4. This corroborates the general consensus: not an outstanding scorer (but still gets points), but really a great passer and rebounder. On the other hand, Joel Embiid has a PTS z-score of 2.8 (we intuitively know Embiid is a stronger scorer than Jokic)
Looking at the top fantasy players, it becomes easy to pick out which players are crazy good in certain categories (roughly, those with z-scores > 2):
- AST: Jokic, Doncic, Haliburton, Ball, Harden, Young
- REB: Jokic, Embiid, Giannis, Davis
- 3PM: Tatum, Curry, Lillard, Ball, Mitchell
By doing this exercise, we can identify which players may be associated with certain punts and focuses for roster composition
(
df[STAT_COLS]
.apply(zscore)
.assign(
ADP=df["ADP"],
total_z=lambda df_: df_[STAT_COLS].sum(axis=1)
)
.head(50)
[["ADP", *STAT_COLS, "total_z"]]
)
Choosing the set of data to compute z-scores
One key characteristic of the z-score is it adjusts the range of input data that are provided. It becomes extremely versatile if you carefully choose what your input data are.
In the above example, we looked at the pool of all available players. This is certainly useful to understand the "landscape" of counting stats, but as top tier players get drafted, the remaining players all begin to appear muted and unimpressive with z-scores all close to 0.
We can remove players from our pool as they get drafted and re-tabulate z-scores to continually adjust and re-scale our data. For example, if we neglected Jokic and Embiid from the dataset, the overall average REB goes down but still gets standardized to 0 (data below). Anyone who stands out relative to this new average REB will show up with z-scores above 1 or 2.
Admittedly, this re-scaling does nothing more than slide numbers up and down, but will not change the ultimate, qualitative trend of who produces more AST. In reality, you could do this z-score calculation over all players once, and work off that. Re-scaling simply helps to make numbers pop out more easily.
(
df[STAT_COLS]
.drop(index=["Nikola Jokic", "Joel Embiid"])
.apply(zscore)
.assign(
ADP=df["ADP"],
total_z=lambda df_: df_[STAT_COLS].sum(axis=1)
)
.head(50)
[["ADP", *STAT_COLS, "total_z"]]
)
Positional z-score comparisons
The other consideration in fantasy drafting is position. If you decided you want to build a team that focuses on REB and BLK and are willing to give up something like 3PM, you'd probably want to draft a bunch of Cs. However, due to positional constraints, you will inevitably have to draft some PGs/SGs.
We can, again, apply z-scores to compare players. In this situation, however, we restrict our input data to just players among PGs (or another position). This way, we are asking ourselves "among PGs, who gets rebounds really well"? Below, we have 5 tables comparing z-scores for each position.
Again, this ultimately doesn't change qualitative trends (we know Luka gets more rebounds than Steph), but in the world of PGs, we can see this gap is huge.
When searching for a good rebounding PG, you might observe someone like Cade fits the role well, but he's somewhat far down in the draft (mid ADP). Unless you are extremely confident in someone like Cade to outperform his stats or fit your team composition extremely well, it's not a good idea to "reach" and try to draft him super early as your rebounding PG; there are likely some better options out there that could still fit your build. With an ADP of 43.7, perhaps it may be reasonable to pick him 5-10 spots early based on role-fit.
positions_zscores = {}
for pos in POSITIONS:
positions_zscores[pos] = (
df.loc[df[pos], STAT_COLS]
.apply(zscore)
.assign(
ADP=df.loc[df[pos], "ADP"],
total_z=lambda df_: df_[STAT_COLS].sum(axis=1)
)
)
for pos, subdf in positions_zscores.items():
subdf.index.rename(pos, inplace=True)
display(subdf.head(20)[["ADP", *STAT_COLS, "total_z"]])
print("--")
Holistically comparing players and re-ranking players according to punts
In the data tables presented so far, I've included a total_z
column.
Roughly speaking, if you wanted to find the most-general "best" player, you'd want the player with
the highest z-score across the board.
This can be simply evaluated by summing up a player's zscores.
Revisiting the data, we can see that higher z-score does not directly correlate to higher ADP.
ADP will generally factor in things like player availability (injuries) and season outlook, for which our z-score has not accounted.
As a team is constructed through the draft, perhaps you begin to identify (perhaps by using z-scores) certain categories you want to punt. If you want to commit to the punt, you can re-evaluate the remaining players solely based on the categories on which you're focusing. For example, the table below examines a scenario where we focus on PTS, REB, AST, STL, and FT%. In this case, we pretend the other 4 categories don't exist and only examine these 5; compute the z-scores and sum across these 5 categories. By neglecting certain categories, the new "best" players can shift around and this may help prioritze players for the punt.
As a counterpoint, some leagues might be more amenable to drafting the best player available. In this situation, you do not draft based on your team build, but instead try to draft the best players to maximize the amount of "draft capital" you have throughout the season. Drafting the best player available means the player will likely have good value to both you and opposing league managers. The team you draft might not have any notably strong or weak categories, but you have a lot of valuable trade pieces for which you can then re-construct your team (if your league trades a lot). If you draft based on your punt, your valuation of the player will noticeably differ from another manager's valuation of a player. For example, our PTS, REB, AST, STL, FT% focus says Trae Young is the 8th best player in the league even though his ADP is 23.5. Spending your first round pick on Trae Young means you likely miss out on some players that would be highly valued by your competition (like missing out on Embiid or Giannis). In my experience, the fact that there's no objective valuation for a player is what makes fantasy basketball trades so interesting and complex.
FOCUS_COLS = ["PTS", "TREB", "AST", "STL", "FT%"]
(
df[FOCUS_COLS]
.apply(zscore)
.assign(
ADP=df["ADP"],
total_z=lambda df_: df_[FOCUS_COLS].sum(axis=1),
new_rank=lambda df_: df_["total_z"].rank(ascending=False)
)
.sort_values("total_z", ascending=False)
.head(50)
[["ADP", "new_rank", *FOCUS_COLS, "total_z"]]
)
A framework or starting point for evaluating trades
Even though I just said there's no objective valuation for fantasy basketball players, we can still try to establish some princples when it comes to conducting fair trades.
When trading, people quickly react with "X got fleeced". However, if it was so easy to identify X got fleeced, why would X have made the trade in the first place? It ultimately comes down to the values each player has on the new team.
For example, having notoriously-inefficient shooter like Fred Vanvleet is bad if you need to keep a high FG%, but great if you're looking for AST and STL. If you're looking to add FVV to your AST/STL focus, his "value" comes from how many z-scores of AST/STL by which he could increase your team. If you don't care about FG%, you don't need to look into his abysmal FG% z-score.
Conversely, if you're trying to build a reasonable trade offer for FVV, it helps to understand the team composition of your trade counterpart. If your trade counterpart is looking to optimize on something like BLK and PTS, then examine who on your roster can add as many BLK/PTS z-scores as FVV adds AST/STL z-scores. In summary, valuations of players for trades comes down to the dimensions/categories that are relevant for you (and then any personal opinions/availability outlook/player news that could affect projections).
Or, yes, you could try to fleece your opponent and trade a player with "value 10" for an unequivocally player with "value 15", but this is probably not the best way to conduct fantasy basketball trades. If you are successful, then this is certainly one way to try and increase the "net worth" of your team without any considerations for winning categories.
Building an interactive tool
I've updated the streamlit fantasy basketball dashboard to apply these ideas github repo here