How to visualize hidden relationships in data with Python — analysing NBA assists

By JP Hwang

Manipulating & visualising data with interactive shot, bubble & Sankey charts for insights with Plotly (code & data in my GitLab repo)

JP Hwang

A very basic goal of data analysis is to understand relationships in data. This is true at every level of analysis, from a simple Excel plot or for data scientists working in machine learning. Whether it is in predicting a value (regression) or identifying a type (categorisation), the goal is to discover the nature of relationships between inputs and outputs.

As some of you know, I have been tinkering with basketball data analysis. One question that I’ve been investigating is what a particular statistic, an ‘assist’, indicates about a team in general. In this article, I share some of my analysis and outputs used to gain insights into this, the relationship between assists other indicators of performance.

Even though this example is specific to basketball, the general process and visualisations should be applicable to other fields, to their own datasets. There is only limited discussion of basketball, only in the context of interpreting results, so don’t let the lack of domain knowledge worry you too much.

I included the code for this in my GitLab repo here (basketball_assists directory), so please feel free to download it and play with it / improve upon it.

Before we get started

I include the code and data in my repo, so you should be able to easily follow along by downloading/cloning the repo if you wish.

Packages

I assume you’re familiar with python. Even if you’re relatively new, this tutorial shouldn’t be too tricky, though.

You’ll need pandas, plotly and statsmodels. Install each (in your virtual environment) with a simple pip install [PACKAGE_NAME].

Question: are assists… helpful?

In basketball, an ‘assist’ is scored against a player if a pass “leads to a score by field goal” by a teammate. Therefore, an assist by definition only captures those occasions when a player scores.

So, the question that I want to answer is this: are assists good?

Do assists capture how often a team gets good opportunities to score? What does the data tell us about it? (Note: some modern statistics have started to capture ‘potential’ assists for just this purpose, but ignore it for this discussion.)

Let’s get started. I began by reviewing a shot accuracy’s ties to assists. Before we get into it, maybe a little bit of recap / background would be handy.

Recap

In a previous article, I discussed generating shot charts using Plotly. In it, we generated charts that utilise coloured hexagons on the court to display shot locations, frequency and accuracy.

In this plot, marker location shows shot location, size correspond to frequency, and the colour to shot accuracy from that zone (the court is divided into blocks, or ‘zones’ to smooth out variance).

I have made minor changes, but include the code and underlying functions here so you won’t have to recreate them. (As always, I also include all data files used in this article.)

Could we similarly map assist rates on the floor?

Mapping assist rates

Thankfully, this can be done easily while recyling most of the code and replacing shot accuracy with assisted percentages, calculated by assisted_shots / total_made_shots.

I have pre-calculated it, so you can simply load the data and run the following:

Which generates:

NBA shot charts — coloured by assist rate

That’s interesting! Assist rates are the highest at the corners (at about 95%), and pretty low around the rim, especially as we start to get away from the basket, where it gets down to about 40%.

But this doesn’t really indicate correlation. In fact, a side by side observation would say that these rates are inversely correlated (take a look below). The areas with highest accuracies in fact have the lowest of the plotted assist rates. Is that causal, though? Are assists bad for the team?

Does this indicate inverse correlation? Or are we ignoring other factors?

This doesn’t seem to follow, intuitively or logically. The visuals are misleading. The reason is that these accuracies affected the shot locations, more than assist rates. What if, instead of looking at league-wide data, we looked at data from individual teams, to see what impact a change to assist rate has on a team’s shot accuracy?

Scatter plots to understand accuracy vs. assist rate

For this analysis, we need data from individual teams. Basketball-reference.com makes this incredibly easy with their collection of team, player and game statistics. As we will be dealing with shot data from the 18–19 NBA season, I collected the team data from basketball-reference here.

2018/2019 team per-100 stats (basketball-reference.com)

In using statistics, it’s important to use the right ones for the given goal or context. I used the per-100 possession stats here to normalise for the effects of a team’s pace, which could potentially skew our data a little (although not really, as we are looking at ratios).

Scatter plots are simple, and still one of the best way to get a good overview of data and any correlation in my opinion. So let’s do just that, plotting accuracy (field goal %) vs assists per 100, which captures % of possessions ending with an assist).

As a reminder, we use the scatter function from Plotly Express, where the syntax is basically to pass a dataframe, and the various parameters are passed on as strings matching names of columns in that dataframe.

So px.scatter(per100_df, x=’AST’, y=’FG%’ lets Plotly know that the X data should come from the ‘AST’ column and Y from the ‘FG%’ column. The rest are pretty self-explanatory.

We also add a trendline here using Plotly. (That’s what the statsmodels packge was for.), with a simple linear trendline (read more about it here).

Visualising correlations between accuracy & assists

Wow, that looks like a pretty well-correlated plot between assists/100 and FG%. This intuitively looks sensible, too. Because a ball moves faster than defenders do, good passes are better able to find a player who is not closely defended at the time. So, it makes sense that a team who records more assists are providing more good opportunities.

(If you’re wondering about the colours indicating points/100 stat, it’s due to the outliers (HOU/IND) shooting more/fewer 3s than others.)

What if we delved further into details, looking at the data, divided into different areas in the court?

Subdividing the data into zones

Remember how we said the court has been divided into areas? For this analysis, we’re using simplified areas, which are mostly based on distance, ignoring directions.

So, the court is divided into seven areas, or zones, like so:

Shot zones as defined for my analysis

By assigning each shot to one of these zones, the same analysis as above can be carried out to see what the impact of assists are in each zone. As the team data does not readily provide this collection, we will manipulate the dataframe and create this ourselves.

Starting with the database of individual shots, where I have already encoded each shot with a zone, we create a dataframe where each row encodes information from one zone, per team.

Here, I loop over each team name, collate data for them by zone, and assign each data for each zone to a new entry by looping again within zone names. Each entry is saved as a dictionary object into a list, which is then converted to a dataframe. (If you’re not sure what’s going on, here’s pandas’ documentation on .groupby method, and pandas.DataFrame function.)

This data is now ready to be plotted — here’s one that I prepared earlier, using: fig = px.scatter(flat_summary_df, x=’assist_pct’, y=’shots_pct’, color=’zone_name’, hover_name=’teamname’) and a little bit of formatting.

Shot accuracy vs assists, by zones.

Each set of coloured markers include 30 figures, one for each team. Take a look for yourself. Many of these groups of markers do not show much correlation, but some clearly do. A few other things stand out/

The red and green traces for what we now call the ‘midrange’ shots show that they have the lowest assist percentages. This may be due to the fact that they are relatively undesirable, so offences are not designed to go here, and thus they are usually fallback options.

Corner 3s in purple are one of the most desirable shots, so it makes sense that they have high assist percentages.

The bottom set of 30+ feet shots are quite spread apart — but considering that they are from the longest distance, it does not appear too out of line. We would expect a higher variance in outcomes due to the low-percentage nature of these shots, and some of these teams are on average going to be chucking more often from a long distance.

But the shots from the interior show pretty good linearity, as do possibly corner 3s. To get better looks. let’s actually just separate them all out, and look at them individually.

Additionally, we will colour them by team name, and introduce sizing based on shot frequency from each zone. This way, we will also be able to see how often a team is shooting from each area.

You are (hopefully) familiar with Plotly Express’ scatter function by now. This is a little different due to using the facet feature, and making modifications, so let’s step through them.

The facet_col parameter specifies which category variable should be used to separate out these graphs, and facet_col_wrap how many subplots are in a column.

Typically, you would use the same axis range for subplots in these graphs to allow for fairer comparisons between graphs, as changing scales can lead to visual misconceptions.

However, we’ve already looked at the data as a whole above. Also, I found that the scaling obscured understanding of the data by bunching them up.

The update_xaxis and update_yaxis methods are thus introduced to unalign the axes. As we do that, I want to make sure that each axes are displayed on the graph to again, avoid misconceptions. The resulting graph is shown below (have a play with the graphs, but keep in mind that you should be careful aligning graphs where axes are NOT shared but could be misconstrued):

Shot accuracy vs assists, by zones in subplots.

Clearly, some zones show better correlations than others between assist percentages and shot accuracy. This probably shows a couple of things:

One is that assists can help create good shots, and the second is that as we get further from the basket, the shooters’ abilities become larger factors.

Around the rim, simply creating good opportunities makes it easy for players to get easy baskets in forms of layups and dunks. The effect decreases with distance, but we can see the impact of increased assist rate up to corner 3s.

By the time we get to regulation 3s, though, the effect is smallish, and the long 3s distribution looks like a classic random distribution shown here with zero correlation.

Wikipedia’s primer on distribution shapes & correlation coefficients.

Can we go back, and take a look at this data on basketball courts? Yes we can.

Correlation by shot charts

Remember that we could not gain a great idea of correlation between shot accuracy and assists above. That was because we were simply observing absolute values, rather than the effect of changes to one against another.

What if we plotted relative accuracies against relative assist rates on the court for a few teams, against league average values?

In my previous article, we saw how relative accuracies could be plotted by subtracting base value (league data) from relative values (team data). Here, we simply do the same thing with ‘ass_perc_by_hex’. I omit the code here to avoid repetition — see my git repo if you’d like to see it in full.

Plotting accuracy charts against assist rates, as relative values against league average, we generate the following graphs for the Warriors & Cavaliers:

Relative accuracy vs assist rate, GSW 18–19
Relative accuracy vs assist rate, CLE 18–19

Take a look at the interior stats, and the corners, where we saw above that the correlation was relatively good. It is still true here.

Further out from the rim, the role of randomness is higher, as is the impact of players’ ability, more than being given the ball at the right spots and right circumstances. So the correlation is weaker, although it still appears to exist.

(For this to be more robust for outside shots, the stats would have to be normalised for players’ abilities somehow.)

I would like to finish off by visualisingwhere these assists come from. We’ll delve into what some (like Plotly) call parallel categories plot, which is similar to flow plots, or Sankey plots.

ParCat / Sankey plots

These plots were initially designed to show flow, such as of liquids or materials. Accordingly, they are great at showing relationships. Every basket is related to the player shooting the ball, a player assisting on the play (sometimes there is none, though), and the location of the shot.

It is easy to generate a parallel category plot of made shots with Plotly:

teamname = 'HOU'
team_df = shots_df[shots_df.team == teamname]
# Simple ParCat plotfig = px.parallel_categories(team_df[team_df.shot_made == 1], dimensions=['player', 'shot_zone', 'assist'], color_continuous_scale=px.colors.sequential.Inferno, labels={'player':'Shooter', 'shot_zone':'Shot location', 'assist':'Assist (if any)'})

fig.show()

My first parallel category plot

That looks impressive, but it’s hard to see what’s going on. Perhaps we should organise the left side & right side by counts, and colour the bars according to whether the shot was assisted.

To sort the dataframe by counts, we can use pandas’ pd.factorize function, extracting counts, sorting by counts, and then creating a new dataframe based on the row order.

i, r = pd.factorize(makes_df['assist'])a = np.argsort(np.bincount(i)[i], kind='mergesort')

makes_df = makes_df.iloc[a]

The colours can be assigned in either Plotly Express, although I do so in regular plotly.graph_objects.

Excluding the formatting object, the main part of the code is as follows (see repo for the full function & call):

Now plotting these for a few players, as examples:

Clint Capela’s shots in the 2018–19 NBA season
James Harden’s shots in the 2018–19 NBA season
Klay Thompson’s shots in the 2018–19 NBA season

We can immediately see how much of a player’s shots are assisted, where they like to shoot from, how much of their shots from each zone is assisted, and where their most productive partnerships are.

The top chart immediately shows you how much of Capela’s shots are assisted, and how big a role Harden plays in Capela’s scoring right at the rim.

The next two charts show that even though Harden and Klay Thompson are incredible shooters, their scoring style could not be more different. Most of Thompson’s shots come on passes, whereas Harden is absolutely creating almost of all his shots.

This chart can be extended to show the entire team:

All of Houston Rockets’ shots in the 2018–19 NBA season

On the 18–19 Rockets, Harden and Chris Paul (and Austin Rivers to some extent) were the only ones creating their own shots in any kind of volume. Most of the midrange shots for the Rockets came unassisted, and the Capela-Harden connection is evident even with this, complicated layout.

As you can see, we’ve discovered quite a lot about the relationship between various NBA statistics in just a few plots. As an added bonus, the interactivity of Plotly means that you can simply hover over the data points and gain further insights which may not be as easy to come by, or require creation of additional plots or annotations.

Although this example relates to basketball data, I have no doubt that these examples would be just as applicable to other sports, or any other dataset. The point here was to investigate the nature of relationships between various data columns, and these charts have allowed us to do just that, without building or training a complex model.

Please download the data and code, and have a play with it, and build something similar with your own dataset. I would love to hear about your experiences or comments!