Clustering NBA Shots in the 2014-2015 Season

Andrew Giocondi
Apr 8, 2022
8 min read

Introduction

The NBA continues to evolve with new varieties of playing styles. One of the main reasons for this is how shots are taken. Shooting is a very common measure to determine a player’s offense and scoring ability, as well as how well an entire team performs. There is an increasing demand for ways to track and analyze shots in order to help players play well, for teams to win, and to keep up with shooting trends. With shooting being an important topic of discussion in the NBA, there are many aspects to a player’s shot that can be measured. Since 2013, the NBA has implemented an elaborate camera-based system that provides spatial tracking for games. It is able to identify statistics regarding shot location, how well a shot was contested, and how shots are created using dribbles and touch time.

In this project, I will be exploring a sample of shots from the 2014-2015 NBA season to gain a deeper understanding of how shots are taken. The goal is to use clustering to uncover and analyze the most common types of shots during that season.

Clustering

As previously stated, the algorithm being used in this project is clustering. It is an example of unsupervised learning, meaning, it consists of working on unlabeled and unclassified data. This type of learning allows an algorithm to act on the given data without much guidance and prior training. The idea surrounding this type of machine learning is to group or identify information based upon similar characteristics, patterns, and trends.

In clustering, a population of data is classified into groups, or clusters, based on similarities. A label is assigned to each group after they are identified and analyzed. A difficult requirement of clustering is its interpretability. The results should include clusters that are distinguishable enough and display a high enough level of discreteness. Clustering is especially useful when dealing with large and unorganized data that cannot be analyzed quickly, as it results in structured data that can be processed easier.

In the case of this project, K-means clustering will be implemented. This type of clustering method uses center points to determine clusters depending on k, the number of clusters. It captures the insight that each point in a cluster should be near the center of that cluster. K-means first randomly selects the centroid points, and then measures the distance of every point of the data to the centroids by using the Euclidean or Manhattan distance techniques. The shortest distances are assigned to that corresponding cluster.

To determine the number of clusters, the inertia, or elbow method is used. It tests the number of clusters against the explained variance for each. The display tells us how far away points within a cluster are. Therefore, small inertia with a small number of clusters is aimed for. The elbow point in the inertia graph is an optimal choice as the change in variance isn’t significant.

Data Introduction

This shot-tracking dataset was found on Kaggle and contains 21 different variables to uniquely explain every shot that was taken between October 2014 and March 2015. It is formatted as a CSV file and organized cleanly to begin with. There are 128,069 rows representing shots, so the data is fairly large. For every shot, it provides data regarding who took the shot, where on the floor the shot was taken, who the nearest defender was, how far away the nearest defender was, time on the clock, dribble count, touch time, and a few other important details. It is mentioned that the data was scraped from NBA’s REST API, which is a system that provides the most up-to-date basketball data and stats on the current NBA season.

Data Understanding and Pre-Processing

The first step to manipulating the data was to delete a few columns that were unneeded, including game outcome, final margin, closest defender player ID, field goal make, and player ID. These are not significant to this shot analysis case. I then had to work with formatting the strings representing time and converting them to numerical values. I changed the game clock from a time format to an integer value of the total remaining seconds in the quarter and was then able to remove the game clock variable.

Next, after noticing that were null values in the shot clock column, I needed to examine the data more. There were 3554 null values out of the total 5567 that were explained, meaning when the shot clock is disregarded at the end of the quarter when the game clock has 24 or fewer seconds remaining. For these instances, I replaced the null values with the remaining seconds. For the others, I removed them to prevent inaccuracy.

The next task was to change the home/away and make/ miss columns from strings to numerical values. A home game is now represented by a 0 while an away game is now a 1. A shot make is now represented by a 1 while a shot miss is now a 0. I then noticed an error in the touch time variable where some shots have recorded data with negative values. To maintain data integrity, I removed these shots. Additionally, I created a column to represent cumulative points in a game when a player shoots. Therefore, every time player made a shot throughout the game, the number of points they scored is added.

Lastly, before visualizing the data, I generated a column to represent the month numerically in which the game was played. This was performed by taking the matchup column and reading the first three characters of the string, which represented the month. The matchup column was then removed.

After creating the above scatter plot matrix and correlation matrix for each variable, it can initially be declared that there are not many strong relationships. Many of the variable pairs with higher relationships with each other are expected and can be explained by using basketball knowledge. If a player shoots more, defenders will most likely guard them closely. If a player dribbles and touches the ball more, defenders will also defend them closer. These are understandable as defenses will want to guard better players and scorers closer because they will handle the ball more and shoot frequently throughout the game. Also, besides some very close shots, if a player shoots further away from the basket, they will be guarded softer.

More scatter plots were articulated in hopes of finding some intriguing relationships. These take into account the makes and misses of the shots and are graphed against total points. Only the significant ones were added below.

It is exhibited that there are many more misses at long distances compared to short, which is expected. However, players with more points tend to shoot more close and 3-point shots rather than mid-range. There are more makes when the shot is taken early in the shot clock instead of late, and there are more makes earlier in the game compared to later, especially in overtime. It is also interesting to see defenders progressively guarding players closer as they have more points. Being very far away from the shot as a defender leads to a make almost every time if above around 20 feet.

Principal Component Analysis (PCA)

Due to there being more than 3 variables currently in the data, PCA is used to reduce the dimensionality without losing a large amount of information. In order to use PCA on the data, I removed the non-continuous variables, which were shot result, location, period, and month. To visualize how the data was reduced to 2 and 3 dimensions, the graphs take into account shot result, location, period, and month.

2-Dimensions

From these graphs, we are able to see how shot result, location, months, and quarters were separated by the PCA. Location is the only variable that is not distinguishable while quarter seems to be the most distinguishable.

3-Dimensions

From these graphs, we are able to see how shot result, location, months, and quarters were separated by the PCA after the 3-D reduction. Location and months are not as distinguishable while quarter still seems to be the most.

K-Means Clustering

As mentioned earlier in the project, K-Means will be used for the clustering analysis of the shot data. K-Means was the best option compared to hierarchical clustering due to the dataset being very large. The first step of the modeling was determining how many clusters to use. Based on the inertia method visualized below, the graph shows that 4 clusters will be the best option

Clustering the 2-dimensional and 3-dimensional data yielded the following results:

It was very interesting to see how similar the 2-dimensional model and 3-dimensional model were. Almost the identical shape of data was produced for each cluster. With both models showing cluster results that are highly discrete, the decision for using 4 clusters seemed to be a success.

Clustering Analysis

Referring back to the goal of the project to uncover and analyze the shots in order to achieve a better understanding of how shots were taken in that year, the clusters can represent different kinds of shots. With 4 clusters, each one will represent and be treated as a separate type of shot. To interpret the differences between these clusters, each variable was plotted according to the different clusters with the exception of location. Location showed variability between the clusters and seemed meaningless to include. For the continuous variables, instead of displaying counts the mean for each cluster was calculated.

An initial observation can be made that the month in which the shot was taken has no difference for each of the clustered shot types.

The first shot is a 2-pointer that encapsulates many of the shots in the first quarter and has a slightly higher percentage of makes than misses. It is a low shot number and has both a high remaining shot clock and game clock. It has low dribble count, touch time, shot distance, and total points, and the distance a defender is from the shot is moderate.

Regarding the second shot, there are almost equally as many shots in the first, second, third, and fourth quarter, with only slightly less in the fourth. Out of the 4 clusters of shots, this type of shot has the highest percentage of makes and is a slightly higher shot number than the first cluster. Therefore, this shot category has slightly fewer remaining seconds on the shot clock and game clock. It has the lowest dribble count and touch time, as well as having a high shot distance and defender distance. It is identified as a 3-point shot taken when the player has low total points in the game but slightly higher compared to the first clustered shot.

The third shot category is a 2-point shot that also has approximately an equal probability of shots taken in the first, second, third, and fourth quarters. However, unlike the second shot, this one has slightly less in the first quarter. This shot is a make more often than a miss, has a slightly higher shot number than the second, has a slightly lower game clock and shot clock, and has the highest dribble count and touch time. The shot distance and defender distance are moderate like the first shot type and often taken when the player has slightly more points in the game compared to the second shot type.

The last type of shot identified by the shot cluster model is a 2-point shot with most of them being taken primarily in the third quarter, and secondly in the second quarter. This shot is the cluster with a higher miss percentage and is taken very late in the quarters. In addition, both the shot number and total points are high, representing a late-game shot. The remaining seconds on the game clock are the lowest while the shot clock is equal to the second shot type. The dribble count and touch time are low but are higher than both the first and second shot categories. This shot type is taken at a higher distance than the first, but lower than the second and third. Lastly, the defender’s distance from the shot is equal to the first and third shot types.

Conclusion

Overall, the ability of the sample of shots in the 2014-2015 NBA season to differentiate into the 4 clusters was a success. Even after dimensionality reduction for both 2 and 3 dimensions, not only were the clusters similar for both models but there were also distinct differences for the majority of the variables in each type of shot that was taken. Throughout the project, many aspects of the data were able to be explored and analyzed, providing the opportunity to gain a greater knowledge of the shots taken during that season. The experience gained from this kind of analysis is valuable as I can now plan on looking into comparing differences in shot categories over different seasons in the future.

Code and References

The code for this project:

https://github.com/agiocondi12/Clustering-NBA-Shots_2014-2015/blob/main/Clustering%20NBA%20Shots%202014-2015%20.ipynb

The following resources were used in the process of this project.

https://seaborn.pydata.org/api.html

https://pandas.pydata.org/docs/user_guide/dsintro.html

https://matplotlib.org/stable/api/pyplot_summary.html

https://scikit-learn.org/stable/user_guide.html

https://shottracker.com/articles/analytics-shot-selection

https://www.naftaliharris.com/blog/visualizing-k-means-clustering/

Cover image:

https://www.csmonitor.com/USA/2015/0605/NBA-Finals-2015-Warriors-need-overtime-to-beat-Cavaliers-in-Game-1