NBA Draft Classification from College Player Statistics
- Andrew Giocondi
- Feb 17, 2022
- 7 min read
Updated: Feb 21, 2022

Introduction
In the continuation of my interest in sports analytics, the idea of being able to predict when a college basketball player will be drafted is intriguing. In the NBA draft, there are only two rounds, each with currently 30 picks, representing the number of teams in the league (Even though not every team will have a pick in the draft). Therefore, with there being over 1,500 NCAA Division I basketball players, the difficulty to make it to the NBA is substantial. This leaves the remaining players who go undrafted to either compete overseas or move on to the career they prepared for in college.
In this project, I will be using classification, a method of supervised learning, to predict where in the draft a player will be selected. In this example of machine learning, the target value is known and discrete, and in this case, the accuracy will be determined based on how well a model can make a draft prediction.
The question regarding the problem is the following:
How well can a college basketball player's statistics predict when they will be drafted in the NBA draft?
I will use the following classification methods to find solutions to this issue:
Decision Trees
Support Vector Machine (SVM)
Random Forest
The Data
The data that I will be using in this project contains the demographics and statistics for every college basketball player, and for every season, from the year 2009 to 2021. It contains more than 60 different variables and over 60,0000 instances. View the data source below to observe all of the attributes.
Data source: https://www.kaggle.com/adityak2003/college-basketball-players-20092021?select=CollegeBasketballPlayers2009-2021.csv
The data was collected from the following website which uses both natstat and NCAA to continually update:
With there being many advanced statistics to measure a player’s performance, referencing the following website is useful:
In terms of drafting in the NBA, not only college players are drafted. Therefore, the only issue with this dataset is the exclusion of international players. In this project, however, making the draft measure specific to college players is not an issue due to the fact that college players are drafted at significantly higher rates than international players. When viewing the draft picks in the data, I was able to notice that there are not many instances of missing data from international players being selected.
There are many advanced attributes in this dataset and not many missing or null values. The majority of these are for those who do not get drafted, for statistics such as the specific type of shots (rim, dunks, mid-range, etc.) which will not be used in the classification models.
Pre-processing the Data
There were many steps that were taken to end up with the best possible collection of data for classification. It predominantly consisted of trial and error, as well as brainstorming new ideas along the way.
The first action I took was removing all rows of players that were not drafted in the NBA. These null values took up a sizable portion of the data containing other missing information and outlier values. Next, with there being many columns, I only kept the variables that would be meaningful to the target values and did not contain additional missing data. My initial decision was to keep certain columns that I believed would measure a player’s performance the best, however, the visualizations and models showed low correlation and accuracy. For this reason, I then chose a larger list based on null values, relevance, and simplicity, and decided to construct a correlation table to assist with the variable selection.
Prior to calculating the correlations, I noticed a null value for the assist/turnover ratio variable, which I just filled in with the mean to prevent loss of data. Afterwards, since the dataset includes stats for every season, the average needed to be taken for each player who played in college for multiple seasons.
When I obtained the correlation values for every variable, I filtered out and viewed only the instances with less than a -.10 or higher than a .10 correlation with draft pick. Even though these values do not represent a strong positive or negative correlation, do this can potentially rule out the possibility of including insignificant data in our classification model. This was done in hopes of increasing the accuracy of the models in comparison to the first selection of variables. I also made it a priority to check the correlations between these variables to reduce redundancy. Out of the variables that met the correlation conditions previously stated, I removed total rebounds as offensive rebounds and defensive rebounds are not as highly correlated with each other as they are with the total. I removed steal percentage since steals is higher correlated with the draft pick. Lastly, I removed eFG and two-point percentage because true shooting percentage has a higher correlation with pick and they were all highly correlated with each other. This left the remaining variables to use in our model: pick, minutes played, points, offensive rating, offensive rebounds, defensive rebounds, true shooting percentage, defensive rating, steals, and blocks.
The final step that I needed to take before visualizing and modeling that data was separating the picks into categorical bins. This was another step that I attempted various approaches before arriving at the best one. I first experimented with equal binning using quartiles (4 bins) and deciles (10 bins). The model accuracy was low for the quartile binning and even lower for the decile binning. Realizing this trend, I ended up using only two bins– one for the first round, which includes the values 1 to 30, and the second round values 31-60. However, for the graphs, I used 4 bins as it displayed the relationship between the statistics and draft picks better.
Data Understanding and Visualization
To comprehend the data and obtain a better understanding of each variable in regards to the target value, I first developed scatter plots.



It is important to note that for these scatter plots since the lower draft picks are better, the data that fall on the lower parts of the graphs are the higher pick selections. Slightly negative relationships are apparent for most of the plots and a positive correlation is shown in the defensive rating graph, which is expected. With the correlations being low, the anticipation that these would show distinct relationships was somewhat low. For this reason, the following scatter plots were created with quartile bins to validate the relationships.









For these graphs, to recognize the relationship in terms of how it will be used in the model, the first two bins representing the first round tend to have higher data points compared to the last two in the second round. Again, the only exception is for defensive rating, which has an opposite case.
Data Modeling and Evaluation
As I explained earlier in the project, I will be implementing three different classification models. To do so, I needed to perform a train-test split. I separated the dataset into an x, containing the statistics, and a y, containing only the draft pick. Furthermore, I applied standardization to both the train and test. This will improve the data’s uniformity, making it easier for comparison when the classification algorithms perform. The accuracy and cross-validation values are susceptible to slight fluctuation due to re-running the file, so when viewing the code take this into account.
The first is a decision tree model. This algorithm is used to build classification models in the form of a tree-like structure when a target result or variable is already known. A decision tree discriminates a set of data into classes based on recursive partitioning. In this case, the decision tree will end up testing the college player statistics in order to articulate rules for predicting the draft pick target variable.
The decision tree classifier produced the following cross-validation and accuracy score:

The second model I produced was an SVM. I attempted the SVM with an additional polynomial kernel and linear kernel case. The model works to find the line or hyperplane that maximizes the margin between the two classes and separates the features into different domains. Kernel functions can be used to manipulate all the data so that it is represented in a higher dimension.
The following are the model results including cross-validation and accuracy score:
SVM:

SVM Polynomial kernel case:

SVM Linear kernel case:

The last classification model I used was a random forest. Its performance is based upon the growth and combination of multiple decision trees. By doing this, it creates a more accurate representation and prediction of the data. Each tree gives a separate classification, which the random forest then chooses the classification through majority.
The random forest classifier produced the following cross-validation and accuracy score:

After examining the performances of each, I found that the SVM polynomial kernel case performed the best. To explore how this specific model is executed, I used a confusion matrix to visualize how the predictions aligned with what was expected.

The counts of true negatives (0, 0) and true positives (1, 1) are higher than the counts of false negatives (1, 0) and false positives (0, 1). This is a good sign as it connotates more correct predictions than not.
Conclusion
Through the sequence of analyzing the relationships between college player statistics and NBA draft picks and then developing classification models, it can be said that drafting a player is complicated. In response to how well a college basketball player's statistics predict when they will be drafted in the NBA draft, the answer is faintly ambiguous. Although, there is obvious success in making the prediction between the first and second rounds as seen in the classification models. After choosing the SVM with the linear kernel case, it can be said that the model is around 66.8% accurate in predicting whether or not a player will be selected in the first or second round.
An important detail that I noticed while running the models was the fluctuation in accuracy score. This most likely could be stemming from overfitting, which could also prove the benefit of obtaining a larger dataset. I used cross-validation to evaluate the models in addition to the accuracy scores to help keep the possibility of overfitting to a minimum. For the models to perform at peak performance many changes to the inclusion and exclusion of variables were made. There were also many model trials with different variable bins before arriving at the best-performing one.
Due to the correlations being slightly low as well as the models only producing, at most, mid-to-high 60% accuracies, it can be fair to state that more information may be required to predict when a player will be selected in the NBA draft. This can especially be the case when predicting the pick on a more specific level, such as using more bins. Variables such as schedule difficulty, playing style, team contribution, etc., could be valuable aspects to also take into consideration. Additionally, the amount of data may need to be much larger to produce higher prediction accuracy compared to only using the limited year span from 2009 to 2021. However, by only viewing how correlated certain statistics were to draft pick, offensive scoring, shooting ability, and scoring prevention on defense are the best factors to assess.
Code and References
The code for this project:
The following resources were used in the process of this project.
Cover image:
Comentários