Predicting Car Prices

Andrew Giocondi
May 4, 2022
10 min read

Introduction

Buying a car can be a headache and knowing how much to pay for that car is usually the source of the problem. With 15 million cars being sold in the US just last year that's quite a few headaches. Unfortunately in 2022 this problem is getting even worse with consumers paying 12.2 percent more for new cars and 40 percent more for used vehicles. With the current state of the car market, the consumer needs to know how much they should be paying for a car and sellers need to know how much to sell cars for. Using machine learning we plan on creating a model that solves this problem. Specifically, this project will be utilizing linear regression to develop a model to predict the price of a new or used car based on various features. The hope is that this model could be utilized by a consumer or seller to accurately predict the price of their car of interest based on its characteristics.

The Data

The dataset used to create the model is an existing one and can be found on Kaggle. It is in CSV form and contains 201 columns and 26 variables that describe the features of various cars. It's important to note that the data set does not include a variable that lists the year that the car was made; only the car's features are listed. For example, some of the variables used in this data set are the make, body style, miles-per-gallon, horsepower, curb-weight, and a plethora of others. The dataset contains a variety of both categorical data and continuous values. The categorical variables represent 31 percent of the data and contains information like the brand of the car. While the continuous values represent the remaining 69 percent of the data and models information such as the peak-rpm of the vehicle.

Pre-Processing and Data Understanding

In order for the data to be in a format and style of organization to run smoothly in the regression models, various steps needed to be taken. First, we dealt with the NA instances. For normalized losses, bore, stroke, horsepower, and peak-rpm, the mean values based on the make of the cars replaced the NA’s. For normalized losses, there were still a few NA instances so the mean for the remaining ones were based on the body style. For the number of doors, with there only being two and four door cars we removed the cars with NA values. After, due to the number of doors and number of cylinders variables being categorical, we altered them to be in integer form by changing the string representation of each value to its corresponding number.

This data contained many categorical variables, so a variable was assigned to each of the continuous and categorical variable columns. This made it significantly easier to examine and visualize the data according to the different data types. To understand the categorical variables and their relationships with the target variable, each one was graphed against price.

It is easy to notice the difference of price according to the make of car. BMW, Jaguar, Mercedes-Benz, and Porsche have cars that are significantly more expensive than the others. Body style, drive wheels, engine location, engine type, and fuel-system seem to have differences in price as well. Hatchback and wagon cars, along with cars that have forward-drive and four-wheel-drive characteristics have lower prices than the other types. The majority of the cars have ohc engines, engines in the front of the car, and a mpfi fuel system. Although, the more expensive cars seem to have ohcv engine, engines in the rear of the car, and mpfi fuel systems. Fuel type and aspiration do not display enough price difference for it to be conclusive.

We then generated graphs to explore the relationships that the continuous features have with price.

Many of these variables have a positive relationship with price, including wheel base, length, width, curb weight, number of cylinders, engine size, bore, stroke, and horsepower. Both city and highway mile-per-gallon, as expected, have negative relationships with price. Normalized losses only shows a very slight positive trend. The remaining variables will likely not be as important to use in the price predicting regression model.

To progress the data manipulation of this car dataset, dummy variables were articulated. This will allow the categorical variables to properly be represented in the regression models. After they were created for make, fuel type, aspiration, body style, drive-wheels, engine location, engine type, and fuel system, one dummy variable was removed to satisfy the k-1 assumption. The regression analysis will treat the missing dummy variable as a baseline with which to compare all others.

Modeling

When it came to modeling our methods involved a 70/30 split in the data. 70% of the data would act as a training set while the 30% would act as the test set.

As previously stated, we created a linear regression to see the relationship between car features and prices. The first model that was implemented included all of the variables in their original form up to the pre-processing part. Therefore, there were no additional data manipulation or model optimization steps. This was used to set a base level to progress off of.

The model yielded the following results:

R^2 score: 0.8315490424942048

Mean squared error: 18771550.40

Root Mean squared error: 4332.61

The equation of the regression model is below:

y = 10035.623651015158 + -77.80868977632163 X 0 + 13.080156723253806 X 1 + 397.8013092622039 X 2 + 173.35481049298622 X 3 + -127.93257251801627 X 4 + 617.8986106142655 X 5 + -238.06698389536436 X 6 + 6.783561781482126 X 7 + -2649.041784005516 X 8 + 115.32068522709345 X 9 + -10264.991003414783 X 10 + -2032.4679757387369 X 11 + -106.0367346721664 X 12 + 10.523379852864764 X 13 + 0.7872308877545038 X 14 + -133.94400752615047 X 15 + 123.91619180770044 X 16 + -1412.2684807703731 X 17 + 5694.080832278168 X 18 + 1489.6433259606874 X 19 + -3942.902304725705 X 20 + -952.3452761197868 X 21 + -2821.7787350833605 X 22 + 1049.3229538520366 X 23 + -699.1909062450102 X 24 + 2019.4703480344324 X 25 + -1755.1181494270434 X 26 + -4773.709357206967 X 27 + -560.3672133268544 X 28 + -4331.974761591686 X 29 + -4795.7956182550615 X 30 + 6630.816242676512 X 31 + -2192.746968923712 X 32 + 3934.196937114924 X 33 + -2411.548260068416 X 34 + -2213.9778930803423 X 35 + -2203.6549756679497 X 36 + 204.11619965323462 X 37 + 751.5424674912363 X 38 + -1704.9309466241302 X 39 + 4240.018266137415 X 40 + 934.6788833086493 X 41 + 1220.6501284612268 X 42 + 919.5730663884428 X 43 + -379.54397597863135 X 44 + 948.6673929392069 X 45 + -6630.816242676522 X 46 + 131.01735916132202 X 47 + 968.511667602741 X 48 + 4219.2679826080985 X 49 + -2638.2570411776464 X 50 + 161.79146743646334 X 51 + -2145.5630732278923 X 52 + -960.2047043517389 X 53 + 751.5424674912194 X 54 + -1073.114031524026 X 55 + 92.12804337851765 X 56 + -712.0239709257085 X 57 + 145.02127674761687 X 58 + E

In this model, around 83.2% of the variation in car prices can be explained and accounted for by the car features. There is obvious room for improvement despite being fairly accurate. To optimize the model performance, we filtered out certain columns, normalized the data, and checked for normal distribution for the continuous variables.

To filter out insignificant columns, the correlation coefficients were calculated for every independent feature. The columns that contained weaker correlation than 0.15 or -0.15 were removed. For reference, there were 31 columns removed in this step. After, we generated a correlation matrix to view the collinearity between features.

With city and highway miles per gallon having high collinearity but high correlation with price, we decided to combine the two variables and replace them by one that represents the average miles per gallon. Next, we removed the 1bbl fuel system and kept Honda, removed front engine location and kept Porsche, removed length and width and kept curb weight, and removed horsepower and number of cylinders while keeping engine size. After removing these columns, the number of predictor variables was decreased to 21. To normalize the data, it was scaled between 0 and 1 to keep the data organized and improve its consistency. It is important to note that this was only applied to the continuous variables, excluding the target variable.

To check for normal distribution, every continuous variable including price was graphed to demonstrate their density. For example, the graph of price can be seen below.

Square-root transformations were used on normalized losses, wheel base, and average miles per gallon. A log transformation was used on price due to it being skewed more. Every predictor variable was now approximately normally distributed, meeting the multivariate normality assumption for linear regression. The new price graph can be seen below.

Now that the data is optimized, we ran the next linear regression model. This performed significantly better with a coefficient of determination of 0.921. Even after this model, we had the desire to increase the accuracy. Therefore, we attempted to use Ridge Regression which works best when the number of predictor variables and the number of observations are closer in value.

The Ridge Regression model yielded the following results:

R^2 score: 0.9265481958291261

Mean squared error: 0.02

Root Mean squared error: 0.16

The equation of the regression model is below:

y = 8.771565142164205 + -0.001042226515708229 X 0 + 0.029993075292856317 X 1 + 1.6662648529466921 X 2 + -0.44781457470561015 X 3 + -0.03315241116647386 X 4 + -0.42467344585245154 X 5 + 0.4123696910412718 X 6 + 0.052433132348108875 X 7 + 0.055371554450973956 X 8 + 0.29807982865917326 X 9 + 1.1111565783866522 X 10 + -0.08201516312476043 X 11 + -0.09016784723655302 X 12 + 0.28343761198270206 X 13 + 0.0942204572493309 X 14 + 0.13635156837123508 X 15 + 0.047906295476887524 X 16 + 0.13871772120538345 X 17 + 0.10600106900018498 X 18 + 0.08198078583310203 X 19 + 0.1361831310200071 X 20 + E

In this model, around 92.7% of the variation in car prices can be explained and accounted for by the car features. To view the significance of each variable in the model, the table below contains the importance of each variable represented by the t-value. A graph of each of the variable t-values was also produced. To find the t-value, the coefficient is divided by the standard error to represent the significance that increases with sample size and decreases with variance.

The table and graph portrays that curb-weight is the most important variable in the regression model. Meaning, it is the most important car feature in the model to predict the car price. On the other hand, normalized losses seem to have very low significance in the model to predict car prices.

To further evaluate and visualize the accuracy of this model, the actual values were graphed against the predicted values, and a table was created to examine the low percent difference between them. Only the first ten instances are shown in the table.

Although there is only a sample size of 60 for the testing data, we are still able to see how accurate the model is for predicting car prices. The data predominately follows a straight line with very few outliers.

Conclusion

Our model being highly accurate is a great sign for our original initiative of creating a model to predict car prices. It is easy to say that we succeeded, and that a 92.7% coefficient of determination is extremely good. Performing the optimization steps seemed to be crucial for increasing the accuracy of the regression model.

To better understand the model, there are many aspects that we felt could be explainable. Interestingly, curb weight being a large indicator for our price is likely due to vehicles such as trucks tending to be more expensive. Brands such as the Porsche and Jaguar being more expensive is also understandable simply because they are luxury brands. In the visualizations, we saw that things like fwd, bwd, 2-door, and 4-door all had differences in their average price. Included in this are features such as engine placement. Of course, as seems to be the trend, these things are commonly different between normal vehicles and luxury or sports cars. Sports cars especially can often have two doors and engines in their back, along with also predominately being RWD. All of these factors cause these variables to be associated with higher prices. In all, variables associated larger vehicles and luxury brands tend towards a higher price, and this causes them to increase the price when running the model.

While our model is accurate for our dataset and the training set that was used, we believe that using a larger data would increase the validity of the model. Obtaining data relating to the year of the vehicles could also prove to be important. Testing against a data set that contains mostly non-luxury sedans would be less accurate than if we tested it against a list full of luxury cars or trucks. It was also considered that the model may not be able to predict an economy car's price as accurately as a luxury vehicle’s. However, with the potential for luxury vehicles skewing our average price greatly, there was a far larger number of economy cars comparatively. This likely helped minimize any inaccuracy between the two. Overall, in the model’s current case, and with the dataset that was used, the model is highly accurate when considering the included car features.

Impact

To understand the impact that this project has, we must first examine why this project is important. In today’s modern world, cars are used as the main mode of transportation. If you want the ability to go places, it is the best and most convenient option to utilize your own automobile. Due to their great importance, it would make sense that purchasing any vehicle can be a huge financial investment. There is a constant drive to acquire the biggest bang for your buck. For this reason, there are already many services available that help search for the best price for any automobile that is ideal to you.

In this project, we delve into an in-depth analysis of the price of automobiles by correlating every aspect of a vehicle to the price of a vehicle. Both consumers and sellers of the automobile industry can potentially see which factors have the greatest impact on price. In doing so, they will either be able to buy or sell vehicles at a reasonable price that is backed by concrete data analysis. Therefore, in this case, both sellers and consumers are stakeholders. For example, a seller will be able to realize that perhaps they are asking for too high of a price for one of their vehicles and aren’t able to sell it. On the other had, they could be undercharging and losing out on potential profits. By implementing this model, sellers can determine how they should be listing the price in accordance with the characteristics of that particular vehicle. They can sell that vehicle at a reasonable price that will satisfy both themselves and the consumer. For consumers, using this model will help determine a sensible price for which to buy a vehicle for. This will prevent a customer from overpaying for a car with desired features and increase the probability of finding advantageous deals.

As we know and stated in the beginning of the project, the price of everything is continuously increasing, especially for automobiles. It is more important now than it ever was for people to be able to properly calculate the price of goods such as automobiles.

Code and References

The code for this project:

https://github.com/rhampt15/Final-Project/blob/main/Predicting%20Car%20Prices.ipynb

The following resources were used in the process of this project.

https://seaborn.pydata.org/api.html

https://pandas.pydata.org/docs/user_guide/dsintro.html

https://matplotlib.org/stable/api/pyplot_summary.html

https://scikit-learn.org/stable/user_guide.html

https://www.caranddriver.com/news/a39357957/car-prices-high-when-will-change/

Cover Image:

https://www.copilotsearch.com/posts/cities-with-the-biggest-increase-in-used-car-prices/