Used car price prediction - Part 3
We will work on minimizing the prediction error. I found 3 things that we need to consider from the raw data. Let's look at our raw data to see if we can find any patterns. I plotted histogram of year and miles, and they are very biased to the recent years and low miles.
![](https://static.wixstatic.com/media/ec25b8_3af876d9be8a48adb36d2edec5be673d~mv2.png/v1/fill/w_789,h_395,al_c,q_85,enc_auto/ec25b8_3af876d9be8a48adb36d2edec5be673d~mv2.png)
We can take log transformation or square root of them to force them to be normal distributed. Plots below show the scatter plots of year and miles against price, and the relationships of them are not linear, but higher order. We can increase complexity (in other words, we can increase the degree of order) to account for the non-linear relationships. Also the scales of year and mile are different and the smaller scale (year) might be underestimated comparing to miles with much bigger scale. We would like to normalize them to have the same scale. We can normalize them by subtracting with average and divide by the standard deviation.
![](https://static.wixstatic.com/media/ec25b8_f42eb636afe44d1b825c937e3306f941~mv2.png/v1/fill/w_726,h_590,al_c,q_90,enc_auto/ec25b8_f42eb636afe44d1b825c937e3306f941~mv2.png)
In short, there were 3 findings that might cause some error on prediction.
![](https://static.wixstatic.com/media/ec25b8_2f1d8cc1fbdf4a0899ec6515bfa78f4f~mv2.png/v1/fill/w_411,h_109,al_c,q_85,enc_auto/ec25b8_2f1d8cc1fbdf4a0899ec6515bfa78f4f~mv2.png)
I will show you how to do all these with python as preprocessing.
![](https://static.wixstatic.com/media/ec25b8_66a81a8b7bb542f4899334be7ebe2aae~mv2.png/v1/fill/w_705,h_615,al_c,q_90,enc_auto/ec25b8_66a81a8b7bb542f4899334be7ebe2aae~mv2.png)
Our data are all organized. Let's build a regression model.
![](https://static.wixstatic.com/media/ec25b8_9f4ec45074684a238c5bd5ac26761c1e~mv2.png/v1/fill/w_764,h_168,al_c,q_85,enc_auto/ec25b8_9f4ec45074684a238c5bd5ac26761c1e~mv2.png)
![](https://static.wixstatic.com/media/ec25b8_01c41bad704b4df284d2b5abb377c237~mv2.png/v1/fill/w_980,h_500,al_c,q_90,usm_0.66_1.00_0.01,enc_auto/ec25b8_01c41bad704b4df284d2b5abb377c237~mv2.png)
![](https://static.wixstatic.com/media/ec25b8_540fed23b43e45eeba548d30fe4c8ca9~mv2.png/v1/fill/w_980,h_491,al_c,q_90,usm_0.66_1.00_0.01,enc_auto/ec25b8_540fed23b43e45eeba548d30fe4c8ca9~mv2.png)
After we applied the three solutions on the data, R-squared increased to 0.92, and prediction error is less than $ 1,100. Residuals are random (there is no pattern). The prediction at low price and high price are now tight to the fit line. If we remove outliers, we will be able to increase R-squared and decrease prediction error, but I will not discuss it in this post.
Final prediction
My wife sold her car to her brother after quick market research. She thought her car worth $10,000 and she sold it at $8,000 as a family discount. It was 2012 Honda Civic LX with 55,000 miles and clean title. The regression model we just built predicted that my wife's car worth $10,340. We can conclude that my wife did pretty good job on market research and it was a good deal for her brother.
Future work
1. This data does not have title status. Title is a significant feature for price prediction and it will contribute to error reduction.
2. The non-linear relationships between variables are exponential. It might be interesting to add exponential term to increase complexity rather than adding polynomial features we tried above.
3. Data are biased to recent cars with low miles. It might be helpful to divide data set for new cars and old cars and build models separately.