Used car price prediction - Part 2

Jay Kim
Oct 9, 2017
2 min read

Let's open a pickled file from Part 1, and build a linear regression model.

You can open saved file with

>>> df2=pd.read_pickle('used_car_price_prediction.pkl')

We will import some libraries. I will explain the details of the libraries later.

scikit-learn (sklearn) is the machine learning library we will use a lot. It has preprocessing features as well as data mining and data analysis. It is built on NumPy, SciPy, and matplotlib.

Dummy variables (Converting categorical variables to numerical variables)

In our DataFrame, we have numerical variables (year and miles), and categorical variables (trim, transmission, exterior color, and interior color). Categorical variables contain strings and they should be converted to integers (or floats) to build a linear regression models. Pandas can create dummy variables for us to generate numerical values for categorical data. For example, if there are 8 unique values in trim, pandas will create 8 columns which is consist of 0 or 1. For trim 'LX', 'LX' column created as a dummy variable will be 1, and the other trims will have 0. Here is an example and how they look after conversion.

I separately worked on numerical variables and categorical variables to create dummies only for categorical. We had 6 columns for X features, and now we have 36 columns for X with all dummy variables. Let's move on to the regression model. I will build the Ridge regression model.

Regression model

I divided the data into train and test sets. The regression model was built with train set and tested by test set. Ridge regression has penalty term (alpha), and we can explore different alphas with 'make_pipeline' library. For better understanding of penalty term, you might want to study about Regularization from here.

R-squared is about 0.89 and root mean squared error is $ 1331. It is a pretty good model considering it was the first attempt to build regression models, and plot below show how the model fit the data. It fits tight but error at low price and high price are relatively high, and residual plot shows a pattern that has polynomial shape in it. There are some things we can do to reduce the error which means that we can have better prediction models.We will continue this on the next part of this post.