Used car price prediction - Part 1
![](https://static.wixstatic.com/media/ec25b8_f60bdbb707de4c05a3087806afc0211a~mv2.jpg/v1/fill/w_539,h_739,al_c,q_85,enc_auto/ec25b8_f60bdbb707de4c05a3087806afc0211a~mv2.jpg)
Was it a fair trade?
My wife sold her car to her brother this year. She did some quick research on the used car market and gave him some discount. I will build a regression model to predict the price of her car to confirm it was a fair trade for both of them.
Web scraping
Web scraping is a technique for data extraction from web pages. Many web pages provide APIs, but sometimes it is necessary to go through the webpages and collect information we need.
Python has 'selenium' and 'beautifulsoup' library. Selenium can be used with chrome browser that opens a chrome window and submit my command, while beautifulsoup does not open any internet browser, but only extract data from webpages. In order to assess my wife's deal, I used selenium on 'cars.com', and extracted used car info for Honda Civic which was my wife's car's model.
Selenium
First you need to install ChromeDriver to use selenium. Check this webpage for ChromeDriver.
Now let's import selenium library.
>>> from selenium import webdriver >>> from selenium.webdriver.common.keys import Keys
We need to find html tags to extract information we are interested. I underlined some info from cars.com and we can use the tags that selenium will look for to extract data.
![](https://static.wixstatic.com/media/ec25b8_0f45eae7707045439abee7ecf8b1987f~mv2.png/v1/fill/w_452,h_269,al_c,q_85,enc_auto/ec25b8_0f45eae7707045439abee7ecf8b1987f~mv2.png)
Here is an example that will extract 7 information for a car listed on cars.com.
![](https://static.wixstatic.com/media/ec25b8_67355986e6c9401a8fc004c92d2ab9ae~mv2.png/v1/fill/w_657,h_107,al_c,q_85,enc_auto/ec25b8_67355986e6c9401a8fc004c92d2ab9ae~mv2.png)
>>> driver.get(url) opens url that I assign to (this case, cars.com).
>>> driver.find_elements_by_xpath looks for tags in the quotation marks. It looks for class names with 'h2' or 'span' tags. You can use different method to look for tags. Here is a link to selenium documents for different functions.
To find the contents in the tags, you need .text function. If you only look for elements in the tags, it only contains information, but contents. See the examples below.
![](https://static.wixstatic.com/media/ec25b8_be3d30601f7347ba8c1b9e18ea642978~mv2.png/v1/fill/w_594,h_146,al_c,q_85,enc_auto/ec25b8_be3d30601f7347ba8c1b9e18ea642978~mv2.png)
We might need to go through multiple pages from the webpage, and here is an example code for scraping multiple pages.
![](https://static.wixstatic.com/media/ec25b8_9892188ce0d842749680ebd0d86ef055~mv2.png/v1/fill/w_774,h_139,al_c,q_85,enc_auto/ec25b8_9892188ce0d842749680ebd0d86ef055~mv2.png)
![](https://static.wixstatic.com/media/ec25b8_b12cead09361426db801d987d1376f15~mv2.png/v1/fill/w_775,h_550,al_c,q_90,enc_auto/ec25b8_b12cead09361426db801d987d1376f15~mv2.png)
After finishing web scraping, you might need further data cleaning. I am going to skip the details of data cleaning. Now let's put them together into DataFrame, so we can plot them and build regression models.
![](https://static.wixstatic.com/media/ec25b8_afb58291917d4d11bd46448bfb78fc76~mv2.png/v1/fill/w_765,h_95,al_c,q_85,enc_auto/ec25b8_afb58291917d4d11bd46448bfb78fc76~mv2.png)
![](https://static.wixstatic.com/media/ec25b8_9968eb2795c34390acbdb747a929349c~mv2.png/v1/fill/w_706,h_157,al_c,q_85,enc_auto/ec25b8_9968eb2795c34390acbdb747a929349c~mv2.png)
We are done with data scraping and cleaning. We can save this DataFrame, so we can use it later without running all cods above again. 'Pickle' will save variables and we can open later.
>>> pd.to_pickle('used_car_price_prediction.pkl')
It will save the .pkl file in the same folder you are working on.
We will work on the linear regression model with this data on the Part-2 of this post.