Used Cars Analysis: From Scrapping the Data to Visualisation

Webscrape and Visualize the data using Tableau

Sowmya D
DataDrivenInvestor

--

Photo by Scott Umstattd on Unsplash

The COVID-19 epidemic barely affected the sector. The market for used cars is expected to expand significantly as more people choose independent mobility and as additional financing options become available. The second-hand automobile market has significant growth potential in this context because of the pandemic’s reduced cash stream, which has caused purchasers to consider alternatives to brand-new vehicles. Due to the pandemic’s impact on new car sales and production, second-hand car sales are growing in popularity with consumers.

A recent article in Mordor Intelligence, says that the Indian used car market was valued at US$32.14 billion in 2021 and is projected to reach US$74.70 billion in 2027, growing at a CAGR of 15.1% over the forecast period (2022–2027). It also led to the development of many semi-organized companies like Cars24, Carsdekho, Cartrade, etc. Analysis of used cars helps to find interesting trends and correlations between the various features of the sale of used cars. It also helps in identifying customer trends over cars.

Web scraping is the process of extracting needed information from any website. Python provides web scraping libraries like BeautifulSoup and Selenium. Beautiful Soup is a Python package for parsing HTML and XML documents. It creates parse trees that are helpful to extract the data more easily than compared to Selenium.

A clear understanding of data is needed for any decision that is to be made, Data scientists use data analysis (DA), which frequently makes use of data visualization techniques, to examine and investigate data sets and summarise their key characteristics. Better understanding better the results.

Here, we will use Tableau for data analysis. Tableau is an end-to-end data analytics platform that lets you prep, analyze, collaborate, and share your big data insights. Tableau excels at visual, self-service analytics that enables users to ask new questions about managed big data and easily share those insights across the organization.

In this article, we will gather data on used cars from Cartrade through web scrapping and do data analysis of the resultant data frame, and visualize the data using Tableau.

Here is the step-by-step outline of the project:

  1. Web scrapes the data from Cartrade.
  2. Data Analysis and Visualization

Web scrapes the data from the Cartrade website

In this part, we will get the details of used cars like the model of the car, manufactured company, fuel type of the car, kilometer driven, the color of the car, number of pre-owners, manufactured year, and the price quoted.

Each page of Cartrade comprises 32 listings of used cars. So we’re going to get a list of all the links for every single used car from the first 200 pages. Then we are going to go into each used car individually and scrape our desired data.

To begin let's install bs4 to import BeautifulSoup.

pip install bs4

Let’s import the necessary libraries: requests to send an HTTP request to the website, BeautifulSoup to parse the HTML and XML documents of the website, pandas here to convert the web scrapping results to a data frame.

import requests
from bs4 import BeautifulSoup
import pandas as pd

To begin with, let's set the base URL of the main of Cartrade because we’ll need that when we construct our URLs for each of the individual cars present.

Here, we will scrape the details of the cars listed in Cartrade. The Cartrade website comprises multiple pages, each page has 32 car cards listed. Let's first, fetch details of cars present on one page. The base_url_pg is the URL of page 1 of used cars present in Cartrade.

base_url="https://www.cartrade.com"
base_url_pg="https://www.cartrade.com/buy-used-cars/#sc=-1&so=-1&pn=1"

Now, let's inspect base_url_pg and retrieve the links to cars present on the website.

Photo by Author on Inspect Cartrade Webpage

On inspecting the webpage of Cartrade, we find that the links of cars in the car cards are presented on <a> of HTML page. Let's get the links of all the 32 cars present on that page.

Car link present in <a> tag

To fetch the links of cars in base_url_pg, we will write a script to go through each one of these and create a URL for us. To do that we need to make an HTTP call first. Then we will extract the <a> element present inside the h2 element using BeautifulSoup.

carlinks=[]
page=requests.get(base_url_pg)
soup=BeautifulSoup(page.content,"html.parser")
h2s = soup.find_all("h2",{"class":"h2heading truncate"})
for h2 in h2s:
carlinks.append(base_url + h2.a['href'])

Here first we have declared an empty list called carlinks. Then we find all the h2 elements where we have href attribute present in <a> tag. Then we used a for loop to reach each <a> element to extract the link. After extracting the link we store every link inside the list of carlinks. Since we have to create a legit URL, we have added base_urlto the link.

Now, carlinks[] comprises the links of all 32 cars presented on the page of Cartrade.

Photo by Author on Links to Car present on page 1

Now we can loop through each of these links to extract the used car information from each page and then store it in another list or dictionary.

Next, we are going to analyze the pattern in which the information is displayed on the used car page. We will extract details of used cars like the model of the car, manufactured company, fuel type of the car, kilometer driven, the color of the car, number of pre-owners, manufactured year, and the price quoted.

Photo by Author on Sample inspection of details on the car page

The car name, car price, city, and car manufacturer are here taken from the Contact Seller dialog button, under an ‘a’ tag of class “contact-seller-btn contact-seller-btn-class one-click flatRed pull-lef get-seller-details”.

The fuel type, kilometer driven, color, number of owners, and manufacturing year is taken from the Car details , the table on the page. Let's extract the data.

data=[]
for link in carlinks:
car_url=requests.get(link)
car=BeautifulSoup(car_url.content,"html.parser")

car_detail1=car.find("a",{"class":"contact-seller-btn contact-seller-btn-class one-click flatRed pull-lef get-seller-details"})
try:
model=car_detail1['data-model']
except:
model=None
try:
price=car_detail1['data-price']
except:
price=None
try:
city=car_detail1['data-city']
except:
city=None
try:
make=car_detail1['data-make']
except:
make=None
try:
fuel_type=car.find("td", {"class":"vt1"},text="FUEL TYPE").find_next_sibling("td").text
except:
fuel_type=None
try:
km_driven=car.find("td", {"class":"vt1"},text="KMS Driven").find_next_sibling("td").text
except:
km_driven=None
try:
color=car.find("td", {"class":"vt1"},text="COLOUR").find_next_sibling("td").text
except:
color=None
try:
no_of_owners=car.find("td", {"class":"vt1"},text="NUMBER OF OWNERS").find_next_sibling("td").text
except:
no_of_owners=None
try:
mfg_year=car.find("td", {"class":"vt1"},text="MANUFACTURING YEAR").find_next_sibling("td").text
except:
mfg_year=None

details = {"car_name":model,"manufactured by":make,"city":city,"fuel_type":fuel_type,"km_driven":km_driven,"colour":color,"no_of_owners":no_of_owners,"mfg_year":mfg_year,"price":price}

data.append(details)

Here, we have started a forloop to iterate over every individual link of carlinks. We will make an HTTP GET call to every linkand then extract the required data as mentioned earlier.

We are using tryand exceptto avoid any errors if any element is not found. We have created a dictionary with the name detailswhere we will store all the extracted information. At the very end, we are storing the dictionary inside the list data.

Now, let's put the code bits together for 200 pages:

base_url="https://www.cartrade.com"
base_url_pg="https://www.cartrade.com/buy-used-cars/#sc=-1&so=-1&pn={}"

carlinks = []
for x in range(1,200):
page=requests.get(base_url_pg.format(x))
soup=BeautifulSoup(page.content,"html.parser")
h2s = soup.find_all("h2",{"class":"h2heading truncate"})
for h2 in h2s:
carlinks.append(base_url + h2.a['href'])

data=[]
for link in carlinks:
car_url=requests.get(link)
car=BeautifulSoup(car_url.content,"html.parser")

car_detail1=car.find("a",{"class":"contact-seller-btn contact-seller-btn-class one-click flatRed pull-lef get-seller-details"})
try:
model=car_detail1['data-model']
except:
model=None
try:
price=car_detail1['data-price']
except:
price=None
try:
city=car_detail1['data-city']
except:
city=None
try:
make=car_detail1['data-make']
except:
make=None
try:
fuel_type=car.find("td", {"class":"vt1"},text="FUEL TYPE").find_next_sibling("td").text
except:
fuel_type=None
try:
km_driven=car.find("td", {"class":"vt1"},text="KMS Driven").find_next_sibling("td").text
except:
km_driven=None
try:
color=car.find("td", {"class":"vt1"},text="COLOUR").find_next_sibling("td").text
except:
color=None
try:
no_of_owners=car.find("td", {"class":"vt1"},text="NUMBER OF OWNERS").find_next_sibling("td").text
except:
no_of_owners=None
try:
mfg_year=car.find("td", {"class":"vt1"},text="MANUFACTURING YEAR").find_next_sibling("td").text
except:
mfg_year=None

details = {"car_name":model,"manufactured by":make,"city":city,"fuel_type":fuel_type,"km_driven":km_driven,"colour":color,"no_of_owners":no_of_owners,"mfg_year":mfg_year,"price":price}

data.append(details)

Let's convert the list of dictionaries to a dataframe and save it in excel for later use.

df = pd.DataFrame(data)
df.to_excel("Used Cars.xlsx",index=False)
Photo by Author on Sample dataset

Seems like our dataset needs some data cleaning so let’s discuss it in the next part of the article. So, to conclude we have managed to scrape all the required information from the first 200 pages of the Cartrade website.

You can find the dataset in Kaggle: Used Cars Dataset

Data Analysis and Visualization

With the data ready in hand let’s proceed with data cleaning and data analysis. In data cleaning, we could see the no_of_owners contain /n ,/t let's remove the same and also remove the white spaces.

df['no_of_owners']=df['no_of_owners'].replace('\t',' ', regex=True)
df['no_of_owners']=df['no_of_owners'].replace('\n',' ', regex=True)
df['no_of_owners']=df['no_of_owners'].replace(' ','',regex=True)

Remove Kms in km_driven column.

#KMS in km_driven
df['km_driven']=df['km_driven'].replace(" Kms","",regex=True)
df['km_driven']=df['km_driven'].replace(",","",regex=True)

Now, let's check for any null values and remove the same.

print(df.shape)                #Get the shape of DF
print(df.isnull().sum()) #Get the sum of null values by column
print(df.dropna(inplace=True)) #Remove rows of null values
print(df.shape) #Again check the shape of GF
Photo by Author on Data Cleaning

Let's view the final used cars dataset after data cleaning:

Photo by Author on Final Used Car Dataset

You can find the source code of this project here: GitHub

Now, we will do Data Analysis on the Used Cars Dataset and understand the same using Tableau.

On the tableau dashboard, I have included the following charts:

  • Average KM driven, Average Price, No.of Cars to get some basic info regarding the dataset.
  • Added Top 5 Colors, Top 10 Manufacturers, Top 5 Carmodel, and Top 5 City that helps to understand the market and customer trend.
  • Added Distribution of Used Cars over years and Price Vs KM driven over car manufacturers, which helps to understand how proceget affected by km_driven
  • Added Top 10 Car Company Vs the price range along with the fuel type of the used cars, helps to understand how fuel type affects the price range.
  • Added a filter on Car Manufacturers so that we can analyze the Used Cars by Car Manufacturers.
  • Added year range filter, so that we can get insights into used cars through the years.
  • Fuel-type filter was added that helps to identify how fuel type impacts the various charts, say Company VS Price on Fuel Type.

NOTE: Charts are also added as filters in the dashboard. All the filters and chart impacts the data in the entire dashboard.

Photo by Author on Data Visualization of used cars using Tableau

From the above visualization the following insights are drawn:

  1. The average kilometer driven is around 35k.
  2. The average price of all used cars in the dataset is about INR 10 Lakh 94 thousand.
  3. The No. of Cars indicates the total no cars present in the data set.
  4. And the top 5 colors are White, Grey, Blue, Red, and Silver mean the users are more likely to buy white cars.
  5. The top 10 Used Car company listed includes Hyundai (31.27%), Honda (21.98%), and Maruti Suzuki (17.31%) in a total of the used cars present in the dataset.
  6. The top 5 Used Car model includes Jazz, WR-V, City, Creta, and Ecosport.
  7. The top 5 city posting used cars are Mumbai, Ranchi, Panchkula, Pune, and Hyderabad as per the dataset taken.
  8. The maximum number of Cars brought/manufactured in 2018 (around 1142 cars) followed by 2020 (1084 cars).
  9. Petrol is the most preferred choice of fuel on used car websites, followed by diesel and LPG.
  10. The pricecolumn is the target to predict. i.e Regression Problem and it is affected by km_driven and mfg_year.

All the above charts can be filtered by Fuel type, manufactured Company, by year range from which various insights can be drawn on the company, its car varieties, price range, car colors, etc.,

View the interactive dashboard: Used Car Analysis on CarTrade

So, with data visualization and analysis did let’s proceed with the ML modeling and deploy the same in the next article.

Conclusion

To summarize, in this blog we web scraped the data from CarTrade, preprocessed the data, and visualized the same in Tableau Dashboard. We also drew some valuable insights from the same that helps in better understanding the data.

You can find the dataset in Kaggle: Used Cars Dataset

You can find the source code of this project here: GitHub

Link to the Tableau Dashboard: Used Car Analysis on CarTrade

If you’ve enjoyed this article or have any questions feel free to connect with me on LinkedIn.

References

[1] BeautifulSoup for Web Scrapping

[2] Tableau Charts

[3] Web Scrapping Python Tutorial

[4]Used Cars report in Mordor Intelligence

Subscribe to DDIntel Here.

Visit our website here: https://www.datadriveninvestor.com

Join our network here: https://datadriveninvestor.com/collaborate

--

--