End to End Project on Used Car Price Prediction

Published in

DataDrivenInvestor

10 min readDec 28, 2020

Overview

This dataset consists information about used car listed on cardekho.com. It has 9 columns each columns consists information about specific features like Car_Name gives information about car company .which Year the brand new car has been purchased.selling_price the price at which car is being sold this will be target label for further prediction of price.km_driven number of kilometre car has been driven.fuel this feature the fuel type of car (CNG , petrol,diesel etc).seller_type tells whether the seller is individual or a dealer. transmission gives information about the whether the car is automatic and manual.owner number of previous owner of the car. Present_price what is the current showroom price of the car.

Step 1 : Setting a virtual environment

This should be initial step when you are building an end to end project.We need new virtual environment because each project required different set and different version of libraries so by making an individual environment for a specific project we can feed all the essential library to that environment.follow these step to do so….

conda create -n carfare python=3.6
#some essential package for the environment will be installed  #automatically then you will get option
[y/n] ---> y   #click y

After that we will activate the enviorment by using

>> activate carfare
>> jupyter notebook #run jupyter notebook on newly created env

It might be possible that we will get issue about absence of jupyter notebook for that we have to do one more step

>> pip install jupyter notebook # installing jupyter notebook on env
>> jupyter notebook

our enviorment is created , now we will do our complete project on this enviorment.

Step 2 :- Acquiring data set and importing all the essential library

I have taken data set from here & The data set is in csv format.Now i will import all the essential library that will be needed for this project.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

Step 3 :- Data Pre Processing

df=pd.read_csv("car data.csv")
df.head()   #Printing first five rows

len(df["Car_Name"].unique())[out]>> 98

Here we have cars of 98 different companies and name of companies won’t affect car’s price ,price depends upon how many year it’s been used ,fuel type etc,.so i will drop the column car_name from original dataframe.

df.shape  #number of rows and columns present in a dataset[out] >> (301, 9)--------------------------------------------------------------------
df.columns  #printing index of all the columns[out]>>Index(['Car_Name', 'Year', 'Selling_Price', 'Present_Price', 'Kms_Driven','Fuel_Type', 'Seller_Type', 'Transmission', 'Owner'],
      dtype='object')-------------------------------------------------------------------
#dropping the car_name column
df.drop("Car_Name",axis=1,inplace=True)--------------------------------------------------------------------
df.isnull().values.any()  #is there any null value present [out]>> False    # there is no null value present in dataset

Now we will check the data type of each column if the data type is numerical then we have no such issue but if datatype is categorical then we need to convert all those categorical features into numerical values.

df.dtypes

If we will observe above output we can say that there are some features which are having object data type now in my next step i will create a cat_df and will store all the categorical feature into cat_df.

list1=[]  #storing all the features having categorical datatype
for i in df.columns :
    if df[df[i]=="object"]:
         list1.append(i)[out]>> ['Fuel_Type', 'Seller_Type', 'Transmission']

Now we will make the categorical dataframe with all the features having categorical variables, and will drop all the categorical features from original dataframe.

cat_df=df[list1]
cat_df.head() #top five rows of cat_df

df.head()

Year represent the year in which car have been purchased so how we can estimate the number of year car has been used ?

number of year car has been used = current year — previous year

So as we are in 2020 and car is of 2014 then number of years it’s been used will be :- number of year car has been used=2020–2014=6 to print number of year car has been used we need to add a column which represent current year.

df["Current_Year"]=2020

We have successfully added the current_ year now we will add number_of_years column and drop Year and Current _year column.

df["No_of_years"]=df["Current_Year"]-df["Year"]
df=df.drop(["Current_Year","Year"],axis=1)
df.head()

Getting statistical description about data

df.describe()

On an average car has been driven 36947 kilometres and max distance the car has been traveled is 5,00,000 kilometres. The car with highest ex-showroom selling price present in data set is 92.6 lakh. Maximum number of years car has been used and then come for sell is 17 years.maximum number of owner that has used a single car is 3.Maximum selling price for used car is 35 lakh rupees. This is how we make conclusion with statistical description of dataset.

Step 4 :- Data Visualization

This is most important step of data science life cycle, here we understand the behavior of data and try to make certain meaningful insight out of it. let’s understand it by doing…

sns.set_style("darkgrid")
sns.FacetGrid(df,hue="No_of_years",height=6).map(plt.scatter,"Present_Price","Selling_Price").add_legend()
plt.show()

More number of Years you will use your car lesser the amount you will get.

sns.set_style("darkgrid")
sns.FacetGrid(df,hue="Present_Price",height=6).map(plt.scatter,"Kms_Driven","Selling_Price")
plt.xlabel("Present Price",fontsize=20)
plt.ylabel("Selling Price",fontsize=20)
plt.show()

lesser the car would be driven higher will be the cost as we see the graph at max distance i e:- 500000 kilometres the car’s cost is near to Zero or we can say nobody is willing to pay any amount to those cars.

plotting pair plot

We cannot visualize multi dimensional scatter plot hence by using pair plot we can visualize each and every dimension of (Dimension with numerical variable )multidimensional data precisely.

As we see there are very less overlapping in dataset is seen so we cannot use knn ,linear regression,svm and because of the dynamic nature of dataset we even cannot use decision tree so we will go with random forest and xgboost.

Uni variate analysis :- when analysis involve single variable most predominantly it is used to find the pattern in dataset.

sns.set_style("darkgrid")
sns.FacetGrid(df,height=6).map(sns.histplot,"Selling_Price")
plt.xlabel()
plt.show()

Most number of car has been sold within a price range of 1–10 lakh and for a price range of 25 -35 lakh there are negligible amount of customer

sns.set_style("darkgrid")
sns.FacetGrid(df,height=6).map(sns.histplot,"Kms_Driven")
plt.xlabel("Distance Travel",Fontsize=20)
plt.ylabel("Demand",Fontsize=20)
plt.show()

Demand for those car that has been traveled less distance are in more demand especially if car has traveled distance within a range of 0–5000 kilometre people are more attracted towards them.

C.D.F Plot

It defines how many percentage of variable has value less than and equal to corresponding x-axis. lets’ take above example how many percentage of vehicle has selling price less than 15 lakh then we can find this kind of answer by using C.D.F.

df_Selling_Price=df.loc[:,"Selling_Price"]
count,bin_edges=np.histogram(df_Selling_Price,bins=10,density=True)  #density=True gives normalized form od bin_edges and count
print(count)
print(bin_edges)
PDF=count/sum(count)
CDF=np.cumsum(PDF)  #cdf is sum of all pdf values
plt.figure(figsize=(8,6))
plt.plot(bin_edges[1:],PDF,label="PDF")
plt.plot(bin_edges[1:],CDF,label="CDF")
plt.yticks(np.linspace(0,1,20))
plt.legend(loc="lower left")
plt.show()

As we can see 94.7% of cars that are on cardekho has price ≤15 lakh . So one thing is clear that if we want to purchase used car with a price range of 20–25 we won’t prefer to go to cardekho.com because there we won’t get so many options.

multivariate Analysis :- When we are analyzing two and more variables.

sns.set_style("darkgrid")
sns.jointplot("Present_Price","Selling_Price",data=df,kind="kde")
plt.show()

If we see above graph we can understand that for those vehicle whose original price lie within range of 0–20 lakh they are getting approximately 50% of their money when they sell their car after using certain period of time.

How Machine Learning and Artificial Intelligence Changing the Face of eCommerce? | Data Driven…

The eCommerce development company, nowadays, integrating advancement to take customer experience to the next level…

www.datadriveninvestor.com

Step 5:- Feature Engineering

As we have already made the cat_df with all the categorical features so we will drop all those feature consisting categorical variable from original dataframe and at the end after using feature engineering in cat_df we will concatenate cat_df with original dataframe and by doing this we will be able to convert all the variables into numerical form.

df=df.drop(list1,axis=1)

Now we will do feature engineering on cat_df to convert the categorical variable into numerical variable.But before that we will check how many unique categorical variable each feature consists.

dict1={}
for index,col in enumerate(cat_df.columns):
    dict1[col]=cat_df[col].unique().tolist()
dict1   #key is feature and values will be cat variables

Figure 15

As Fuel_Type,Seller_Type,Transmission all are nominal feature and having small number of categorical variable we will use one hot encoding.

cat_df=pd.get_dummies(cat_df,drop_first=True)
cat_df.head()

Now we will concatenate encoded cat_df with original df and will also delete the categorical variable column.

df=df.drop(["Fuel_Type","Seller_Type","Transmission"],axis=1)
df=pd.concat([df,cat_df],axis=1)
df.head()

Now we have converted all the features into numerical variable.Here we will check correlation between feature but for this dataset we won’t do feature selection,Because Feature selection is used when we have large features in a dataset but in this dataset we only check how features in dataset are correlated with each other.

Step 6 : Pre Modeling Steps

Splitting dataset into dependent and independent variable.

X=df.iloc[:,1:]
Y=df.iloc[:,0]
print(X.shape)
print(Y.shape)[out] >> (301, 11)
         (301,)

split the dataset into train and test set

X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2,random_state=1)
print(X_train.shape)
print(Y_train.shape)
print(X_test.shape)
print(Y_test.shape)[out]>> (240, 11)
        (240,)
        (61, 11)
        (61,)

Choosing best fit model for our dataset :

checking which model will be best for our dataset as from pair plot it is clear that we have to take model which do prediction on non linear and combination of categorical and numerical data that are decision tree,random forest and xgboost then we will check which model will have high accuracy based on that we select the model. We can choose our best fit model using cross validation score.

from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn import model_selection
models=[]
models.append(('CART', DecisionTreeRegressor()))
models.append(("KNN",KNeighborsRegressor()))
models.append(("RF", RandomForestRegressor()))
models.append(("XGBOOST", XGBRegressor()))
names=[]
result=[]
for name,model in models:
    k_fold=model_selection.KFold(n_splits=10,shuffle=True,random_state=7)
    score=model_selection.cross_val_score(model,X_train,Y_train,cv=k_fold,scoring="r2")
    result.append(score)
    names.append(name)
    print(name,score.mean(),score.std())

let’s understand variation of score for each algorithm by box plot

fig = plt.figure(figsize=(10,6))
plt.boxplot(result,labels=names)
plt.title('Algorithm Comparison',fontsize=25)
plt.show()

As we see above accuracy score result we can say that XGboost gives better accuracy with very low standard deviation hence we should go with XGboost.

Feature Importance :- checking which all features are important for output features out of all the given features.

plt.figure(figsize=(8,6))
model=XGBRegressor()
model.fit(X,Y)
importance=np.sort(model.feature_importances_)
plt.barh(X.columns,importance)
plt.show()

Transmission_manual,seller_Type_individual,Fuel_Type_petrol,Fuel_Type_Diesel,No_of_Years make most impact to output prediction.

Hyper parameter tuning :

By using Hyper parameter we search for best parameter of the given model so that we can obtain best result out of the given model.

param_grid={"n_estimators":[100,120,130,140,150],
            "max_depth":range(1,12),
            "booster":["gbtree","gblinear","dart"]
           }
from sklearn.model_selection import RandomizedSearchCV
xgb=XGBRegressor()
random_cv=RandomizedSearchCV(estimator=xgb,param_distributions=param_grid,n_iter=100,cv=10)
random_cv.fit(X_train,Y_train)

let’s check the best parameter that would be used for prediction.

random_cv.best_params_[out] >> {'n_estimators': 130, 'max_depth': 3, 'booster': 'dart'}

Check train test accuracy of dataset

from sklearn.metrics import r2_score 
xgb=XGBRegressor(n_estimators= 130, max_depth=3,booster= 'dart')
xgb.fit(X_train,Y_train)
Y_train_predicted=xgb.predict(X_train)
Y_test_predicted=xgb.predict(X_test)
print("Train set accuracy: ",r2_score(Y_train,Y_train_predicted))
print("Test set accuracy : ",r2_score(Y_test,Y_test_predicted))

Figure 22

Understanding predicted values;

Result=pd.DataFrame({"Actual":Y_test,"Predicted":Y_test_predicted})
Result.head(10)

plotting Kde plot to see how is the variation between actual and predicted value.

saving a model in serialized format :
Save model in serialized format and when we need to do prediction we will just load the pickle file and make prediction using that serialized file we don’t need to again make a new model for prediction on new test data.

import picklewith open("car_price_prediction.pkl","wb) as file: #save pickle file
pickle.dump(xgb_ran,file)
with open "cat_price_prediction.pkl","rb") as file #load pickle file
pickle.load(file)

Conclusion :-

If you have any suggestion regarding this blog please comment below.keep learning keep exploring…..