Big Data using pyspark in Databricks

Published in

DataDrivenInvestor

5 min readNov 14, 2020

Implementation of linear regression through pyspark library in Databricks. before starting of the implementation we must familiar to the databricks platform.

This is the homepage of the databricks developed specially for spark to implement on it and its is cost free. Before starting of the session we have to make cluster and then for properties of the cluster in terms of their ram and other requirements based on the workspace you want to do

In Databricks we can apply spark RDD’s , SQl , MLIB , graphX . This is one of the best platform to learn spark without any requirements of the local system.

we will be implementing linear regression here on the dataset before this we will be uploading file here.

click data icon and then browse file and select the file you want to upload and by click file gets Uploaded EC.CSV file gets uploaded.

Now we will form cluster to make a file so we will click on cluster icon and then we have option of create cluster click on it and then we will go to name of the cluster we want to set karan_first_cluster and other requirements and then we will go to create cluster button.

Before implementing code we will go to theory of linear regression and how it works.

Linear regression is a technique used to model the relationships between observed variables. The idea behind simple linear regression is to “fit” the observations of two variables into a linear relationship between them. Graphically, the task is to draw the line that is “best-fitting” or “closest” to the points (x_i,y_i),(xi,yi), where x_ixi and y_iyi are observations of the two variables which are expected to depend linearly on each other.

The best-fitting linear relationship between the variables xx and yy.

Regression is a common process used in many applications of statistics in the real world. There are two main types of applications:

Predictions: After a series of observations of variables, regression analysis gives a statistical model for the relationship between the variables. This model can be used to generate predictions: given two variables xx and y,y, the model can predict values of yy given future observations of x.x. This idea is used to predict variables in countless situations, e.g. the outcome of political elections, the behavior of the stock market, or the performance of a professional athlete.
Correlation: The model given by a regression analysis will often fit some kinds of data better than others. This can be used to analyze correlations between variables and to refine a statistical model to incorporate further inputs: if the model describes certain subsets of the data points very well, but is a poor predictor for other data points, it can be instructive to examine the differences between the different types of data points for a possible explanation. This type of application is common in scientific tests, e.g. of the effects of a proposed drug on the patients in a controlled study.

Although many measures of best fit are possible, for most applications the best-fitting line is found using the method of least squares . That is, viewing yy as a linear function of x,x, the method finds the linear function LL which minimizes the sum of the squares of the errors .

After creating cluster and uploading file we will come back to the home page and then create new notebook .

we will first import the pyspark libraray.
import pyspark
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName(‘LINEAR_REGRESSION_KARAN_FIRSTFILE’).getOrCreate()

App builder is for the making an new app for the process to create and then we will go for the new sparksession .

now we will import the dataset through cmmd.

from pyspark.ml.regression import LinearRegression
data_EC =spark.read.csv(“dbfs:/FileStore/tables/EC.csv”,inferSchema=True,header=True)

here is the screenshort of the commands which we have applied.

#now we will print schema and hen we will be implemntaing the correlation between each independent to dependent features.
from pyspark.sql.functions import corr
print(data_EC.corr(‘Time on App’,’Time on Website’))
print(data_EC.corr(‘Time on App’,’Yearly Amount Spent’))

#ouput

0.0823882731909817

0.4993277700534503

#now we will be creating a vector of the independent variables in a vector storing row wise value in it so we then make a better prediction.

#converting the vectors having value of numericals in to a vector
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
#taking the numerical value from the dataset and then come across the last 5 columns
assembler=VectorAssembler(inputCols=[ ‘Avg Session Length’,’Time on App’,’Time on Website’,’Length of Membership’],outputCol=’features’)
assembler

#converting all these features into a feature [‘Avg Session Length’,’Time on App’,’Time on Website’,’Length of Membership]

output=assembler.transform(data_EC)

#then prediciting the faetures to the depenedent variable
#we can see the dense vector which will combination of vector of the ‘Avg Session Length’,’Time on App’,’Time on Website’,’Length of Membership’
#now we will predict the multilinear regression here and then classify it
final_data=output.select(‘features’,’Yearly Amount Spent’)

#splitting the fdata into train and test
train_split,test_split=final_data.randomSplit([0.7,0.3])

70/30 ratio splait data

Making the linear regression model and then predicitng residuals.

lr=LinearRegression(labelCol=’Yearly Amount Spent’)
lr_model=lr.fit(train_split)
lr_model

LinearRegressionModel: uid=LinearRegression_89496b266f8f, numFeatures=4

#error for the actual -predicted of each row
test_results=lr_model.evaluate(test_split)
#residuels refer to the actual -predicted
test_results.residuals.show()

here is the end of the linear regression and we will be making the files save to format as if cluster died aur data will be erased from it.

In various format we can print it and then we will make a launch to it.

All the viewers can access the code through drive link and if u find the article intersting do comment on it.

https://drive.google.com/drive/folders/1mWmvpINd05_n2DKLtPJiyjA94PSNX0Cu?usp=sharing

Big Data using pyspark in Databricks

Written by Karan Choudhary