Netflix — Data Exploration and Visualization

Shagun Sharma
DataDrivenInvestor
Published in
6 min readJul 23, 2022

--

Netflix is one of the most popular media and video streaming platforms. They have over 10000 movies or tv shows available on their platform, as of mid-2021, they have over 222M Subscribers globally. This tabular dataset consists of listings of all the movies and tv shows available on Netflix, along with details such as — cast, directors, ratings, release year, duration, etc.

Our aim while exploring this dataset is to analyze the data and generate insights that could help in deciding which type of shows or movies to produce and how they can grow the business in different countries.

Refer to this link to access the dataset. You can download the files by right-clicking on the page and clicking on “Save As”, then naming the file as per your wish, with .csv as the extension.

Let’s start…!

Data Visualization

Initially, we will import relevant libraries to load and explore the dataset. For this, we relied on numpy, pandas for numerical computations and analysis and matplotlib, seaborn for visualization. The dataset provided to us consists of a list of all the TV shows/movies available on Netflix:

Show_id: Unique ID for every Movie / Tv Show
Type: Identifier — A Movie or TV Show
Title: Title of the Movie / Tv Show
Director: Director of the Movie
Cast: Actors involved in the movie/show
Country: Country where the movie/show was produced
Date_added: Date it was added on Netflix
Release_year: Actual Release year of the movie/show
Rating: TV Rating of the movie/show
Duration: Total Duration — in minutes or number of seasons
Listed_in: Genre
Description: The summary description

We need to take a note of how many entries in our dataset our null/blank and accordingly clean our dataset as these are the initial steps of our data preprocessing. Hence we plot a heatmap of the null values to visualize what all records are null.

The white rows denotes the blank records in the data

Our dataset is pretty clean except for ‘director’, ‘cast’ and ‘country’ columns. Since we need to provide insights aiming to flourish the business in different countries, we have to impute the values in ‘country’ column. We will choose to drop the other two columns as they have a lot of missing values. Now let’s explore the distribution of number of TV Shows and Movies present in Netflix.

We can visually conclude that our distribution is skewed relative to the type of content present on the platform and the data is unequal for both the categories of content we want to provide insights for. To avoid any bias in our analysis and results, we will divide the data into two categories based on their type (i.e.TV Show and Movie) and proceed further keeping our methodology same for both the categories and deducing possible measures to achieve our aim for each category.

The data contains a variety of genres as mentioned in the ‘Listed_in’ column. we will pick the top 10 genres for the content provided in the data and drill deeper into them to get insights enabling us to filter out the genres our users are interested in. We will therefore plot a graph to find out what are the top genres of the content present in Netflix. You can try out picking more than top 10 genres from the data as well to improve over your analysis for the same.

Data Pre-processing

After this, we split our data into two categories on the basis of their type (i.e TV Show or Movie). Also we drop the columns for director and cast from the data.

Segregating the data with ‘type’ as Movie
Dropping the columns with null values

Our next step is to impute the empty values in our data for country. Previously we discussed why we cannot drop this column as our aim is to increase the business in different countries. Although there are a couple of methods to impute these values, we will proceed with forward filling of imputation. In upcoming revisions, we will train a model to predict the null values for country using the filled records as our training sample and predicting the values for our empty records. However one possible drawback of the same is the fact that we have relatively less samples in our training set for this case and our model might tend to overfit. Forward filling approach is used with an assumption in mind that the collection of data might have been performed in a sequential manner and there could be a probability of recording the samples for similar countries on a sequential basis. Like I said, this is just an assumption without any proof to back upon hence it is solely upon your discretion to continue with this approach or try out another methods of your choice.

The null records are now imputed and there is no record with a blank country value in the data

We will now proceed with imputing other missing values in the records of the data. Part-1 of the data (i.e. with type as Movie) contains two records where the ratings are null. Since these are a hand full of records, we can manually fill in the values for these.

A couple of more records, with empty values in the date_added column were present however at this moment, we will choose to drop these records off as they are quite sparse and won’t hamper our analysis much. We will look into the same later on in further revisions by imputing the values manually as they are also not quantitively large.

Dropping the records with null value in date_added column

As we can see in the data, the country column holds multiple values for countries where the content was displayed. We are going to use the first country for our further analysis.

Further dividing our processed data based on the top 10 main countries we found we will group the data based on their genres and ratings aiming for our objective of analysis.

It’s up to your will to distribute the data further into top 10 countries like we are doing here and can willingly go forward to analyze the data for all the countries present in the data as well. We are choosing to proceed with this as some countries does not have more than a pair of movies being screened hence it will be slightly challenging to provide any concrete suggestion regarding the type of content to be promoted there.

We will have insights into what kind of content fetched most audience in different countries based on top genres, therefore promoting the development of similar content in future.

Top 10 countries for the Movies

Similar to how we found the top 10 genres from the data, we will find our top 10 genres for Movies again (since the previous one was plotted for the entire data before segregation into two categories) that will be used for further analysis as mentioned previously.

Now that we have our top 10 countries and genres for Movies, we can clearly depict and identify the trend in the ratings and type of content preferred by the audience in different countries. Let us filter out the data to reflect the top countries and genres (as selected above).

Take a brief look into the ratings provided for our filtered content and the top countries as well as the genres for the same.

From the above processing, we are now ready to draw our final conclusions for the same. We will plot our data to present the most viewed genres of movies and their corresponding ratings in different countries. This will empower us to take decisions for promoting relevant rated genres of movies in respective countries.

Illustrated above is a descriptive representation of the entire summary of our data aiming directly to achieve our goal for this analysis. You are empowered to make further descriptive representations in numerous ways possible to provide insights from different perspective as well.

We can also represent the ratings of the most viewed movies in these countries as well. This could also incentivize the producers to target specific audience with the ratings of the content they prefer to view the most.

All the afore mentioned methodology is repeated for TV Shows as well and similar conclusions are drawn from the same. Please refer to the GitHub repository for this project if you wish to debrief yourself on the same.

Resources:

  1. GitHub Notebook
  2. Data for analysis

To be continued……

Thank you for reading!

Subscribe to DDIntel Here.

Join our network here: https://datadriveninvestor.com/collaborate

--

--