This Data set contains information about over 10000 movies collected from the Movie Database (TMDB), including user rating and revenue. The dataset was downloaded from here.
For my exploration, I'm interested in finding out which genres of movies have the highest revenues, and if voting and budget have any effect on a movie's revenue. I am also interested in knowing which production companies have the highest grossing movies and how a movie's revenue is affected by its popularity
# import libraries needed for our investigation
import pandas as pd # data wrangling
import numpy as np # mathematical calculations
import matplotlib.pyplot as plt # visualizations
# import seaborn as sns # visualizations
%matplotlib inline
# load and summarize dataset
tmdb_movies_data = pd.read_csv("Database_TMDb_movie_data.csv")
tmdb_movies_data.shape
(10866, 21)
Our dataset contains 10866 observations (rows) and 21 columns.
# inspect dataset (1) - show the top 2 columns
tmdb_movies_data.head(2)
id | imdb_id | popularity | budget | revenue | original_title | cast | homepage | director | tagline | ... | overview | runtime | genres | production_companies | release_date | vote_count | vote_average | release_year | budget_adj | revenue_adj | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 135397 | tt0369610 | 32.985763 | 150000000 | 1513528810 | Jurassic World | Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... | http://www.jurassicworld.com/ | Colin Trevorrow | The park is open. | ... | Twenty-two years after the events of Jurassic ... | 124 | Action|Adventure|Science Fiction|Thriller | Universal Studios|Amblin Entertainment|Legenda... | 6/9/15 | 5562 | 6.5 | 2015 | 1.379999e+08 | 1.392446e+09 |
1 | 76341 | tt1392190 | 28.419936 | 150000000 | 378436354 | Mad Max: Fury Road | Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... | http://www.madmaxmovie.com/ | George Miller | What a Lovely Day. | ... | An apocalyptic story set in the furthest reach... | 120 | Action|Adventure|Science Fiction|Thriller | Village Roadshow Pictures|Kennedy Miller Produ... | 5/13/15 | 6185 | 7.1 | 2015 | 1.379999e+08 | 3.481613e+08 |
2 rows × 21 columns
# inspect dataset (2) - show the last 2 columns
tmdb_movies_data.tail(2)
id | imdb_id | popularity | budget | revenue | original_title | cast | homepage | director | tagline | ... | overview | runtime | genres | production_companies | release_date | vote_count | vote_average | release_year | budget_adj | revenue_adj | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10864 | 21449 | tt0061177 | 0.064317 | 0 | 0 | What's Up, Tiger Lily? | Tatsuya Mihashi|Akiko Wakabayashi|Mie Hama|Joh... | NaN | Woody Allen | WOODY ALLEN STRIKES BACK! | ... | In comic Woody Allen's film debut, he took the... | 80 | Action|Comedy | Benedict Pictures Corp. | 11/2/66 | 22 | 5.4 | 1966 | 0.000000 | 0.0 |
10865 | 22293 | tt0060666 | 0.035919 | 19000 | 0 | Manos: The Hands of Fate | Harold P. Warren|Tom Neyman|John Reynolds|Dian... | NaN | Harold P. Warren | It's Shocking! It's Beyond Your Imagination! | ... | A family gets lost on the road and stumbles up... | 74 | Horror | Norm-Iris | 11/15/66 | 15 | 1.5 | 1966 | 127642.279154 | 0.0 |
2 rows × 21 columns
# Get a general idea of every column (series) in the dataset
tmdb_movies_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10866 entries, 0 to 10865 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 10866 non-null int64 1 imdb_id 10856 non-null object 2 popularity 10866 non-null float64 3 budget 10866 non-null int64 4 revenue 10866 non-null int64 5 original_title 10866 non-null object 6 cast 10790 non-null object 7 homepage 2936 non-null object 8 director 10822 non-null object 9 tagline 8042 non-null object 10 keywords 9373 non-null object 11 overview 10862 non-null object 12 runtime 10866 non-null int64 13 genres 10843 non-null object 14 production_companies 9836 non-null object 15 release_date 10866 non-null object 16 vote_count 10866 non-null int64 17 vote_average 10866 non-null float64 18 release_year 10866 non-null int64 19 budget_adj 10866 non-null float64 20 revenue_adj 10866 non-null float64 dtypes: float64(4), int64(6), object(11) memory usage: 1.7+ MB
# Show number of missing values
tmdb_movies_data.isnull().sum()
id 0 imdb_id 10 popularity 0 budget 0 revenue 0 original_title 0 cast 76 homepage 7930 director 44 tagline 2824 keywords 1493 overview 4 runtime 0 genres 23 production_companies 1030 release_date 0 vote_count 0 vote_average 0 release_year 0 budget_adj 0 revenue_adj 0 dtype: int64
The columns homepage
, tagline
, keywords
, and overview
have a high number of missing values and need to be processed in some way. Since most of these details are not necessarily important for analysis, I will be dropping them off from my dataset as part of the cleaning process.
I will also be dropping the id
and imdb_id
columns as they are simply unique identifiers and do not have any bearing on the variable of interest (revenue).
# Our genre needs to be expanded into separeate column to see the different values of the genres
tmdb_movies_data["genres"].str.split("|", expand=True).head(5)
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | Action | Adventure | Science Fiction | Thriller | None |
1 | Action | Adventure | Science Fiction | Thriller | None |
2 | Adventure | Science Fiction | Thriller | None | None |
3 | Action | Adventure | Science Fiction | Fantasy | None |
4 | Action | Crime | Thriller | None | None |
# Our production_companies column also needs to be expanded into separeate column to see the different values of the genres
tmdb_movies_data["production_companies"].str.split("|", expand=True).head(5)
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | Universal Studios | Amblin Entertainment | Legendary Pictures | Fuji Television Network | Dentsu |
1 | Village Roadshow Pictures | Kennedy Miller Productions | None | None | None |
2 | Summit Entertainment | Mandeville Films | Red Wagon Entertainment | NeoReel | None |
3 | Lucasfilm | Truenorth Productions | Bad Robot | None | None |
4 | Universal Pictures | Original Film | Media Rights Capital | Dentsu | One Race Films |
We can see that there are at most five genre categories and production_companies categories for all movies. I will expand the dataframe to include these separate categories for later analysis. I will also be dropping the genre column as we will no longer be needing it.
tmdb_movies_data[["genre_1", "genre_2", "genre_3", "genre_4", "genre_5"]] = tmdb_movies_data["genres"].str.split("|", expand=True)
tmdb_movies_data[["prod_com_1", "prod_com_2", "prod_com_3", "prod_com_4", "prod_com_5"]] = tmdb_movies_data["production_companies"].str.split("|", expand=True)
tmdb_movies_data.head(2)
id | imdb_id | popularity | budget | revenue | original_title | cast | homepage | director | tagline | ... | genre_1 | genre_2 | genre_3 | genre_4 | genre_5 | prod_com_1 | prod_com_2 | prod_com_3 | prod_com_4 | prod_com_5 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 135397 | tt0369610 | 32.985763 | 150000000 | 1513528810 | Jurassic World | Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... | http://www.jurassicworld.com/ | Colin Trevorrow | The park is open. | ... | Action | Adventure | Science Fiction | Thriller | None | Universal Studios | Amblin Entertainment | Legendary Pictures | Fuji Television Network | Dentsu |
1 | 76341 | tt1392190 | 28.419936 | 150000000 | 378436354 | Mad Max: Fury Road | Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... | http://www.madmaxmovie.com/ | George Miller | What a Lovely Day. | ... | Action | Adventure | Science Fiction | Thriller | None | Village Roadshow Pictures | Kennedy Miller Productions | None | None | None |
2 rows × 31 columns
# Drop columns that are not needed for analysis
tmdb_movies_data.drop(columns=[
"homepage",
"tagline",
"keywords",
"overview",
"id",
"imdb_id",
"genres",
"production_companies",
],
inplace=True
)
# Check for number of missing values after column drop
tmdb_movies_data.isnull().sum()
popularity 0 budget 0 revenue 0 original_title 0 cast 76 director 44 runtime 0 release_date 0 vote_count 0 vote_average 0 release_year 0 budget_adj 0 revenue_adj 0 genre_1 23 genre_2 2351 genre_3 5787 genre_4 8885 genre_5 10324 prod_com_1 1030 prod_com_2 4470 prod_com_3 7050 prod_com_4 8813 prod_com_5 9740 dtype: int64
The number of missing values in the cast
, and director
, and genre_1
are very small and can be dropped.
For the expanded genre
and production_companies
columns, I will be dropping further the last four columns (_2, _3, _4, _5) of each of them as the first columns contain enough non-null values to help us reach a representative conclusion.
I will then simply fill with "Not Available" the missing values in the prod_comp_1
column as there are a little over 1000 observations and I don't want to drop them.
# Drop other columns that are not needed for analysis
tmdb_movies_data.drop(columns=[
"genre_2",
"genre_3",
"genre_4",
"genre_5",
"prod_com_2",
"prod_com_3",
"prod_com_4",
"prod_com_5",
],
inplace=True
)
# fill null values for the prod_comp_1 column and drop null values from columns with low number of missing values
tmdb_movies_data["prod_com_1"].fillna("Not Available", inplace=True)
tmdb_movies_data.dropna(inplace=True)
tmdb_movies_data.info();
<class 'pandas.core.frame.DataFrame'> Int64Index: 10732 entries, 0 to 10865 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 popularity 10732 non-null float64 1 budget 10732 non-null int64 2 revenue 10732 non-null int64 3 original_title 10732 non-null object 4 cast 10732 non-null object 5 director 10732 non-null object 6 runtime 10732 non-null int64 7 release_date 10732 non-null object 8 vote_count 10732 non-null int64 9 vote_average 10732 non-null float64 10 release_year 10732 non-null int64 11 budget_adj 10732 non-null float64 12 revenue_adj 10732 non-null float64 13 genre_1 10732 non-null object 14 prod_com_1 10732 non-null object dtypes: float64(4), int64(5), object(6) memory usage: 1.3+ MB
# check dataset for any duplicates
tmdb_movies_data.duplicated().sum()
1
# drop duplicated data
tmdb_movies_data.drop_duplicates(inplace=True)
tmdb_movies_data.head()
popularity | budget | revenue | original_title | cast | director | runtime | release_date | vote_count | vote_average | release_year | budget_adj | revenue_adj | genre_1 | prod_com_1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 32.985763 | 150000000 | 1513528810 | Jurassic World | Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... | Colin Trevorrow | 124 | 6/9/15 | 5562 | 6.5 | 2015 | 1.379999e+08 | 1.392446e+09 | Action | Universal Studios |
1 | 28.419936 | 150000000 | 378436354 | Mad Max: Fury Road | Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... | George Miller | 120 | 5/13/15 | 6185 | 7.1 | 2015 | 1.379999e+08 | 3.481613e+08 | Action | Village Roadshow Pictures |
2 | 13.112507 | 110000000 | 295238201 | Insurgent | Shailene Woodley|Theo James|Kate Winslet|Ansel... | Robert Schwentke | 119 | 3/18/15 | 2480 | 6.3 | 2015 | 1.012000e+08 | 2.716190e+08 | Adventure | Summit Entertainment |
3 | 11.173104 | 200000000 | 2068178225 | Star Wars: The Force Awakens | Harrison Ford|Mark Hamill|Carrie Fisher|Adam D... | J.J. Abrams | 136 | 12/15/15 | 5292 | 7.5 | 2015 | 1.839999e+08 | 1.902723e+09 | Action | Lucasfilm |
4 | 9.335014 | 190000000 | 1506249360 | Furious 7 | Vin Diesel|Paul Walker|Jason Statham|Michelle ... | James Wan | 137 | 4/1/15 | 2947 | 7.3 | 2015 | 1.747999e+08 | 1.385749e+09 | Action | Universal Pictures |
tmdb_movies_data.info();
<class 'pandas.core.frame.DataFrame'> Int64Index: 10731 entries, 0 to 10865 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 popularity 10731 non-null float64 1 budget 10731 non-null int64 2 revenue 10731 non-null int64 3 original_title 10731 non-null object 4 cast 10731 non-null object 5 director 10731 non-null object 6 runtime 10731 non-null int64 7 release_date 10731 non-null object 8 vote_count 10731 non-null int64 9 vote_average 10731 non-null float64 10 release_year 10731 non-null int64 11 budget_adj 10731 non-null float64 12 revenue_adj 10731 non-null float64 13 genre_1 10731 non-null object 14 prod_com_1 10731 non-null object dtypes: float64(4), int64(5), object(6) memory usage: 1.3+ MB
We now have 10751 rows and 15 columns in our dataset after cleaning and can move on to exploration.
# describe numerical variables in the tmdb_movies_dataset
(tmdb_movies_data.describe()).style.format("{0:,.0f}")
popularity | budget | revenue | runtime | vote_count | vote_average | release_year | budget_adj | revenue_adj | |
---|---|---|---|---|---|---|---|---|---|
count | 10,731 | 10,731 | 10,731 | 10,731 | 10,731 | 10,731 | 10,731 | 10,731 | 10,731 |
mean | 1 | 14,803,646 | 40,319,888 | 102 | 220 | 6 | 2,001 | 17,765,303 | 52,006,229 |
std | 1 | 31,064,556 | 117,652,421 | 30 | 579 | 1 | 13 | 34,466,302 | 145,425,154 |
min | 0 | 0 | 0 | 0 | 10 | 2 | 1,960 | 0 | 0 |
25% | 0 | 0 | 0 | 90 | 17 | 5 | 1,995 | 0 | 0 |
50% | 0 | 0 | 0 | 99 | 39 | 6 | 2,006 | 0 | 0 |
75% | 1 | 16,000,000 | 25,000,000 | 112 | 148 | 7 | 2,011 | 21,108,852 | 34,705,457 |
max | 33 | 425,000,000 | 2,781,505,847 | 900 | 9,767 | 9 | 2,015 | 425,000,000 | 2,827,123,750 |
The above gives a summary description of the various numerical variables in our dataset - popularity
, budget
, revenue
, budget_adj
, revenue_adj
. We can already tell that there are very large skews in the budgets and revenues. The next step is to visualize these variables to give a better idea of the skews.
# Plot the distribution of all numerical variables
tmdb_movies_data.hist(figsize=(15, 15));
The budget(adj), revenue(adj), popularity, runtime, and vote_count of our dataset are all right skewed. As a matter of fact, about half of the movies have their budget provided for us. In order for us to make any valid inferences, it will be good to analyse the adjusted budgets for only the subset of our movies with adjusted budgets higher than zero.
The release_year is left-skewed, which shows that more movies were produced as the years progressed. Only the vote_average seems close to a normal distribution.
Any direct relationship between a movie's budget and its revenue?
movies_with_budget = tmdb_movies_data[tmdb_movies_data["budget_adj"] > 0]
movies_with_budget.info();
<class 'pandas.core.frame.DataFrame'> Int64Index: 5153 entries, 0 to 10865 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 popularity 5153 non-null float64 1 budget 5153 non-null int64 2 revenue 5153 non-null int64 3 original_title 5153 non-null object 4 cast 5153 non-null object 5 director 5153 non-null object 6 runtime 5153 non-null int64 7 release_date 5153 non-null object 8 vote_count 5153 non-null int64 9 vote_average 5153 non-null float64 10 release_year 5153 non-null int64 11 budget_adj 5153 non-null float64 12 revenue_adj 5153 non-null float64 13 genre_1 5153 non-null object 14 prod_com_1 5153 non-null object dtypes: float64(4), int64(5), object(6) memory usage: 644.1+ KB
(movies_with_budget.describe()).style.format("{0:,.0f}")
popularity | budget | revenue | runtime | vote_count | vote_average | release_year | budget_adj | revenue_adj | |
---|---|---|---|---|---|---|---|---|---|
count | 5,153 | 5,153 | 5,153 | 5,153 | 5,153 | 5,153 | 5,153 | 5,153 | 5,153 |
mean | 1 | 30,828,241 | 80,531,576 | 107 | 410 | 6 | 2,001 | 36,995,822 | 102,504,045 |
std | 1 | 38,931,994 | 159,674,751 | 23 | 789 | 1 | 12 | 41,982,022 | 196,144,368 |
min | 0 | 1 | 0 | 0 | 10 | 2 | 1,960 | 1 | 0 |
25% | 0 | 6,000,000 | 0 | 93 | 36 | 6 | 1,996 | 8,142,944 | 0 |
50% | 1 | 17,500,000 | 21,126,225 | 103 | 123 | 6 | 2,005 | 22,878,673 | 29,019,220 |
75% | 1 | 40,000,000 | 90,000,000 | 117 | 403 | 7 | 2,010 | 50,245,348 | 113,749,780 |
max | 33 | 425,000,000 | 2,781,505,847 | 540 | 9,767 | 8 | 2,015 | 425,000,000 | 2,827,123,750 |
Since we are interested in the budget for our analysis, it would be good to get an idea of how it is distributed as a single variable. An histogram is a good way to visualize the distribution of a variable.
# plot histogram of adjusted budget
movies_with_budget["budget_adj"].hist();
The above chart shows a budget that is right-skewed, implying that we have movie budgets to the right of the mean value.
Another good way to visualize the spread of a variable is to use a box plot. I will want to see the outliers in my dataset using a boxplot.
As I will be plotting more than one box plot, I will create a function box_plotter()
that takes in the dataset, column to plot, title of plot, and color as parameters and plots using these parameters.
# Function to plot box plots given the dataframe, column title, plot title, and color
def box_plotter(df, col, title, color):
"""
this function plots a box plot with the df (DataFrame),
column (Series), color (string) and title (string) supplied as parameters.
"""
fig, ax = plt.subplots(figsize=(10, 5))
plot = df[col].plot(kind="box", vert=False, title=title, color=color, ax=ax)
return plot
# box plot of budget_adj
box_plotter(tmdb_movies_data, "budget_adj", "Distribution of adjusted movie budget for our entire dataset", "green");
The box plot above shows that a lot of the movie budgets were not provided, looking at the clustered plots to the left.
# box plot of budget_adj for dataset subset with budgets higher than zero
box_plotter(movies_with_budget, "budget_adj", "Distribution of adjusted movie budget for movies with budgets greater than zero", "red");
Our segmented box plot here show that most movies have adjusted budgets from 1 to about 280,000,000.
Next, I will be plotting a series of scatter plots to visualize the relationship between some variables and the revenue. Since there will be more than one plot, I will be creating a function scatter_plotter()
that takes in parameters specific to each plot and visualizes the plot.
# Define function scatter_plotter
def scatter_plotter(df, x, y, ax_x, ax_y, color, title):
"""
this function plots a scatter plot with the df (DataFrame),
x(x axis), y(y axis), ax_x(x axis length), ax_y(y axis length),
color (string), and title (string) supplied as parameters.
"""
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot([0, ax_x], [0, ax_y], linestyle="--", color="red")
plot = df.plot(x, y, kind="scatter", color=color, title=title, ax=ax, alpha=0.9)
return plot
# visualize budget_adj vs revenue_adj for entire dataset
scatter_plotter(tmdb_movies_data,
"budget_adj", "revenue_adj",
400000000, 400000000,
"#800080",
"Relationship between a movie's budget (adjusted) and revenue (adjusted) for entire dataset");
print("The coefficient of correlation between a movie's adjusted budget and it's adjusted revenue (for entire dataset) is:", round(movies_with_budget["budget_adj"].corr(movies_with_budget["revenue_adj"]), 4))
print();
The coefficient of correlation between a movie's adjusted budget and it's adjusted revenue (for entire dataset) is: 0.5919
The scatter plot here shows that a movie's revenue is dependent on its budget to a certain level which is also supported by around average positive correlation between the two variables.
# visualize budget_adj vs revenue_adj for only movies with budgets > 0
scatter_plotter(movies_with_budget,
"budget_adj", "revenue_adj",
400000000, 400000000,
"#800080",
"Relationship between a movie's budget (adjusted) and revenue (adjusted) for subset of dataset");
print("The coefficient of correlation between a movie's adjusted budget and it's adjusted revenue (for subset) is:", round(movies_with_budget["budget_adj"].corr(movies_with_budget["revenue_adj"]), 4))
print();
The coefficient of correlation between a movie's adjusted budget and it's adjusted revenue (for subset) is: 0.5919
The scatter plot here also shows that a movie's revenue is dependent on its budget to a certain level which is also supported by around average positive correlation between the two variables.
Does a movie's popularity have any effect on its revenue?
I will like to first visualize the distribution of a movie's popularity using a box plot.
# box plot of popularity to determine outliers
box_plotter(tmdb_movies_data, "popularity", "Distribution of movie popularity", "red");
The box plot shows a dataset with movie popularities mostly clustered around 1, with outliers as far as above 30.
# visualize popularity vs revenue
scatter_plotter(tmdb_movies_data,
"popularity", "revenue_adj",
40, 1000000000,
"#800080",
"Relationship between a movie's popularity and revenue (adjusted)");
print("The coefficient of correlation between a movie's popularity and it's revenue is:", round(tmdb_movies_data["popularity"].corr(tmdb_movies_data["revenue_adj"]), 4))
print();
The coefficient of correlation between a movie's popularity and it's revenue is: 0.6084
The above plot shows that up to a point, there is a positive correlation between a movie's popularity and the revenue it generates.
Do movies with higher voting count generally have a higher revenue?
# Do movies with higher voting count generally show a higher revenue?
scatter_plotter(tmdb_movies_data,
"revenue_adj", "vote_count",
1000000000, 10000,
"#800080",
"Relationship between vote count and revenue (adjusted)");
print("The coefficient of correlation between a movie's adjusted revenue and it's vote_count is:", round(tmdb_movies_data["vote_count"].corr(tmdb_movies_data["revenue_adj"]), 4))
print();
The coefficient of correlation between a movie's adjusted revenue and it's vote_count is: 0.7075
There seems to be a strong positive correlation between the number of people who voted and a movie's revenue, implying that the more people who had watched voted, the more people watched the movie.
Which genres are associated with the highest revenues on average?
I am interested in getting a summary of the genres using a table before any further explration.
# get a general idea of the genres
tmdb_movies_data["genre_1"].describe().to_frame()
genre_1 | |
---|---|
count | 10731 |
unique | 20 |
top | Drama |
freq | 2443 |
There are 20 genres in our dataset and the "Drama" genre appears the most with 2443 entries, which is roughly one out of every 5 movies.
# List the unique genres in dataset
tmdb_movies_data["genre_1"].unique()
array(['Action', 'Adventure', 'Western', 'Science Fiction', 'Drama', 'Family', 'Comedy', 'Crime', 'Romance', 'War', 'Mystery', 'Thriller', 'Fantasy', 'History', 'Animation', 'Horror', 'Music', 'Documentary', 'TV Movie', 'Foreign'], dtype=object)
# Which genres have the highest revenue on average?
avg_rev_by_genre = (tmdb_movies_data.groupby("genre_1")["revenue_adj"].mean()).sort_values()
avg_rev_by_genre.tail(5)
genre_1 Family 7.833664e+07 Animation 8.244244e+07 Fantasy 8.314328e+07 Science Fiction 1.009330e+08 Adventure 1.668203e+08 Name: revenue_adj, dtype: float64
# Which genres have the lowest revenue on average?
avg_rev_by_genre = (tmdb_movies_data.groupby("genre_1")["revenue_adj"].mean()).sort_values()
avg_rev_by_genre.head(5)
genre_1 Foreign 0.000000e+00 TV Movie 7.890419e+05 Documentary 2.311728e+06 Horror 2.420624e+07 Mystery 3.002289e+07 Name: revenue_adj, dtype: float64
# Visualize revenues by genre
avg_rev_by_genre_df = tmdb_movies_data.groupby("genre_1")["revenue_adj"].mean().to_frame()
avg_rev_by_genre_df.style.format("{0:,.2f}")
revenue_adj | |
---|---|
genre_1 | |
Action | 74,149,970.20 |
Adventure | 166,820,282.41 |
Animation | 82,442,435.58 |
Comedy | 38,477,209.78 |
Crime | 46,866,881.47 |
Documentary | 2,311,727.80 |
Drama | 35,928,149.06 |
Family | 78,336,643.31 |
Fantasy | 83,143,277.58 |
Foreign | 0.00 |
History | 65,361,945.75 |
Horror | 24,206,242.27 |
Music | 39,665,699.98 |
Mystery | 30,022,894.06 |
Romance | 47,470,361.23 |
Science Fiction | 100,933,045.30 |
TV Movie | 789,041.93 |
Thriller | 30,969,236.99 |
War | 49,583,500.45 |
Western | 47,307,389.85 |
The above table shows us the average revenue by movie genre, but to make this more relatable and understandable, it will be a good idea to visualize using bar charts.
# plot bar graph of the top revenue by genre
fig, ax = plt.subplots(figsize=(10, 10))
avg_rev_by_genre.tail(10).plot(kind="barh", ylabel="revenue_adj", ax=ax, title="Highest revenues by genre");
The most important movie genres in terms of the average revenue generated are Adventure, Science Fiction, and Fantasy.
Next, we will plot the graph for the lowest revenues.
# plot graph of lowest average revenue by genre
fig, ax = plt.subplots(figsize=(10, 10))
avg_rev_by_genre.head(10).plot(kind="barh", ylabel="revenue_adj", ax=ax, title="Lowest revenues by genre");
The least revenues come from Foreign, TV movies, and Documentaries.
Which production companies have produced movies with the highest revenues on average?
I am interest in getting a summary of the production companies column and will visualize that with a table.
# get a general idea of the number of production companies
tmdb_movies_data["prod_com_1"].describe().to_frame()
prod_com_1 | |
---|---|
count | 10731 |
unique | 3030 |
top | Not Available |
freq | 959 |
We see here that there are over 3000 unique production companies and that the "Not Available" category appears over 900 times.
# List the first few unique production companies in dataset
tmdb_movies_data["prod_com_1"].unique()[:20]
array(['Universal Studios', 'Village Roadshow Pictures', 'Summit Entertainment', 'Lucasfilm', 'Universal Pictures', 'Regency Enterprises', 'Paramount Pictures', 'Twentieth Century Fox Film Corporation', 'Walt Disney Pictures', 'Columbia Pictures', 'DNA Films', 'Marvel Studios', 'Double Feature Films', 'Studio Babelsberg', 'Escape Artists', 'New Line Cinema', 'Focus Features', 'Participant Media', 'Gotham Group', 'BBC Films'], dtype=object)
# Which production companies have the highest revenue on average?
# Let us check the top companies using a table and then a bar graph to make it more vivid.
avg_rev_by_comp = (tmdb_movies_data.groupby("prod_com_1")["revenue_adj"].mean()).sort_values()
avg_rev_by_comp.tail(10)
prod_com_1 Bookshop Productions 4.763508e+08 Horizon Pictures (II) 5.045914e+08 WingNut Films 5.167232e+08 Eon Productions 5.583549e+08 Barwood Films 6.169034e+08 Lucasfilm 7.179706e+08 1492 Pictures 8.351229e+08 Cool Music 9.866889e+08 Patalex IV Productions Limited 1.000353e+09 Robert Wise Productions 1.129535e+09 Name: revenue_adj, dtype: float64
# plot bar graph of top average revenue by production company
fig, ax = plt.subplots(figsize=(10, 10))
avg_rev_by_comp.tail(10).plot(kind="barh", ylabel="revenue_adj", ax=ax, title="Highest revenues by production companies");
Our plot shows that Robert Wise Productions, Patalex IV Productions Limited, Cool Music, 1492 Pictures, and Lucasfilm are the top companies with the highest average revenues.
# Which production companies have the lowest revenue on average?
# Let us visualize with a series table and then plot a bar graph to make it more vivid.
avg_rev_by_comp = (tmdb_movies_data.groupby("prod_com_1")["revenue_adj"].mean()).sort_values()
avg_rev_by_comp.head(10)
prod_com_1 Kinowelt Filmproduktion 0.0 Midnight Road Entertainment 0.0 Microwave Film 0.0 Micott & Basara K.K. 0.0 Michael Mailer Films 0.0 Metrodome Distribution 0.0 Metro-Goldwyn-Mayer 0.0 Meteor Film GmbH 0.0 Messick Films 0.0 Merlin Productions 0.0 Name: revenue_adj, dtype: float64
# plot graph of lowest revenue by production company
fig, ax = plt.subplots(figsize=(10, 10))
avg_rev_by_comp.head(10).plot(kind="barh", ylabel="revenue_adj", ax=ax, title="Lowest revenues by production companies");
According to the data, Kinowelt Filmproduction, Midnight Road Entertainment, and Miccot & Basara K.K. made the lowest revenues, as there was no value provided for the revenues.
Do movies with higher budgets generally show a higher voting count? Let us visualize the relationship between votes and budget to get an idea
# Do movies with higher budgets generally have a higher voting count?
scatter_plotter(tmdb_movies_data,
"vote_count", "budget_adj",
10000, 400000000,
"#800080",
"Relationship between vote count and budget (adjusted)");
print("The coefficient of correlation between a movie's adjusted budget and it's vote_count is:", round(tmdb_movies_data["vote_count"].corr(tmdb_movies_data["budget_adj"]), 4))
print();
The coefficient of correlation between a movie's adjusted budget and it's vote_count is: 0.5863
We can see from this plot that there is a positively average correlation between a movie's budget and how many people voted for it.
From our analysis, some of the trends we see are as follows:
A movie's revenue is fairly determined by its popularity score, given the around average positive correlation between the two variables.
A movie's budget is generally a much higher determinant of it's revenue than its popularity.
Movies with higher vote counts tend to have higher revenues, given the strong positive correlation. This does not in any way imply that high revenues are caused by more vote counts.
Adventure, Science Fiction, and Fantasy are the top three movie genres by average revenue. In contrast, the movie genres with the lowest average revenue are Foreign, TV Movie, and Documentary.
An high budget and high vote counts are averagely positively correlated.
Robert Wise Productions, Patalex IV Productions Limited, Cool Music, 1492 Pictures, and Lucasfilm are the top companies with the highest average revenues.
The above conclusions are based on my analysis and there is a lot more that can be generated from exploring the dataset further.
The most visible limitation to this dataset (in my opinion) is the unavailability of budget values for more than half of the dataset. This translates to an incomplete picture and makes the dataset only truly representative of about half of the movies.
Another limitation was that the dataset had multiple values separated by pipes (|) in some columns used i.e. the genre and production companies, and since I only worked the first value in each of the Series, it is possible some vital information has been lost in the process.