Project: TMDB Movie Data¶

Table of Contents¶

  • Introduction

  • Data Wrangling

  • Exploratory Data Analysis (EDA)

  • Report and Conclusion

  • Limitations

  • Resources

Introduction¶

This Data set contains information about over 10000 movies collected from the Movie Database (TMDB), including user rating and revenue. The dataset was downloaded from here.

For my exploration, I'm interested in finding out which genres of movies have the highest revenues, and if voting and budget have any effect on a movie's revenue. I am also interested in knowing which production companies have the highest grossing movies and how a movie's revenue is affected by its popularity

Data Wrangling¶

Load Data¶

In [1]:
# import libraries needed for our investigation

import pandas as pd # data wrangling
import numpy as np # mathematical calculations
import matplotlib.pyplot as plt # visualizations
# import seaborn as sns # visualizations
%matplotlib inline

General Properties¶

In [2]:
# load and summarize dataset
tmdb_movies_data = pd.read_csv("Database_TMDb_movie_data.csv")

tmdb_movies_data.shape
Out[2]:
(10866, 21)

Our dataset contains 10866 observations (rows) and 21 columns.

In [3]:
# inspect dataset (1) - show the top 2 columns

tmdb_movies_data.head(2)
Out[3]:
id imdb_id popularity budget revenue original_title cast homepage director tagline ... overview runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
0 135397 tt0369610 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... http://www.jurassicworld.com/ Colin Trevorrow The park is open. ... Twenty-two years after the events of Jurassic ... 124 Action|Adventure|Science Fiction|Thriller Universal Studios|Amblin Entertainment|Legenda... 6/9/15 5562 6.5 2015 1.379999e+08 1.392446e+09
1 76341 tt1392190 28.419936 150000000 378436354 Mad Max: Fury Road Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... http://www.madmaxmovie.com/ George Miller What a Lovely Day. ... An apocalyptic story set in the furthest reach... 120 Action|Adventure|Science Fiction|Thriller Village Roadshow Pictures|Kennedy Miller Produ... 5/13/15 6185 7.1 2015 1.379999e+08 3.481613e+08

2 rows × 21 columns

In [4]:
# inspect dataset (2) - show the last 2 columns

tmdb_movies_data.tail(2)
Out[4]:
id imdb_id popularity budget revenue original_title cast homepage director tagline ... overview runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
10864 21449 tt0061177 0.064317 0 0 What's Up, Tiger Lily? Tatsuya Mihashi|Akiko Wakabayashi|Mie Hama|Joh... NaN Woody Allen WOODY ALLEN STRIKES BACK! ... In comic Woody Allen's film debut, he took the... 80 Action|Comedy Benedict Pictures Corp. 11/2/66 22 5.4 1966 0.000000 0.0
10865 22293 tt0060666 0.035919 19000 0 Manos: The Hands of Fate Harold P. Warren|Tom Neyman|John Reynolds|Dian... NaN Harold P. Warren It's Shocking! It's Beyond Your Imagination! ... A family gets lost on the road and stumbles up... 74 Horror Norm-Iris 11/15/66 15 1.5 1966 127642.279154 0.0

2 rows × 21 columns

In [5]:
# Get a general idea of every column (series) in the dataset

tmdb_movies_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    10866 non-null  int64  
 1   imdb_id               10856 non-null  object 
 2   popularity            10866 non-null  float64
 3   budget                10866 non-null  int64  
 4   revenue               10866 non-null  int64  
 5   original_title        10866 non-null  object 
 6   cast                  10790 non-null  object 
 7   homepage              2936 non-null   object 
 8   director              10822 non-null  object 
 9   tagline               8042 non-null   object 
 10  keywords              9373 non-null   object 
 11  overview              10862 non-null  object 
 12  runtime               10866 non-null  int64  
 13  genres                10843 non-null  object 
 14  production_companies  9836 non-null   object 
 15  release_date          10866 non-null  object 
 16  vote_count            10866 non-null  int64  
 17  vote_average          10866 non-null  float64
 18  release_year          10866 non-null  int64  
 19  budget_adj            10866 non-null  float64
 20  revenue_adj           10866 non-null  float64
dtypes: float64(4), int64(6), object(11)
memory usage: 1.7+ MB
In [6]:
# Show number of missing values

tmdb_movies_data.isnull().sum()
Out[6]:
id                         0
imdb_id                   10
popularity                 0
budget                     0
revenue                    0
original_title             0
cast                      76
homepage                7930
director                  44
tagline                 2824
keywords                1493
overview                   4
runtime                    0
genres                    23
production_companies    1030
release_date               0
vote_count                 0
vote_average               0
release_year               0
budget_adj                 0
revenue_adj                0
dtype: int64

The columns homepage, tagline, keywords, and overview have a high number of missing values and need to be processed in some way. Since most of these details are not necessarily important for analysis, I will be dropping them off from my dataset as part of the cleaning process.

I will also be dropping the id and imdb_id columns as they are simply unique identifiers and do not have any bearing on the variable of interest (revenue).

Data Cleaning¶

In [7]:
# Our genre needs to be expanded into separeate column to see the different values of the genres

tmdb_movies_data["genres"].str.split("|", expand=True).head(5)
Out[7]:
0 1 2 3 4
0 Action Adventure Science Fiction Thriller None
1 Action Adventure Science Fiction Thriller None
2 Adventure Science Fiction Thriller None None
3 Action Adventure Science Fiction Fantasy None
4 Action Crime Thriller None None
In [8]:
# Our production_companies column also needs to be expanded into separeate column to see the different values of the genres

tmdb_movies_data["production_companies"].str.split("|", expand=True).head(5)
Out[8]:
0 1 2 3 4
0 Universal Studios Amblin Entertainment Legendary Pictures Fuji Television Network Dentsu
1 Village Roadshow Pictures Kennedy Miller Productions None None None
2 Summit Entertainment Mandeville Films Red Wagon Entertainment NeoReel None
3 Lucasfilm Truenorth Productions Bad Robot None None
4 Universal Pictures Original Film Media Rights Capital Dentsu One Race Films

We can see that there are at most five genre categories and production_companies categories for all movies. I will expand the dataframe to include these separate categories for later analysis. I will also be dropping the genre column as we will no longer be needing it.

In [9]:
tmdb_movies_data[["genre_1", "genre_2", "genre_3", "genre_4", "genre_5"]] = tmdb_movies_data["genres"].str.split("|", expand=True)

tmdb_movies_data[["prod_com_1", "prod_com_2", "prod_com_3", "prod_com_4", "prod_com_5"]] = tmdb_movies_data["production_companies"].str.split("|", expand=True)

tmdb_movies_data.head(2)
Out[9]:
id imdb_id popularity budget revenue original_title cast homepage director tagline ... genre_1 genre_2 genre_3 genre_4 genre_5 prod_com_1 prod_com_2 prod_com_3 prod_com_4 prod_com_5
0 135397 tt0369610 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... http://www.jurassicworld.com/ Colin Trevorrow The park is open. ... Action Adventure Science Fiction Thriller None Universal Studios Amblin Entertainment Legendary Pictures Fuji Television Network Dentsu
1 76341 tt1392190 28.419936 150000000 378436354 Mad Max: Fury Road Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... http://www.madmaxmovie.com/ George Miller What a Lovely Day. ... Action Adventure Science Fiction Thriller None Village Roadshow Pictures Kennedy Miller Productions None None None

2 rows × 31 columns

In [10]:
# Drop columns that are not needed for analysis

tmdb_movies_data.drop(columns=[
    "homepage",
    "tagline",
    "keywords",
    "overview",
    "id",
    "imdb_id",
    "genres",
    "production_companies",
    ],
    inplace=True
    )
In [11]:
# Check for number of missing values after column drop

tmdb_movies_data.isnull().sum()
Out[11]:
popularity            0
budget                0
revenue               0
original_title        0
cast                 76
director             44
runtime               0
release_date          0
vote_count            0
vote_average          0
release_year          0
budget_adj            0
revenue_adj           0
genre_1              23
genre_2            2351
genre_3            5787
genre_4            8885
genre_5           10324
prod_com_1         1030
prod_com_2         4470
prod_com_3         7050
prod_com_4         8813
prod_com_5         9740
dtype: int64

The number of missing values in the cast, and director, and genre_1 are very small and can be dropped.

For the expanded genre and production_companies columns, I will be dropping further the last four columns (_2, _3, _4, _5) of each of them as the first columns contain enough non-null values to help us reach a representative conclusion.

I will then simply fill with "Not Available" the missing values in the prod_comp_1 column as there are a little over 1000 observations and I don't want to drop them.

In [12]:
# Drop other columns that are not needed for analysis

tmdb_movies_data.drop(columns=[
    "genre_2",
    "genre_3",
    "genre_4",
    "genre_5",
    "prod_com_2",
    "prod_com_3",
    "prod_com_4",
    "prod_com_5",
    ],
    inplace=True
    )
In [13]:
# fill null values for the prod_comp_1 column and drop null values from columns with low number of missing values

tmdb_movies_data["prod_com_1"].fillna("Not Available", inplace=True)

tmdb_movies_data.dropna(inplace=True)
In [14]:
tmdb_movies_data.info();
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10732 entries, 0 to 10865
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   popularity      10732 non-null  float64
 1   budget          10732 non-null  int64  
 2   revenue         10732 non-null  int64  
 3   original_title  10732 non-null  object 
 4   cast            10732 non-null  object 
 5   director        10732 non-null  object 
 6   runtime         10732 non-null  int64  
 7   release_date    10732 non-null  object 
 8   vote_count      10732 non-null  int64  
 9   vote_average    10732 non-null  float64
 10  release_year    10732 non-null  int64  
 11  budget_adj      10732 non-null  float64
 12  revenue_adj     10732 non-null  float64
 13  genre_1         10732 non-null  object 
 14  prod_com_1      10732 non-null  object 
dtypes: float64(4), int64(5), object(6)
memory usage: 1.3+ MB
In [15]:
# check dataset for any duplicates

tmdb_movies_data.duplicated().sum()
Out[15]:
1
In [16]:
# drop duplicated data

tmdb_movies_data.drop_duplicates(inplace=True)
In [17]:
tmdb_movies_data.head()
Out[17]:
popularity budget revenue original_title cast director runtime release_date vote_count vote_average release_year budget_adj revenue_adj genre_1 prod_com_1
0 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... Colin Trevorrow 124 6/9/15 5562 6.5 2015 1.379999e+08 1.392446e+09 Action Universal Studios
1 28.419936 150000000 378436354 Mad Max: Fury Road Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... George Miller 120 5/13/15 6185 7.1 2015 1.379999e+08 3.481613e+08 Action Village Roadshow Pictures
2 13.112507 110000000 295238201 Insurgent Shailene Woodley|Theo James|Kate Winslet|Ansel... Robert Schwentke 119 3/18/15 2480 6.3 2015 1.012000e+08 2.716190e+08 Adventure Summit Entertainment
3 11.173104 200000000 2068178225 Star Wars: The Force Awakens Harrison Ford|Mark Hamill|Carrie Fisher|Adam D... J.J. Abrams 136 12/15/15 5292 7.5 2015 1.839999e+08 1.902723e+09 Action Lucasfilm
4 9.335014 190000000 1506249360 Furious 7 Vin Diesel|Paul Walker|Jason Statham|Michelle ... James Wan 137 4/1/15 2947 7.3 2015 1.747999e+08 1.385749e+09 Action Universal Pictures
In [18]:
tmdb_movies_data.info();
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10731 entries, 0 to 10865
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   popularity      10731 non-null  float64
 1   budget          10731 non-null  int64  
 2   revenue         10731 non-null  int64  
 3   original_title  10731 non-null  object 
 4   cast            10731 non-null  object 
 5   director        10731 non-null  object 
 6   runtime         10731 non-null  int64  
 7   release_date    10731 non-null  object 
 8   vote_count      10731 non-null  int64  
 9   vote_average    10731 non-null  float64
 10  release_year    10731 non-null  int64  
 11  budget_adj      10731 non-null  float64
 12  revenue_adj     10731 non-null  float64
 13  genre_1         10731 non-null  object 
 14  prod_com_1      10731 non-null  object 
dtypes: float64(4), int64(5), object(6)
memory usage: 1.3+ MB

We now have 10751 rows and 15 columns in our dataset after cleaning and can move on to exploration.

Exploratory Data Analysis¶

In [19]:
# describe numerical variables in the tmdb_movies_dataset

(tmdb_movies_data.describe()).style.format("{0:,.0f}")
Out[19]:
  popularity budget revenue runtime vote_count vote_average release_year budget_adj revenue_adj
count 10,731 10,731 10,731 10,731 10,731 10,731 10,731 10,731 10,731
mean 1 14,803,646 40,319,888 102 220 6 2,001 17,765,303 52,006,229
std 1 31,064,556 117,652,421 30 579 1 13 34,466,302 145,425,154
min 0 0 0 0 10 2 1,960 0 0
25% 0 0 0 90 17 5 1,995 0 0
50% 0 0 0 99 39 6 2,006 0 0
75% 1 16,000,000 25,000,000 112 148 7 2,011 21,108,852 34,705,457
max 33 425,000,000 2,781,505,847 900 9,767 9 2,015 425,000,000 2,827,123,750

The above gives a summary description of the various numerical variables in our dataset - popularity, budget, revenue, budget_adj, revenue_adj. We can already tell that there are very large skews in the budgets and revenues. The next step is to visualize these variables to give a better idea of the skews.

In [20]:
# Plot the distribution of all numerical variables

tmdb_movies_data.hist(figsize=(15, 15));

The budget(adj), revenue(adj), popularity, runtime, and vote_count of our dataset are all right skewed. As a matter of fact, about half of the movies have their budget provided for us. In order for us to make any valid inferences, it will be good to analyse the adjusted budgets for only the subset of our movies with adjusted budgets higher than zero.

The release_year is left-skewed, which shows that more movies were produced as the years progressed. Only the vote_average seems close to a normal distribution.

Budgets vs Revenue¶

Any direct relationship between a movie's budget and its revenue?

In [21]:
movies_with_budget = tmdb_movies_data[tmdb_movies_data["budget_adj"] > 0]

movies_with_budget.info();
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5153 entries, 0 to 10865
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   popularity      5153 non-null   float64
 1   budget          5153 non-null   int64  
 2   revenue         5153 non-null   int64  
 3   original_title  5153 non-null   object 
 4   cast            5153 non-null   object 
 5   director        5153 non-null   object 
 6   runtime         5153 non-null   int64  
 7   release_date    5153 non-null   object 
 8   vote_count      5153 non-null   int64  
 9   vote_average    5153 non-null   float64
 10  release_year    5153 non-null   int64  
 11  budget_adj      5153 non-null   float64
 12  revenue_adj     5153 non-null   float64
 13  genre_1         5153 non-null   object 
 14  prod_com_1      5153 non-null   object 
dtypes: float64(4), int64(5), object(6)
memory usage: 644.1+ KB
In [22]:
(movies_with_budget.describe()).style.format("{0:,.0f}")
Out[22]:
  popularity budget revenue runtime vote_count vote_average release_year budget_adj revenue_adj
count 5,153 5,153 5,153 5,153 5,153 5,153 5,153 5,153 5,153
mean 1 30,828,241 80,531,576 107 410 6 2,001 36,995,822 102,504,045
std 1 38,931,994 159,674,751 23 789 1 12 41,982,022 196,144,368
min 0 1 0 0 10 2 1,960 1 0
25% 0 6,000,000 0 93 36 6 1,996 8,142,944 0
50% 1 17,500,000 21,126,225 103 123 6 2,005 22,878,673 29,019,220
75% 1 40,000,000 90,000,000 117 403 7 2,010 50,245,348 113,749,780
max 33 425,000,000 2,781,505,847 540 9,767 8 2,015 425,000,000 2,827,123,750

Since we are interested in the budget for our analysis, it would be good to get an idea of how it is distributed as a single variable. An histogram is a good way to visualize the distribution of a variable.

In [23]:
# plot histogram of adjusted budget

movies_with_budget["budget_adj"].hist();

The above chart shows a budget that is right-skewed, implying that we have movie budgets to the right of the mean value.

Another good way to visualize the spread of a variable is to use a box plot. I will want to see the outliers in my dataset using a boxplot.

As I will be plotting more than one box plot, I will create a function box_plotter() that takes in the dataset, column to plot, title of plot, and color as parameters and plots using these parameters.

In [24]:
# Function to plot box plots given the dataframe, column title, plot title, and color

def box_plotter(df, col, title, color):
    """
    this function plots a box plot with the df (DataFrame),
    column (Series), color (string) and title (string) supplied as parameters.
    """
    fig, ax = plt.subplots(figsize=(10, 5))
    plot = df[col].plot(kind="box", vert=False, title=title, color=color, ax=ax)
    return plot
In [25]:
# box plot of budget_adj

box_plotter(tmdb_movies_data, "budget_adj", "Distribution of adjusted movie budget for our entire dataset", "green");

The box plot above shows that a lot of the movie budgets were not provided, looking at the clustered plots to the left.

In [26]:
# box plot of budget_adj for dataset subset with budgets higher than zero

box_plotter(movies_with_budget, "budget_adj", "Distribution of adjusted movie budget for movies with budgets greater than zero", "red");

Our segmented box plot here show that most movies have adjusted budgets from 1 to about 280,000,000.

Next, I will be plotting a series of scatter plots to visualize the relationship between some variables and the revenue. Since there will be more than one plot, I will be creating a function scatter_plotter() that takes in parameters specific to each plot and visualizes the plot.

In [27]:
# Define function scatter_plotter

def scatter_plotter(df, x, y, ax_x, ax_y, color, title):
    """
    this function plots a scatter plot with the df (DataFrame),
    x(x axis), y(y axis), ax_x(x axis length), ax_y(y axis length),
    color (string), and title (string) supplied as parameters.
    """
    fig, ax = plt.subplots(figsize=(10, 5))
    ax.plot([0, ax_x], [0, ax_y], linestyle="--", color="red")
    plot = df.plot(x, y, kind="scatter", color=color, title=title, ax=ax, alpha=0.9)
    return plot
In [28]:
# visualize budget_adj vs revenue_adj for entire dataset

scatter_plotter(tmdb_movies_data,
                "budget_adj", "revenue_adj",
                400000000, 400000000,
                "#800080",
                "Relationship between a movie's budget (adjusted) and revenue (adjusted) for entire dataset");
print("The coefficient of correlation between a movie's adjusted budget and it's adjusted revenue (for entire dataset) is:", round(movies_with_budget["budget_adj"].corr(movies_with_budget["revenue_adj"]), 4))
print();
The coefficient of correlation between a movie's adjusted budget and it's adjusted revenue (for entire dataset) is: 0.5919

The scatter plot here shows that a movie's revenue is dependent on its budget to a certain level which is also supported by around average positive correlation between the two variables.

In [29]:
# visualize budget_adj vs revenue_adj for only movies with budgets > 0

scatter_plotter(movies_with_budget,
                "budget_adj", "revenue_adj",
                400000000, 400000000,
                "#800080",
                "Relationship between a movie's budget (adjusted) and revenue (adjusted) for subset of dataset");

print("The coefficient of correlation between a movie's adjusted budget and it's adjusted revenue (for subset) is:", round(movies_with_budget["budget_adj"].corr(movies_with_budget["revenue_adj"]), 4))
print();
The coefficient of correlation between a movie's adjusted budget and it's adjusted revenue (for subset) is: 0.5919

The scatter plot here also shows that a movie's revenue is dependent on its budget to a certain level which is also supported by around average positive correlation between the two variables.

Popularity vs Revenue¶

Does a movie's popularity have any effect on its revenue?

I will like to first visualize the distribution of a movie's popularity using a box plot.

In [30]:
# box plot of popularity to determine outliers

box_plotter(tmdb_movies_data, "popularity", "Distribution of movie popularity", "red");

The box plot shows a dataset with movie popularities mostly clustered around 1, with outliers as far as above 30.

In [31]:
# visualize popularity vs revenue

scatter_plotter(tmdb_movies_data,
                "popularity", "revenue_adj",
                40, 1000000000,
                "#800080", 
                "Relationship between a movie's popularity and revenue (adjusted)");
print("The coefficient of correlation between a movie's popularity and it's revenue is:", round(tmdb_movies_data["popularity"].corr(tmdb_movies_data["revenue_adj"]), 4))
print();
The coefficient of correlation between a movie's popularity and it's revenue is: 0.6084

The above plot shows that up to a point, there is a positive correlation between a movie's popularity and the revenue it generates.

Voting Counts vs Revenue¶

Do movies with higher voting count generally have a higher revenue?

In [32]:
# Do movies with higher voting count generally show a higher revenue?

scatter_plotter(tmdb_movies_data,
                "revenue_adj", "vote_count",
                1000000000, 10000,
                "#800080",
                "Relationship between vote count and revenue (adjusted)");
print("The coefficient of correlation between a movie's adjusted revenue and it's vote_count is:", round(tmdb_movies_data["vote_count"].corr(tmdb_movies_data["revenue_adj"]), 4))
print();
The coefficient of correlation between a movie's adjusted revenue and it's vote_count is: 0.7075

There seems to be a strong positive correlation between the number of people who voted and a movie's revenue, implying that the more people who had watched voted, the more people watched the movie.

Genres vs Reveue¶

Which genres are associated with the highest revenues on average?

I am interested in getting a summary of the genres using a table before any further explration.

In [33]:
# get a general idea of the genres

tmdb_movies_data["genre_1"].describe().to_frame()
Out[33]:
genre_1
count 10731
unique 20
top Drama
freq 2443

There are 20 genres in our dataset and the "Drama" genre appears the most with 2443 entries, which is roughly one out of every 5 movies.

In [34]:
# List the unique genres in dataset

tmdb_movies_data["genre_1"].unique()
Out[34]:
array(['Action', 'Adventure', 'Western', 'Science Fiction', 'Drama',
       'Family', 'Comedy', 'Crime', 'Romance', 'War', 'Mystery',
       'Thriller', 'Fantasy', 'History', 'Animation', 'Horror', 'Music',
       'Documentary', 'TV Movie', 'Foreign'], dtype=object)
In [35]:
# Which genres have the highest revenue on average?

avg_rev_by_genre = (tmdb_movies_data.groupby("genre_1")["revenue_adj"].mean()).sort_values()

avg_rev_by_genre.tail(5)
Out[35]:
genre_1
Family             7.833664e+07
Animation          8.244244e+07
Fantasy            8.314328e+07
Science Fiction    1.009330e+08
Adventure          1.668203e+08
Name: revenue_adj, dtype: float64
In [36]:
# Which genres have the lowest revenue on average?

avg_rev_by_genre = (tmdb_movies_data.groupby("genre_1")["revenue_adj"].mean()).sort_values()

avg_rev_by_genre.head(5)
Out[36]:
genre_1
Foreign        0.000000e+00
TV Movie       7.890419e+05
Documentary    2.311728e+06
Horror         2.420624e+07
Mystery        3.002289e+07
Name: revenue_adj, dtype: float64
In [37]:
# Visualize revenues by genre

avg_rev_by_genre_df = tmdb_movies_data.groupby("genre_1")["revenue_adj"].mean().to_frame()

avg_rev_by_genre_df.style.format("{0:,.2f}")
Out[37]:
  revenue_adj
genre_1  
Action 74,149,970.20
Adventure 166,820,282.41
Animation 82,442,435.58
Comedy 38,477,209.78
Crime 46,866,881.47
Documentary 2,311,727.80
Drama 35,928,149.06
Family 78,336,643.31
Fantasy 83,143,277.58
Foreign 0.00
History 65,361,945.75
Horror 24,206,242.27
Music 39,665,699.98
Mystery 30,022,894.06
Romance 47,470,361.23
Science Fiction 100,933,045.30
TV Movie 789,041.93
Thriller 30,969,236.99
War 49,583,500.45
Western 47,307,389.85

The above table shows us the average revenue by movie genre, but to make this more relatable and understandable, it will be a good idea to visualize using bar charts.

In [38]:
# plot bar graph of the top revenue by genre

fig, ax = plt.subplots(figsize=(10, 10))
avg_rev_by_genre.tail(10).plot(kind="barh", ylabel="revenue_adj", ax=ax, title="Highest revenues by genre");

The most important movie genres in terms of the average revenue generated are Adventure, Science Fiction, and Fantasy.

Next, we will plot the graph for the lowest revenues.

In [39]:
# plot graph of lowest average revenue by genre

fig, ax = plt.subplots(figsize=(10, 10))
avg_rev_by_genre.head(10).plot(kind="barh", ylabel="revenue_adj", ax=ax, title="Lowest revenues by genre");

The least revenues come from Foreign, TV movies, and Documentaries.

Production Companies vs Revenue¶

Which production companies have produced movies with the highest revenues on average?

I am interest in getting a summary of the production companies column and will visualize that with a table.

In [40]:
# get a general idea of the number of production companies

tmdb_movies_data["prod_com_1"].describe().to_frame()
Out[40]:
prod_com_1
count 10731
unique 3030
top Not Available
freq 959

We see here that there are over 3000 unique production companies and that the "Not Available" category appears over 900 times.

In [41]:
# List the first few unique production companies in dataset

tmdb_movies_data["prod_com_1"].unique()[:20]
Out[41]:
array(['Universal Studios', 'Village Roadshow Pictures',
       'Summit Entertainment', 'Lucasfilm', 'Universal Pictures',
       'Regency Enterprises', 'Paramount Pictures',
       'Twentieth Century Fox Film Corporation', 'Walt Disney Pictures',
       'Columbia Pictures', 'DNA Films', 'Marvel Studios',
       'Double Feature Films', 'Studio Babelsberg', 'Escape Artists',
       'New Line Cinema', 'Focus Features', 'Participant Media',
       'Gotham Group', 'BBC Films'], dtype=object)
In [42]:
# Which production companies have the highest revenue on average? 
# Let us check the top companies using a table and then a bar graph to make it more vivid.

avg_rev_by_comp = (tmdb_movies_data.groupby("prod_com_1")["revenue_adj"].mean()).sort_values()

avg_rev_by_comp.tail(10)
Out[42]:
prod_com_1
Bookshop Productions              4.763508e+08
Horizon Pictures (II)             5.045914e+08
WingNut Films                     5.167232e+08
Eon Productions                   5.583549e+08
Barwood Films                     6.169034e+08
Lucasfilm                         7.179706e+08
1492 Pictures                     8.351229e+08
Cool Music                        9.866889e+08
Patalex IV Productions Limited    1.000353e+09
Robert Wise Productions           1.129535e+09
Name: revenue_adj, dtype: float64
In [43]:
# plot bar graph of top average revenue by production company

fig, ax = plt.subplots(figsize=(10, 10))
avg_rev_by_comp.tail(10).plot(kind="barh", ylabel="revenue_adj", ax=ax, title="Highest revenues by production companies");

Our plot shows that Robert Wise Productions, Patalex IV Productions Limited, Cool Music, 1492 Pictures, and Lucasfilm are the top companies with the highest average revenues.

In [44]:
# Which production companies have the lowest revenue on average? 
# Let us visualize with a series table and then plot a bar graph to make it more vivid.

avg_rev_by_comp = (tmdb_movies_data.groupby("prod_com_1")["revenue_adj"].mean()).sort_values()

avg_rev_by_comp.head(10)
Out[44]:
prod_com_1
Kinowelt Filmproduktion        0.0
Midnight Road Entertainment    0.0
Microwave Film                 0.0
Micott & Basara K.K.           0.0
Michael Mailer Films           0.0
Metrodome Distribution         0.0
Metro-Goldwyn-Mayer            0.0
Meteor Film GmbH               0.0
Messick Films                  0.0
Merlin Productions             0.0
Name: revenue_adj, dtype: float64
In [45]:
# plot graph of lowest revenue by production company

fig, ax = plt.subplots(figsize=(10, 10))
avg_rev_by_comp.head(10).plot(kind="barh", ylabel="revenue_adj", ax=ax, title="Lowest revenues by production companies");

According to the data, Kinowelt Filmproduction, Midnight Road Entertainment, and Miccot & Basara K.K. made the lowest revenues, as there was no value provided for the revenues.

Voting Counts vs Budgets¶

Do movies with higher budgets generally show a higher voting count? Let us visualize the relationship between votes and budget to get an idea

In [46]:
# Do movies with higher budgets generally have a higher voting count?

scatter_plotter(tmdb_movies_data,
                "vote_count", "budget_adj",
                10000, 400000000,
                "#800080",
                "Relationship between vote count and budget (adjusted)");
print("The coefficient of correlation between a movie's adjusted budget and it's vote_count is:", round(tmdb_movies_data["vote_count"].corr(tmdb_movies_data["budget_adj"]), 4))
print();
The coefficient of correlation between a movie's adjusted budget and it's vote_count is: 0.5863

We can see from this plot that there is a positively average correlation between a movie's budget and how many people voted for it.

Report and Conclusion¶

From our analysis, some of the trends we see are as follows:

  • A movie's revenue is fairly determined by its popularity score, given the around average positive correlation between the two variables.

  • A movie's budget is generally a much higher determinant of it's revenue than its popularity.

  • Movies with higher vote counts tend to have higher revenues, given the strong positive correlation. This does not in any way imply that high revenues are caused by more vote counts.

  • Adventure, Science Fiction, and Fantasy are the top three movie genres by average revenue. In contrast, the movie genres with the lowest average revenue are Foreign, TV Movie, and Documentary.

  • An high budget and high vote counts are averagely positively correlated.

  • Robert Wise Productions, Patalex IV Productions Limited, Cool Music, 1492 Pictures, and Lucasfilm are the top companies with the highest average revenues.

The above conclusions are based on my analysis and there is a lot more that can be generated from exploring the dataset further.

Limitations¶

  • The most visible limitation to this dataset (in my opinion) is the unavailability of budget values for more than half of the dataset. This translates to an incomplete picture and makes the dataset only truly representative of about half of the movies.

  • Another limitation was that the dataset had multiple values separated by pipes (|) in some columns used i.e. the genre and production companies, and since I only worked the first value in each of the Series, it is possible some vital information has been lost in the process.

Resources¶

  • Pandas Documentation