Introduction¶

This Data set contains information about over 10000 movies collected from the Movie Database (TMDB), including user rating and revenue. The dataset was downloaded from here.

For my exploration, I'm interested in finding out which genres of movies have the highest revenues, and if voting and budget have any effect on a movie's revenue. I am also interested in knowing which production companies have the highest grossing movies and how a movie's revenue is affected by its popularity


# import libraries needed for our investigation

import pandas as pd # data wrangling
import numpy as np # mathematical calculations
import matplotlib.pyplot as plt # visualizations
# import seaborn as sns # visualizations
%matplotlib inline


# load and summarize dataset
tmdb_movies_data = pd.read_csv("Database_TMDb_movie_data.csv")

tmdb_movies_data.shape

(10866, 21)


# inspect dataset (1) - show the top 2 columns

tmdb_movies_data.head(2)


# inspect dataset (2) - show the last 2 columns

tmdb_movies_data.tail(2)


# Get a general idea of every column (series) in the dataset

tmdb_movies_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    10866 non-null  int64  
 1   imdb_id               10856 non-null  object 
 2   popularity            10866 non-null  float64
 3   budget                10866 non-null  int64  
 4   revenue               10866 non-null  int64  
 5   original_title        10866 non-null  object 
 6   cast                  10790 non-null  object 
 7   homepage              2936 non-null   object 
 8   director              10822 non-null  object 
 9   tagline               8042 non-null   object 
 10  keywords              9373 non-null   object 
 11  overview              10862 non-null  object 
 12  runtime               10866 non-null  int64  
 13  genres                10843 non-null  object 
 14  production_companies  9836 non-null   object 
 15  release_date          10866 non-null  object 
 16  vote_count            10866 non-null  int64  
 17  vote_average          10866 non-null  float64
 18  release_year          10866 non-null  int64  
 19  budget_adj            10866 non-null  float64
 20  revenue_adj           10866 non-null  float64
dtypes: float64(4), int64(6), object(11)
memory usage: 1.7+ MB


# Show number of missing values

tmdb_movies_data.isnull().sum()

id                         0
imdb_id                   10
popularity                 0
budget                     0
revenue                    0
original_title             0
cast                      76
homepage                7930
director                  44
tagline                 2824
keywords                1493
overview                   4
runtime                    0
genres                    23
production_companies    1030
release_date               0
vote_count                 0
vote_average               0
release_year               0
budget_adj                 0
revenue_adj                0
dtype: int64


# Our genre needs to be expanded into separeate column to see the different values of the genres

tmdb_movies_data["genres"].str.split("|", expand=True).head(5)


# Our production_companies column also needs to be expanded into separeate column to see the different values of the genres

tmdb_movies_data["production_companies"].str.split("|", expand=True).head(5)


tmdb_movies_data[["genre_1", "genre_2", "genre_3", "genre_4", "genre_5"]] = tmdb_movies_data["genres"].str.split("|", expand=True)

tmdb_movies_data[["prod_com_1", "prod_com_2", "prod_com_3", "prod_com_4", "prod_com_5"]] = tmdb_movies_data["production_companies"].str.split("|", expand=True)

tmdb_movies_data.head(2)


# Drop columns that are not needed for analysis

tmdb_movies_data.drop(columns=[
    "homepage",
    "tagline",
    "keywords",
    "overview",
    "id",
    "imdb_id",
    "genres",
    "production_companies",
    ],
    inplace=True
    )


# Check for number of missing values after column drop

tmdb_movies_data.isnull().sum()

popularity            0
budget                0
revenue               0
original_title        0
cast                 76
director             44
runtime               0
release_date          0
vote_count            0
vote_average          0
release_year          0
budget_adj            0
revenue_adj           0
genre_1              23
genre_2            2351
genre_3            5787
genre_4            8885
genre_5           10324
prod_com_1         1030
prod_com_2         4470
prod_com_3         7050
prod_com_4         8813
prod_com_5         9740
dtype: int64


# Drop other columns that are not needed for analysis

tmdb_movies_data.drop(columns=[
    "genre_2",
    "genre_3",
    "genre_4",
    "genre_5",
    "prod_com_2",
    "prod_com_3",
    "prod_com_4",
    "prod_com_5",
    ],
    inplace=True
    )


# fill null values for the prod_comp_1 column and drop null values from columns with low number of missing values

tmdb_movies_data["prod_com_1"].fillna("Not Available", inplace=True)

tmdb_movies_data.dropna(inplace=True)


tmdb_movies_data.info();

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10732 entries, 0 to 10865
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   popularity      10732 non-null  float64
 1   budget          10732 non-null  int64  
 2   revenue         10732 non-null  int64  
 3   original_title  10732 non-null  object 
 4   cast            10732 non-null  object 
 5   director        10732 non-null  object 
 6   runtime         10732 non-null  int64  
 7   release_date    10732 non-null  object 
 8   vote_count      10732 non-null  int64  
 9   vote_average    10732 non-null  float64
 10  release_year    10732 non-null  int64  
 11  budget_adj      10732 non-null  float64
 12  revenue_adj     10732 non-null  float64
 13  genre_1         10732 non-null  object 
 14  prod_com_1      10732 non-null  object 
dtypes: float64(4), int64(5), object(6)
memory usage: 1.3+ MB


# check dataset for any duplicates

tmdb_movies_data.duplicated().sum()

1


# drop duplicated data

tmdb_movies_data.drop_duplicates(inplace=True)


tmdb_movies_data.head()


tmdb_movies_data.info();

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10731 entries, 0 to 10865
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   popularity      10731 non-null  float64
 1   budget          10731 non-null  int64  
 2   revenue         10731 non-null  int64  
 3   original_title  10731 non-null  object 
 4   cast            10731 non-null  object 
 5   director        10731 non-null  object 
 6   runtime         10731 non-null  int64  
 7   release_date    10731 non-null  object 
 8   vote_count      10731 non-null  int64  
 9   vote_average    10731 non-null  float64
 10  release_year    10731 non-null  int64  
 11  budget_adj      10731 non-null  float64
 12  revenue_adj     10731 non-null  float64
 13  genre_1         10731 non-null  object 
 14  prod_com_1      10731 non-null  object 
dtypes: float64(4), int64(5), object(6)
memory usage: 1.3+ MB


# describe numerical variables in the tmdb_movies_dataset

(tmdb_movies_data.describe()).style.format("{0:,.0f}")


# Plot the distribution of all numerical variables

tmdb_movies_data.hist(figsize=(15, 15));


movies_with_budget = tmdb_movies_data[tmdb_movies_data["budget_adj"] > 0]

movies_with_budget.info();

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5153 entries, 0 to 10865
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   popularity      5153 non-null   float64
 1   budget          5153 non-null   int64  
 2   revenue         5153 non-null   int64  
 3   original_title  5153 non-null   object 
 4   cast            5153 non-null   object 
 5   director        5153 non-null   object 
 6   runtime         5153 non-null   int64  
 7   release_date    5153 non-null   object 
 8   vote_count      5153 non-null   int64  
 9   vote_average    5153 non-null   float64
 10  release_year    5153 non-null   int64  
 11  budget_adj      5153 non-null   float64
 12  revenue_adj     5153 non-null   float64
 13  genre_1         5153 non-null   object 
 14  prod_com_1      5153 non-null   object 
dtypes: float64(4), int64(5), object(6)
memory usage: 644.1+ KB


(movies_with_budget.describe()).style.format("{0:,.0f}")


# plot histogram of adjusted budget

movies_with_budget["budget_adj"].hist();


# Function to plot box plots given the dataframe, column title, plot title, and color

def box_plotter(df, col, title, color):
    """
    this function plots a box plot with the df (DataFrame),
    column (Series), color (string) and title (string) supplied as parameters.
    """
    fig, ax = plt.subplots(figsize=(10, 5))
    plot = df[col].plot(kind="box", vert=False, title=title, color=color, ax=ax)
    return plot


# box plot of budget_adj

box_plotter(tmdb_movies_data, "budget_adj", "Distribution of adjusted movie budget for our entire dataset", "green");


# box plot of budget_adj for dataset subset with budgets higher than zero

box_plotter(movies_with_budget, "budget_adj", "Distribution of adjusted movie budget for movies with budgets greater than zero", "red");


# Define function scatter_plotter

def scatter_plotter(df, x, y, ax_x, ax_y, color, title):
    """
    this function plots a scatter plot with the df (DataFrame),
    x(x axis), y(y axis), ax_x(x axis length), ax_y(y axis length),
    color (string), and title (string) supplied as parameters.
    """
    fig, ax = plt.subplots(figsize=(10, 5))
    ax.plot([0, ax_x], [0, ax_y], linestyle="--", color="red")
    plot = df.plot(x, y, kind="scatter", color=color, title=title, ax=ax, alpha=0.9)
    return plot


# visualize budget_adj vs revenue_adj for entire dataset

scatter_plotter(tmdb_movies_data,
                "budget_adj", "revenue_adj",
                400000000, 400000000,
                "#800080",
                "Relationship between a movie's budget (adjusted) and revenue (adjusted) for entire dataset");
print("The coefficient of correlation between a movie's adjusted budget and it's adjusted revenue (for entire dataset) is:", round(movies_with_budget["budget_adj"].corr(movies_with_budget["revenue_adj"]), 4))
print();

The coefficient of correlation between a movie's adjusted budget and it's adjusted revenue (for entire dataset) is: 0.5919


# visualize budget_adj vs revenue_adj for only movies with budgets > 0

scatter_plotter(movies_with_budget,
                "budget_adj", "revenue_adj",
                400000000, 400000000,
                "#800080",
                "Relationship between a movie's budget (adjusted) and revenue (adjusted) for subset of dataset");

print("The coefficient of correlation between a movie's adjusted budget and it's adjusted revenue (for subset) is:", round(movies_with_budget["budget_adj"].corr(movies_with_budget["revenue_adj"]), 4))
print();

The coefficient of correlation between a movie's adjusted budget and it's adjusted revenue (for subset) is: 0.5919


# box plot of popularity to determine outliers

box_plotter(tmdb_movies_data, "popularity", "Distribution of movie popularity", "red");


# visualize popularity vs revenue

scatter_plotter(tmdb_movies_data,
                "popularity", "revenue_adj",
                40, 1000000000,
                "#800080", 
                "Relationship between a movie's popularity and revenue (adjusted)");
print("The coefficient of correlation between a movie's popularity and it's revenue is:", round(tmdb_movies_data["popularity"].corr(tmdb_movies_data["revenue_adj"]), 4))
print();

The coefficient of correlation between a movie's popularity and it's revenue is: 0.6084


# Do movies with higher voting count generally show a higher revenue?

scatter_plotter(tmdb_movies_data,
                "revenue_adj", "vote_count",
                1000000000, 10000,
                "#800080",
                "Relationship between vote count and revenue (adjusted)");
print("The coefficient of correlation between a movie's adjusted revenue and it's vote_count is:", round(tmdb_movies_data["vote_count"].corr(tmdb_movies_data["revenue_adj"]), 4))
print();

The coefficient of correlation between a movie's adjusted revenue and it's vote_count is: 0.7075


# get a general idea of the genres

tmdb_movies_data["genre_1"].describe().to_frame()


# List the unique genres in dataset

tmdb_movies_data["genre_1"].unique()

array(['Action', 'Adventure', 'Western', 'Science Fiction', 'Drama',
       'Family', 'Comedy', 'Crime', 'Romance', 'War', 'Mystery',
       'Thriller', 'Fantasy', 'History', 'Animation', 'Horror', 'Music',
       'Documentary', 'TV Movie', 'Foreign'], dtype=object)


# Which genres have the highest revenue on average?

avg_rev_by_genre = (tmdb_movies_data.groupby("genre_1")["revenue_adj"].mean()).sort_values()

avg_rev_by_genre.tail(5)

genre_1
Family             7.833664e+07
Animation          8.244244e+07
Fantasy            8.314328e+07
Science Fiction    1.009330e+08
Adventure          1.668203e+08
Name: revenue_adj, dtype: float64


# Which genres have the lowest revenue on average?

avg_rev_by_genre = (tmdb_movies_data.groupby("genre_1")["revenue_adj"].mean()).sort_values()

avg_rev_by_genre.head(5)

genre_1
Foreign        0.000000e+00
TV Movie       7.890419e+05
Documentary    2.311728e+06
Horror         2.420624e+07
Mystery        3.002289e+07
Name: revenue_adj, dtype: float64


# Visualize revenues by genre

avg_rev_by_genre_df = tmdb_movies_data.groupby("genre_1")["revenue_adj"].mean().to_frame()

avg_rev_by_genre_df.style.format("{0:,.2f}")


# plot bar graph of the top revenue by genre

fig, ax = plt.subplots(figsize=(10, 10))
avg_rev_by_genre.tail(10).plot(kind="barh", ylabel="revenue_adj", ax=ax, title="Highest revenues by genre");


# plot graph of lowest average revenue by genre

fig, ax = plt.subplots(figsize=(10, 10))
avg_rev_by_genre.head(10).plot(kind="barh", ylabel="revenue_adj", ax=ax, title="Lowest revenues by genre");


# get a general idea of the number of production companies

tmdb_movies_data["prod_com_1"].describe().to_frame()


# List the first few unique production companies in dataset

tmdb_movies_data["prod_com_1"].unique()[:20]

array(['Universal Studios', 'Village Roadshow Pictures',
       'Summit Entertainment', 'Lucasfilm', 'Universal Pictures',
       'Regency Enterprises', 'Paramount Pictures',
       'Twentieth Century Fox Film Corporation', 'Walt Disney Pictures',
       'Columbia Pictures', 'DNA Films', 'Marvel Studios',
       'Double Feature Films', 'Studio Babelsberg', 'Escape Artists',
       'New Line Cinema', 'Focus Features', 'Participant Media',
       'Gotham Group', 'BBC Films'], dtype=object)


# Which production companies have the highest revenue on average? 
# Let us check the top companies using a table and then a bar graph to make it more vivid.

avg_rev_by_comp = (tmdb_movies_data.groupby("prod_com_1")["revenue_adj"].mean()).sort_values()

avg_rev_by_comp.tail(10)

prod_com_1
Bookshop Productions              4.763508e+08
Horizon Pictures (II)             5.045914e+08
WingNut Films                     5.167232e+08
Eon Productions                   5.583549e+08
Barwood Films                     6.169034e+08
Lucasfilm                         7.179706e+08
1492 Pictures                     8.351229e+08
Cool Music                        9.866889e+08
Patalex IV Productions Limited    1.000353e+09
Robert Wise Productions           1.129535e+09
Name: revenue_adj, dtype: float64


# plot bar graph of top average revenue by production company

fig, ax = plt.subplots(figsize=(10, 10))
avg_rev_by_comp.tail(10).plot(kind="barh", ylabel="revenue_adj", ax=ax, title="Highest revenues by production companies");


# Which production companies have the lowest revenue on average? 
# Let us visualize with a series table and then plot a bar graph to make it more vivid.

avg_rev_by_comp = (tmdb_movies_data.groupby("prod_com_1")["revenue_adj"].mean()).sort_values()

avg_rev_by_comp.head(10)

prod_com_1
Kinowelt Filmproduktion        0.0
Midnight Road Entertainment    0.0
Microwave Film                 0.0
Micott & Basara K.K.           0.0
Michael Mailer Films           0.0
Metrodome Distribution         0.0
Metro-Goldwyn-Mayer            0.0
Meteor Film GmbH               0.0
Messick Films                  0.0
Merlin Productions             0.0
Name: revenue_adj, dtype: float64


# plot graph of lowest revenue by production company

fig, ax = plt.subplots(figsize=(10, 10))
avg_rev_by_comp.head(10).plot(kind="barh", ylabel="revenue_adj", ax=ax, title="Lowest revenues by production companies");


# Do movies with higher budgets generally have a higher voting count?

scatter_plotter(tmdb_movies_data,
                "vote_count", "budget_adj",
                10000, 400000000,
                "#800080",
                "Relationship between vote count and budget (adjusted)");
print("The coefficient of correlation between a movie's adjusted budget and it's vote_count is:", round(tmdb_movies_data["vote_count"].corr(tmdb_movies_data["budget_adj"]), 4))
print();

The coefficient of correlation between a movie's adjusted budget and it's vote_count is: 0.5863

	id	imdb_id	popularity	budget	revenue	original_title	cast	homepage	director	tagline	...	overview	runtime	genres	production_companies	release_date	vote_count	vote_average	release_year	budget_adj	revenue_adj
0	135397	tt0369610	32.985763	150000000	1513528810	Jurassic World	Chris Pratt\|Bryce Dallas Howard\|Irrfan Khan\|Vi...	http://www.jurassicworld.com/	Colin Trevorrow	The park is open.	...	Twenty-two years after the events of Jurassic ...	124	Action\|Adventure\|Science Fiction\|Thriller	Universal Studios\|Amblin Entertainment\|Legenda...	6/9/15	5562	6.5	2015	1.379999e+08	1.392446e+09
1	76341	tt1392190	28.419936	150000000	378436354	Mad Max: Fury Road	Tom Hardy\|Charlize Theron\|Hugh Keays-Byrne\|Nic...	http://www.madmaxmovie.com/	George Miller	What a Lovely Day.	...	An apocalyptic story set in the furthest reach...	120	Action\|Adventure\|Science Fiction\|Thriller	Village Roadshow Pictures\|Kennedy Miller Produ...	5/13/15	6185	7.1	2015	1.379999e+08	3.481613e+08

	id	imdb_id	popularity	budget	revenue	original_title	cast	homepage	director	tagline	...	overview	runtime	genres	production_companies	release_date	vote_count	vote_average	release_year	budget_adj	revenue_adj
10864	21449	tt0061177	0.064317	0	0	What's Up, Tiger Lily?	Tatsuya Mihashi\|Akiko Wakabayashi\|Mie Hama\|Joh...	NaN	Woody Allen	WOODY ALLEN STRIKES BACK!	...	In comic Woody Allen's film debut, he took the...	80	Action\|Comedy	Benedict Pictures Corp.	11/2/66	22	5.4	1966	0.000000	0.0
10865	22293	tt0060666	0.035919	19000	0	Manos: The Hands of Fate	Harold P. Warren\|Tom Neyman\|John Reynolds\|Dian...	NaN	Harold P. Warren	It's Shocking! It's Beyond Your Imagination!	...	A family gets lost on the road and stumbles up...	74	Horror	Norm-Iris	11/15/66	15	1.5	1966	127642.279154	0.0

	0	1	2	3	4
0	Universal Studios	Amblin Entertainment	Legendary Pictures	Fuji Television Network	Dentsu
1	Village Roadshow Pictures	Kennedy Miller Productions	None	None	None
2	Summit Entertainment	Mandeville Films	Red Wagon Entertainment	NeoReel	None
3	Lucasfilm	Truenorth Productions	Bad Robot	None	None
4	Universal Pictures	Original Film	Media Rights Capital	Dentsu	One Race Films

	id	imdb_id	popularity	budget	revenue	original_title	cast	homepage	director	tagline	...	genre_1	genre_2	genre_3	genre_4	genre_5	prod_com_1	prod_com_2	prod_com_3	prod_com_4	prod_com_5
0	135397	tt0369610	32.985763	150000000	1513528810	Jurassic World	Chris Pratt\|Bryce Dallas Howard\|Irrfan Khan\|Vi...	http://www.jurassicworld.com/	Colin Trevorrow	The park is open.	...	Action	Adventure	Science Fiction	Thriller	None	Universal Studios	Amblin Entertainment	Legendary Pictures	Fuji Television Network	Dentsu
1	76341	tt1392190	28.419936	150000000	378436354	Mad Max: Fury Road	Tom Hardy\|Charlize Theron\|Hugh Keays-Byrne\|Nic...	http://www.madmaxmovie.com/	George Miller	What a Lovely Day.	...	Action	Adventure	Science Fiction	Thriller	None	Village Roadshow Pictures	Kennedy Miller Productions	None	None	None

	popularity	budget	revenue	original_title	cast	director	runtime	release_date	vote_count	vote_average	release_year	budget_adj	revenue_adj	genre_1	prod_com_1
0	32.985763	150000000	1513528810	Jurassic World	Chris Pratt\|Bryce Dallas Howard\|Irrfan Khan\|Vi...	Colin Trevorrow	124	6/9/15	5562	6.5	2015	1.379999e+08	1.392446e+09	Action	Universal Studios
1	28.419936	150000000	378436354	Mad Max: Fury Road	Tom Hardy\|Charlize Theron\|Hugh Keays-Byrne\|Nic...	George Miller	120	5/13/15	6185	7.1	2015	1.379999e+08	3.481613e+08	Action	Village Roadshow Pictures
2	13.112507	110000000	295238201	Insurgent	Shailene Woodley\|Theo James\|Kate Winslet\|Ansel...	Robert Schwentke	119	3/18/15	2480	6.3	2015	1.012000e+08	2.716190e+08	Adventure	Summit Entertainment
3	11.173104	200000000	2068178225	Star Wars: The Force Awakens	Harrison Ford\|Mark Hamill\|Carrie Fisher\|Adam D...	J.J. Abrams	136	12/15/15	5292	7.5	2015	1.839999e+08	1.902723e+09	Action	Lucasfilm
4	9.335014	190000000	1506249360	Furious 7	Vin Diesel\|Paul Walker\|Jason Statham\|Michelle ...	James Wan	137	4/1/15	2947	7.3	2015	1.747999e+08	1.385749e+09	Action	Universal Pictures

Introduction¶

Project: TMDB Movie Data¶

Table of Contents¶

Data Wrangling¶

Load Data¶

General Properties¶

Data Cleaning¶

Exploratory Data Analysis¶

Budgets vs Revenue¶

Popularity vs Revenue¶

Voting Counts vs Revenue¶

Genres vs Reveue¶

Production Companies vs Revenue¶

Voting Counts vs Budgets¶

Report and Conclusion¶

Limitations¶

Resources¶

	popularity	budget	revenue	runtime	vote_count	vote_average	release_year	budget_adj	revenue_adj
count	10,731	10,731	10,731	10,731	10,731	10,731	10,731	10,731	10,731
mean	1	14,803,646	40,319,888	102	220	6	2,001	17,765,303	52,006,229
std	1	31,064,556	117,652,421	30	579	1	13	34,466,302	145,425,154
min	0	0	0	0	10	2	1,960	0	0
25%	0	0	0	90	17	5	1,995	0	0
50%	0	0	0	99	39	6	2,006	0	0
75%	1	16,000,000	25,000,000	112	148	7	2,011	21,108,852	34,705,457
max	33	425,000,000	2,781,505,847	900	9,767	9	2,015	425,000,000	2,827,123,750

	popularity	budget	revenue	runtime	vote_count	vote_average	release_year	budget_adj	revenue_adj
count	5,153	5,153	5,153	5,153	5,153	5,153	5,153	5,153	5,153
mean	1	30,828,241	80,531,576	107	410	6	2,001	36,995,822	102,504,045
std	1	38,931,994	159,674,751	23	789	1	12	41,982,022	196,144,368
min	0	1	0	0	10	2	1,960	1	0
25%	0	6,000,000	0	93	36	6	1,996	8,142,944	0
50%	1	17,500,000	21,126,225	103	123	6	2,005	22,878,673	29,019,220
75%	1	40,000,000	90,000,000	117	403	7	2,010	50,245,348	113,749,780
max	33	425,000,000	2,781,505,847	540	9,767	8	2,015	425,000,000	2,827,123,750

	revenue_adj
genre_1
Action	74,149,970.20
Adventure	166,820,282.41
Animation	82,442,435.58
Comedy	38,477,209.78
Crime	46,866,881.47
Documentary	2,311,727.80
Drama	35,928,149.06
Family	78,336,643.31
Fantasy	83,143,277.58
Foreign	0.00
History	65,361,945.75
Horror	24,206,242.27
Music	39,665,699.98
Mystery	30,022,894.06
Romance	47,470,361.23
Science Fiction	100,933,045.30
TV Movie	789,041.93
Thriller	30,969,236.99
War	49,583,500.45
Western	47,307,389.85

	prod_com_1
count	10731
unique	3030
top	Not Available
freq	959