Prosper Loan Data - Part I¶

by Oluwafunmito Blessed ODEFEMI¶

Introduction¶

The Prosper loan data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, and many others. We are interested in drawing out as many interesting insights as possible using univaraite, bivariate, and multivariate analysis and visualization.

Preliminary Wrangling¶

In [1]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
In [2]:
# read data set into pandas DataFrame
prosper_dataset = pd.read_csv("prosperLoanData.csv")

# get a general overview of the data set.
prosper_dataset.shape
Out[2]:
(113937, 81)
In [3]:
# get information on the dataset's features and completeness

prosper_dataset.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 113937 entries, 0 to 113936
Data columns (total 81 columns):
 #   Column                               Non-Null Count   Dtype  
---  ------                               --------------   -----  
 0   ListingKey                           113937 non-null  object 
 1   ListingNumber                        113937 non-null  int64  
 2   ListingCreationDate                  113937 non-null  object 
 3   CreditGrade                          28953 non-null   object 
 4   Term                                 113937 non-null  int64  
 5   LoanStatus                           113937 non-null  object 
 6   ClosedDate                           55089 non-null   object 
 7   BorrowerAPR                          113912 non-null  float64
 8   BorrowerRate                         113937 non-null  float64
 9   LenderYield                          113937 non-null  float64
 10  EstimatedEffectiveYield              84853 non-null   float64
 11  EstimatedLoss                        84853 non-null   float64
 12  EstimatedReturn                      84853 non-null   float64
 13  ProsperRating (numeric)              84853 non-null   float64
 14  ProsperRating (Alpha)                84853 non-null   object 
 15  ProsperScore                         84853 non-null   float64
 16  ListingCategory (numeric)            113937 non-null  int64  
 17  BorrowerState                        108422 non-null  object 
 18  Occupation                           110349 non-null  object 
 19  EmploymentStatus                     111682 non-null  object 
 20  EmploymentStatusDuration             106312 non-null  float64
 21  IsBorrowerHomeowner                  113937 non-null  bool   
 22  CurrentlyInGroup                     113937 non-null  bool   
 23  GroupKey                             13341 non-null   object 
 24  DateCreditPulled                     113937 non-null  object 
 25  CreditScoreRangeLower                113346 non-null  float64
 26  CreditScoreRangeUpper                113346 non-null  float64
 27  FirstRecordedCreditLine              113240 non-null  object 
 28  CurrentCreditLines                   106333 non-null  float64
 29  OpenCreditLines                      106333 non-null  float64
 30  TotalCreditLinespast7years           113240 non-null  float64
 31  OpenRevolvingAccounts                113937 non-null  int64  
 32  OpenRevolvingMonthlyPayment          113937 non-null  float64
 33  InquiriesLast6Months                 113240 non-null  float64
 34  TotalInquiries                       112778 non-null  float64
 35  CurrentDelinquencies                 113240 non-null  float64
 36  AmountDelinquent                     106315 non-null  float64
 37  DelinquenciesLast7Years              112947 non-null  float64
 38  PublicRecordsLast10Years             113240 non-null  float64
 39  PublicRecordsLast12Months            106333 non-null  float64
 40  RevolvingCreditBalance               106333 non-null  float64
 41  BankcardUtilization                  106333 non-null  float64
 42  AvailableBankcardCredit              106393 non-null  float64
 43  TotalTrades                          106393 non-null  float64
 44  TradesNeverDelinquent (percentage)   106393 non-null  float64
 45  TradesOpenedLast6Months              106393 non-null  float64
 46  DebtToIncomeRatio                    105383 non-null  float64
 47  IncomeRange                          113937 non-null  object 
 48  IncomeVerifiable                     113937 non-null  bool   
 49  StatedMonthlyIncome                  113937 non-null  float64
 50  LoanKey                              113937 non-null  object 
 51  TotalProsperLoans                    22085 non-null   float64
 52  TotalProsperPaymentsBilled           22085 non-null   float64
 53  OnTimeProsperPayments                22085 non-null   float64
 54  ProsperPaymentsLessThanOneMonthLate  22085 non-null   float64
 55  ProsperPaymentsOneMonthPlusLate      22085 non-null   float64
 56  ProsperPrincipalBorrowed             22085 non-null   float64
 57  ProsperPrincipalOutstanding          22085 non-null   float64
 58  ScorexChangeAtTimeOfListing          18928 non-null   float64
 59  LoanCurrentDaysDelinquent            113937 non-null  int64  
 60  LoanFirstDefaultedCycleNumber        16952 non-null   float64
 61  LoanMonthsSinceOrigination           113937 non-null  int64  
 62  LoanNumber                           113937 non-null  int64  
 63  LoanOriginalAmount                   113937 non-null  int64  
 64  LoanOriginationDate                  113937 non-null  object 
 65  LoanOriginationQuarter               113937 non-null  object 
 66  MemberKey                            113937 non-null  object 
 67  MonthlyLoanPayment                   113937 non-null  float64
 68  LP_CustomerPayments                  113937 non-null  float64
 69  LP_CustomerPrincipalPayments         113937 non-null  float64
 70  LP_InterestandFees                   113937 non-null  float64
 71  LP_ServiceFees                       113937 non-null  float64
 72  LP_CollectionFees                    113937 non-null  float64
 73  LP_GrossPrincipalLoss                113937 non-null  float64
 74  LP_NetPrincipalLoss                  113937 non-null  float64
 75  LP_NonPrincipalRecoverypayments      113937 non-null  float64
 76  PercentFunded                        113937 non-null  float64
 77  Recommendations                      113937 non-null  int64  
 78  InvestmentFromFriendsCount           113937 non-null  int64  
 79  InvestmentFromFriendsAmount          113937 non-null  float64
 80  Investors                            113937 non-null  int64  
dtypes: bool(3), float64(50), int64(11), object(17)
memory usage: 68.1+ MB
In [4]:
# get random sampe of dataset

prosper_dataset.sample(10)
Out[4]:
ListingKey ListingNumber ListingCreationDate CreditGrade Term LoanStatus ClosedDate BorrowerAPR BorrowerRate LenderYield ... LP_ServiceFees LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss LP_NonPrincipalRecoverypayments PercentFunded Recommendations InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors
46159 B5FB348594719244427B0FC 462328 2010-06-13 07:08:50.627000000 NaN 36 Completed 2013-12-30 00:00:00 0.33097 0.2950 0.2850 ... -35.25 0.0 0.00 0.00 0.0 1.0 0 0 0.0 35
81917 56F03390726402205C4F164 147487 2007-06-06 04:39:58.487000000 C 36 Chargedoff 2008-06-15 00:00:00 0.29776 0.2900 0.2800 ... -146.62 0.0 21951.10 21933.70 0.0 1.0 0 0 0.0 209
97664 87473593448868465EE9951 1030929 2013-11-11 10:55:20.243000000 NaN 60 Current NaN 0.21115 0.1870 0.1770 ... -37.23 0.0 0.00 0.00 0.0 1.0 0 0 0.0 1
107839 BFE33587507307752A4BACA 888273 2013-09-05 16:01:37.460000000 NaN 36 Current NaN 0.31032 0.2712 0.2612 ... -16.05 0.0 0.00 0.00 0.0 1.0 0 0 0.0 29
59363 F5FF3579022256746145B9A 774015 2013-05-08 18:47:09.327000000 NaN 60 Current NaN 0.20081 0.1769 0.1669 ... -72.31 0.0 0.00 0.00 0.0 1.0 0 0 0.0 1
33843 D7B53601896238808A1578A 1193207 2014-02-14 05:53:27.880000000 NaN 36 Current NaN 0.12117 0.0930 0.0830 ... 0.00 0.0 0.00 0.00 0.0 1.0 0 0 0.0 1
16566 1CF93399896040134488434 199879 2007-09-10 10:46:05.807000000 D 36 Chargedoff 2008-11-03 00:00:00 0.18475 0.1775 0.1675 ... -30.87 0.0 4106.92 4106.92 0.0 1.0 0 0 0.0 166
52128 FBFC3507042071489F0746B 493263 2011-02-09 09:24:46.263000000 NaN 60 Completed 2012-03-06 00:00:00 0.18915 0.1749 0.1649 ... -120.00 0.0 0.00 0.00 0.0 1.0 0 0 0.0 313
22827 FB9A3534284455007B9FD2C 544504 2011-12-13 10:13:56.453000000 NaN 36 Current NaN 0.19108 0.1621 0.1521 ... -76.08 0.0 0.00 0.00 0.0 1.0 0 0 0.0 94
46271 6A2E33767314400493604DB 76053 2006-12-18 15:53:52.733000000 C 36 Completed 2008-09-11 00:00:00 0.13654 0.1295 0.1245 ... -18.78 0.0 0.00 0.00 0.0 1.0 0 0 0.0 85

10 rows × 81 columns

In [5]:
# Generate random dataset and seed for reproducibility.
np.random.seed(5000)
sample = np.random.choice(prosper_dataset.shape[0], 2000, replace=False)
prosper_data_rand = prosper_dataset.loc[sample]
In [6]:
#  summary statistics of numerical variables

prosper_dataset.describe().style.format("{0:,.3f}")
Out[6]:
  ListingNumber Term BorrowerAPR BorrowerRate LenderYield EstimatedEffectiveYield EstimatedLoss EstimatedReturn ProsperRating (numeric) ProsperScore ListingCategory (numeric) EmploymentStatusDuration CreditScoreRangeLower CreditScoreRangeUpper CurrentCreditLines OpenCreditLines TotalCreditLinespast7years OpenRevolvingAccounts OpenRevolvingMonthlyPayment InquiriesLast6Months TotalInquiries CurrentDelinquencies AmountDelinquent DelinquenciesLast7Years PublicRecordsLast10Years PublicRecordsLast12Months RevolvingCreditBalance BankcardUtilization AvailableBankcardCredit TotalTrades TradesNeverDelinquent (percentage) TradesOpenedLast6Months DebtToIncomeRatio StatedMonthlyIncome TotalProsperLoans TotalProsperPaymentsBilled OnTimeProsperPayments ProsperPaymentsLessThanOneMonthLate ProsperPaymentsOneMonthPlusLate ProsperPrincipalBorrowed ProsperPrincipalOutstanding ScorexChangeAtTimeOfListing LoanCurrentDaysDelinquent LoanFirstDefaultedCycleNumber LoanMonthsSinceOrigination LoanNumber LoanOriginalAmount MonthlyLoanPayment LP_CustomerPayments LP_CustomerPrincipalPayments LP_InterestandFees LP_ServiceFees LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss LP_NonPrincipalRecoverypayments PercentFunded Recommendations InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors
count 113,937.000 113,937.000 113,912.000 113,937.000 113,937.000 84,853.000 84,853.000 84,853.000 84,853.000 84,853.000 113,937.000 106,312.000 113,346.000 113,346.000 106,333.000 106,333.000 113,240.000 113,937.000 113,937.000 113,240.000 112,778.000 113,240.000 106,315.000 112,947.000 113,240.000 106,333.000 106,333.000 106,333.000 106,393.000 106,393.000 106,393.000 106,393.000 105,383.000 113,937.000 22,085.000 22,085.000 22,085.000 22,085.000 22,085.000 22,085.000 22,085.000 18,928.000 113,937.000 16,952.000 113,937.000 113,937.000 113,937.000 113,937.000 113,937.000 113,937.000 113,937.000 113,937.000 113,937.000 113,937.000 113,937.000 113,937.000 113,937.000 113,937.000 113,937.000 113,937.000 113,937.000
mean 627,885.693 40.830 0.219 0.193 0.183 0.169 0.080 0.096 4.072 5.950 2.774 96.072 685.568 704.568 10.317 9.260 26.755 6.970 398.292 1.435 5.584 0.592 984.507 4.155 0.313 0.015 17,598.707 0.561 11,210.225 23.230 0.886 0.802 0.276 5,608.026 1.421 22.934 22.272 0.614 0.049 8,472.312 2,930.314 -3.223 152.817 16.268 31.897 69,444.474 8,337.014 272.476 4,183.079 3,105.537 1,077.543 -54.726 -14.243 700.446 681.420 25.143 0.999 0.048 0.023 16.551 80.475
std 328,076.236 10.436 0.080 0.075 0.075 0.068 0.047 0.030 1.673 2.377 3.997 94.481 66.458 66.458 5.458 5.023 13.638 4.631 447.160 2.438 6.430 1.979 7,158.270 10.160 0.728 0.154 32,936.402 0.318 19,818.361 11.871 0.148 1.098 0.552 7,478.497 0.764 19.250 18.830 2.447 0.556 7,395.508 3,806.635 50.064 466.320 9.006 29.974 38,930.480 6,245.801 192.698 4,790.907 4,069.528 1,183.414 60.675 109.233 2,388.514 2,357.167 275.658 0.018 0.332 0.232 294.545 103.239
min 4.000 12.000 0.007 0.000 -0.010 -0.183 0.005 -0.183 1.000 1.000 0.000 0.000 0.000 19.000 0.000 0.000 2.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 -209.000 0.000 0.000 0.000 1.000 1,000.000 0.000 -2.350 0.000 -2.350 -664.870 -9,274.750 -94.200 -954.550 0.000 0.700 0.000 0.000 0.000 1.000
25% 400,919.000 36.000 0.156 0.134 0.124 0.116 0.042 0.074 3.000 4.000 1.000 26.000 660.000 679.000 7.000 6.000 17.000 4.000 114.000 0.000 2.000 0.000 0.000 0.000 0.000 0.000 3,121.000 0.310 880.000 15.000 0.820 0.000 0.140 3,200.333 1.000 9.000 9.000 0.000 0.000 3,500.000 0.000 -35.000 0.000 9.000 6.000 37,332.000 4,000.000 131.620 1,005.760 500.890 274.870 -73.180 0.000 0.000 0.000 0.000 1.000 0.000 0.000 0.000 2.000
50% 600,554.000 36.000 0.210 0.184 0.173 0.162 0.072 0.092 4.000 6.000 1.000 67.000 680.000 699.000 10.000 9.000 25.000 6.000 271.000 1.000 4.000 0.000 0.000 0.000 0.000 0.000 8,549.000 0.600 4,100.000 22.000 0.940 0.000 0.220 4,666.667 1.000 16.000 15.000 0.000 0.000 6,000.000 1,626.550 -3.000 0.000 14.000 21.000 68,599.000 6,500.000 217.740 2,583.830 1,587.500 700.840 -34.440 0.000 0.000 0.000 0.000 1.000 0.000 0.000 0.000 44.000
75% 892,634.000 36.000 0.284 0.250 0.240 0.224 0.112 0.117 5.000 8.000 3.000 137.000 720.000 739.000 13.000 12.000 35.000 9.000 525.000 2.000 7.000 0.000 0.000 3.000 0.000 0.000 19,521.000 0.840 13,180.000 30.000 1.000 1.000 0.320 6,825.000 2.000 33.000 32.000 0.000 0.000 11,000.000 4,126.720 25.000 0.000 22.000 65.000 101,901.000 12,000.000 371.580 5,548.400 4,000.000 1,458.540 -13.920 0.000 0.000 0.000 0.000 1.000 0.000 0.000 0.000 115.000
max 1,255,725.000 60.000 0.512 0.497 0.492 0.320 0.366 0.284 7.000 11.000 20.000 755.000 880.000 899.000 59.000 54.000 136.000 51.000 14,985.000 105.000 379.000 83.000 463,881.000 99.000 38.000 20.000 1,435,667.000 5.950 646,285.000 126.000 1.000 20.000 10.010 1,750,002.917 8.000 141.000 141.000 42.000 21.000 72,499.000 23,450.950 286.000 2,704.000 44.000 100.000 136,486.000 35,000.000 2,251.510 40,702.390 35,000.000 15,617.030 32.060 0.000 25,000.000 25,000.000 21,117.900 1.012 39.000 33.000 25,000.000 1,189.000

What is the structure of your dataset?¶

The Prosper Loan dataset is a features-heavy dataset with 81 variables on over 110,000 loan listings. The variable types include 3 booleans, 50 floats, 11 integers, and 17 strings (objects), which contain the date- and categorical-type variables.

What is/are the main feature(s) of interest in your dataset?¶

The main features of interest in this dataset will include the loan annual percentage rates (APR), Interest rates, credit scores, borrower income, etc, and how they affect or do not affect the repayment status of the loan.

What features in the dataset do you think will help support your investigation into your feature(s) of interest?¶

To investigate my dataset, I will be interested in using summary statistics such as averages, medians, value counts, correlations between variables, visual plots, and any other data analysis exploration as needed. I anticipate a higher APR/interest rate to lead to more defaulting by borrowers and higher risk scores to mean less defaulting.

Univariate Exploration¶

Variable 1 - ListingCreationDate¶

Description: The date the listing was created.

Question: What time period does our dataset cover? When do people borrow the most? Let's begin our exploration by answering that question.

The variable is currently saved in the object format, so we will first have to do a type conversion before we can answer our question.

In [7]:
# convert the ListingCreationDate Series to the datetime data type

prosper_dataset["ListingCreationDate"] = pd.to_datetime(prosper_dataset["ListingCreationDate"])
In [8]:
# Confirm the new data type of the column

prosper_dataset["ListingCreationDate"].info()
<class 'pandas.core.series.Series'>
RangeIndex: 113937 entries, 0 to 113936
Series name: ListingCreationDate
Non-Null Count   Dtype         
--------------   -----         
113937 non-null  datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 890.3 KB
In [9]:
# Check for the start date for the data set

minimum_date = prosper_dataset["ListingCreationDate"].min()

print(f"The start date for the dataset is {minimum_date.day_name()}, {minimum_date.month_name()}, {minimum_date.day}, {minimum_date.year}")
The start date for the dataset is Wednesday, November, 9, 2005
In [10]:
# Check for the end date for the data set

maximum_date = prosper_dataset["ListingCreationDate"].max()

print(f"The end date for the dataset is {maximum_date.day_name()}, {maximum_date.month_name()}, {maximum_date.day}, {maximum_date.year}")
The end date for the dataset is Monday, March, 10, 2014

Next, let us find out what month(s) people borrow the most and what month borrowing is at the least. Also, what day of the week do people borrow the most?

In [11]:
# Get a count of the number of listings by year
listing_year_count = prosper_dataset["ListingCreationDate"].dt.strftime('%Y').value_counts().to_frame().rename_axis("Year")

# Add percentages to the above dataframe
listing_year_count["percentage"] = listing_year_count["ListingCreationDate"] / len(prosper_dataset)*100

listing_year_count
Out[11]:
ListingCreationDate percentage
Year
2013 35413 31.081212
2012 19556 17.163871
2007 11557 10.143325
2011 11442 10.042392
2008 11263 9.885287
2014 10734 9.420996
2006 6213 5.453014
2010 5530 4.853559
2009 2206 1.936158
2005 23 0.020187
In [12]:
# Get a count of the number of listings by month
listing_month_count = prosper_dataset["ListingCreationDate"].dt.strftime('%B').value_counts().to_frame().rename_axis("Month")

# Add percentages to the above dataframe
listing_month_count["percentage"] = listing_month_count["ListingCreationDate"] / len(prosper_dataset)*100

listing_month_count
Out[12]:
ListingCreationDate percentage
Month
January 11214 9.842281
October 10539 9.249849
December 10320 9.057637
February 10124 8.885612
September 10074 8.841728
November 9952 8.734652
July 9506 8.343207
August 9202 8.076393
June 8672 7.611224
May 8641 7.584016
March 8032 7.049510
April 7661 6.723891
In [13]:
# Get a count of the number of listings by day of the week
listing_day_count = prosper_dataset["ListingCreationDate"].dt.strftime('%A').value_counts().to_frame().rename_axis("Weekday")

# Add percentages to the above dataframe
listing_day_count["percentage"] = listing_day_count["ListingCreationDate"] / len(prosper_dataset)*100

listing_day_count
Out[13]:
ListingCreationDate percentage
Weekday
Tuesday 19946 17.506166
Monday 19594 17.197223
Wednesday 19057 16.725910
Thursday 17962 15.764853
Friday 16136 14.162212
Saturday 10758 9.442060
Sunday 10484 9.201576
In [14]:
# Define a function that creates a simple bar chart from a value count dataframe
def simple_bar_plot(df, xlabel, ylabel, title):
    """
    This simple function creates a bar chart from a value_counts dataframe.
    It takes four parameters:
    df (DataFrame): the value_counts dataframe
    xlabel (str): the x label for the plot
    ylabel (str): the y label for the plot
    title (str): the title of the plot
    """
    fig, ax = plt.subplots(figsize=(10, 6))
    plot = df.plot(kind="bar", ax=ax)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.title(title)
    plt.legend()
    ax.get_legend().remove()
    return plot
In [15]:
# plot bar graph for months count

simple_bar_plot(listing_year_count, "Listing Year",
                "Number of borrowers",
                "Bar plot for borrower listings by year");
In [16]:
# plot bar graph for months count
simple_bar_plot(listing_month_count, "Listing Month",
                "Number of borrowers",
                "Bar plot for borrower listings by month");
In [17]:
# plot bar graph for days count
simple_bar_plot(listing_day_count, "Listing Day",
                "Number of borrowers",
                "Bar plot for bumber of borrower listings by day");

We can see from the above that our dataset entries range from November 9 2005 till March 10 2014, representing over 8 years of loan data collection. 2013 had the highest amount of borrowing while 2005 had the lowest. January appears to be the month when people borrow the most, followed by October, then December. The month with the least amount of borrowing is April. Similarly, people borrow the most on Tuesday and the least on Sunday.

While the ListingCreationDate univariate exploration does give us some insight into our dataset, a better way to explore it will be using a time-series plot alongside another variable of interest. This will be explored later in the bivariate exploration section.

Variable 2 - Term¶

Description: The length of the loan expressed in months.

Question: What is the average term of the listed loans? How are the terms distributed?

In [18]:
prosper_dataset["Term"].describe()
Out[18]:
count    113937.000000
mean         40.830248
std          10.436212
min          12.000000
25%          36.000000
50%          36.000000
75%          36.000000
max          60.000000
Name: Term, dtype: float64
In [19]:
unique_vals = prosper_dataset["Term"].unique()

unique_vals
Out[19]:
array([36, 60, 12], dtype=int64)
In [20]:
#  Get the count of each unique value in the Term column

unique_counts = prosper_dataset["Term"].value_counts()

unique_counts
Out[20]:
36    87778
60    24545
12     1614
Name: Term, dtype: int64

Given the few unique values of the Term feature, a pie or chart will be a good way to explore this variable. Let us use both visualizations for this variable

In [21]:
# Define a function for plotting bar charts using seaborn.

def bar_chart_plotter(df, x, title):
    """
    This function will be used to plot bar graphs using seaborn's countplot. 
    The parameters are the dataframe (df), Series to plot(x),
    title of plot (title), xlabel (xlabel), and ylabel (ylabel)
    """
    sns.color_palette()
    base_color = sns.color_palette()[0]
    sns.set_theme(style="darkgrid")
    plt.title(title)
    return sns.countplot(data=df, x=x, color=base_color);
In [22]:
# Define a function for plotting pie charts using pyplot.

def pie_plotter(x, labels, title, textprops, autopct, explode):
    """
    x (list or series): The cound of categories to be plotted
    labels (list or series): The category labels to be plotted
    title (string): title of the pie plot
    textprops (dictionary): text customizations
    autopct (string): formating for the text
    explode (list): list representing the pie offset. Should be the same length as the number of categories
    """
    
    plt.title(title)
    return plt.pie(x=x, labels=labels, textprops=textprops, autopct=autopct, explode=explode)
    plt.legend()
In [23]:
# Plot a bar graph and pie chart of the loan term for the dataset in one plot
plt.figure(figsize = [20, 6])

plt.subplot(1, 2, 1)
bar_chart_plotter(prosper_dataset,
                  prosper_dataset["Term"],
                  "Plot of Prosper Loan Term in months")
plt.xlabel("Term (months)")
plt.ylabel("Number of listings")

plt.subplot(1, 2, 2)
pie_plotter(unique_counts,
            unique_vals,
            "Pie chart showing Term of Prosper Loan Dataset in months",
            {"fontsize":15},
            "%0.1f%%",
            [0.01,0.01,0.2])
plt.legend();

Our bar chart clearly shows that more than two-thirds of the dataset have a Term of 36 months (3 years), with a little over one-quarter having a Term of 60 months (5 years). About one percent of the dataset have a loan Term of 12 months (1 year). The average loan term is at about 40 months.

Variable 3 - LoanStatus¶

Description: The current status of the loan: Cancelled, Chargedoff, Completed, Current, Defaulted, FinalPaymentInProgress, PastDue. The PastDue status will be accomanied by a delinquency bucket.

Question: What is the distribution of the loan status for our dataset? Are more people defaulting than are currently paying off their loans?

In [24]:
# get an idea of the values in out LoanStatus categories
print(prosper_dataset["LoanStatus"].nunique())

prosper_dataset["LoanStatus"].unique()
12
Out[24]:
array(['Completed', 'Current', 'Past Due (1-15 days)', 'Defaulted',
       'Chargedoff', 'Past Due (16-30 days)', 'Cancelled',
       'Past Due (61-90 days)', 'Past Due (31-60 days)',
       'Past Due (91-120 days)', 'FinalPaymentInProgress',
       'Past Due (>120 days)'], dtype=object)

There are 12 unique values for the loan status in this dataset, and this can be easily represented using a bar chart. We will create another function that allows us visualize using pyplot

In [25]:
def plt_bar_plotter(x, y, xlabel, title, rotation):
    """
    This function plots a bar chart using pyplot and takes the following parameters::
    x (list or series): category to be plotted on the x-axis
    y (list or series): height of the categories plotted on the x-axis
    xlabel (string): label for the x-axis
    title (string): title of the plot
    rotation (int): xticks the rotation on the x-axis
    """

    # add text to bar chart using for loop
    for i, g in enumerate(y):
        plt.text(i, g, f"{g}", ha="center", fontsize=16, fontweight="bold")

    # add labels to plot 
    plt.xlabel(xlabel) # add x-label
    plt.xticks(rotation=rotation) # rotate the x-ticks
    plt.yticks([]) # remove y-ticks
    sns.despine(left=True) # remove left spine
    plt.title(title) # add title to plot
    return plt.bar(x, y, edgecolor="black", color="xkcd:light gray blue");
In [26]:
fig, ax = plt.subplots(figsize=(15, 5)) # create plot canvas
# save the categories and their counts into variables for plotting
loan_status = list(prosper_dataset["LoanStatus"].unique())
loan_status_counts = prosper_dataset["LoanStatus"].value_counts(sort=True).to_list()

# plot bar chart of loan statuses using pyplot
plt_bar_plotter(loan_status, # loan status categories
                loan_status_counts, # count of categories
                "Loan Status", # x label
                "Current Status of Loans", # title
                90); # xticks rotation

Our bar chart shows us that about half of all loans have been completed while roughly a third of them are currently in term. A little over ten percent are past due to various degrees ranging from 1 day to over 120 days.

Variable 4 - BorrowerAPR¶

Description: The Borrower's Annual Percentage Rate (APR) for the loan.

Question: At what rates are borrowers getting their loans? What is the distribution of the APR model?

In [27]:
prosper_dataset["BorrowerAPR"].describe()
Out[27]:
count    113912.000000
mean          0.218828
std           0.080364
min           0.006530
25%           0.156290
50%           0.209760
75%           0.283810
max           0.512290
Name: BorrowerAPR, dtype: float64

Given the nature of the APR, a histogram will be a good way to see the distribution of the variable. We will first define a function which will allow us plot histograms automatically from this point on.

In [28]:
# Function that plots a histogran
def hist_plotter(df, x, bins, xlabel, ylabel, title):
    """
    This function plots a histogram using the following parameters:
    df (DataFrame): The dataframe object 
    x (series): The series to be plotted on the histogram
    bin (list): the number of bins to plot
    xlabel (string): label on the x-axis
    ylabel (string): label on the y-axis
    title (string): title of the plot
    """
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.title(title)
    return plt.hist(data=df, x=x, bins=bins)
In [29]:
fig, ax = plt.subplots(figsize=(10, 5)) # create plot canvas
# customize bins
bins = np.arange(0.00, 0.5+0.1, 0.03)

# plot histogram using borrower's APR
hist_plotter(prosper_dataset,
             "BorrowerAPR",
             bins, 
             "Annual Percentage Rate (APR)","Number of Listings",
             "Borrower's Annual Percentage Rate (APR) Distribution");

The APR closely models a bimodal distibribution, with both modes lying between 0.15% and 0.2%. The average APR is at 0.22%.

Variable 5 - BorrowerState¶

Description: The two letter abbreviation of the state of the address of the borrower at the time the Listing was created.

Question: What state has the highest number of borrowers?

In [30]:
# Get the number of states with borrowers 
print(prosper_dataset["BorrowerState"].nunique())

# Print out the list of all the states with borrowers listed
prosper_dataset["BorrowerState"].unique()
51
Out[30]:
array(['CO', 'GA', 'MN', 'NM', 'KS', 'CA', 'IL', 'MD', nan, 'AL', 'AZ',
       'VA', 'FL', 'PA', 'OR', 'MI', 'NY', 'LA', 'WI', 'OH', 'NC', 'WA',
       'NV', 'NJ', 'TX', 'AR', 'SC', 'DE', 'MO', 'NE', 'UT', 'DC', 'MA',
       'CT', 'IN', 'KY', 'OK', 'MS', 'WV', 'RI', 'TN', 'ID', 'MT', 'HI',
       'NH', 'VT', 'WY', 'ME', 'AK', 'IA', 'SD', 'ND'], dtype=object)

Given the high number of states, it would be difficult to plot everything at once, so we will plot the top 10 and bottom 10 states in terms of the number of borrowers listed. A bar chart will be a great way to visualize this.

In [31]:
# Get the states with the highest number of borrowers.

print(prosper_dataset["BorrowerState"].value_counts().head(10))

prosper_dataset["BorrowerState"].value_counts().tail(10)
CA    14717
TX     6842
NY     6729
FL     6720
IL     5921
GA     5008
OH     4197
MI     3593
VA     3278
NJ     3097
Name: BorrowerState, dtype: int64
Out[31]:
DC    382
MT    330
DE    300
VT    207
AK    200
SD    189
IA    186
WY    150
ME    101
ND     52
Name: BorrowerState, dtype: int64
In [32]:
# save variables for plotting
# highest number of borrowers
state_list_top = prosper_dataset["BorrowerState"].value_counts(sort=True).head(10).index
state_counts_top = prosper_dataset["BorrowerState"].value_counts(sort=True).head(10).to_list()
# least number of borrowers
state_list_bottom = prosper_dataset["BorrowerState"].value_counts(sort=True).tail(10).index
state_counts_bottom = prosper_dataset["BorrowerState"].value_counts(sort=True).tail(10).to_list()

fig, ax = plt.subplots(figsize=(20, 8)) # create plot canvas

plt.subplot(1, 2, 1)
# plot bar chart of borrower states
plt_bar_plotter(state_list_top, # borrower state
                state_counts_top, # count of borrowers
                "State", # x-label
                "Highest Number of borrowers by State", # title
                90); # xticks rotation

plt.subplot(1, 2, 2)
# plot bar chart of borrower states
plt_bar_plotter(state_list_bottom, # borrower state
                state_counts_bottom, # count of borrowers
                "State", # x-label
                "Lowest Number of borrowers by State", # title
                90); # xticks rotation

It is not surprising to see that California (CA) has the highest number of listed borrowers, given its status as the most populous state and one of the most important commercial hubs in the country. The number of borrowers listed are more than double the next state, Texas (TX) which had 6842 borrowers.

In contrast, North Dakota (ND) and Maine (ME) have the lowest number of listed borrowers, with 52 and 101 borrowers respectively.

This is a very important variable that would be interesting to explore using bivariate or multivariate analysis, if possible.

Variable 6 - EmploymentStatus¶

Description: The employment status of the borrower at the time they posted the listing.

Question: What is the employment status of borrowers like as at the time the listing was posetd?

In [33]:
employ_status = prosper_dataset.groupby("EmploymentStatus").size().reset_index(name="counts")

employ_status
Out[33]:
EmploymentStatus counts
0 Employed 67322
1 Full-time 26355
2 Not available 5347
3 Not employed 835
4 Other 3806
5 Part-time 1088
6 Retired 795
7 Self-employed 6134
In [34]:
fig, ax = plt.subplots(figsize=(10, 6)) # create plot canvas

plt_bar_plotter(employ_status["EmploymentStatus"], # employment status
                employ_status["counts"], # count of status
                "Employment Status", # x-label
                "Number of Borrowers by Employment Status as at listing time", # title
                45); # xticks rotation

More than 90% of borrowers were employed in one form or the other (employed, full-time, part-time, self-employed) as at the time of listing. We will be interested in knowing if a borrower's employment status has any relation with their loan status in our later explorations.

Variable 7 - IsBorrowerHomeowner¶

Description: A Borrower will be classified as a homowner if they have a mortgage on their credit profile or provide documentation confirming they are a homeowner.

Question: What percentage of our borrowers are homeowners?

In [35]:
homeowner_status = prosper_dataset["IsBorrowerHomeowner"].unique()

homeowner_status
Out[35]:
array([ True, False])
In [36]:
homeowner_counts = prosper_dataset["IsBorrowerHomeowner"].value_counts().to_list()

homeowner_counts
Out[36]:
[57478, 56459]

Given the nature of the Home owner variable, a pie chart will be the best option to visualize it.

In [37]:
# Plot pie chart of homeonwer status

fig, ax = plt.subplots(figsize=(10, 6)) # create plot canvas

pie_plotter(homeowner_counts, # count of homeowners
            homeowner_status, # labels
            "Home Owner Status among listed borrowers", # title of plot
            {"fontsize":15, "fontweight":"bold"}, # text customization
            "%0.1f%%", # Value formatting
            [0.001,0.001]) # explode pie slices
plt.legend();

The home owner status of borrowers is roughly split in half between True and False, with 50.4% of borrowers being homeowners and 49.6% being non-homeowners

Variable 8 - IncomeRange¶

Description: The income range of the borrower at the time the listing was created.

Question: What is the distribution of the income range for listed borrowers?

In [38]:
income_range_counts = prosper_dataset["IncomeRange"].value_counts().to_frame()

income_range_counts["percentage"] = income_range_counts["IncomeRange"] / len(prosper_dataset) * 100

income_range_counts
Out[38]:
IncomeRange percentage
$25,000-49,999 32192 28.254211
$50,000-74,999 31050 27.251902
$100,000+ 17337 15.216304
$75,000-99,999 16916 14.846801
Not displayed 7741 6.794106
$1-24,999 7274 6.384230
Not employed 806 0.707408
$0 621 0.545038
In [39]:
# plot Income range distribution on bar chart

fig, ax = plt.subplots(figsize=(12, 6)) # create plot canvas

plt_bar_plotter(income_range_counts.index, # values to plot
                income_range_counts["IncomeRange"].to_list(), # count of values
                "Income Range", # x-label
                "Income Range of listed borrowers", # title
                45); # xticks rotation
In [40]:
# Generate random dataset and seed for reproducibility.
np.random.seed(5000)
sample = np.random.choice(prosper_dataset.shape[0], 2000, replace=False)
prosper_data_rand = prosper_dataset.loc[sample]

Variable 9 - StatedMonthlyIncome¶

Description: The monthly income the borrower stated at the time the listing was created.

Question: How is monthly income distributed among borrowers?

In [41]:
# set display format for floats
pd.options.display.float_format = '{:.4f}'.format
In [42]:
# Get summary statistics for income distribution
prosper_dataset["StatedMonthlyIncome"].describe()
Out[42]:
count    113937.0000
mean       5608.0256
std        7478.4973
min           0.0000
25%        3200.3333
50%        4666.6667
75%        6825.0000
max     1750002.9167
Name: StatedMonthlyIncome, dtype: float64
In [43]:
# Plot histogram from the stated monthly income column to see the distribution
fig, ax = plt.subplots(figsize=(10, 5)) # create plot canvas
# customize bins
bins = np.arange(0.00, 1750000, 100000)

# plot histogram of stated income
hist_plotter(prosper_dataset,
             "StatedMonthlyIncome",
             bins, 
             "Stated Income",
             "Number of Listings",
             "Income distibution for borrowers");

The summary statistics and histogram above show that the income distribution is highly skewed and there will be the need to transform to get the actual distribution. We will also need to perform some wrangling at this stage to only include datasets with income above zero for the purpose of our transformation.

In [44]:
# subset data to only those with income above zero. 
# We will continue working with this new dataset henceforth.
prosper_dataset = prosper_dataset[prosper_dataset["StatedMonthlyIncome"] > 0]

print(prosper_dataset["StatedMonthlyIncome"].info())

prosper_dataset["StatedMonthlyIncome"].describe()
<class 'pandas.core.series.Series'>
Int64Index: 112543 entries, 0 to 113936
Series name: StatedMonthlyIncome
Non-Null Count   Dtype  
--------------   -----  
112543 non-null  float64
dtypes: float64(1)
memory usage: 1.7 MB
None
Out[44]:
count    112543.0000
mean       5677.4887
std        7498.4193
min           0.0833
25%        3293.3333
50%        4703.0000
75%        6833.3333
max     1750002.9167
Name: StatedMonthlyIncome, dtype: float64
In [45]:
# Get idea of bins to plot from log transform
np.log(prosper_dataset["StatedMonthlyIncome"].describe())
Out[45]:
count   11.6311
mean     8.6443
std      8.9224
min     -2.4849
25%      8.0997
50%      8.4560
75%      8.8296
max     14.3751
Name: StatedMonthlyIncome, dtype: float64
In [46]:
# Plot histogram with the log transform
fig, ax = plt.subplots(figsize=(15, 6)) # create plot canvas
# customize bins
bins = 10 ** np.arange(3, 5, 0.03)

# Generate the x-ticks you want to apply 
ticks = [1000, 3000, 5000, 10000, 30000, 50000, 100000]

# Convert ticks into string values, to be displayed along the x-axis
labels = ['{}'.format(v) for v in ticks]

# plot histogram of stated income
hist_plotter(prosper_dataset,
             "StatedMonthlyIncome",
             bins, 
             "Stated Monthly Income",
             "Number of Borrowers",
             "Income distibution for borrowers")
# Apply log scale
plt.xscale("log")
# Apply x-ticks
plt.xticks(ticks, labels);

With the log scale, we can see that the stated monthly income now closely resembles a normal distribution, with the highest number of borrowers earning between around 2000 and 10000 dollars. The monthly income variable is one other interesting variable we will explore in a bavariate or multivariate visualization.

Variable 10 - ProsperScore¶

Description: A custom risk score built using historical Prosper data. The score ranges from 1-10, with 10 being the best, or lowest risk score. Applicable for loans originated after July 2009.

Question: What does the risk score look like for the average borrower?

In [47]:
# Get summary statistics for the ProsperScore variable
prosper_dataset["ProsperScore"].describe()
Out[47]:
count   84160.0000
mean        5.9558
std         2.3788
min         1.0000
25%         4.0000
50%         6.0000
75%         8.0000
max        11.0000
Name: ProsperScore, dtype: float64
In [48]:
# Get the value count for each rating
prosper_dataset["ProsperScore"].value_counts()
Out[48]:
4.0000     12477
6.0000     12155
8.0000     11980
7.0000     10506
5.0000      9698
3.0000      7563
9.0000      6883
2.0000      5710
10.0000     4747
11.0000     1456
1.0000       985
Name: ProsperScore, dtype: int64

The value count show that there are listings with scores higher than 10. We will assume this is an error since the documentation provides for only ratings between 1 and 10. We will subset our dataset to those with ratings that only fall between this.

In [49]:
# subset dataframe to only listings with ratings between 1 and 10.
prosper_ratings_normal = prosper_dataset[prosper_dataset["ProsperScore"] < 11]

# Group the new subset by score
prosper_ratings_normal_group = prosper_ratings_normal.groupby("ProsperScore").size().reset_index(name="counts")

# describe the subset prosper score
print(prosper_ratings_normal["ProsperScore"].describe())

# view the new dataframe summary
prosper_ratings_normal_group
count   82704.0000
mean        5.8670
std         2.3027
min         1.0000
25%         4.0000
50%         6.0000
75%         8.0000
max        10.0000
Name: ProsperScore, dtype: float64
Out[49]:
ProsperScore counts
0 1.0000 985
1 2.0000 5710
2 3.0000 7563
3 4.0000 12477
4 5.0000 9698
5 6.0000 12155
6 7.0000 10506
7 8.0000 11980
8 9.0000 6883
9 10.0000 4747

We will visualize our prosper scores using seaborn's countplot.

In [50]:
# Plot bar graph
fig, ax = plt.subplots(figsize=(10, 6)) # create plot canvas

bar_chart_plotter(prosper_ratings_normal,
                  prosper_ratings_normal["ProsperScore"],
                  "Prosper Score rating for listed borrowers")
plt.xlabel("ProsperScore")
plt.ylabel("Number of listings");

It is interesting to see that most of the borrowers listed fell into the score category "4", even though the average score was "6", and 75% of borrowers had scores between "1" and "8". This goes to show that averages can sometimes not truly represent our data.

Variable 11 - ListingCategory¶

Description: The category of the listing that the borrower selected when posting their listing: 0 - Not Available, 1 - Debt Consolidation, 2 - Home Improvement, 3 - Business, 4 - Personal Loan, 5 - Student Use, 6 - Auto, 7- Other, 8 - Baby&Adoption, 9 - Boat, 10 - Cosmetic Procedure, 11 - Engagement Ring, 12 - Green Loans, 13 - Household Expenses, 14 - Large Purchases, 15 - Medical/Dental, 16 - Motorcycle, 17 - RV, 18 - Taxes, 19 - Vacation, 20 - Wedding Loans

Question: What do people borrow money for the most?

In [51]:
# Print out the list of categories
print(prosper_dataset["ListingCategory (numeric)"].nunique())

cats = list(prosper_dataset["ListingCategory (numeric)"].unique())

cats
21
Out[51]:
[0, 2, 16, 1, 7, 13, 6, 15, 20, 19, 18, 3, 8, 4, 11, 14, 5, 9, 17, 10, 12]
In [52]:
# Create dataframe with full listing category

listing_cats = prosper_dataset.groupby("ListingCategory (numeric)").size().reset_index(name="counts")

listing_cats["Listing Category"] = ["Not Available", "Debt Consolidation", "Home Improvement",
                                    "Business", "Personal Loan", "Student Use", "Auto", "Other",
                                    "Baby&Adoption", "Boat", "Cosmetic Procedure", "Engagement Ring",
                                    "Green Loans", "Household Expenses", "Large Purchases",
                                    "Medical/Dental", "Motorcycle", "RV", "Taxes", "Vacation", "Wedding Loans"]

listing_cats
Out[52]:
ListingCategory (numeric) counts Listing Category
0 0 16727 Not Available
1 1 57950 Debt Consolidation
2 2 7386 Home Improvement
3 3 6940 Business
4 4 2274 Personal Loan
5 5 689 Student Use
6 6 2544 Auto
7 7 10311 Other
8 8 198 Baby&Adoption
9 9 84 Boat
10 10 90 Cosmetic Procedure
11 11 213 Engagement Ring
12 12 57 Green Loans
13 13 1940 Household Expenses
14 14 870 Large Purchases
15 15 1505 Medical/Dental
16 16 304 Motorcycle
17 17 52 RV
18 18 881 Taxes
19 19 760 Vacation
20 20 768 Wedding Loans

As this is a categorical variable, a bar chart will be a great way to see what category people tended to borrow money for.

In [53]:
fig, ax = plt.subplots(figsize=(20, 6)) # create plot canvas
plt_bar_plotter(listing_cats["Listing Category"],
                listing_cats["counts"],
                "Listing Category",
                "Prosper Loan Data Listing Category by number of Borrowers",
                75);

From the chart, more than half (51.5%) of the borrowers took their loans for the consolidation of other debts while a little over 10% did not specify the reason for taking on the debt.

Variable 12 - LoanOriginalAmount¶

Description: The origination amount of the loan.

What was the average amount of money borrowed? What is the range?

In [54]:
prosper_dataset["LoanOriginalAmount"].describe()
Out[54]:
count   112543.0000
mean      8364.7371
std       6249.2212
min       1000.0000
25%       4000.0000
50%       6500.0000
75%      12000.0000
max      35000.0000
Name: LoanOriginalAmount, dtype: float64

For visualizing the distribution of the loan original amount, let us use box plots. We will create a function in case we need to reuse the plot

In [55]:
# function definition for plotting a box plot
def plot_a_box(data, series, color, title):
    """
    This function will plot a box and whiskers plot using the following arguments:
    data (DataFrame): the DataFrame of interest
    series (Series or list): the series or list to be visualized
    color (string): color of the box
    title (string): title to be used for the plot.
    """
    fig, ax = plt.subplots(figsize=(8, 5))
    box = data[series].plot(ax=ax, kind="box", vert=False, color=color, title=title)
    return box
In [56]:
# Plot box plot for loan amount data
plot_a_box(prosper_dataset,
           "LoanOriginalAmount",
           "#800080",
           "Box and whiskers plot for Prosper Data Original Amount Borrowed");

The box plot above shows that the data available for loan amount borrowed is positively skewed, with 25% of borrowers borrowing less than 4000 dollars, and 75% borrowing less than 12000 dollars. The median amount borrowed was 6500 dollars. Borrowers with 24000 dollars and above are responsible for the skew as they are considered outliers.

Variable 13 - BorrowerRate¶

Description: The Borrower's interest rate for this loan.

Question: What is the average interest rate for borrowing?

In [57]:
prosper_dataset["BorrowerRate"].describe()
Out[57]:
count   112543.0000
mean         0.1924
std          0.0746
min          0.0000
25%          0.1334
50%          0.1830
75%          0.2500
max          0.4975
Name: BorrowerRate, dtype: float64
In [58]:
# Plot box plot for loan amount data
plot_a_box(prosper_dataset,
           "BorrowerRate",
           "#800080",
           "Box and whiskers plot for Prosper Data Borrower Rate");

The average borrower rate for this dataset is 0.19%, with the 75th percentile at 0.25%. There are only a few outliers within this variable. We will explore this variable much later with bivariate or multivariate analysis

Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?¶

So far, we have seen some interesting trends in the distribution of our data, summarily as follows:

  • The loan data covers a period of over 8 years of loan data collection, with the most number of borrowings taking place in January and the least in April. People are most likely to borrow on a Tuesday than they are to borrow on a Sunday.
  • Debt consolidation was the biggest reason for borrowing.
  • The average loan term is at 40 months, while 36 months is the modal loan term.
  • California (CA) and Texas (TX) have the highest amount of borrowers, while North Dakota (ND) and Maine (ME) have the least.
  • More than 90% of borrowers had one form of employment or the other as at the time of their loan listing.
  • The majority of borrowers had loans between 4000 and 12000 dollars (25th to 75th percentile).
  • Borrower rates were mostly between 0.18% and 0.25%, while the annual percentage rates (APR) were mostly between 0.15% and 0.28%

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?¶

A few transformations and cleaning had to be done in the univariate analysis process.

  • The loan listing date had to be converted to the datetime type for any analysis to be performed.
  • The monthly income distribution did not appear normally at first, and had to be transforemed using a log scale for the distribution to appear normally.
  • Our dataset was subset to include only those with stated income higher than zero.

Bivariate Exploration¶

We will now move on to the bivariate exploration of the loan dataset, so we can get some more useful insights into our data

Investigation 1: ProsperScore vs LoanStatus¶

Question: Does a borrower's rating affect how they end up with their loans? What prosper score do borrowers who default have on average? Let us explore what this relationship will unfold.

From our earlier univariate analysis, we had created a subset that only includes datasets with ratings between 1 and 10 (prosper_ratings_normal), and we will continue our work here with that dataset.

In [59]:
prosper_ratings_normal["LoanStatus"].unique()
Out[59]:
array(['Current', 'Past Due (1-15 days)', 'Defaulted', 'Completed',
       'Chargedoff', 'Past Due (16-30 days)', 'Past Due (61-90 days)',
       'Past Due (31-60 days)', 'Past Due (91-120 days)',
       'FinalPaymentInProgress', 'Past Due (>120 days)'], dtype=object)
In [60]:
# Define a function to plot a violin plot and 2D box plot
def violin_chart(data, x, y, xlabel, ylabel, title):
    """
    Function plots a violin plot using the parameters:
    data (DataFrame): DataFrame containing the data to be plotted.
    x (series): categorical series to plot on the x axis
    y (series): quantitavie series to be plotted on the y axis
    xlabel (str): x label for the plot
    ylabel (str): y label for the plot
    title (title): title of the plot
    """
    fig, ax = plt.subplots(figsize=(18, 6))
    base_color = sns.color_palette()[0]
    violin = sns.violinplot(data=data, x=x, y=y, ax=ax, color=base_color)
    plt.xticks(rotation=90)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.title(title);
    return violin

def box_chart_2D(data, x, y, xlabel, ylabel, title):
    """
    Function plots a 2D box plot using the parameters:
    data (DataFrame): DataFrame containing the data to be plotted.
    x (series): categorical series to plot on the x axis
    y (series): quantitavie series to be plotted on the y axis
    xlabel (str): x label for the plot
    ylabel (str): y label for the plot
    title (title): title of the plot
    """
    fig, ax = plt.subplots(figsize=(18, 6))
    base_color = sns.color_palette()[0]
    box_chart = sns.boxplot(data=data, x=x, y=y, ax=ax, color=base_color)
    plt.xticks(rotation=90)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.title(title);
    return box_chart
In [61]:
# Plot violin plot of Prosper score and Loan Status

violin_chart(prosper_ratings_normal, "LoanStatus", "ProsperScore", "Loan Status", "Prosper Score", "Distribution of Loan Status by Prosper Score");

The above chart can also be visualized using a 2D box plot to get a clearer picture of the distribution.

In [62]:
# Plot box plot of Prosper score and Loan Status

box_chart_2D(prosper_ratings_normal, "LoanStatus", "ProsperScore", "Loan Status", "Prosper Score", "Distribution of Loan Status by Prosper Score");

It is obvious from the charts above that borrowers with the prosper risk score of (7) were the most likely to complete their loans, while those with a prosper risk score of 6 or less were likely to default.

Investigation 2: LoanOriginalAmount vs StatedMonthlyIncome¶

Question: How does a borrower's monthly income affect how they borrow? To visualize this relationship, we will employ the use of a scatter plot and the correlation.

In [63]:
# Define a function to plot scatter plots. 

def plot_scatter(data, x, y, xlabel, ylabel, a_x, a_y, color, title):
    """
    A bivariate scatter plotting function. Parameters are:
    data (DataFrame): The DataFrame that contains observations to be plotted
    x (str): series or column to plot on x-axis
    y (str): series or column to plot on y-axis
    xlabel (str): x axis label
    ylabel (str): y axis label
    a_x (int): length of x axis
    a_y (int): length of y axis
    color (str or hex): color of plot
    title (str): plot title
    """
    fig, ax = plt.subplots(figsize=(12, 6))
    ax.plot([0, a_x], [0, a_y], linestyle="--", color="#800080")
    scatter = data.plot(x, y, kind="scatter", color=color, title=title, ax=ax, alpha=0.9)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel);
    return scatter
In [64]:
# get correlation coefficient of borrower monthly income and loan amount
prosper_dataset["StatedMonthlyIncome"].corr(prosper_dataset["LoanOriginalAmount"])
Out[64]:
0.19974080694069674
In [65]:
# plot scatter plot of borrower monthly income and original loan amount.
plot_scatter(prosper_dataset, "LoanOriginalAmount", "StatedMonthlyIncome",
             "Original Loan Amount", "Borrower's Monthly Income",
             36000, 36000, "#00FFFF",
             "Relationship between borrower monthly income and original amount borrowed");

Based on the entire dataset, there is a very weak relationship between a borrower's income and how much they borrowed as loan, however, the data shows that mostly only low income earners (typically earning less than $50000) borrowed. In order to dig in more into this relationship, we will use the earlier generated random subset of our dataset and plot another scatter diagram.

In [66]:
# plot scatter plot of borrower monthly income and original loan amount.
plot_scatter(prosper_data_rand, "LoanOriginalAmount", "StatedMonthlyIncome",
             "Original Loan Amount", "Borrower's Monthly Income",
             30000, 30000, "#00FFFF",
             "Relationship between borrower monthly income and original amount borrowed");
In [67]:
# get correlation coefficient of borrower monthly income and loan amount
prosper_data_rand["StatedMonthlyIncome"].corr(prosper_data_rand["LoanOriginalAmount"])
Out[67]:
0.4189600151580486

We can see from above that there is a much stronger positive relationship between the income and amount borrowed. Generally, the higher a borrower's income, the more they tend to borrow.

Investigation 3: LoanOriginalAmount vs BorrowerRate/BorrowerAPR¶

Question: Is there any dependence of the APR/Interest Rate on the amount borrowed? Let us calculating the coefficient of correlation between the two variables.

In [68]:
prosper_dataset["LoanOriginalAmount"].corr(prosper_dataset["BorrowerAPR"])
Out[68]:
-0.3241576725500873
In [69]:
prosper_dataset["LoanOriginalAmount"].corr(prosper_dataset["BorrowerRate"])
Out[69]:
-0.33062595925551863
In [70]:
# plot scatter plot of borrower original loan amount and Annual Percentage rate
plot_scatter(prosper_dataset, "LoanOriginalAmount", "BorrowerRate",
             "Original Loan Amount", "Borrower Interest Rate",
             0, 0, "#3EB4E7",
             "Relationship between amount borrowed and interest rate");
In [71]:
# plot scatter plot of borrower original loan amount and Annual Percentage rate
plot_scatter(prosper_dataset, "LoanOriginalAmount", "BorrowerAPR",
             "Original Loan Amount", "Borrower Annual Percentage Rate",
             0, 0, "#3EB4E7",
             "Relationship between amount borrowed and APR");

The plot above shows that generally, the higher the anount borrowed, the lower the APR/Interest Rate of the loan, This is also backed up by the coefficient of correlation which is negative in value and tending toward the midpoint.

Investigation 4: LoanStatus vs BorrowerAPR/BorrowerRate¶

Question: What relationship exists between a loan's APR/Rate and how the loan status ends? Do higher APR/Rates mean a lot of defaulting? Let us find out.

In [72]:
# group relationship between LoanStatus and APR
status_apr_rel = prosper_dataset.groupby("LoanStatus")["BorrowerAPR"].mean().to_frame()

status_apr_rel
Out[72]:
BorrowerAPR
LoanStatus
Cancelled 0.1932
Chargedoff 0.2575
Completed 0.2085
Current 0.2133
Defaulted 0.2389
FinalPaymentInProgress 0.2296
Past Due (1-15 days) 0.2636
Past Due (16-30 days) 0.2679
Past Due (31-60 days) 0.2652
Past Due (61-90 days) 0.2730
Past Due (91-120 days) 0.2705
Past Due (>120 days) 0.2791
In [73]:
# group relationship between LoanStatus and BorrowerRate
status_rate_rel = prosper_dataset.groupby("LoanStatus")["BorrowerRate"].mean().to_frame()

status_rate_rel
Out[73]:
BorrowerRate
LoanStatus
Cancelled 0.1711
Chargedoff 0.2352
Completed 0.1862
Current 0.1834
Defaulted 0.2231
FinalPaymentInProgress 0.1970
Past Due (1-15 days) 0.2306
Past Due (16-30 days) 0.2347
Past Due (31-60 days) 0.2325
Past Due (61-90 days) 0.2398
Past Due (91-120 days) 0.2379
Past Due (>120 days) 0.2484
In [74]:
# Define a function that plots line graphs from groupby dataframe
def line_graph_plot(data, xlabel, ylabel, title, rotation):
    """
    Line graph plotter
    """
    fig, ax = plt.subplots(figsize=(12, 5))
    line = data.plot(kind="line", ax=ax)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.title(title)
    plt.xticks(rotation=rotation)
    return line
In [75]:
# Plot line graph for BorrowerAPR and LoanStatus
line_graph_plot(status_apr_rel,
                "Loan Status", "Borrower APR",
                "Relationship between Borrower APR and Loan Status",
                45);
In [76]:
# Plot line graph for BorrowerRate and LoanStatus
line_graph_plot(status_rate_rel,
                "Loan Status", "Borrower Rate",
                "Relationship between Borrower Rate and Loan Status",
                45);

The higher a loan's rate/APR, the higher the chances of a defaulting, as seen from the above chart of loan status with APR/borrower rate.

Investigation 5: LoanStatus vs LoanOriginalAmount¶

Question: What is the likelyhood of a default based on the amount borrowed?

In [77]:
status_amount_rel = prosper_dataset.groupby("LoanStatus")["LoanOriginalAmount"].mean().to_frame()
In [78]:
status_amount_rel
Out[78]:
LoanOriginalAmount
LoanStatus
Cancelled 1875.0000
Chargedoff 6377.1221
Completed 6202.3844
Current 10383.4047
Defaulted 6444.4226
FinalPaymentInProgress 8346.1220
Past Due (1-15 days) 8473.4176
Past Due (16-30 days) 8211.3615
Past Due (31-60 days) 8558.2333
Past Due (61-90 days) 7758.1569
Past Due (91-120 days) 8044.7143
Past Due (>120 days) 8633.3333
In [79]:
# Plot line graph for Amount Borrowed and LoanStatus
line_graph_plot(status_amount_rel,
                "Loan Status", "Original Amount Borrowed",
                "Relationship between Borrower Rate and Amount Borrowed",
                45);

Investigation 6: LoanStatus vs Home ownership¶

Question: Is there any significant difference between homeowners and non-homeowners when it comes to loan repayment/status?

In [80]:
status_homeowner_rel = prosper_dataset.groupby(["LoanStatus"])["IsBorrowerHomeowner"].value_counts().to_frame().unstack()

status_homeowner_rel
Out[80]:
IsBorrowerHomeowner
IsBorrowerHomeowner False True
LoanStatus
Cancelled 3 1
Chargedoff 6444 5201
Completed 19338 18059
Current 25913 30404
Defaulted 2702 2236
FinalPaymentInProgress 95 110
Past Due (1-15 days) 422 373
Past Due (16-30 days) 133 127
Past Due (31-60 days) 178 182
Past Due (61-90 days) 130 176
Past Due (91-120 days) 174 127
Past Due (>120 days) 7 8
In [81]:
# Plot line graph for Home ownership and LoanStatus
line_graph_plot(status_homeowner_rel,
                "Loan Status", "Is Borrower Homeowner",
                "Relationship between Borrower Home ownership and Loan Status",
                45);

There does not appear to be any significant difference between borrowers with homes and those without homes in the payments of the loan statuses.

In [82]:
prosper_dataset.groupby(["LoanStatus"])["IncomeRange"].value_counts().unstack()
Out[82]:
IncomeRange $0 $1-24,999 $100,000+ $25,000-49,999 $50,000-74,999 $75,000-99,999 Not displayed Not employed
LoanStatus
Cancelled NaN NaN NaN 1.0000 NaN NaN 3.0000 NaN
Chargedoff 1.0000 1329.0000 968.0000 4162.0000 2633.0000 1153.0000 1377.0000 22.0000
Completed NaN 2906.0000 4774.0000 10891.0000 9282.0000 4914.0000 4579.0000 51.0000
Current NaN 2536.0000 10916.0000 15111.0000 17615.0000 10139.0000 NaN NaN
Defaulted NaN 334.0000 322.0000 1290.0000 874.0000 375.0000 1736.0000 7.0000
FinalPaymentInProgress NaN 14.0000 52.0000 49.0000 64.0000 26.0000 NaN NaN
Past Due (1-15 days) NaN 52.0000 134.0000 266.0000 212.0000 131.0000 NaN NaN
Past Due (16-30 days) NaN 20.0000 40.0000 80.0000 83.0000 37.0000 NaN NaN
Past Due (31-60 days) NaN 32.0000 65.0000 119.0000 89.0000 55.0000 NaN NaN
Past Due (61-90 days) NaN 22.0000 38.0000 102.0000 92.0000 52.0000 NaN NaN
Past Due (91-120 days) NaN 27.0000 28.0000 114.0000 99.0000 33.0000 NaN NaN
Past Due (>120 days) NaN NaN NaN 7.0000 7.0000 1.0000 NaN NaN

Investigation 7: ListingCreationDate vs LoanOriginalAmount¶

Question: What is the general trend of amount borrowed over the years?

In [83]:
# To plot a time-series chart, let us first set our index to the Listing Creation Date

prosper_dataset = prosper_dataset.set_index("ListingCreationDate")
In [84]:
# Plot raw time series of the original amount borrowed

fig, ax = plt.subplots(figsize=(15, 6))
prosper_dataset["LoanOriginalAmount"].plot(xlabel="Time", ylabel="Loan Original Amount", title="Loan Original Amount Series", ax=ax);

There are a lot missing values in our dataset, and for us to have a meaningful time series plot, we need to find a way to fill in this values. Since we are interested in the general distribution of the data, we will resample our data to the either weekly or monthly averages and visualize.

In [85]:
prosper_dataset["LoanOriginalAmount"].resample("1W").mean().isnull().sum()
Out[85]:
37
In [86]:
prosper_dataset["LoanOriginalAmount"].resample("1M").mean().isnull().sum()
Out[86]:
6

In order to cater for the missing data points, we will use the forward fill method

In [87]:
# Average amount borrowed weekly

prosper_dataset["LoanOriginalAmount"].resample("1W").mean().fillna(method="ffill").to_frame()
Out[87]:
LoanOriginalAmount
ListingCreationDate
2005-11-13 1250.0000
2005-11-20 4027.2727
2005-11-27 4000.0000
2005-12-04 2564.8333
2005-12-11 3500.0000
... ...
2014-02-16 11811.3439
2014-02-23 11981.4665
2014-03-02 12066.8385
2014-03-09 11957.3443
2014-03-16 10320.0000

436 rows × 1 columns

In [88]:
# Average amount borrowed monthly

prosper_dataset["LoanOriginalAmount"].resample("1M").mean().fillna(method="ffill").to_frame()
Out[88]:
LoanOriginalAmount
ListingCreationDate
2005-11-30 3342.1053
2005-12-31 6296.7500
2006-01-31 4962.0000
2006-02-28 5774.4298
2006-03-31 4782.1673
... ...
2013-11-30 11148.7404
2013-12-31 11800.6394
2014-01-31 11863.5759
2014-02-28 12000.4169
2014-03-31 11835.9887

101 rows × 1 columns

In [89]:
# Function to plot a time series
def plot_time_series(data, series, sample, xlabel, ylabel, title):
    """
    data (DataFrame): the DataFrame
    series (series): the column to plot
    sample (str): the smaple window for the series. Options are Day (D), Week (W), Month (M), Year (Y)
    xlabel (str): label on the x-axis
    ylabel (str): label on the y-axis
    title (str): title of plot
    """
    fig, ax = plt.subplots(figsize=(15, 6))
    plot = data[series].resample(sample).mean().fillna(method="ffill").plot(ax=ax)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.title(title)
    return plot
In [90]:
# Plot time series of average amount borrowed weekly
plot_time_series(prosper_dataset, "LoanOriginalAmount", "1W", 
                 "Listing Creation Date", "Average Amount Borrowed",
                 "Weekly Time series plot of Average Amount Borrowed");
In [91]:
# Plot time series of average amount borrowed monthly
plot_time_series(prosper_dataset, "LoanOriginalAmount", "1M", 
                 "Listing Creation Date", "Average Amount Borrowed",
                 "Monthly Time series plot of Average Amount Borrowed");

Going by the trend, the average amount borrowed increased with each passing year, except for the dip which occured somewhere midway between the perios of the dataset, likely attributable to the fact there was somne missing points during this period.

Investigation 8: ListingCreationDate vs BorrowerRate¶

Question: What is the general trend of the borrower interest rate over the years?

In [92]:
# Plot timw series of borrower rate with a weekly sample window.
plot_time_series(prosper_dataset, "BorrowerRate", "1W",
                 "Listing Creation Date (Year)", "Average Borrower Rate",
                 "Time series plot of Average Borrower Rate (Weekly)");
In [93]:
# Plot timw series of borrower rate with a monthly sample window.
plot_time_series(prosper_dataset, "BorrowerRate", "1M",
                 "Listing Creation Date (Year)", "Average Borrower Rate",
                 "Time series plot of Average Borrower Rate (Monthly)");

The average borrower rate was relatively steady for the first half of the period covered, peaked around 2011, then dropped off again.

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?¶

  • It appears that borrowers with prosper score ratings between 6 and 8 completed their loan payments, while the majority of those who defaulted or had their loans charged off had ratings between 5 and 6. The highest proportion of those who were still in the process of paying had a prosper score rating of 4.
  • There is no strong relationship between a borrower's income and how much they borrowed as loan, however, the data shows that mostly only low income earners (typically earning less than $50000) borrowed.
  • The higher a loan's rate/APR, the higher the chances of a defaulting, as seen from the above chart of loan status with APR/borrower rate.
  • There does not appear to be any significant difference between borrowers with homes and those without homes in the distribution of the loan statuses.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?¶

It was interesting to see that home ownership did not really have much effect on the status of a loan, as the numbers were roughly equal between the two groups. It would have been natural to think that home owners would default a lot lesser than non home owners.

Multivariate Exploration¶

In this section, we will explore a few of the earlier features in multivariate plots.

Investigation 1: Monthly Income vs Loan Amount vs Home Ownersip¶

Question: What is the difference in how the loan amount affects borrower APR/rate by home ownership and Employment status?

In [95]:
g = sns.FacetGrid(data = prosper_data_rand, hue = "IsBorrowerHomeowner",
                height=5, aspect=2) # viridis used in reverse
g.map(sns.regplot, "StatedMonthlyIncome", "LoanOriginalAmount", x_jitter=0.5, y_jitter=0.5, fit_reg=True, scatter_kws={'alpha':5/10})
g.add_legend()
plt.xlabel("Monthly Income")
plt.ylabel("Original Amount Borrowed");

Home ownership doesnt's seem to affect the relationship between income and the amount borrowed, as the distribution seems to be fairly consistent across both groups.

Investigation 2: Loan Amount vs Borrower Rate vs Home Ownersip vs Employment Status¶

Question: What is the difference in how the loan amount affects borrower APR/rate by home ownership and Employment status?

In [96]:
# plot a multivariate Facet Grid to subset homeonwers from non homeowners for the loan amount and borrower rate relationship.

g = sns.FacetGrid(data = prosper_dataset, col = "EmploymentStatus", row = "IsBorrowerHomeowner", margin_titles=True)
g.map(plt.scatter, "LoanOriginalAmount", "BorrowerRate");

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?¶

The general trend we can see here from the multivariate analysis is that there is not much difference between how the Loan Amount borrowed affected the Rate in both home owners and non home owners, in the same way that there isn't much difference in how income affected how much they borrowed.

However, we can clearly see the trend for both home owners and non-homeowners that people who are employed in one form or the other and have higher pay generally borrow more, which means they attract lower rates.

Were there any interesting or surprising interactions between features?¶

It was interesting to see how a risk score can tell how likely a borrower is to repay their loan. It was also interesting (and a bit surprising) to see that people borrowed to pay off an earlier debt.

Conclusions¶

The prosper loan data was an interesting dataset to explore. The main findings from the data are as follows:

  • Over 8 years of loan data,which revelaed that most people borrowed during in the first month of the year, likely due to spendings during the year-end holiday seasons.
  • The average tenure of loans is at 40 months, at an average annual percentage rate (APR) of 0.22% and average borrower rate of 0.19%.
  • The higher the amount borrowed, the more likely a borrower is likely to default.
  • Of all the features explored, home ownership among borrowers seemed to have the least effect on loan paynent status.

In conclusion, the major predictors of whether a customer repays a loan or not are the amount of loan borrowed, borrower rate, APR, and prosper risk score. However, there is so much more that can be gotten from this dataset, if more exploration is done.

Resources¶

  • Kimberly Fessel YouTube Channel
  • Convert Value Counts To List - Stack Overflow
  • Matplotlib Visualizations - Machine Learning +
  • Python Howtos - Study Tonight