The Prosper loan data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, and many others. We are interested in drawing out as many interesting insights as possible using univaraite, bivariate, and multivariate analysis and visualization.
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# read data set into pandas DataFrame
prosper_dataset = pd.read_csv("prosperLoanData.csv")
# get a general overview of the data set.
prosper_dataset.shape
(113937, 81)
# get information on the dataset's features and completeness
prosper_dataset.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 113937 entries, 0 to 113936 Data columns (total 81 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ListingKey 113937 non-null object 1 ListingNumber 113937 non-null int64 2 ListingCreationDate 113937 non-null object 3 CreditGrade 28953 non-null object 4 Term 113937 non-null int64 5 LoanStatus 113937 non-null object 6 ClosedDate 55089 non-null object 7 BorrowerAPR 113912 non-null float64 8 BorrowerRate 113937 non-null float64 9 LenderYield 113937 non-null float64 10 EstimatedEffectiveYield 84853 non-null float64 11 EstimatedLoss 84853 non-null float64 12 EstimatedReturn 84853 non-null float64 13 ProsperRating (numeric) 84853 non-null float64 14 ProsperRating (Alpha) 84853 non-null object 15 ProsperScore 84853 non-null float64 16 ListingCategory (numeric) 113937 non-null int64 17 BorrowerState 108422 non-null object 18 Occupation 110349 non-null object 19 EmploymentStatus 111682 non-null object 20 EmploymentStatusDuration 106312 non-null float64 21 IsBorrowerHomeowner 113937 non-null bool 22 CurrentlyInGroup 113937 non-null bool 23 GroupKey 13341 non-null object 24 DateCreditPulled 113937 non-null object 25 CreditScoreRangeLower 113346 non-null float64 26 CreditScoreRangeUpper 113346 non-null float64 27 FirstRecordedCreditLine 113240 non-null object 28 CurrentCreditLines 106333 non-null float64 29 OpenCreditLines 106333 non-null float64 30 TotalCreditLinespast7years 113240 non-null float64 31 OpenRevolvingAccounts 113937 non-null int64 32 OpenRevolvingMonthlyPayment 113937 non-null float64 33 InquiriesLast6Months 113240 non-null float64 34 TotalInquiries 112778 non-null float64 35 CurrentDelinquencies 113240 non-null float64 36 AmountDelinquent 106315 non-null float64 37 DelinquenciesLast7Years 112947 non-null float64 38 PublicRecordsLast10Years 113240 non-null float64 39 PublicRecordsLast12Months 106333 non-null float64 40 RevolvingCreditBalance 106333 non-null float64 41 BankcardUtilization 106333 non-null float64 42 AvailableBankcardCredit 106393 non-null float64 43 TotalTrades 106393 non-null float64 44 TradesNeverDelinquent (percentage) 106393 non-null float64 45 TradesOpenedLast6Months 106393 non-null float64 46 DebtToIncomeRatio 105383 non-null float64 47 IncomeRange 113937 non-null object 48 IncomeVerifiable 113937 non-null bool 49 StatedMonthlyIncome 113937 non-null float64 50 LoanKey 113937 non-null object 51 TotalProsperLoans 22085 non-null float64 52 TotalProsperPaymentsBilled 22085 non-null float64 53 OnTimeProsperPayments 22085 non-null float64 54 ProsperPaymentsLessThanOneMonthLate 22085 non-null float64 55 ProsperPaymentsOneMonthPlusLate 22085 non-null float64 56 ProsperPrincipalBorrowed 22085 non-null float64 57 ProsperPrincipalOutstanding 22085 non-null float64 58 ScorexChangeAtTimeOfListing 18928 non-null float64 59 LoanCurrentDaysDelinquent 113937 non-null int64 60 LoanFirstDefaultedCycleNumber 16952 non-null float64 61 LoanMonthsSinceOrigination 113937 non-null int64 62 LoanNumber 113937 non-null int64 63 LoanOriginalAmount 113937 non-null int64 64 LoanOriginationDate 113937 non-null object 65 LoanOriginationQuarter 113937 non-null object 66 MemberKey 113937 non-null object 67 MonthlyLoanPayment 113937 non-null float64 68 LP_CustomerPayments 113937 non-null float64 69 LP_CustomerPrincipalPayments 113937 non-null float64 70 LP_InterestandFees 113937 non-null float64 71 LP_ServiceFees 113937 non-null float64 72 LP_CollectionFees 113937 non-null float64 73 LP_GrossPrincipalLoss 113937 non-null float64 74 LP_NetPrincipalLoss 113937 non-null float64 75 LP_NonPrincipalRecoverypayments 113937 non-null float64 76 PercentFunded 113937 non-null float64 77 Recommendations 113937 non-null int64 78 InvestmentFromFriendsCount 113937 non-null int64 79 InvestmentFromFriendsAmount 113937 non-null float64 80 Investors 113937 non-null int64 dtypes: bool(3), float64(50), int64(11), object(17) memory usage: 68.1+ MB
# get random sampe of dataset
prosper_dataset.sample(10)
ListingKey | ListingNumber | ListingCreationDate | CreditGrade | Term | LoanStatus | ClosedDate | BorrowerAPR | BorrowerRate | LenderYield | ... | LP_ServiceFees | LP_CollectionFees | LP_GrossPrincipalLoss | LP_NetPrincipalLoss | LP_NonPrincipalRecoverypayments | PercentFunded | Recommendations | InvestmentFromFriendsCount | InvestmentFromFriendsAmount | Investors | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
46159 | B5FB348594719244427B0FC | 462328 | 2010-06-13 07:08:50.627000000 | NaN | 36 | Completed | 2013-12-30 00:00:00 | 0.33097 | 0.2950 | 0.2850 | ... | -35.25 | 0.0 | 0.00 | 0.00 | 0.0 | 1.0 | 0 | 0 | 0.0 | 35 |
81917 | 56F03390726402205C4F164 | 147487 | 2007-06-06 04:39:58.487000000 | C | 36 | Chargedoff | 2008-06-15 00:00:00 | 0.29776 | 0.2900 | 0.2800 | ... | -146.62 | 0.0 | 21951.10 | 21933.70 | 0.0 | 1.0 | 0 | 0 | 0.0 | 209 |
97664 | 87473593448868465EE9951 | 1030929 | 2013-11-11 10:55:20.243000000 | NaN | 60 | Current | NaN | 0.21115 | 0.1870 | 0.1770 | ... | -37.23 | 0.0 | 0.00 | 0.00 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
107839 | BFE33587507307752A4BACA | 888273 | 2013-09-05 16:01:37.460000000 | NaN | 36 | Current | NaN | 0.31032 | 0.2712 | 0.2612 | ... | -16.05 | 0.0 | 0.00 | 0.00 | 0.0 | 1.0 | 0 | 0 | 0.0 | 29 |
59363 | F5FF3579022256746145B9A | 774015 | 2013-05-08 18:47:09.327000000 | NaN | 60 | Current | NaN | 0.20081 | 0.1769 | 0.1669 | ... | -72.31 | 0.0 | 0.00 | 0.00 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
33843 | D7B53601896238808A1578A | 1193207 | 2014-02-14 05:53:27.880000000 | NaN | 36 | Current | NaN | 0.12117 | 0.0930 | 0.0830 | ... | 0.00 | 0.0 | 0.00 | 0.00 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
16566 | 1CF93399896040134488434 | 199879 | 2007-09-10 10:46:05.807000000 | D | 36 | Chargedoff | 2008-11-03 00:00:00 | 0.18475 | 0.1775 | 0.1675 | ... | -30.87 | 0.0 | 4106.92 | 4106.92 | 0.0 | 1.0 | 0 | 0 | 0.0 | 166 |
52128 | FBFC3507042071489F0746B | 493263 | 2011-02-09 09:24:46.263000000 | NaN | 60 | Completed | 2012-03-06 00:00:00 | 0.18915 | 0.1749 | 0.1649 | ... | -120.00 | 0.0 | 0.00 | 0.00 | 0.0 | 1.0 | 0 | 0 | 0.0 | 313 |
22827 | FB9A3534284455007B9FD2C | 544504 | 2011-12-13 10:13:56.453000000 | NaN | 36 | Current | NaN | 0.19108 | 0.1621 | 0.1521 | ... | -76.08 | 0.0 | 0.00 | 0.00 | 0.0 | 1.0 | 0 | 0 | 0.0 | 94 |
46271 | 6A2E33767314400493604DB | 76053 | 2006-12-18 15:53:52.733000000 | C | 36 | Completed | 2008-09-11 00:00:00 | 0.13654 | 0.1295 | 0.1245 | ... | -18.78 | 0.0 | 0.00 | 0.00 | 0.0 | 1.0 | 0 | 0 | 0.0 | 85 |
10 rows × 81 columns
# Generate random dataset and seed for reproducibility.
np.random.seed(5000)
sample = np.random.choice(prosper_dataset.shape[0], 2000, replace=False)
prosper_data_rand = prosper_dataset.loc[sample]
# summary statistics of numerical variables
prosper_dataset.describe().style.format("{0:,.3f}")
ListingNumber | Term | BorrowerAPR | BorrowerRate | LenderYield | EstimatedEffectiveYield | EstimatedLoss | EstimatedReturn | ProsperRating (numeric) | ProsperScore | ListingCategory (numeric) | EmploymentStatusDuration | CreditScoreRangeLower | CreditScoreRangeUpper | CurrentCreditLines | OpenCreditLines | TotalCreditLinespast7years | OpenRevolvingAccounts | OpenRevolvingMonthlyPayment | InquiriesLast6Months | TotalInquiries | CurrentDelinquencies | AmountDelinquent | DelinquenciesLast7Years | PublicRecordsLast10Years | PublicRecordsLast12Months | RevolvingCreditBalance | BankcardUtilization | AvailableBankcardCredit | TotalTrades | TradesNeverDelinquent (percentage) | TradesOpenedLast6Months | DebtToIncomeRatio | StatedMonthlyIncome | TotalProsperLoans | TotalProsperPaymentsBilled | OnTimeProsperPayments | ProsperPaymentsLessThanOneMonthLate | ProsperPaymentsOneMonthPlusLate | ProsperPrincipalBorrowed | ProsperPrincipalOutstanding | ScorexChangeAtTimeOfListing | LoanCurrentDaysDelinquent | LoanFirstDefaultedCycleNumber | LoanMonthsSinceOrigination | LoanNumber | LoanOriginalAmount | MonthlyLoanPayment | LP_CustomerPayments | LP_CustomerPrincipalPayments | LP_InterestandFees | LP_ServiceFees | LP_CollectionFees | LP_GrossPrincipalLoss | LP_NetPrincipalLoss | LP_NonPrincipalRecoverypayments | PercentFunded | Recommendations | InvestmentFromFriendsCount | InvestmentFromFriendsAmount | Investors | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 113,937.000 | 113,937.000 | 113,912.000 | 113,937.000 | 113,937.000 | 84,853.000 | 84,853.000 | 84,853.000 | 84,853.000 | 84,853.000 | 113,937.000 | 106,312.000 | 113,346.000 | 113,346.000 | 106,333.000 | 106,333.000 | 113,240.000 | 113,937.000 | 113,937.000 | 113,240.000 | 112,778.000 | 113,240.000 | 106,315.000 | 112,947.000 | 113,240.000 | 106,333.000 | 106,333.000 | 106,333.000 | 106,393.000 | 106,393.000 | 106,393.000 | 106,393.000 | 105,383.000 | 113,937.000 | 22,085.000 | 22,085.000 | 22,085.000 | 22,085.000 | 22,085.000 | 22,085.000 | 22,085.000 | 18,928.000 | 113,937.000 | 16,952.000 | 113,937.000 | 113,937.000 | 113,937.000 | 113,937.000 | 113,937.000 | 113,937.000 | 113,937.000 | 113,937.000 | 113,937.000 | 113,937.000 | 113,937.000 | 113,937.000 | 113,937.000 | 113,937.000 | 113,937.000 | 113,937.000 | 113,937.000 |
mean | 627,885.693 | 40.830 | 0.219 | 0.193 | 0.183 | 0.169 | 0.080 | 0.096 | 4.072 | 5.950 | 2.774 | 96.072 | 685.568 | 704.568 | 10.317 | 9.260 | 26.755 | 6.970 | 398.292 | 1.435 | 5.584 | 0.592 | 984.507 | 4.155 | 0.313 | 0.015 | 17,598.707 | 0.561 | 11,210.225 | 23.230 | 0.886 | 0.802 | 0.276 | 5,608.026 | 1.421 | 22.934 | 22.272 | 0.614 | 0.049 | 8,472.312 | 2,930.314 | -3.223 | 152.817 | 16.268 | 31.897 | 69,444.474 | 8,337.014 | 272.476 | 4,183.079 | 3,105.537 | 1,077.543 | -54.726 | -14.243 | 700.446 | 681.420 | 25.143 | 0.999 | 0.048 | 0.023 | 16.551 | 80.475 |
std | 328,076.236 | 10.436 | 0.080 | 0.075 | 0.075 | 0.068 | 0.047 | 0.030 | 1.673 | 2.377 | 3.997 | 94.481 | 66.458 | 66.458 | 5.458 | 5.023 | 13.638 | 4.631 | 447.160 | 2.438 | 6.430 | 1.979 | 7,158.270 | 10.160 | 0.728 | 0.154 | 32,936.402 | 0.318 | 19,818.361 | 11.871 | 0.148 | 1.098 | 0.552 | 7,478.497 | 0.764 | 19.250 | 18.830 | 2.447 | 0.556 | 7,395.508 | 3,806.635 | 50.064 | 466.320 | 9.006 | 29.974 | 38,930.480 | 6,245.801 | 192.698 | 4,790.907 | 4,069.528 | 1,183.414 | 60.675 | 109.233 | 2,388.514 | 2,357.167 | 275.658 | 0.018 | 0.332 | 0.232 | 294.545 | 103.239 |
min | 4.000 | 12.000 | 0.007 | 0.000 | -0.010 | -0.183 | 0.005 | -0.183 | 1.000 | 1.000 | 0.000 | 0.000 | 0.000 | 19.000 | 0.000 | 0.000 | 2.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | -209.000 | 0.000 | 0.000 | 0.000 | 1.000 | 1,000.000 | 0.000 | -2.350 | 0.000 | -2.350 | -664.870 | -9,274.750 | -94.200 | -954.550 | 0.000 | 0.700 | 0.000 | 0.000 | 0.000 | 1.000 |
25% | 400,919.000 | 36.000 | 0.156 | 0.134 | 0.124 | 0.116 | 0.042 | 0.074 | 3.000 | 4.000 | 1.000 | 26.000 | 660.000 | 679.000 | 7.000 | 6.000 | 17.000 | 4.000 | 114.000 | 0.000 | 2.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 3,121.000 | 0.310 | 880.000 | 15.000 | 0.820 | 0.000 | 0.140 | 3,200.333 | 1.000 | 9.000 | 9.000 | 0.000 | 0.000 | 3,500.000 | 0.000 | -35.000 | 0.000 | 9.000 | 6.000 | 37,332.000 | 4,000.000 | 131.620 | 1,005.760 | 500.890 | 274.870 | -73.180 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 2.000 |
50% | 600,554.000 | 36.000 | 0.210 | 0.184 | 0.173 | 0.162 | 0.072 | 0.092 | 4.000 | 6.000 | 1.000 | 67.000 | 680.000 | 699.000 | 10.000 | 9.000 | 25.000 | 6.000 | 271.000 | 1.000 | 4.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 8,549.000 | 0.600 | 4,100.000 | 22.000 | 0.940 | 0.000 | 0.220 | 4,666.667 | 1.000 | 16.000 | 15.000 | 0.000 | 0.000 | 6,000.000 | 1,626.550 | -3.000 | 0.000 | 14.000 | 21.000 | 68,599.000 | 6,500.000 | 217.740 | 2,583.830 | 1,587.500 | 700.840 | -34.440 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 44.000 |
75% | 892,634.000 | 36.000 | 0.284 | 0.250 | 0.240 | 0.224 | 0.112 | 0.117 | 5.000 | 8.000 | 3.000 | 137.000 | 720.000 | 739.000 | 13.000 | 12.000 | 35.000 | 9.000 | 525.000 | 2.000 | 7.000 | 0.000 | 0.000 | 3.000 | 0.000 | 0.000 | 19,521.000 | 0.840 | 13,180.000 | 30.000 | 1.000 | 1.000 | 0.320 | 6,825.000 | 2.000 | 33.000 | 32.000 | 0.000 | 0.000 | 11,000.000 | 4,126.720 | 25.000 | 0.000 | 22.000 | 65.000 | 101,901.000 | 12,000.000 | 371.580 | 5,548.400 | 4,000.000 | 1,458.540 | -13.920 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 115.000 |
max | 1,255,725.000 | 60.000 | 0.512 | 0.497 | 0.492 | 0.320 | 0.366 | 0.284 | 7.000 | 11.000 | 20.000 | 755.000 | 880.000 | 899.000 | 59.000 | 54.000 | 136.000 | 51.000 | 14,985.000 | 105.000 | 379.000 | 83.000 | 463,881.000 | 99.000 | 38.000 | 20.000 | 1,435,667.000 | 5.950 | 646,285.000 | 126.000 | 1.000 | 20.000 | 10.010 | 1,750,002.917 | 8.000 | 141.000 | 141.000 | 42.000 | 21.000 | 72,499.000 | 23,450.950 | 286.000 | 2,704.000 | 44.000 | 100.000 | 136,486.000 | 35,000.000 | 2,251.510 | 40,702.390 | 35,000.000 | 15,617.030 | 32.060 | 0.000 | 25,000.000 | 25,000.000 | 21,117.900 | 1.012 | 39.000 | 33.000 | 25,000.000 | 1,189.000 |
The Prosper Loan dataset is a features-heavy dataset with 81 variables on over 110,000 loan listings. The variable types include 3 booleans, 50 floats, 11 integers, and 17 strings (objects), which contain the date- and categorical-type variables.
The main features of interest in this dataset will include the loan annual percentage rates (APR), Interest rates, credit scores, borrower income, etc, and how they affect or do not affect the repayment status of the loan.
To investigate my dataset, I will be interested in using summary statistics such as averages, medians, value counts, correlations between variables, visual plots, and any other data analysis exploration as needed. I anticipate a higher APR/interest rate to lead to more defaulting by borrowers and higher risk scores to mean less defaulting.
Description: The date the listing was created.
Question: What time period does our dataset cover? When do people borrow the most? Let's begin our exploration by answering that question.
The variable is currently saved in the object format, so we will first have to do a type conversion before we can answer our question.
# convert the ListingCreationDate Series to the datetime data type
prosper_dataset["ListingCreationDate"] = pd.to_datetime(prosper_dataset["ListingCreationDate"])
# Confirm the new data type of the column
prosper_dataset["ListingCreationDate"].info()
<class 'pandas.core.series.Series'> RangeIndex: 113937 entries, 0 to 113936 Series name: ListingCreationDate Non-Null Count Dtype -------------- ----- 113937 non-null datetime64[ns] dtypes: datetime64[ns](1) memory usage: 890.3 KB
# Check for the start date for the data set
minimum_date = prosper_dataset["ListingCreationDate"].min()
print(f"The start date for the dataset is {minimum_date.day_name()}, {minimum_date.month_name()}, {minimum_date.day}, {minimum_date.year}")
The start date for the dataset is Wednesday, November, 9, 2005
# Check for the end date for the data set
maximum_date = prosper_dataset["ListingCreationDate"].max()
print(f"The end date for the dataset is {maximum_date.day_name()}, {maximum_date.month_name()}, {maximum_date.day}, {maximum_date.year}")
The end date for the dataset is Monday, March, 10, 2014
Next, let us find out what month(s) people borrow the most and what month borrowing is at the least. Also, what day of the week do people borrow the most?
# Get a count of the number of listings by year
listing_year_count = prosper_dataset["ListingCreationDate"].dt.strftime('%Y').value_counts().to_frame().rename_axis("Year")
# Add percentages to the above dataframe
listing_year_count["percentage"] = listing_year_count["ListingCreationDate"] / len(prosper_dataset)*100
listing_year_count
ListingCreationDate | percentage | |
---|---|---|
Year | ||
2013 | 35413 | 31.081212 |
2012 | 19556 | 17.163871 |
2007 | 11557 | 10.143325 |
2011 | 11442 | 10.042392 |
2008 | 11263 | 9.885287 |
2014 | 10734 | 9.420996 |
2006 | 6213 | 5.453014 |
2010 | 5530 | 4.853559 |
2009 | 2206 | 1.936158 |
2005 | 23 | 0.020187 |
# Get a count of the number of listings by month
listing_month_count = prosper_dataset["ListingCreationDate"].dt.strftime('%B').value_counts().to_frame().rename_axis("Month")
# Add percentages to the above dataframe
listing_month_count["percentage"] = listing_month_count["ListingCreationDate"] / len(prosper_dataset)*100
listing_month_count
ListingCreationDate | percentage | |
---|---|---|
Month | ||
January | 11214 | 9.842281 |
October | 10539 | 9.249849 |
December | 10320 | 9.057637 |
February | 10124 | 8.885612 |
September | 10074 | 8.841728 |
November | 9952 | 8.734652 |
July | 9506 | 8.343207 |
August | 9202 | 8.076393 |
June | 8672 | 7.611224 |
May | 8641 | 7.584016 |
March | 8032 | 7.049510 |
April | 7661 | 6.723891 |
# Get a count of the number of listings by day of the week
listing_day_count = prosper_dataset["ListingCreationDate"].dt.strftime('%A').value_counts().to_frame().rename_axis("Weekday")
# Add percentages to the above dataframe
listing_day_count["percentage"] = listing_day_count["ListingCreationDate"] / len(prosper_dataset)*100
listing_day_count
ListingCreationDate | percentage | |
---|---|---|
Weekday | ||
Tuesday | 19946 | 17.506166 |
Monday | 19594 | 17.197223 |
Wednesday | 19057 | 16.725910 |
Thursday | 17962 | 15.764853 |
Friday | 16136 | 14.162212 |
Saturday | 10758 | 9.442060 |
Sunday | 10484 | 9.201576 |
# Define a function that creates a simple bar chart from a value count dataframe
def simple_bar_plot(df, xlabel, ylabel, title):
"""
This simple function creates a bar chart from a value_counts dataframe.
It takes four parameters:
df (DataFrame): the value_counts dataframe
xlabel (str): the x label for the plot
ylabel (str): the y label for the plot
title (str): the title of the plot
"""
fig, ax = plt.subplots(figsize=(10, 6))
plot = df.plot(kind="bar", ax=ax)
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.title(title)
plt.legend()
ax.get_legend().remove()
return plot
# plot bar graph for months count
simple_bar_plot(listing_year_count, "Listing Year",
"Number of borrowers",
"Bar plot for borrower listings by year");
# plot bar graph for months count
simple_bar_plot(listing_month_count, "Listing Month",
"Number of borrowers",
"Bar plot for borrower listings by month");
# plot bar graph for days count
simple_bar_plot(listing_day_count, "Listing Day",
"Number of borrowers",
"Bar plot for bumber of borrower listings by day");
We can see from the above that our dataset entries range from November 9 2005 till March 10 2014, representing over 8 years of loan data collection. 2013 had the highest amount of borrowing while 2005 had the lowest. January appears to be the month when people borrow the most, followed by October, then December. The month with the least amount of borrowing is April. Similarly, people borrow the most on Tuesday and the least on Sunday.
While the ListingCreationDate univariate exploration does give us some insight into our dataset, a better way to explore it will be using a time-series plot alongside another variable of interest. This will be explored later in the bivariate exploration section.
Description: The length of the loan expressed in months.
Question: What is the average term of the listed loans? How are the terms distributed?
prosper_dataset["Term"].describe()
count 113937.000000 mean 40.830248 std 10.436212 min 12.000000 25% 36.000000 50% 36.000000 75% 36.000000 max 60.000000 Name: Term, dtype: float64
unique_vals = prosper_dataset["Term"].unique()
unique_vals
array([36, 60, 12], dtype=int64)
# Get the count of each unique value in the Term column
unique_counts = prosper_dataset["Term"].value_counts()
unique_counts
36 87778 60 24545 12 1614 Name: Term, dtype: int64
Given the few unique values of the Term feature, a pie or chart will be a good way to explore this variable. Let us use both visualizations for this variable
# Define a function for plotting bar charts using seaborn.
def bar_chart_plotter(df, x, title):
"""
This function will be used to plot bar graphs using seaborn's countplot.
The parameters are the dataframe (df), Series to plot(x),
title of plot (title), xlabel (xlabel), and ylabel (ylabel)
"""
sns.color_palette()
base_color = sns.color_palette()[0]
sns.set_theme(style="darkgrid")
plt.title(title)
return sns.countplot(data=df, x=x, color=base_color);
# Define a function for plotting pie charts using pyplot.
def pie_plotter(x, labels, title, textprops, autopct, explode):
"""
x (list or series): The cound of categories to be plotted
labels (list or series): The category labels to be plotted
title (string): title of the pie plot
textprops (dictionary): text customizations
autopct (string): formating for the text
explode (list): list representing the pie offset. Should be the same length as the number of categories
"""
plt.title(title)
return plt.pie(x=x, labels=labels, textprops=textprops, autopct=autopct, explode=explode)
plt.legend()
# Plot a bar graph and pie chart of the loan term for the dataset in one plot
plt.figure(figsize = [20, 6])
plt.subplot(1, 2, 1)
bar_chart_plotter(prosper_dataset,
prosper_dataset["Term"],
"Plot of Prosper Loan Term in months")
plt.xlabel("Term (months)")
plt.ylabel("Number of listings")
plt.subplot(1, 2, 2)
pie_plotter(unique_counts,
unique_vals,
"Pie chart showing Term of Prosper Loan Dataset in months",
{"fontsize":15},
"%0.1f%%",
[0.01,0.01,0.2])
plt.legend();
Our bar chart clearly shows that more than two-thirds of the dataset have a Term of 36 months (3 years), with a little over one-quarter having a Term of 60 months (5 years). About one percent of the dataset have a loan Term of 12 months (1 year). The average loan term is at about 40 months.
Description: The current status of the loan: Cancelled, Chargedoff, Completed, Current, Defaulted, FinalPaymentInProgress, PastDue. The PastDue status will be accomanied by a delinquency bucket.
Question: What is the distribution of the loan status for our dataset? Are more people defaulting than are currently paying off their loans?
# get an idea of the values in out LoanStatus categories
print(prosper_dataset["LoanStatus"].nunique())
prosper_dataset["LoanStatus"].unique()
12
array(['Completed', 'Current', 'Past Due (1-15 days)', 'Defaulted', 'Chargedoff', 'Past Due (16-30 days)', 'Cancelled', 'Past Due (61-90 days)', 'Past Due (31-60 days)', 'Past Due (91-120 days)', 'FinalPaymentInProgress', 'Past Due (>120 days)'], dtype=object)
There are 12 unique values for the loan status in this dataset, and this can be easily represented using a bar chart. We will create another function that allows us visualize using pyplot
def plt_bar_plotter(x, y, xlabel, title, rotation):
"""
This function plots a bar chart using pyplot and takes the following parameters::
x (list or series): category to be plotted on the x-axis
y (list or series): height of the categories plotted on the x-axis
xlabel (string): label for the x-axis
title (string): title of the plot
rotation (int): xticks the rotation on the x-axis
"""
# add text to bar chart using for loop
for i, g in enumerate(y):
plt.text(i, g, f"{g}", ha="center", fontsize=16, fontweight="bold")
# add labels to plot
plt.xlabel(xlabel) # add x-label
plt.xticks(rotation=rotation) # rotate the x-ticks
plt.yticks([]) # remove y-ticks
sns.despine(left=True) # remove left spine
plt.title(title) # add title to plot
return plt.bar(x, y, edgecolor="black", color="xkcd:light gray blue");
fig, ax = plt.subplots(figsize=(15, 5)) # create plot canvas
# save the categories and their counts into variables for plotting
loan_status = list(prosper_dataset["LoanStatus"].unique())
loan_status_counts = prosper_dataset["LoanStatus"].value_counts(sort=True).to_list()
# plot bar chart of loan statuses using pyplot
plt_bar_plotter(loan_status, # loan status categories
loan_status_counts, # count of categories
"Loan Status", # x label
"Current Status of Loans", # title
90); # xticks rotation
Our bar chart shows us that about half of all loans have been completed while roughly a third of them are currently in term. A little over ten percent are past due to various degrees ranging from 1 day to over 120 days.
Description: The Borrower's Annual Percentage Rate (APR) for the loan.
Question: At what rates are borrowers getting their loans? What is the distribution of the APR model?
prosper_dataset["BorrowerAPR"].describe()
count 113912.000000 mean 0.218828 std 0.080364 min 0.006530 25% 0.156290 50% 0.209760 75% 0.283810 max 0.512290 Name: BorrowerAPR, dtype: float64
Given the nature of the APR, a histogram will be a good way to see the distribution of the variable. We will first define a function which will allow us plot histograms automatically from this point on.
# Function that plots a histogran
def hist_plotter(df, x, bins, xlabel, ylabel, title):
"""
This function plots a histogram using the following parameters:
df (DataFrame): The dataframe object
x (series): The series to be plotted on the histogram
bin (list): the number of bins to plot
xlabel (string): label on the x-axis
ylabel (string): label on the y-axis
title (string): title of the plot
"""
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.title(title)
return plt.hist(data=df, x=x, bins=bins)
fig, ax = plt.subplots(figsize=(10, 5)) # create plot canvas
# customize bins
bins = np.arange(0.00, 0.5+0.1, 0.03)
# plot histogram using borrower's APR
hist_plotter(prosper_dataset,
"BorrowerAPR",
bins,
"Annual Percentage Rate (APR)","Number of Listings",
"Borrower's Annual Percentage Rate (APR) Distribution");
The APR closely models a bimodal distibribution, with both modes lying between 0.15% and 0.2%. The average APR is at 0.22%.
Description: The two letter abbreviation of the state of the address of the borrower at the time the Listing was created.
Question: What state has the highest number of borrowers?
# Get the number of states with borrowers
print(prosper_dataset["BorrowerState"].nunique())
# Print out the list of all the states with borrowers listed
prosper_dataset["BorrowerState"].unique()
51
array(['CO', 'GA', 'MN', 'NM', 'KS', 'CA', 'IL', 'MD', nan, 'AL', 'AZ', 'VA', 'FL', 'PA', 'OR', 'MI', 'NY', 'LA', 'WI', 'OH', 'NC', 'WA', 'NV', 'NJ', 'TX', 'AR', 'SC', 'DE', 'MO', 'NE', 'UT', 'DC', 'MA', 'CT', 'IN', 'KY', 'OK', 'MS', 'WV', 'RI', 'TN', 'ID', 'MT', 'HI', 'NH', 'VT', 'WY', 'ME', 'AK', 'IA', 'SD', 'ND'], dtype=object)
Given the high number of states, it would be difficult to plot everything at once, so we will plot the top 10 and bottom 10 states in terms of the number of borrowers listed. A bar chart will be a great way to visualize this.
# Get the states with the highest number of borrowers.
print(prosper_dataset["BorrowerState"].value_counts().head(10))
prosper_dataset["BorrowerState"].value_counts().tail(10)
CA 14717 TX 6842 NY 6729 FL 6720 IL 5921 GA 5008 OH 4197 MI 3593 VA 3278 NJ 3097 Name: BorrowerState, dtype: int64
DC 382 MT 330 DE 300 VT 207 AK 200 SD 189 IA 186 WY 150 ME 101 ND 52 Name: BorrowerState, dtype: int64
# save variables for plotting
# highest number of borrowers
state_list_top = prosper_dataset["BorrowerState"].value_counts(sort=True).head(10).index
state_counts_top = prosper_dataset["BorrowerState"].value_counts(sort=True).head(10).to_list()
# least number of borrowers
state_list_bottom = prosper_dataset["BorrowerState"].value_counts(sort=True).tail(10).index
state_counts_bottom = prosper_dataset["BorrowerState"].value_counts(sort=True).tail(10).to_list()
fig, ax = plt.subplots(figsize=(20, 8)) # create plot canvas
plt.subplot(1, 2, 1)
# plot bar chart of borrower states
plt_bar_plotter(state_list_top, # borrower state
state_counts_top, # count of borrowers
"State", # x-label
"Highest Number of borrowers by State", # title
90); # xticks rotation
plt.subplot(1, 2, 2)
# plot bar chart of borrower states
plt_bar_plotter(state_list_bottom, # borrower state
state_counts_bottom, # count of borrowers
"State", # x-label
"Lowest Number of borrowers by State", # title
90); # xticks rotation
It is not surprising to see that California (CA) has the highest number of listed borrowers, given its status as the most populous state and one of the most important commercial hubs in the country. The number of borrowers listed are more than double the next state, Texas (TX) which had 6842 borrowers.
In contrast, North Dakota (ND) and Maine (ME) have the lowest number of listed borrowers, with 52 and 101 borrowers respectively.
This is a very important variable that would be interesting to explore using bivariate or multivariate analysis, if possible.
Description: The employment status of the borrower at the time they posted the listing.
Question: What is the employment status of borrowers like as at the time the listing was posetd?
employ_status = prosper_dataset.groupby("EmploymentStatus").size().reset_index(name="counts")
employ_status
EmploymentStatus | counts | |
---|---|---|
0 | Employed | 67322 |
1 | Full-time | 26355 |
2 | Not available | 5347 |
3 | Not employed | 835 |
4 | Other | 3806 |
5 | Part-time | 1088 |
6 | Retired | 795 |
7 | Self-employed | 6134 |
fig, ax = plt.subplots(figsize=(10, 6)) # create plot canvas
plt_bar_plotter(employ_status["EmploymentStatus"], # employment status
employ_status["counts"], # count of status
"Employment Status", # x-label
"Number of Borrowers by Employment Status as at listing time", # title
45); # xticks rotation
More than 90% of borrowers were employed in one form or the other (employed, full-time, part-time, self-employed) as at the time of listing. We will be interested in knowing if a borrower's employment status has any relation with their loan status in our later explorations.
Description: A Borrower will be classified as a homowner if they have a mortgage on their credit profile or provide documentation confirming they are a homeowner.
Question: What percentage of our borrowers are homeowners?
homeowner_status = prosper_dataset["IsBorrowerHomeowner"].unique()
homeowner_status
array([ True, False])
homeowner_counts = prosper_dataset["IsBorrowerHomeowner"].value_counts().to_list()
homeowner_counts
[57478, 56459]
Given the nature of the Home owner variable, a pie chart will be the best option to visualize it.
# Plot pie chart of homeonwer status
fig, ax = plt.subplots(figsize=(10, 6)) # create plot canvas
pie_plotter(homeowner_counts, # count of homeowners
homeowner_status, # labels
"Home Owner Status among listed borrowers", # title of plot
{"fontsize":15, "fontweight":"bold"}, # text customization
"%0.1f%%", # Value formatting
[0.001,0.001]) # explode pie slices
plt.legend();
The home owner status of borrowers is roughly split in half between True and False, with 50.4% of borrowers being homeowners and 49.6% being non-homeowners
Description: The income range of the borrower at the time the listing was created.
Question: What is the distribution of the income range for listed borrowers?
income_range_counts = prosper_dataset["IncomeRange"].value_counts().to_frame()
income_range_counts["percentage"] = income_range_counts["IncomeRange"] / len(prosper_dataset) * 100
income_range_counts
IncomeRange | percentage | |
---|---|---|
$25,000-49,999 | 32192 | 28.254211 |
$50,000-74,999 | 31050 | 27.251902 |
$100,000+ | 17337 | 15.216304 |
$75,000-99,999 | 16916 | 14.846801 |
Not displayed | 7741 | 6.794106 |
$1-24,999 | 7274 | 6.384230 |
Not employed | 806 | 0.707408 |
$0 | 621 | 0.545038 |
# plot Income range distribution on bar chart
fig, ax = plt.subplots(figsize=(12, 6)) # create plot canvas
plt_bar_plotter(income_range_counts.index, # values to plot
income_range_counts["IncomeRange"].to_list(), # count of values
"Income Range", # x-label
"Income Range of listed borrowers", # title
45); # xticks rotation
# Generate random dataset and seed for reproducibility.
np.random.seed(5000)
sample = np.random.choice(prosper_dataset.shape[0], 2000, replace=False)
prosper_data_rand = prosper_dataset.loc[sample]
Description: The monthly income the borrower stated at the time the listing was created.
Question: How is monthly income distributed among borrowers?
# set display format for floats
pd.options.display.float_format = '{:.4f}'.format
# Get summary statistics for income distribution
prosper_dataset["StatedMonthlyIncome"].describe()
count 113937.0000 mean 5608.0256 std 7478.4973 min 0.0000 25% 3200.3333 50% 4666.6667 75% 6825.0000 max 1750002.9167 Name: StatedMonthlyIncome, dtype: float64
# Plot histogram from the stated monthly income column to see the distribution
fig, ax = plt.subplots(figsize=(10, 5)) # create plot canvas
# customize bins
bins = np.arange(0.00, 1750000, 100000)
# plot histogram of stated income
hist_plotter(prosper_dataset,
"StatedMonthlyIncome",
bins,
"Stated Income",
"Number of Listings",
"Income distibution for borrowers");