Data Cleaning and Pre-Processing

Data Collection
Data Cleaning
Generating clean_df For Entire Dataset
Descriptions of Variable Meanings (Data Dictionary)

Data Collection

Imports

import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)

from IPython.display import clear_output
from datetime import datetime

from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score, train_test_split

%matplotlib inline
import numpy as np
import pandas as pd
from pandas.plotting import scatter_matrix
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(font_scale=1.5)
sns.set_style('ticks')

from os import listdir
from os.path import isfile, join
import warnings
warnings.filterwarnings('ignore')

Load Data

Here, we load and combine all of the available Lending Club data, which we downloaded from https://www.lendingclub.com/info/download-data.action. We have included all of the available data from 2007 through 2018Q3, except for the declined loan data.

After loading the data initially using the get_loan_data function, we select only the rows where loan_status is “Charged Off” or “Fully Paid”, and then store the data as a .pkl file for faster/easier loading in the future.

def get_loan_data(data_path, other_data_paths=None):
    """Read CSV files containing loan data from specified data path
    and concatenate to one dataframe
    returns: the dataframe"""
    
    if other_data_paths:
        paths = [data_path] + other_data_paths
    else:
        paths = [data_path]

    for path in paths:
        files = files + [join(path,f) for f in listdir(path) if isfile(join(path, f)) and f.endswith ('.csv')]

    #Read CSV and concatenate files into df
    files = []
    df_list = []
    for each in files:
        print('Reading:', each)
        df_list.append(pd.read_csv(each, header=1, skipfooter=2, engine='python'))
        clear_output(wait=True)
    
    df = pd.concat(df_list)

    #Remove any excess rows that do not fit ('id' should always be empty)
    df = df[df['id'].isnull()]        
    
    return df

#loan_df = get_loan_data(data_path='./data/', other_data_paths=['./data/old'])

Selecting “Fully Paid” and “Charged Off” loans only

Some of the code below has been commented out since we loaded the data from the .pkl after this file was generated and saved.

We see that there are over a million rows in the dataset. For processing purposes, we found that working with a random sample of this large dataset was very useful in the initial stages of data processing.

#msk = loan_df['loan_status'].isin(['Fully Paid', 'Charged Off'])
#loan_df = loan_df[msk]

#cast issue_d column to datetime for later use
#loan_df['issue_d'] = pd.to_datetime(loan_df['issue_d'])

#loan_df.to_pickle('../data/Pickle/loan_df0.pkl')

loan_df = pd.read_pickle('./data/Pickle/loan_df0.pkl').sample(frac=.05, random_state=0)

print('Total Number of Rows:', '{:,}'.format(loan_df.shape[0]))
print('Total Number of Columns:', '{:,}'.format(loan_df.shape[1]))

Total Number of Rows: 57,519
Total Number of Columns: 145

Data Cleaning

Missing Values: Visualization and Feature Selection

Checking for and Cleaning Missing Data
A visual check at the data set shows that there are several columns with a large number of null values.

def get_missing_values_table(df):
    #Number of null values by column
    missing_values_df = pd.DataFrame(df.isnull().sum(),columns=['Missing_Count'])
    
    #Portion of null values by column
    missing_values_df['Portion_Missing'] = missing_values_df['Missing_Count']/df.shape[0]

    #Sort by Missing_Count
    missing_values_df = missing_values_df.sort_values(by='Missing_Count',ascending=False)  
    
    return missing_values_df

missing_values_df = get_missing_values_table(loan_df)
missing_values_df.head()

	Missing_Count	Portion_Missing
id	57519	1.000000
next_pymnt_d	57519	1.000000
member_id	57519	1.000000
url	57519	1.000000
orig_projected_additional_accrued_interest	57405	0.998018

print('Number of Columns with under 15% missing values:', 
      missing_values_df[missing_values_df['Portion_Missing']<=0.15].shape[0])

#Plot the distribution of Portions of missing values for the columns
fig, ax = plt.subplots(figsize=(5,5))
sns.distplot(missing_values_df['Portion_Missing'], bins=10, kde=False, rug=True)
ax.set_title('Distribution of Portion of Missing Values')
ax.set_xlabel('Portion of Missing Values for Columns in Dataset')
ax.set_ylabel('Count')
sns.despine()

Number of Columns with under 15% missing values: 87

png

Upon first glance, we see that there are many columns that do have a reasonably low level of missingess: 87 columns that have less than 15% missing values.

There seems to be a small concentration of columns between 60-70% missing values. Below, we explore this and see that there are 11 columns with almost exactly the same number of missing values. Furthermore, using the issue_d column (which indicates the date the date in which the loan was issued), we found that these variables had missing values for all loans issued prior to 2015-12-01. It’s possible that the Lending Club simply did not record or use this particular information from potential borrowers until a later date. We have chosen not to include these columns because:

1) there are many other columns that capture similar credit-related information; and

2) we want to keep the older loan information in the dataset, especially since we are looking at only completed loans. Removing the older samples would greatly reduce our sample size.

missing_grp = missing_values_df[(missing_values_df['Portion_Missing']>=0.6)&(missing_values_df['Portion_Missing']<=0.7)]
display(missing_grp)

missing_columns = missing_grp.index

earliest_date = []
for column in missing_columns:
    earliest_date.append(min(loan_df[~loan_df[column].isnull()]['issue_d'])) 

display(pd.DataFrame({'Column':missing_columns,
                      'Earliest issue_d for which column value is not null':earliest_date}))

	Missing_Count	Portion_Missing
mths_since_rcnt_il	39066	0.679184
all_util	38578	0.670700
inq_fi	38576	0.670665
total_bal_il	38576	0.670665
open_il_24m	38576	0.670665
open_il_12m	38576	0.670665
open_act_il	38576	0.670665
open_acc_6m	38576	0.670665
open_rv_24m	38576	0.670665
max_bal_bc	38576	0.670665
total_cu_tl	38576	0.670665
open_rv_12m	38576	0.670665
inq_last_12m	38576	0.670665
mths_since_recent_revol_delinq	38378	0.667223

	Column	Earliest issue_d for which column value is not null
0	mths_since_rcnt_il	2015-12-01
1	all_util	2015-12-01
2	inq_fi	2015-12-01
3	total_bal_il	2015-12-01
4	open_il_24m	2015-12-01
5	open_il_12m	2015-12-01
6	open_act_il	2015-12-01
7	open_acc_6m	2015-12-01
8	open_rv_24m	2015-12-01
9	max_bal_bc	2015-12-01
10	total_cu_tl	2015-12-01
11	open_rv_12m	2015-12-01
12	inq_last_12m	2015-12-01
13	mths_since_recent_revol_delinq	2012-03-01

Before going on, we will remove many columns from the dataset. We have made decisions on which columns to initially drop from our DataFrame based on the variable descriptions in the Lending Club data dictionary, applicability of each feature to our model, and the level of missingness observed.:

There are 3 columns relating to the lending club’s borrower identification (id, member_id, and url), which contain 100% missing values. next_pymnt_d also has all missing values: since these loans have been completed, there is no data for the next payment date. We exclude these four columns.
pymnt_plan, out_prncp, out_prncp_inv, and policy_code have only one unique value each so they will not be useful in our model. We exclude these columns.
As we saw above following columns all had the same large number of missing values, and we determined that all of the loans issued prior to December 2015 have null values for these variables. This could be because the lending club did not previously store/analyze these columns but made a recent change. We have chosen to drop these variables since other variables cover similar information and we prefer to keep as many of the samples as possible. The 11 columns are: open_acc_6m, open_act_il, open_il_12m, open_il_24m, total_bal_il, open_rv_12m, open_rv_24m, max_bal_bc, inq_fi, total_cu_tl, inq_last_12m. The following 3 columns had slightly more missing data but followed the same trend of newer-only data, so we also exclude them: il_util, all_util, mths_since_rcnt_il
We exclude num_tl_120dpd_2m and num_tl_30dpd because data dictionary indicates these were “udpated in past 2 months” (i.e. we wouldn’t have these values at loan origination)
Hardship-related variables: There are a number of columns relating to borrowers who were on a hardship plan. For borrowers who were not on a hardship plan, the data are coded as missing. We will remove these variables, since we do not know at the point of loan origination whether or not the borrower will have a hardship. This is the list of hardshp-related columns based on the data dictionary: deferral_term,hardship_amount,hardship_dpd, hardship_end_date, hardship_flag, hardship_last_payment_amount, hardship_length, hardship_loan_status, hardship_payoff_balance_amount, hardship_reason, hardship_start_date, hardship_status, hardship_type, orig_projected_additional_accrued_interest, payment_plan_start_date.
We exclude desc, title and emp_title because these are string variables with borrower-provided text; the Lending Club has categorized this information in other columns.
We exclude columns related to the specifics of loan settlement, something we wouldn’t know at the time of funding (and in our case these are very trivial features relating to a loan being charged off): debt_settlement_flag, debt_settlement_flag_date,settlement_amount, settlement_date, settlement_percentage, settlement_status, settlement_term, recoveries, collection_recovery_fee
We also excluded the following colums since they include information relating to the specifics of the Lending Club’s loan (including information on late fees received on the lending club loan, last payment amount on the lending club loan, etc.), which also wouldn’t be available at the time of the funding decision: funded_amnt, funded_amnt_inv, last_pymnt_amnt, out_prncp, out_prncp_inv, total_pymnt, total_pymnt_inv, total_rec_int, total_rec_late_fee, total_rec_prncp, last_pymnt_d, last_credit_pull_d, disbursement_method
We exclude initial_list_status based on research into the meaning of this column. “On September 28th, 2012, LendingClub announced that, as opposed to its standard fractional investment business model, it would begin setting aside some loans that could only be purchased in their entirety–or as whole loans. Part of that announcement were promises that whole loans would be chosen randomly from the general pool and that anyone would be able to participate in the whole loan program”(https://www.lendacademy.com/lending-club-whole-loan-program-one-year-later/). Because the whole loans are randomly-chosen, they should not provide additional information. Also, we noticed that there are more “whole”-category loans in later years.

Text descriptions of the variables that remain in our model after cleaning are provided at the end of this page.

exclude = ['id', 'member_id', 'url', 'next_pymnt_d', 'pymnt_plan', 'out_prncp',
           'out_prncp_inv', 'policy_code', 'open_acc_6m', 'open_act_il', 'open_il_12m',
           'open_il_24m', 'total_bal_il', 'open_rv_12m', 'open_rv_24m', 'max_bal_bc',
           'inq_fi', 'total_cu_tl', 'inq_last_12m', 'il_util', 'all_util','mths_since_rcnt_il',
           'num_tl_120dpd_2m', 'num_tl_30dpd', 'deferral_term', 'hardship_amount',
           'hardship_dpd', 'hardship_end_date', 'hardship_flag', 'hardship_last_payment_amount',
           'hardship_length', 'hardship_loan_status', 'hardship_payoff_balance_amount',
           'hardship_reason', 'hardship_start_date', 'hardship_status', 'hardship_type',
           'orig_projected_additional_accrued_interest', 'payment_plan_start_date',
           'desc', 'title', 'emp_title', 'debt_settlement_flag', 'debt_settlement_flag_date',
           'settlement_amount', 'settlement_date', 'settlement_percentage', 'settlement_status',
           'settlement_term', 'recoveries', 'collection_recovery_fee', 'funded_amnt',
           'funded_amnt_inv', 'last_pymnt_amnt', 'out_prncp', 'out_prncp_inv',  'total_pymnt',
           'total_pymnt_inv', 'total_rec_int', 'total_rec_late_fee', 'total_rec_prncp',
           'last_pymnt_d',  'last_credit_pull_d', 'disbursement_method', 'initial_list_status' ]

potential_features = np.setdiff1d(loan_df.columns.tolist(), exclude)

print(len(potential_features))
print(sorted(potential_features))

loan_df = loan_df[potential_features]

82
['acc_now_delinq', 'acc_open_past_24mths', 'addr_state', 'annual_inc', 'annual_inc_joint', 'application_type', 'avg_cur_bal', 'bc_open_to_buy', 'bc_util', 'chargeoff_within_12_mths', 'collections_12_mths_ex_med', 'delinq_2yrs', 'delinq_amnt', 'dti', 'dti_joint', 'earliest_cr_line', 'emp_length', 'grade', 'home_ownership', 'inq_last_6mths', 'installment', 'int_rate', 'issue_d', 'loan_amnt', 'loan_status', 'mo_sin_old_il_acct', 'mo_sin_old_rev_tl_op', 'mo_sin_rcnt_rev_tl_op', 'mo_sin_rcnt_tl', 'mort_acc', 'mths_since_last_delinq', 'mths_since_last_major_derog', 'mths_since_last_record', 'mths_since_recent_bc', 'mths_since_recent_bc_dlq', 'mths_since_recent_inq', 'mths_since_recent_revol_delinq', 'num_accts_ever_120_pd', 'num_actv_bc_tl', 'num_actv_rev_tl', 'num_bc_sats', 'num_bc_tl', 'num_il_tl', 'num_op_rev_tl', 'num_rev_accts', 'num_rev_tl_bal_gt_0', 'num_sats', 'num_tl_90g_dpd_24m', 'num_tl_op_past_12m', 'open_acc', 'pct_tl_nvr_dlq', 'percent_bc_gt_75', 'pub_rec', 'pub_rec_bankruptcies', 'purpose', 'revol_bal', 'revol_bal_joint', 'revol_util', 'sec_app_chargeoff_within_12_mths', 'sec_app_collections_12_mths_ex_med', 'sec_app_earliest_cr_line', 'sec_app_inq_last_6mths', 'sec_app_mort_acc', 'sec_app_mths_since_last_major_derog', 'sec_app_num_rev_accts', 'sec_app_open_acc', 'sec_app_open_act_il', 'sec_app_revol_util', 'sub_grade', 'tax_liens', 'term', 'tot_coll_amt', 'tot_cur_bal', 'tot_hi_cred_lim', 'total_acc', 'total_bal_ex_mort', 'total_bc_limit', 'total_il_high_credit_limit', 'total_rev_hi_lim', 'verification_status', 'verification_status_joint', 'zip_code']

We visualize missingness on the remaining columns and see that many of the columns are new from 2012-08-01 and later. We choose to work with data from this date and beyond since the data from prior to this date is only a small subset of our data.

def view_missingness_by_issue_d(df):
    missing_values_df = get_missing_values_table(df)
    earliest_date = []
    cols_with_missing_vals = missing_values_df[(missing_values_df.Portion_Missing<1) &
                                           (missing_values_df.Missing_Count >0)].index

    for column in cols_with_missing_vals:
        earliest_date.append(min(loan_df[~loan_df[column].isnull()]['issue_d'])) 

    df = pd.DataFrame({'Column':cols_with_missing_vals,
                      'Earliest issue_d for which column value is not null':earliest_date})
    
    return df.sort_values(by='Earliest issue_d for which column value is not null', ascending=False)

view_missingness_by_issue_d(loan_df).head(15)

	Column	Earliest issue_d for which column value is not null
0	sec_app_mths_since_last_major_derog	2017-03-01
2	sec_app_inq_last_6mths	2017-03-01
3	sec_app_earliest_cr_line	2017-03-01
4	sec_app_collections_12_mths_ex_med	2017-03-01
5	sec_app_chargeoff_within_12_mths	2017-03-01
6	revol_bal_joint	2017-03-01
7	sec_app_num_rev_accts	2017-03-01
8	sec_app_open_acc	2017-03-01
9	sec_app_open_act_il	2017-03-01
10	sec_app_mort_acc	2017-03-01
1	sec_app_revol_util	2017-03-01
12	annual_inc_joint	2015-11-01
13	dti_joint	2015-11-01
11	verification_status_joint	2015-11-01
27	num_il_tl	2012-08-01

Now, we see that there are columns relating to details of applications that have multiple borrowers (co-borrowers). These columns only have non-null values from issue dates of 2017-03-01 and later (or 2015-10-01 and later for ‘dti_joint’, ‘annual_inc_joint’, and ‘verification_status_joint’), so we remove these columns. We remove all of these columns:

annual_inc_joint, dti_joint, revol_bal_joint, sec_app_chargeoff_within_12_mths, sec_app_collections_12_mths_ex_med, sec_app_earliest_cr_line, sec_app_inq_last_6mths, sec_app_mort_acc, sec_app_mths_since_last_major_derog, sec_app_num_rev_accts, sec_app_open_acc, sec_app_open_act_il, sec_app_revol_util, verification_status_joint]

Note that we still do have the application_type variable, which is a simple binary indicator of whether or not the application had co-borrowers.

coborrower_cols =  ['annual_inc_joint', 'dti_joint', 'revol_bal_joint',
                    'sec_app_chargeoff_within_12_mths', 'sec_app_collections_12_mths_ex_med',
                    'sec_app_earliest_cr_line', 'sec_app_inq_last_6mths', 'sec_app_mort_acc',
                    'sec_app_mths_since_last_major_derog', 'sec_app_num_rev_accts',
                    'sec_app_open_acc', 'sec_app_open_act_il', 'sec_app_revol_util',
                    'verification_status_joint']

potential_features = np.setdiff1d(loan_df.columns.tolist(), coborrower_cols)
loan_df = loan_df[potential_features]

After removing these colummns, we take another look at the remaining columns that have missing values.

display(view_missingness_by_issue_d(loan_df))

get_missing_values_table(loan_df).head(15)

	Column	Earliest issue_d for which column value is not null
22	total_rev_hi_lim	2012-08-01
14	num_il_tl	2012-08-01
26	tot_cur_bal	2012-08-01
25	tot_coll_amt	2012-08-01
24	tot_hi_cred_lim	2012-08-01
23	total_il_high_credit_limit	2012-08-01
1	mths_since_recent_bc_dlq	2012-08-01
21	mo_sin_rcnt_rev_tl_op	2012-08-01
20	mo_sin_old_rev_tl_op	2012-08-01
18	num_tl_90g_dpd_24m	2012-08-01
17	num_rev_tl_bal_gt_0	2012-08-01
16	num_rev_accts	2012-08-01
15	num_op_rev_tl	2012-08-01
19	num_tl_op_past_12m	2012-08-01
13	num_actv_rev_tl	2012-08-01
8	avg_cur_bal	2012-08-01
12	num_bc_tl	2012-08-01
6	mo_sin_old_il_acct	2012-08-01
7	pct_tl_nvr_dlq	2012-08-01
2	mths_since_last_major_derog	2012-08-01
10	mo_sin_rcnt_tl	2012-08-01
11	num_accts_ever_120_pd	2012-08-01
9	num_actv_bc_tl	2012-08-01
33	num_bc_sats	2012-06-01
32	num_sats	2012-06-01
29	percent_bc_gt_75	2012-03-01
31	mths_since_recent_bc	2012-03-01
36	total_bal_ex_mort	2012-03-01
35	acc_open_past_24mths	2012-03-01
34	mort_acc	2012-03-01
37	total_bc_limit	2012-03-01
30	bc_open_to_buy	2012-03-01
28	bc_util	2012-03-01
3	mths_since_recent_revol_delinq	2012-03-01
5	mths_since_recent_inq	2012-03-01
39	pub_rec_bankruptcies	2008-03-01
41	chargeoff_within_12_mths	2007-08-01
42	collections_12_mths_ex_med	2007-08-01
43	tax_liens	2007-08-01
38	revol_util	2007-07-01
40	dti	2007-07-01
27	emp_length	2007-07-01
4	mths_since_last_delinq	2007-07-01
0	mths_since_last_record	2007-07-01

	Missing_Count	Portion_Missing
mths_since_last_record	47647	0.828370
mths_since_recent_bc_dlq	43972	0.764478
mths_since_last_major_derog	42490	0.738712
mths_since_recent_revol_delinq	38378	0.667223
mths_since_last_delinq	29089	0.505729
mths_since_recent_inq	7657	0.133121
mo_sin_old_il_acct	4930	0.085711
pct_tl_nvr_dlq	3351	0.058259
avg_cur_bal	3346	0.058172
num_actv_bc_tl	3344	0.058137
mo_sin_rcnt_tl	3344	0.058137
num_accts_ever_120_pd	3344	0.058137
num_bc_tl	3344	0.058137
num_actv_rev_tl	3344	0.058137
num_il_tl	3344	0.058137

We notice that many of the columns only have non-null values for issue dates of 2012-08-01 and later, so we remove the samples that have issue_dates prior to this. We see that this only reduces our sample size by a small amount.

print("Number of rows in dataset: {}".format (loan_df.shape[0]))

loan_df = loan_df[loan_df['issue_d']>=datetime(2012,8,1)]

print("Number of rows in dataset after removing loans issued prior to 2012-08-01: {}"\
      .format (loan_df.shape[0]))

Number of rows in dataset: 57519
Number of rows in dataset after removing loans issued prior to 2012-08-01: 54383

**Taking another look at missingness: **

display(view_missingness_by_issue_d(loan_df))

get_missing_values_table(loan_df).head(15)

	Column	Earliest issue_d for which column value is not null
0	mths_since_last_record	2012-08-01
25	num_op_rev_tl	2012-08-01
19	mo_sin_rcnt_rev_tl_op	2012-08-01
20	mo_sin_old_rev_tl_op	2012-08-01
21	num_tl_op_past_12m	2012-08-01
22	num_tl_90g_dpd_24m	2012-08-01
23	num_accts_ever_120_pd	2012-08-01
24	num_il_tl	2012-08-01
26	num_rev_accts	2012-08-01
1	mths_since_recent_bc_dlq	2012-08-01
27	num_rev_tl_bal_gt_0	2012-08-01
28	tot_coll_amt	2012-08-01
29	tot_cur_bal	2012-08-01
30	tot_hi_cred_lim	2012-08-01
31	num_bc_tl	2012-08-01
32	revol_util	2012-08-01
18	mo_sin_rcnt_tl	2012-08-01
17	num_actv_rev_tl	2012-08-01
16	total_il_high_credit_limit	2012-08-01
15	total_rev_hi_lim	2012-08-01
14	num_actv_bc_tl	2012-08-01
13	avg_cur_bal	2012-08-01
12	pct_tl_nvr_dlq	2012-08-01
11	mths_since_recent_bc	2012-08-01
10	bc_open_to_buy	2012-08-01
9	percent_bc_gt_75	2012-08-01
8	bc_util	2012-08-01
7	mo_sin_old_il_acct	2012-08-01
6	emp_length	2012-08-01
5	mths_since_recent_inq	2012-08-01
4	mths_since_last_delinq	2012-08-01
3	mths_since_recent_revol_delinq	2012-08-01
2	mths_since_last_major_derog	2012-08-01
33	dti	2012-08-01

	Missing_Count	Portion_Missing
mths_since_last_record	44691	0.821783
mths_since_recent_bc_dlq	40836	0.750896
mths_since_last_major_derog	39354	0.723645
mths_since_recent_revol_delinq	35455	0.651950
mths_since_last_delinq	27073	0.497821
mths_since_recent_inq	5205	0.095710
emp_length	3093	0.056874
mo_sin_old_il_acct	1794	0.032988
bc_util	624	0.011474
percent_bc_gt_75	616	0.011327
bc_open_to_buy	602	0.011070
mths_since_recent_bc	561	0.010316
pct_tl_nvr_dlq	215	0.003953
avg_cur_bal	210	0.003862
num_actv_bc_tl	208	0.003825

Pre-Processing of Selected Columns

Great! The missingness for the remaining columns is much more reasonable, but there are still a few columns with large portions of missing values. Consulting the data dictionary reveals a plausible explanation for this:

Many of the columns that still have high levels of missingness are those which measure “months since” something happened. For example, months_since_recent_bc_dlq measures “Months since most recent bankcard delinquency.” Thus, it’s likely that the “missing” values are not truly missing data; rather, the borrower has simply never had a bankcard delinquency. Thus, we choose to process the “month since” columns by filling the null values with the maximum observed value + 1, so that the borrowers who have never had delinquencies just have the largest value for number of months since delinquency. Since we will be using decision-tree based models, this is an appropriate approach that allows the decision tree to separate the borrowers who have had recent delinquencies from those who have not.

We also process a few other columns below:

Re-categorize emp_length so that there are fewer categories
Cast revol_util to float
Change loan_grade and grade values to numerical rankings (i.e., an ordinal variable with the lowest grade/subgrade indicating the least-risky loans)
credit_line_age and earliest_cr_line
int_rate
zip_code
fully_paid and loan_status

#emp_length
print('emp_length value counts: \n{}\n'.format(loan_df.emp_length.value_counts()))

emp_length value counts: 
10+ years    18067
2 years       4761
3 years       4403
< 1 year      4235
1 year        3537
5 years       3328
4 years       3255
8 years       2554
6 years       2554
7 years       2472
9 years       2124
Name: emp_length, dtype: int64

def process_emp_length(df):
    df_new = df.copy()
    #0-1 years inclues: null, None, < 1 year, 1 year
    df_new['emp_length'] = df_new['emp_length'].fillna('0-1 years')
    df_new['emp_length'] = np.where(df_new['emp_length'].isin([np.nan,None,'< 1 year','1 year']),
                                     '0-1 years',df_new['emp_length'])
    #2-4 years inclues: 2 years, 3 years, 4 years
    df_new['emp_length'] = np.where(df_new['emp_length'].isin(['2 years','3 years','4 years']),
                                     '2-4 years',df_new['emp_length'])
    
    #5-9 years inclues: 5 years, 6 years, 7 years, 8 years, 9 years
    df_new['emp_length'] = np.where(df_new['emp_length'].isin(['5 years','6 years','7 years','8 years','9 years']),
                                     '5-9 years',df_new['emp_length'])
    #10+ years includes 10+ years (no change)
    return df_new

def process_revol_util(df):
    df_new = df.copy()
    
    df_new['revol_util'] = df_new['revol_util'].fillna(-100)
    df_new['revol_util'] = df_new['revol_util'].apply(lambda x: float(str(x).split('%')[0])/100)
    df_new['revol_util'] = np.where(df_new['revol_util']==-1.0,np.nan,df_new['revol_util'])
    
    return df_new

def process_month_since_cols(df):
    df_new = df.copy()

    mo_since_cols = ['mo_sin_old_il_acct','mths_since_last_delinq','mths_since_last_major_derog',
                     'mths_since_last_record','mths_since_recent_bc_dlq','mths_since_recent_inq',
                     'mths_since_recent_revol_delinq','mo_sin_old_rev_tl_op','mths_since_recent_bc',
                     'mo_sin_rcnt_rev_tl_op', 'mo_sin_rcnt_tl']

    for col in mo_since_cols:
        df_new[col].fillna(df_new[col].max()+1, inplace=True)
    
    return df_new

def process_loan_grades(df):
    df_new = df.copy()
    
    # turn sub_grade into ordinal mapping
    sorted_subgrades = sorted(df.sub_grade.unique())
    subgrade_dict = dict(zip(sorted_subgrades, range(len(sorted_subgrades))))
    df_new['sub_grade'] = df_new['sub_grade'].map(subgrade_dict)
    
    # turn grade into ordinal mapping
    sorted_grades = sorted(df.grade.unique())
    grade_dict = dict(zip(sorted_grades, range(len(sorted_grades))))
    df_new['grade'] = df_new['grade'].map(grade_dict)
    return df_new
    
    
def process_loan_cols(df):
    df_processed = process_emp_length(df)
    df_processed = process_revol_util(df_processed)
    df_processed = process_month_since_cols(df_processed)
    df_processed = process_loan_grades(df_processed)
    
    #add credit_line_age
    df_processed['credit_line_age'] = df_processed['issue_d'] - pd.to_datetime(df_processed['earliest_cr_line'])
    df_processed = df_processed.drop(columns='earliest_cr_line')

    #process int_rate, zip_code, credit_line_age
    df_processed['int_rate'] = df_processed['int_rate'].apply(lambda x: float(str(x).split('%')[0])/100)
    df_processed['zip_code'] = df_processed['zip_code'].apply(lambda x: x[:3])
    df_processed['credit_line_age'] = df_processed['credit_line_age'].apply(lambda x: x.days)
    
    # generate new column for outcome variable ('fully_paid'); drop 'loan_status'
    df_processed['fully_paid'] = df_processed['loan_status'].map({'Fully Paid':1, 'Charged Off':0})
    df_processed = df_processed.drop(columns='loan_status')
    
    return df_processed

loan_df = process_loan_cols(loan_df)

Missing Values: Imputation for Continuous Columns

We will fill the remaining columns’ missing values using linear regression model-based imputation. First, we verify that all of the remaining columns are continuous float-64 variables for which this model imputation is appropriate.

Then, we use pd.get_dummies() to get dummy-variables for the necessary categorical columns. This is an important step which should be done prior to model-based imputation

Finally, we proceed with our imputation.

missing_data = get_missing_values_table(loan_df)
display(missing_data[missing_data.Missing_Count!=0])
missing_cols = missing_data[missing_data.Missing_Count!=0].index

print("datatypes of the columns that still have missing values: ")
print(loan_df[missing_cols].dtypes.unique().tolist())

	Missing_Count	Portion_Missing
bc_util	624	0.011474
percent_bc_gt_75	616	0.011327
bc_open_to_buy	602	0.011070
pct_tl_nvr_dlq	215	0.003953
avg_cur_bal	210	0.003862
num_actv_bc_tl	208	0.003825
tot_coll_amt	208	0.003825
num_actv_rev_tl	208	0.003825
num_op_rev_tl	208	0.003825
num_accts_ever_120_pd	208	0.003825
num_rev_accts	208	0.003825
num_rev_tl_bal_gt_0	208	0.003825
num_tl_90g_dpd_24m	208	0.003825
num_tl_op_past_12m	208	0.003825
num_bc_tl	208	0.003825
num_il_tl	208	0.003825
tot_cur_bal	208	0.003825
total_il_high_credit_limit	208	0.003825
tot_hi_cred_lim	208	0.003825
total_rev_hi_lim	208	0.003825
revol_util	31	0.000570
dti	13	0.000239

datatypes of the columns that still have missing values: 
[dtype('float64')]

Before performing imputation, we need to process some of the categorical variables using pd.get_dummies. We do so below.

#Rerun Types
types_df = pd.DataFrame(loan_df.dtypes,columns=['Types'])
display(types_df[types_df['Types']!='float64'].sort_values(by='Types'))


object_vars=loan_df.select_dtypes(include='object').columns.tolist()
object_vars = np.setdiff1d(object_vars,['addr_state', 'zip_code'])
print(object_vars)

#Dummy encoding
loan_df = pd.get_dummies(loan_df,columns=object_vars,drop_first=True) 

	Types
grade	int64
issue_d	datetime64[ns]
sub_grade	int64
credit_line_age	int64
fully_paid	int64
addr_state	object
application_type	object
emp_length	object
home_ownership	object
purpose	object
term	object
verification_status	object
zip_code	object

['application_type' 'emp_length' 'home_ownership' 'purpose' 'term'
 'verification_status']

#columns where imputation is neededpd.
missing_cols = loan_df.columns[loan_df.isnull().any()].tolist()

#verify they're all continuous dtypes appropriate for linear regression imputation
loan_df[missing_cols].dtypes.unique() # all float64

array([dtype('float64')], dtype=object)

Performing the Linear Regression Model-Based Imputation:

def impute_missing_continuous(df, missing_columns, cols_to_exclude):
    clean_df = df.copy()
    
    for column in missing_columns:
        types_df = pd.DataFrame(df.dtypes,columns=['Types'])
        # Remove response variable
        impute_train = clean_df[~clean_df[column].isnull()].drop(cols_to_exclude,1).sample(frac=0.1,random_state=12)

        #Split target
        X_impute_train = impute_train.drop(column,1)
        Y_impute_train = impute_train[column]

        #Mean Imputation for current nulls for columns that did not get imputed yet
        X_impute_train = X_impute_train.fillna(X_impute_train.mean())

        # Train LinearRegression
        impute_ols = LinearRegression(fit_intercept=True)
        impute_ols.fit(X_impute_train,Y_impute_train)

        # Generate new temp column with model predictions
        # Only replace rows where the value is null with the K-NN predicted value 
        predictions = clean_df.drop(cols_to_exclude + [column],1)

        #Mean Imputation for current nulls for columns that did not get imputed yet
        predictions = predictions.fillna(predictions.mean())
        clean_df['temp'] = impute_ols.predict(predictions)

        clean_df[column] = np.where(clean_df[column].isnull(),clean_df['temp'],clean_df[column])
        clean_df = clean_df.drop('temp',1)
    
    return clean_df

clean_df = impute_missing_continuous(loan_df, missing_columns=missing_cols, cols_to_exclude=['fully_paid','issue_d', 'zip_code', 'addr_state'])

clean_df.isnull().values.any()

False

print("float64 columns: ")
print(clean_df.dtypes[clean_df.dtypes=='float64'].index.tolist())

print("\nnon-float64 columns: ")
print(clean_df.dtypes[clean_df.dtypes!='float64'])

print("\n\nVisualizing unique values for non-float64 variables (except for zip_code and issue_d)")
for col in clean_df.dtypes[clean_df.dtypes!='float64'].index.tolist():
    if not col in['issue_d', 'zip_code']:
        print("\n", col)
        print(clean_df[col].unique())

float64 columns: 
['acc_now_delinq', 'acc_open_past_24mths', 'annual_inc', 'avg_cur_bal', 'bc_open_to_buy', 'bc_util', 'chargeoff_within_12_mths', 'collections_12_mths_ex_med', 'delinq_2yrs', 'delinq_amnt', 'dti', 'inq_last_6mths', 'installment', 'int_rate', 'loan_amnt', 'mo_sin_old_il_acct', 'mo_sin_old_rev_tl_op', 'mo_sin_rcnt_rev_tl_op', 'mo_sin_rcnt_tl', 'mort_acc', 'mths_since_last_delinq', 'mths_since_last_major_derog', 'mths_since_last_record', 'mths_since_recent_bc', 'mths_since_recent_bc_dlq', 'mths_since_recent_inq', 'mths_since_recent_revol_delinq', 'num_accts_ever_120_pd', 'num_actv_bc_tl', 'num_actv_rev_tl', 'num_bc_sats', 'num_bc_tl', 'num_il_tl', 'num_op_rev_tl', 'num_rev_accts', 'num_rev_tl_bal_gt_0', 'num_sats', 'num_tl_90g_dpd_24m', 'num_tl_op_past_12m', 'open_acc', 'pct_tl_nvr_dlq', 'percent_bc_gt_75', 'pub_rec', 'pub_rec_bankruptcies', 'revol_bal', 'revol_util', 'tax_liens', 'tot_coll_amt', 'tot_cur_bal', 'tot_hi_cred_lim', 'total_acc', 'total_bal_ex_mort', 'total_bc_limit', 'total_il_high_credit_limit', 'total_rev_hi_lim']

non-float64 columns: 
addr_state                                     object
grade                                           int64
issue_d                                datetime64[ns]
sub_grade                                       int64
zip_code                                       object
credit_line_age                                 int64
fully_paid                                      int64
application_type_Joint App                      uint8
emp_length_10+ years                            uint8
emp_length_2-4 years                            uint8
emp_length_5-9 years                            uint8
home_ownership_MORTGAGE                         uint8
home_ownership_NONE                             uint8
home_ownership_OTHER                            uint8
home_ownership_OWN                              uint8
home_ownership_RENT                             uint8
purpose_credit_card                             uint8
purpose_debt_consolidation                      uint8
purpose_home_improvement                        uint8
purpose_house                                   uint8
purpose_major_purchase                          uint8
purpose_medical                                 uint8
purpose_moving                                  uint8
purpose_other                                   uint8
purpose_renewable_energy                        uint8
purpose_small_business                          uint8
purpose_vacation                                uint8
purpose_wedding                                 uint8
term_ 60 months                                 uint8
verification_status_Source Verified             uint8
verification_status_Verified                    uint8
dtype: object


Visualizing unique values for non-float64 variables (except for zip_code and issue_d)

 addr_state
['FL' 'TX' 'CA' 'IN' 'MO' 'MI' 'GA' 'AZ' 'AL' 'CT' 'WA' 'MD' 'CO' 'IL'
 'PA' 'UT' 'SC' 'MN' 'NJ' 'OH' 'WI' 'NY' 'LA' 'NC' 'NH' 'DE' 'VA' 'WY'
 'TN' 'AR' 'KS' 'NV' 'DC' 'OK' 'MA' 'KY' 'RI' 'OR' 'HI' 'AK' 'NM' 'ND'
 'WV' 'ME' 'MT' 'MS' 'NE' 'SD' 'ID' 'VT']

 grade
[1 0 2 6 3 4 5]

 sub_grade
[ 8  0 11  5  2  4 12 14 32 17  9  7 10 15 21 27 18 13 19 20 16  6 22  1
  3 24 23 25 33 28 26 30 34 29 31]

 credit_line_age
[ 7671  5599  1249 ... 14885 15034 18292]

 fully_paid
[0 1]

 application_type_Joint App
[0 1]

 emp_length_10+ years
[1 0]

 emp_length_2-4 years
[0 1]

 emp_length_5-9 years
[0 1]

 home_ownership_MORTGAGE
[1 0]

 home_ownership_NONE
[0 1]

 home_ownership_OTHER
[0 1]

 home_ownership_OWN
[0 1]

 home_ownership_RENT
[0 1]

 purpose_credit_card
[1 0]

 purpose_debt_consolidation
[0 1]

 purpose_home_improvement
[0 1]

 purpose_house
[0 1]

 purpose_major_purchase
[0 1]

 purpose_medical
[0 1]

 purpose_moving
[0 1]

 purpose_other
[0 1]

 purpose_renewable_energy
[0 1]

 purpose_small_business
[0 1]

 purpose_vacation
[0 1]

 purpose_wedding
[0 1]

 term_ 60 months
[0 1]

 verification_status_Source Verified
[0 1]

 verification_status_Verified
[0 1]

clean_df.to_pickle('./data/Pickle/clean_df_5pct_subset.pkl')

Generating clean_df For Entire Dataset

Since we worked on a subset above, we redo our cleaning on the full dataset and save the clean_df as a .pkl file.

loan_df = pd.read_pickle('./data/Pickle/loan_df0.pkl')

exclude = ['id', 'member_id', 'url', 'next_pymnt_d', 'pymnt_plan', 'out_prncp',
           'out_prncp_inv', 'policy_code', 'open_acc_6m', 'open_act_il', 'open_il_12m',
           'open_il_24m', 'total_bal_il', 'open_rv_12m', 'open_rv_24m', 'max_bal_bc',
           'inq_fi', 'total_cu_tl', 'inq_last_12m', 'il_util', 'all_util','mths_since_rcnt_il',
           'num_tl_120dpd_2m', 'num_tl_30dpd', 'deferral_term', 'hardship_amount',
           'hardship_dpd', 'hardship_end_date', 'hardship_flag', 'hardship_last_payment_amount',
           'hardship_length', 'hardship_loan_status', 'hardship_payoff_balance_amount',
           'hardship_reason', 'hardship_start_date', 'hardship_status', 'hardship_type',
           'orig_projected_additional_accrued_interest', 'payment_plan_start_date',
           'desc', 'title', 'emp_title', 'debt_settlement_flag', 'debt_settlement_flag_date',
           'settlement_amount', 'settlement_date', 'settlement_percentage', 'settlement_status',
           'settlement_term', 'recoveries', 'collection_recovery_fee', 'funded_amnt',
           'funded_amnt_inv', 'last_pymnt_amnt', 'out_prncp', 'out_prncp_inv',  'total_pymnt',
           'total_pymnt_inv', 'total_rec_int', 'total_rec_late_fee', 'total_rec_prncp',
           'last_pymnt_d',  'last_credit_pull_d', 'disbursement_method', 'initial_list_status',
           'annual_inc_joint', 'dti_joint', 'revol_bal_joint','sec_app_chargeoff_within_12_mths',
           'sec_app_collections_12_mths_ex_med','sec_app_earliest_cr_line', 'sec_app_inq_last_6mths',
           'sec_app_mort_acc','sec_app_mths_since_last_major_derog', 'sec_app_num_rev_accts',
           'sec_app_open_acc', 'sec_app_open_act_il', 'sec_app_revol_util',
           'verification_status_joint']

potential_features = np.setdiff1d(loan_df.columns.tolist(), exclude)
loan_df = loan_df[potential_features]

loan_df = loan_df[loan_df['issue_d']>=datetime(2012,8,1)]

loan_df = process_loan_cols(loan_df)

object_vars = loan_df.select_dtypes(include='object').columns.tolist()
vars_for_dummies = np.setdiff1d(object_vars,['addr_state', 'zip_code'])

tmp_df = loan_df[vars_for_dummies]
loan_df = pd.get_dummies(loan_df,columns=vars_for_dummies,drop_first=True) 

missing_cols = loan_df.columns[loan_df.isnull().any()].tolist()

clean_df = impute_missing_continuous(loan_df, missing_columns=missing_cols,
                                     cols_to_exclude=['fully_paid','issue_d', 'zip_code', 'addr_state'])

for col in vars_for_dummies:
    clean_df[col] = tmp_df[col]
    
print(sorted(clean_df.columns.tolist()))

['acc_now_delinq', 'acc_open_past_24mths', 'addr_state', 'annual_inc', 'application_type', 'application_type_Joint App', 'avg_cur_bal', 'bc_open_to_buy', 'bc_util', 'chargeoff_within_12_mths', 'collections_12_mths_ex_med', 'credit_line_age', 'delinq_2yrs', 'delinq_amnt', 'dti', 'emp_length', 'emp_length_10+ years', 'emp_length_2-4 years', 'emp_length_5-9 years', 'fully_paid', 'grade', 'home_ownership', 'home_ownership_MORTGAGE', 'home_ownership_NONE', 'home_ownership_OTHER', 'home_ownership_OWN', 'home_ownership_RENT', 'inq_last_6mths', 'installment', 'int_rate', 'issue_d', 'loan_amnt', 'mo_sin_old_il_acct', 'mo_sin_old_rev_tl_op', 'mo_sin_rcnt_rev_tl_op', 'mo_sin_rcnt_tl', 'mort_acc', 'mths_since_last_delinq', 'mths_since_last_major_derog', 'mths_since_last_record', 'mths_since_recent_bc', 'mths_since_recent_bc_dlq', 'mths_since_recent_inq', 'mths_since_recent_revol_delinq', 'num_accts_ever_120_pd', 'num_actv_bc_tl', 'num_actv_rev_tl', 'num_bc_sats', 'num_bc_tl', 'num_il_tl', 'num_op_rev_tl', 'num_rev_accts', 'num_rev_tl_bal_gt_0', 'num_sats', 'num_tl_90g_dpd_24m', 'num_tl_op_past_12m', 'open_acc', 'pct_tl_nvr_dlq', 'percent_bc_gt_75', 'pub_rec', 'pub_rec_bankruptcies', 'purpose', 'purpose_credit_card', 'purpose_debt_consolidation', 'purpose_educational', 'purpose_home_improvement', 'purpose_house', 'purpose_major_purchase', 'purpose_medical', 'purpose_moving', 'purpose_other', 'purpose_renewable_energy', 'purpose_small_business', 'purpose_vacation', 'purpose_wedding', 'revol_bal', 'revol_util', 'sub_grade', 'tax_liens', 'term', 'term_ 60 months', 'tot_coll_amt', 'tot_cur_bal', 'tot_hi_cred_lim', 'total_acc', 'total_bal_ex_mort', 'total_bc_limit', 'total_il_high_credit_limit', 'total_rev_hi_lim', 'verification_status', 'verification_status_Source Verified', 'verification_status_Verified', 'zip_code']

print(clean_df.shape)
#print(vars_for_dummies)

(1087436, 93)

clean_df.to_pickle('./data/Pickle/clean_df_for_eda.pkl')

Descriptions of Variable Meanings (Data Dictionary)

Descriptions of the remaining variables are provided below. Except for a few variables which we have processed/engieneered, descriptions are taken from the data dictionary which can be downloaded from the lending club website:

acc_now_delinq: The number of accounts on which the borrower is now delinquent.

acc_open_past_24mths: Number of trades opened in past 24 months.

addr_state: The state provided by the borrower in the loan application
We have kept this as a string variable for the time being. It will need to be further processed if we want to include it in our model (using pd.get_dummies). We keep it in the dataset in this form for the time being so that we can compare loans by state.

annual_inc: the borrower’s self-reported annual income.

application_type: Indicates whether the loan is an individual application or a joint application with two co-borrowers. We used pd.get_dummies to change this into a binary 0/1 categorical variable.

avg_cur_bal: Average current balance of all accounts

bc_open_to_buy: Total open to buy on revolving bankcards.

bc_util: Ratio of total current balance to high credit/credit limit for all bankcard accounts.

chargeoff_within_12_mths: Number of charge-offs within 12 months

collections_12_mths_ex_med: Number of collections in 12 months excluding medical collections

credit_line_age: Number of days between earliest_cr_line and issue_d (i.e., age in days of the borrower’s credit line at the issue date). This column was not in the original dataset; we engineered it.

delinq_2yrs: The number of 30+ days past-due incidences of delinquency in the borrower’s credit file for the past 2 years

delinq_amnt: The past-due amount owed for the accounts on which the borrower is now delinquent.

dti: A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.

*earliest_cr_line: The month the borrower’s earliest reported credit line was opened
After we used this variable to calculate credit_line_age, we removed it.

emp_length - Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.
We were unable to change this to a continuous variable but we reduced the number of categorical bins to help reduce potential overfitting (and to create categorical bins with similar numbers of samples within each bin). We will treated nulls as 0 years worked as they most often appear when no ‘emp_title’ was listed.

grade: LC assigned loan grade
We processed this to make it an ordinal variable, with lower values indicating less-risky loans.

home_ownership: The home ownership status provided by the borrower during registration or obtained from the credit report. Our values are: RENT, OWN, MORTGAGE, OTHER
We simply used pd.get_dummies to process this column.

inq_last_6mths: The number of inquiries in past 6 months (excluding auto and mortgage inquiries)

installment: The monthly payment owed by the borrower if the loan originates.

int_rate: Interest Rate on the loan
Originally this was a string percent (%). We processed it to cast it to a float.

issue_d: The date in which the loan was issued. While we will not be using issue_d as a feature in the model itself, we will not remove it at this point in case we need to sample or sort by date. This will be removed later before model training.

loan_amnt: The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.

loan_status: Current status of the loan

mo_sin_old_il_acct: Months since oldest bank installment account opened. We filled missing values with (maximum observed value+1)

mo_sin_old_rev_tl_op: Months since oldest bank installment account opened. We filled missing values with (maximum observed value+1)

mo_sin_rcnt_rev_tl_op: Months since most recent revolving account opened. We filled missing values with (maximum observed value+1)

mo_sin_rcnt_tl: Months since most recent account opened. We filled missing values with (maximum observed value+1)

mort_acc: Number of mortgage accounts.

mths_since_last_delinq: The number of months since the borrower’s last delinquency. We filled missing values with (maximum observed value+1)

mths_since_last_major_derog: Months since most recent 90-day or worse rating. We filled missing values with (maximum observed value+1)

mths_since_last_record: The number of months since the last public record. We filled missing values with (maximum observed value+1)

mths_since_recent_bc: Months since most recent bankcard account opened. We filled missing values with (maximum observed value+1)

mths_since_recent_bc_dlq: Months since most recent bankcard delinquency. We filled missing values with (maximum observed value+1)

mths_since_recent_inq: Months since most recent inquiry. We filled missing values with (maximum observed value+1)

mths_since_recent_revol_delinq: Months since most recent revolving delinquency. We filled missing values with (maximum observed value+1)

num_accts_ever_120_pd: Number of accounts ever 120 or more days past due

num_actv_bc_tl: Number of currently active bankcard accounts

num_actv_rev_tl: Number of currently active revolving trades

num_bc_sats: Number of satisfactory bankcard accounts

num_bc_tl: Number of bankcard accounts

num_il_tl: Number of installment accounts

num_op_rev_tl: Number of open revolving accounts

num_rev_accts: Number of revolving accounts

num_rev_tl_bal_gt_0: Number of revolving trades with balance >0

num_sats: Number of satisfactory accounts

num_tl_90g_dpd_24m: Number of accounts 90 or more days past due in last 24 months

num_tl_op_past_12m: Number of accounts opened in past 12 months

open_acc: The number of open credit lines in the borrower’s credit file.

pct_tl_nvr_dlq: Percent of trades never delinquent

percent_bc_gt_75: Percentage of all bankcard accounts > 75% of limit.

pub_rec: Number of derogatory public records

pub_rec_bankruptcies: Number of public record bankruptcies

purpose: a neatly-categorized variable describing the purpose of the loan. We simply used pd.get_dummies to process this variable.

revol_bal: Total credit revolving balance

revol_util: Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.
We not that this variable was originally a string percent (%). We processed this variable to cast it to float. (While processing, we temporarily set nulls as -1 to cast the column to float.)

sub_grade - LC assigned loan subgrade
We processed this to make it an ordinal variable, with lower values indicating less-risky loans.

tax_liens: Number of tax liens

term: The number of payments on the loan. Values are in months and can be either 36 or 60.
We simply used pd.get_dummies to process this column.

tot_coll_amt: Total collection amounts ever owed

tot_cur_bal: Total current balance of all accounts

tot_hi_cred_lim: Total high credit/credit limit

total_acc: The total number of credit lines currently in the borrower’s credit file

total_bal_ex_mort: Total credit balance excluding mortgage

total_bc_limit: Total bankcard high credit/credit limit

total_il_high_credit_limit: Total installment high credit/credit limit

total_rev_hi_lim: Total revolving high credit/credit limit

verification_status: Indicates if income was verified by LC, not verified, or if the income source was verified.
We simply used pd.get_dummies to process this column.

zip_code - The first 3 numbers of the zip code provided by the borrower in the loan application. While our model will not be trained directly on the zip_code, we will keep this in the dataset as we will be using it to assess fairness of the model going forward.