Statistics using HEaaN.stat
0. Get data
In this documentation, we introduce default detection model using lending club dataset from Kaggle (https://www.kaggle.com/datasets/ethon0426/lending-club-20072020q1).
This part is for sample selection from the original dataset. We will extract 32,768 observations. Also, we will select some features for toy example.
[22]:
import numpy as np
import pandas as pd
[23]:
df = pd.read_csv('Loan_status_2007-2020Q3.gzip')
/tmp/ipykernel_1424221/178761628.py:1: DtypeWarning: Columns (1,48,58,117,127,128,129,132,133,134,137) have mixed types. Specify dtype option on import or set low_memory=False.
df = pd.read_csv('Loan_status_2007-2020Q3.gzip')
[24]:
df.head()
[24]:
Unnamed: 0 | id | loan_amnt | funded_amnt | funded_amnt_inv | term | int_rate | installment | grade | sub_grade | ... | hardship_start_date | hardship_end_date | payment_plan_start_date | hardship_length | hardship_dpd | hardship_loan_status | orig_projected_additional_accrued_interest | hardship_payoff_balance_amount | hardship_last_payment_amount | debt_settlement_flag | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1077501 | 5000.0 | 5000.0 | 4975.0 | 36 months | 10.65% | 162.87 | B | B2 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | N |
1 | 1 | 1077430 | 2500.0 | 2500.0 | 2500.0 | 60 months | 15.27% | 59.83 | C | C4 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | N |
2 | 2 | 1077175 | 2400.0 | 2400.0 | 2400.0 | 36 months | 15.96% | 84.33 | C | C5 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | N |
3 | 3 | 1076863 | 10000.0 | 10000.0 | 10000.0 | 36 months | 13.49% | 339.31 | C | C1 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | N |
4 | 4 | 1075358 | 3000.0 | 3000.0 | 3000.0 | 60 months | 12.69% | 67.79 | B | B5 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | N |
5 rows × 142 columns
[25]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2925493 entries, 0 to 2925492
Columns: 142 entries, Unnamed: 0 to debt_settlement_flag
dtypes: float64(106), int64(1), object(35)
memory usage: 3.1+ GB
[26]:
df.loan_status.value_counts()
[26]:
loan_status
Fully Paid 1497783
Current 1031016
Charged Off 362548
Late (31-120 days) 16154
In Grace Period 10028
Late (16-30 days) 2719
Issued 2062
Does not meet the credit policy. Status:Fully Paid 1988
Does not meet the credit policy. Status:Charged Off 761
Default 433
Name: count, dtype: int64
[27]:
df_sub = df.sample(n=32768, random_state=131)
df_sub.head()
[27]:
Unnamed: 0 | id | loan_amnt | funded_amnt | funded_amnt_inv | term | int_rate | installment | grade | sub_grade | ... | hardship_start_date | hardship_end_date | payment_plan_start_date | hardship_length | hardship_dpd | hardship_loan_status | orig_projected_additional_accrued_interest | hardship_payoff_balance_amount | hardship_last_payment_amount | debt_settlement_flag | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1776442 | 55398 | 166302700 | 5000.0 | 5000.0 | 5000.0 | 36 months | 8.81% | 158.56 | A | A5 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | N |
2274736 | 406975 | 38587318 | 3000.0 | 3000.0 | 3000.0 | 36 months | 14.31% | 102.99 | C | C4 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | N |
284307 | 53590 | 116869925 | 5000.0 | 5000.0 | 5000.0 | 36 months | 19.03% | 183.36 | D | D3 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | N |
493884 | 21818 | 129581391 | 2000.0 | 2000.0 | 2000.0 | 36 months | 17.47% | 71.78 | D | D1 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | N |
1522356 | 36941 | 32629592 | 5000.0 | 5000.0 | 5000.0 | 36 months | 10.99% | 163.67 | B | B3 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | N |
5 rows × 142 columns
[28]:
print(df_sub.columns)
Index(['Unnamed: 0', 'id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv',
'term', 'int_rate', 'installment', 'grade', 'sub_grade',
...
'hardship_start_date', 'hardship_end_date', 'payment_plan_start_date',
'hardship_length', 'hardship_dpd', 'hardship_loan_status',
'orig_projected_additional_accrued_interest',
'hardship_payoff_balance_amount', 'hardship_last_payment_amount',
'debt_settlement_flag'],
dtype='object', length=142)
[29]:
df_sub = df_sub[['loan_amnt','annual_inc','term','int_rate','installment','grade','sub_grade','purpose','loan_status','dti','last_fico_range_high','last_fico_range_low','total_acc', 'delinq_2yrs', 'emp_length']]
[30]:
df_sub.head()
df_sub.to_csv('lc.csv', index=False)
1. Data import
We import lending club data extracted from original set downloaded from Kaggle. In this exercise, we only use 32,768 observations and selected some sensitive features. Let’s import the provided data.
[31]:
import pandas as pd
[32]:
df = pd.read_csv('lc.csv')
[33]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32768 entries, 0 to 32767
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 loan_amnt 32768 non-null float64
1 annual_inc 32768 non-null float64
2 term 32768 non-null object
3 int_rate 32768 non-null object
4 installment 32768 non-null float64
5 grade 32768 non-null object
6 sub_grade 32768 non-null object
7 purpose 32768 non-null object
8 loan_status 32768 non-null object
9 dti 32741 non-null float64
10 last_fico_range_high 32768 non-null float64
11 last_fico_range_low 32768 non-null float64
12 total_acc 32768 non-null float64
13 delinq_2yrs 32768 non-null float64
14 emp_length 30468 non-null object
dtypes: float64(8), object(7)
memory usage: 3.8+ MB
2. Data preprocessing
Before encoding and encrypting data, we have to preprocess data. Frist, we have to change data of numerical columns to number. Second, categorical column have to be set to dtype as “category”.
[34]:
df['int_rate'] = df['int_rate'].apply(lambda x: float(x.strip('%')) / 100)
df["term"] = df["term"].astype("category")
df["grade"] = df["grade"].astype("category")
df["sub_grade"] = df["sub_grade"].astype("category")
df["purpose"] = df["purpose"].astype("category")
df["loan_status"] = df["loan_status"].astype("category")
df["emp_length"] = df["emp_length"].astype("category")
Now, let’s encrypt data. Note that HEaaN.SDK contains various encryption techniques in context. We will use ‘FGb’ parameter on encryption, and make keys based on the parameter.
[35]:
import heaan_sdk
context = heaan_sdk.Context(
parameter=heaan_sdk.HEParameter.from_preset("FGb"),
key_dir_path="./keys_stat",
load_keys="all",
generate_keys=True,
)
HEaaN-SDK uses CUDA v11.7 (> v11.2)
Now, encode and encrypt data.
[46]:
hf = heaan_sdk.HEFrame.from_frame(context, df)
[47]:
hf.encrypt()
[47]:
HEFrame(
number of rows: 32768,
number of columns: 15,
list of columns: ['loan_amnt', 'annual_inc', 'term', 'int_rate', 'installment', 'grade', 'sub_grade', 'purpose', 'loan_status', 'dti', 'last_fico_range_high', 'last_fico_range_low', 'total_acc', 'delinq_2yrs', 'emp_length']
)
Before conducting the analysis, let’s perform some data preprocessing. First, since our objective is to predict whether a loan is well-paid or not, let’s create a binary variable called ‘bad’. We can define ‘bad=0’ if the loan_status is either ‘Fully Paid’ or ‘Current’, and ‘bad=1’ if the loan_status is not well-paid.
[48]:
hf["bad"] = (hf["loan_status"] != "Fully Paid") & (hf["loan_status"] != "Current")
Suppose that we are going to use average of monthly income for individual person.
[49]:
hf["avg_monthly_inc"] = hf["annual_inc"] * (1 / 12)
[40]:
hf.info()
[40]:
<class 'heaan_sdk.frame.frame.HEFrame'>
Data rows: 32768
Data columns (total 17 columns):
# Column Dtype Encrypted
--- ------------------------ --------- ----------
0 loan_amnt float64 True
1 annual_inc float64 True
2 term category True
3 int_rate float64 True
4 installment float64 True
5 grade category True
6 sub_grade category True
7 purpose category True
8 loan_status category True
9 dti float64 True
10 last_fico_range_high float64 True
11 last_fico_range_low float64 True
12 total_acc float64 True
13 delinq_2yrs float64 True
14 emp_length category True
15 bad bool True
16 avg_monthly_inc float True
dtypes: float(1), bool(1), category(6), float64(9)
You will see that “avg_monthly_inc” is added on the encrypted dataset.
3. Descriptive Statistics
In this section, we calculate descriptive statistics for some variables. First of all, we calculate average and standard deviation of average of monthly income.
[41]:
mean_inc = hf["avg_monthly_inc"].mean()
mean_inc.decrypt().to_series()[0]
[41]:
6610.71612473255
[42]:
std_inc = hf["avg_monthly_inc"].std()
std_inc.decrypt().to_series()[0]
[42]:
5018.305150782786
You may see that the result of calculated average and standard deviation of monthly income are approximately 6610, 5018, respectively.
Now, let’s calculate the correlation coefficient between monthly income and annual income.
[43]:
corr_inc = hf["avg_monthly_inc"].corr(hf["annual_inc"])
corr_inc.decrypt().to_series()[0]
[43]:
1.0000000251598327
Since the correlation between monthly income and annual income is perfectly correlated, it is better to drop one of both. In this example, let’s drop annual income variable.
[50]:
hf.drop("annual_inc", axis=1, inplace=True)
[50]:
HEFrame(
number of rows: 32768,
number of columns: 16,
list of columns: ['loan_amnt', 'term', 'int_rate', 'installment', 'grade', 'sub_grade', 'purpose', 'loan_status', 'dti', 'last_fico_range_high', 'last_fico_range_low', 'total_acc', 'delinq_2yrs', 'emp_length', 'bad', 'avg_monthly_inc']
)
If you are interested in calculating skewness and kurtosis, you may also calculate using the program. Let’s show how to conduct.
[51]:
skew_inc = hf["avg_monthly_inc"].skew()
skew_inc.decrypt().to_series()[0]
[51]:
8.979756358129835
[52]:
kurt_inc = hf["avg_monthly_inc"].kurt()
kurt_inc.decrypt().to_series()[0]
[52]:
252.25036127327368
Now, let’s calculate descriptive statistics for specific credit grade. For example, consider the mean and standard deviation of average of monthly income for credit grade ‘A’.
[53]:
mean_inc_with_A = hf['avg_monthly_inc'][hf['grade']=='A'].mean()
mean_inc_with_A.decrypt().to_series()[0]
[53]:
7548.629657508425
[54]:
std_inc_with_A = hf['avg_monthly_inc'][hf['grade']=='A'].std()
std_inc_with_A.decrypt().to_series()[0]
[54]:
5469.998868967143
The calculated mean and standard deviation of credit grade “A“‘s monthly income are approximately 7549, 5470, respectively, which are far from mean and stadnard deviation of whole monthly income.
4. Hypothesis Testing
In this section, we explore how to conduct hypothesis testing. For example, let’s conduct one-sample hypothesis testing for mean.
Assume that we are interested in testing the population mean of average of monthly income is 6,500. Then, the null hypothesis is:
\(H_0: \mu_{inc} = 6,500\).
The alternative hypothesis is
\(H_1: \mu_{inc} \neq 6,500\).
[55]:
null_mean = 6500
t_test = hf['avg_monthly_inc'].t_test(null_mean)
t_test.decrypt().to_series()
[55]:
0 3.99373
1 32767.00000
Name: avg_monthly_inc_t_test, dtype: float64
The result consists of t-statistic in the first slot and degrees of freedom in the second slot. We use scipy to calculate p-value.
[56]:
t_test_series = t_test.to_series()
import scipy.stats as st
st.t.sf(abs(t_test_series[0]), round(t_test_series[1])) * 2
[56]:
6.518366686853663e-05
Since t-statistics is approximately 3.9937, and p-value is approximately 0.000065, we can reject the null hypothesis on 95% significance level, which means that the average anuual income is not equal to 6,500$.
Using the above example, we can also construct 95% confidence interval for the mean of the annual income.
[57]:
conf_interval_t = hf['avg_monthly_inc'].t_interval_for_mean(0.95)
conf_interval_t.decrypt().to_series()
[57]:
0 6556.379042
1 6665.053186
Name: avg_monthly_inc_mean_confint_t_0.95, dtype: float64
The resulting 95% confidence interval of population mean is (6556.379, 6665.053), which means that we are 95% confident that the population mean of annual income will fall between the interval confident by 95%.