Statistics using HEaaN.stat

0. Get data

In this documentation, we introduce default detection model using lending club dataset from Kaggle (https://www.kaggle.com/datasets/ethon0426/lending-club-20072020q1).

This part is for sample selection from the original dataset. We will extract 32,768 observations. Also, we will select some features for toy example.

[22]:
import numpy as np
import pandas as pd
[23]:
df = pd.read_csv('Loan_status_2007-2020Q3.gzip')
/tmp/ipykernel_1424221/178761628.py:1: DtypeWarning: Columns (1,48,58,117,127,128,129,132,133,134,137) have mixed types. Specify dtype option on import or set low_memory=False.
  df = pd.read_csv('Loan_status_2007-2020Q3.gzip')
[24]:
df.head()
[24]:
Unnamed: 0 id loan_amnt funded_amnt funded_amnt_inv term int_rate installment grade sub_grade ... hardship_start_date hardship_end_date payment_plan_start_date hardship_length hardship_dpd hardship_loan_status orig_projected_additional_accrued_interest hardship_payoff_balance_amount hardship_last_payment_amount debt_settlement_flag
0 0 1077501 5000.0 5000.0 4975.0 36 months 10.65% 162.87 B B2 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN N
1 1 1077430 2500.0 2500.0 2500.0 60 months 15.27% 59.83 C C4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN N
2 2 1077175 2400.0 2400.0 2400.0 36 months 15.96% 84.33 C C5 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN N
3 3 1076863 10000.0 10000.0 10000.0 36 months 13.49% 339.31 C C1 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN N
4 4 1075358 3000.0 3000.0 3000.0 60 months 12.69% 67.79 B B5 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN N

5 rows × 142 columns

[25]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2925493 entries, 0 to 2925492
Columns: 142 entries, Unnamed: 0 to debt_settlement_flag
dtypes: float64(106), int64(1), object(35)
memory usage: 3.1+ GB
[26]:
df.loan_status.value_counts()
[26]:
loan_status
Fully Paid                                             1497783
Current                                                1031016
Charged Off                                             362548
Late (31-120 days)                                       16154
In Grace Period                                          10028
Late (16-30 days)                                         2719
Issued                                                    2062
Does not meet the credit policy. Status:Fully Paid        1988
Does not meet the credit policy. Status:Charged Off        761
Default                                                    433
Name: count, dtype: int64
[27]:
df_sub = df.sample(n=32768, random_state=131)
df_sub.head()
[27]:
Unnamed: 0 id loan_amnt funded_amnt funded_amnt_inv term int_rate installment grade sub_grade ... hardship_start_date hardship_end_date payment_plan_start_date hardship_length hardship_dpd hardship_loan_status orig_projected_additional_accrued_interest hardship_payoff_balance_amount hardship_last_payment_amount debt_settlement_flag
1776442 55398 166302700 5000.0 5000.0 5000.0 36 months 8.81% 158.56 A A5 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN N
2274736 406975 38587318 3000.0 3000.0 3000.0 36 months 14.31% 102.99 C C4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN N
284307 53590 116869925 5000.0 5000.0 5000.0 36 months 19.03% 183.36 D D3 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN N
493884 21818 129581391 2000.0 2000.0 2000.0 36 months 17.47% 71.78 D D1 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN N
1522356 36941 32629592 5000.0 5000.0 5000.0 36 months 10.99% 163.67 B B3 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN N

5 rows × 142 columns

[28]:
print(df_sub.columns)
Index(['Unnamed: 0', 'id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv',
       'term', 'int_rate', 'installment', 'grade', 'sub_grade',
       ...
       'hardship_start_date', 'hardship_end_date', 'payment_plan_start_date',
       'hardship_length', 'hardship_dpd', 'hardship_loan_status',
       'orig_projected_additional_accrued_interest',
       'hardship_payoff_balance_amount', 'hardship_last_payment_amount',
       'debt_settlement_flag'],
      dtype='object', length=142)
[29]:
df_sub = df_sub[['loan_amnt','annual_inc','term','int_rate','installment','grade','sub_grade','purpose','loan_status','dti','last_fico_range_high','last_fico_range_low','total_acc', 'delinq_2yrs', 'emp_length']]
[30]:
df_sub.head()
df_sub.to_csv('lc.csv', index=False)

1. Data import

We import lending club data extracted from original set downloaded from Kaggle. In this exercise, we only use 32,768 observations and selected some sensitive features. Let’s import the provided data.

[31]:
import pandas as pd
[32]:
df = pd.read_csv('lc.csv')
[33]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32768 entries, 0 to 32767
Data columns (total 15 columns):
 #   Column                Non-Null Count  Dtype
---  ------                --------------  -----
 0   loan_amnt             32768 non-null  float64
 1   annual_inc            32768 non-null  float64
 2   term                  32768 non-null  object
 3   int_rate              32768 non-null  object
 4   installment           32768 non-null  float64
 5   grade                 32768 non-null  object
 6   sub_grade             32768 non-null  object
 7   purpose               32768 non-null  object
 8   loan_status           32768 non-null  object
 9   dti                   32741 non-null  float64
 10  last_fico_range_high  32768 non-null  float64
 11  last_fico_range_low   32768 non-null  float64
 12  total_acc             32768 non-null  float64
 13  delinq_2yrs           32768 non-null  float64
 14  emp_length            30468 non-null  object
dtypes: float64(8), object(7)
memory usage: 3.8+ MB

2. Data preprocessing

Before encoding and encrypting data, we have to preprocess data. Frist, we have to change data of numerical columns to number. Second, categorical column have to be set to dtype as “category”.

[34]:
df['int_rate'] = df['int_rate'].apply(lambda x: float(x.strip('%')) / 100)

df["term"] = df["term"].astype("category")
df["grade"] = df["grade"].astype("category")
df["sub_grade"] = df["sub_grade"].astype("category")
df["purpose"] = df["purpose"].astype("category")
df["loan_status"] = df["loan_status"].astype("category")
df["emp_length"] = df["emp_length"].astype("category")

Now, let’s encrypt data. Note that HEaaN.SDK contains various encryption techniques in context. We will use ‘FGb’ parameter on encryption, and make keys based on the parameter.

[35]:
import heaan_sdk

context = heaan_sdk.Context(
    parameter=heaan_sdk.HEParameter.from_preset("FGb"),
    key_dir_path="./keys_stat",
    load_keys="all",
    generate_keys=True,
)
HEaaN-SDK uses CUDA v11.7 (> v11.2)

Now, encode and encrypt data.

[46]:
hf = heaan_sdk.HEFrame.from_frame(context, df)
[47]:
hf.encrypt()
[47]:
HEFrame(
  number of rows: 32768,
  number of columns: 15,
  list of columns: ['loan_amnt', 'annual_inc', 'term', 'int_rate', 'installment', 'grade', 'sub_grade', 'purpose', 'loan_status', 'dti', 'last_fico_range_high', 'last_fico_range_low', 'total_acc', 'delinq_2yrs', 'emp_length']
)

Before conducting the analysis, let’s perform some data preprocessing. First, since our objective is to predict whether a loan is well-paid or not, let’s create a binary variable called ‘bad’. We can define ‘bad=0’ if the loan_status is either ‘Fully Paid’ or ‘Current’, and ‘bad=1’ if the loan_status is not well-paid.

[48]:
hf["bad"] = (hf["loan_status"] != "Fully Paid") & (hf["loan_status"] != "Current")

Suppose that we are going to use average of monthly income for individual person.

[49]:
hf["avg_monthly_inc"] = hf["annual_inc"] * (1 / 12)
[40]:
hf.info()
[40]:
<class 'heaan_sdk.frame.frame.HEFrame'>
Data rows: 32768
Data columns (total 17 columns):
#    Column                    Dtype      Encrypted
---  ------------------------  ---------  ----------
  0  loan_amnt                 float64    True
  1  annual_inc                float64    True
  2  term                      category   True
  3  int_rate                  float64    True
  4  installment               float64    True
  5  grade                     category   True
  6  sub_grade                 category   True
  7  purpose                   category   True
  8  loan_status               category   True
  9  dti                       float64    True
 10  last_fico_range_high      float64    True
 11  last_fico_range_low       float64    True
 12  total_acc                 float64    True
 13  delinq_2yrs               float64    True
 14  emp_length                category   True
 15  bad                       bool       True
 16  avg_monthly_inc           float      True
dtypes: float(1), bool(1), category(6), float64(9)

You will see that “avg_monthly_inc” is added on the encrypted dataset.

3. Descriptive Statistics

In this section, we calculate descriptive statistics for some variables. First of all, we calculate average and standard deviation of average of monthly income.

[41]:
mean_inc = hf["avg_monthly_inc"].mean()
mean_inc.decrypt().to_series()[0]
[41]:
6610.71612473255
[42]:
std_inc = hf["avg_monthly_inc"].std()
std_inc.decrypt().to_series()[0]
[42]:
5018.305150782786

You may see that the result of calculated average and standard deviation of monthly income are approximately 6610, 5018, respectively.

Now, let’s calculate the correlation coefficient between monthly income and annual income.

[43]:
corr_inc = hf["avg_monthly_inc"].corr(hf["annual_inc"])
corr_inc.decrypt().to_series()[0]
[43]:
1.0000000251598327

Since the correlation between monthly income and annual income is perfectly correlated, it is better to drop one of both. In this example, let’s drop annual income variable.

[50]:
hf.drop("annual_inc", axis=1, inplace=True)
[50]:
HEFrame(
  number of rows: 32768,
  number of columns: 16,
  list of columns: ['loan_amnt', 'term', 'int_rate', 'installment', 'grade', 'sub_grade', 'purpose', 'loan_status', 'dti', 'last_fico_range_high', 'last_fico_range_low', 'total_acc', 'delinq_2yrs', 'emp_length', 'bad', 'avg_monthly_inc']
)

If you are interested in calculating skewness and kurtosis, you may also calculate using the program. Let’s show how to conduct.

[51]:
skew_inc = hf["avg_monthly_inc"].skew()
skew_inc.decrypt().to_series()[0]
[51]:
8.979756358129835
[52]:
kurt_inc = hf["avg_monthly_inc"].kurt()
kurt_inc.decrypt().to_series()[0]
[52]:
252.25036127327368

Now, let’s calculate descriptive statistics for specific credit grade. For example, consider the mean and standard deviation of average of monthly income for credit grade ‘A’.

[53]:
mean_inc_with_A = hf['avg_monthly_inc'][hf['grade']=='A'].mean()
mean_inc_with_A.decrypt().to_series()[0]
[53]:
7548.629657508425
[54]:
std_inc_with_A = hf['avg_monthly_inc'][hf['grade']=='A'].std()
std_inc_with_A.decrypt().to_series()[0]
[54]:
5469.998868967143

The calculated mean and standard deviation of credit grade “A“‘s monthly income are approximately 7549, 5470, respectively, which are far from mean and stadnard deviation of whole monthly income.

4. Hypothesis Testing

In this section, we explore how to conduct hypothesis testing. For example, let’s conduct one-sample hypothesis testing for mean.

Assume that we are interested in testing the population mean of average of monthly income is 6,500. Then, the null hypothesis is:

\(H_0: \mu_{inc} = 6,500\).

The alternative hypothesis is

\(H_1: \mu_{inc} \neq 6,500\).

[55]:
null_mean = 6500
t_test = hf['avg_monthly_inc'].t_test(null_mean)
t_test.decrypt().to_series()
[55]:
0        3.99373
1    32767.00000
Name: avg_monthly_inc_t_test, dtype: float64

The result consists of t-statistic in the first slot and degrees of freedom in the second slot. We use scipy to calculate p-value.

[56]:
t_test_series = t_test.to_series()

import scipy.stats as st
st.t.sf(abs(t_test_series[0]), round(t_test_series[1])) * 2
[56]:
6.518366686853663e-05

Since t-statistics is approximately 3.9937, and p-value is approximately 0.000065, we can reject the null hypothesis on 95% significance level, which means that the average anuual income is not equal to 6,500$.

Using the above example, we can also construct 95% confidence interval for the mean of the annual income.

[57]:
conf_interval_t = hf['avg_monthly_inc'].t_interval_for_mean(0.95)
conf_interval_t.decrypt().to_series()
[57]:
0    6556.379042
1    6665.053186
Name: avg_monthly_inc_mean_confint_t_0.95, dtype: float64

The resulting 95% confidence interval of population mean is (6556.379, 6665.053), which means that we are 95% confident that the population mean of annual income will fall between the interval confident by 95%.