Statistics using HEaaN.stat

0. Get data

In this documentation, we introduce default detection model using lending club dataset from Kaggle (https://www.kaggle.com/datasets/ethon0426/lending-club-20072020q1).

This part is for sample selection from the original dataset. We will extract 32,768 observations. Also, we will select some features for toy example.

[22]:

import numpy as np
import pandas as pd

[23]:

df = pd.read_csv('Loan_status_2007-2020Q3.gzip')

/tmp/ipykernel_1424221/178761628.py:1: DtypeWarning: Columns (1,48,58,117,127,128,129,132,133,134,137) have mixed types. Specify dtype option on import or set low_memory=False.
  df = pd.read_csv('Loan_status_2007-2020Q3.gzip')

[24]:

df.head()

[24]:

	Unnamed: 0	id	loan_amnt	funded_amnt	funded_amnt_inv	term	int_rate	installment	grade	sub_grade	...	hardship_start_date	hardship_end_date	payment_plan_start_date	hardship_length	hardship_dpd	hardship_loan_status	orig_projected_additional_accrued_interest	hardship_payoff_balance_amount	hardship_last_payment_amount	debt_settlement_flag
0	0	1077501	5000.0	5000.0	4975.0	36 months	10.65%	162.87	B	B2	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	N
1	1	1077430	2500.0	2500.0	2500.0	60 months	15.27%	59.83	C	C4	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	N
2	2	1077175	2400.0	2400.0	2400.0	36 months	15.96%	84.33	C	C5	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	N
3	3	1076863	10000.0	10000.0	10000.0	36 months	13.49%	339.31	C	C1	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	N
4	4	1075358	3000.0	3000.0	3000.0	60 months	12.69%	67.79	B	B5	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	N

5 rows × 142 columns

[25]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2925493 entries, 0 to 2925492
Columns: 142 entries, Unnamed: 0 to debt_settlement_flag
dtypes: float64(106), int64(1), object(35)
memory usage: 3.1+ GB

[26]:

df.loan_status.value_counts()

[26]:

loan_status
Fully Paid                                             1497783
Current                                                1031016
Charged Off                                             362548
Late (31-120 days)                                       16154
In Grace Period                                          10028
Late (16-30 days)                                         2719
Issued                                                    2062
Does not meet the credit policy. Status:Fully Paid        1988
Does not meet the credit policy. Status:Charged Off        761
Default                                                    433
Name: count, dtype: int64

[27]:

df_sub = df.sample(n=32768, random_state=131)
df_sub.head()

[27]:

	Unnamed: 0	id	loan_amnt	funded_amnt	funded_amnt_inv	term	int_rate	installment	grade	sub_grade	...	hardship_start_date	hardship_end_date	payment_plan_start_date	hardship_length	hardship_dpd	hardship_loan_status	orig_projected_additional_accrued_interest	hardship_payoff_balance_amount	hardship_last_payment_amount	debt_settlement_flag
1776442	55398	166302700	5000.0	5000.0	5000.0	36 months	8.81%	158.56	A	A5	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	N
2274736	406975	38587318	3000.0	3000.0	3000.0	36 months	14.31%	102.99	C	C4	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	N
284307	53590	116869925	5000.0	5000.0	5000.0	36 months	19.03%	183.36	D	D3	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	N
493884	21818	129581391	2000.0	2000.0	2000.0	36 months	17.47%	71.78	D	D1	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	N
1522356	36941	32629592	5000.0	5000.0	5000.0	36 months	10.99%	163.67	B	B3	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	N

5 rows × 142 columns

[28]:

print(df_sub.columns)

Index(['Unnamed: 0', 'id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv',
       'term', 'int_rate', 'installment', 'grade', 'sub_grade',
       ...
       'hardship_start_date', 'hardship_end_date', 'payment_plan_start_date',
       'hardship_length', 'hardship_dpd', 'hardship_loan_status',
       'orig_projected_additional_accrued_interest',
       'hardship_payoff_balance_amount', 'hardship_last_payment_amount',
       'debt_settlement_flag'],
      dtype='object', length=142)

[29]:

df_sub = df_sub[['loan_amnt','annual_inc','term','int_rate','installment','grade','sub_grade','purpose','loan_status','dti','last_fico_range_high','last_fico_range_low','total_acc', 'delinq_2yrs', 'emp_length']]

[30]:

df_sub.head()
df_sub.to_csv('lc.csv', index=False)

1. Data import

We import lending club data extracted from original set downloaded from Kaggle. In this exercise, we only use 32,768 observations and selected some sensitive features. Let’s import the provided data.

[31]:

import pandas as pd

[32]:

df = pd.read_csv('lc.csv')

[33]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32768 entries, 0 to 32767
Data columns (total 15 columns):
 #   Column                Non-Null Count  Dtype
---  ------                --------------  -----
 0   loan_amnt             32768 non-null  float64
 1   annual_inc            32768 non-null  float64
 2   term                  32768 non-null  object
 3   int_rate              32768 non-null  object
 4   installment           32768 non-null  float64
 5   grade                 32768 non-null  object
 6   sub_grade             32768 non-null  object
 7   purpose               32768 non-null  object
 8   loan_status           32768 non-null  object
 9   dti                   32741 non-null  float64
 10  last_fico_range_high  32768 non-null  float64
 11  last_fico_range_low   32768 non-null  float64
 12  total_acc             32768 non-null  float64
 13  delinq_2yrs           32768 non-null  float64
 14  emp_length            30468 non-null  object
dtypes: float64(8), object(7)
memory usage: 3.8+ MB

2. Data preprocessing

Before encoding and encrypting data, we have to preprocess data. Frist, we have to change data of numerical columns to number. Second, categorical column have to be set to dtype as “category”.

[34]:

df['int_rate'] = df['int_rate'].apply(lambda x: float(x.strip('%')) / 100)

df["term"] = df["term"].astype("category")
df["grade"] = df["grade"].astype("category")
df["sub_grade"] = df["sub_grade"].astype("category")
df["purpose"] = df["purpose"].astype("category")
df["loan_status"] = df["loan_status"].astype("category")
df["emp_length"] = df["emp_length"].astype("category")

Now, let’s encrypt data. Note that HEaaN.SDK contains various encryption techniques in context. We will use ‘FGb’ parameter on encryption, and make keys based on the parameter.

[35]:

import heaan_sdk

context = heaan_sdk.Context(
    parameter=heaan_sdk.HEParameter.from_preset("FGb"),
    key_dir_path="./keys_stat",
    load_keys="all",
    generate_keys=True,
)

HEaaN-SDK uses CUDA v11.7 (> v11.2)

Now, encode and encrypt data.

[46]:

hf = heaan_sdk.HEFrame.from_frame(context, df)

[47]:

hf.encrypt()

[47]:

HEFrame(
  number of rows: 32768,
  number of columns: 15,
  list of columns: ['loan_amnt', 'annual_inc', 'term', 'int_rate', 'installment', 'grade', 'sub_grade', 'purpose', 'loan_status', 'dti', 'last_fico_range_high', 'last_fico_range_low', 'total_acc', 'delinq_2yrs', 'emp_length']
)

Before conducting the analysis, let’s perform some data preprocessing. First, since our objective is to predict whether a loan is well-paid or not, let’s create a binary variable called ‘bad’. We can define ‘bad=0’ if the loan_status is either ‘Fully Paid’ or ‘Current’, and ‘bad=1’ if the loan_status is not well-paid.

[48]:

hf["bad"] = (hf["loan_status"] != "Fully Paid") & (hf["loan_status"] != "Current")

Suppose that we are going to use average of monthly income for individual person.

[49]:

hf["avg_monthly_inc"] = hf["annual_inc"] * (1 / 12)

[40]:

hf.info()

[40]:

<class 'heaan_sdk.frame.frame.HEFrame'>
Data rows: 32768
Data columns (total 17 columns):
#    Column                    Dtype      Encrypted
---  ------------------------  ---------  ----------
  0  loan_amnt                 float64    True
  1  annual_inc                float64    True
  2  term                      category   True
  3  int_rate                  float64    True
  4  installment               float64    True
  5  grade                     category   True
  6  sub_grade                 category   True
  7  purpose                   category   True
  8  loan_status               category   True
  9  dti                       float64    True
 10  last_fico_range_high      float64    True
 11  last_fico_range_low       float64    True
 12  total_acc                 float64    True
 13  delinq_2yrs               float64    True
 14  emp_length                category   True
 15  bad                       bool       True
 16  avg_monthly_inc           float      True
dtypes: float(1), bool(1), category(6), float64(9)

You will see that “avg_monthly_inc” is added on the encrypted dataset.

3. Descriptive Statistics

In this section, we calculate descriptive statistics for some variables. First of all, we calculate average and standard deviation of average of monthly income.

[41]:

mean_inc = hf["avg_monthly_inc"].mean()
mean_inc.decrypt().to_series()[0]

[41]:

6610.71612473255

[42]:

std_inc = hf["avg_monthly_inc"].std()
std_inc.decrypt().to_series()[0]

[42]:

5018.305150782786

You may see that the result of calculated average and standard deviation of monthly income are approximately 6610, 5018, respectively.

Now, let’s calculate the correlation coefficient between monthly income and annual income.

[43]:

corr_inc = hf["avg_monthly_inc"].corr(hf["annual_inc"])
corr_inc.decrypt().to_series()[0]

[43]:

1.0000000251598327

Since the correlation between monthly income and annual income is perfectly correlated, it is better to drop one of both. In this example, let’s drop annual income variable.

[50]:

hf.drop("annual_inc", axis=1, inplace=True)

[50]:

HEFrame(
  number of rows: 32768,
  number of columns: 16,
  list of columns: ['loan_amnt', 'term', 'int_rate', 'installment', 'grade', 'sub_grade', 'purpose', 'loan_status', 'dti', 'last_fico_range_high', 'last_fico_range_low', 'total_acc', 'delinq_2yrs', 'emp_length', 'bad', 'avg_monthly_inc']
)

If you are interested in calculating skewness and kurtosis, you may also calculate using the program. Let’s show how to conduct.

[51]:

skew_inc = hf["avg_monthly_inc"].skew()
skew_inc.decrypt().to_series()[0]

[51]:

8.979756358129835

[52]:

kurt_inc = hf["avg_monthly_inc"].kurt()
kurt_inc.decrypt().to_series()[0]

[52]:

252.25036127327368

Now, let’s calculate descriptive statistics for specific credit grade. For example, consider the mean and standard deviation of average of monthly income for credit grade ‘A’.

[53]:

mean_inc_with_A = hf['avg_monthly_inc'][hf['grade']=='A'].mean()
mean_inc_with_A.decrypt().to_series()[0]

[53]:

7548.629657508425

[54]:

std_inc_with_A = hf['avg_monthly_inc'][hf['grade']=='A'].std()
std_inc_with_A.decrypt().to_series()[0]

[54]:

5469.998868967143

The calculated mean and standard deviation of credit grade “A“‘s monthly income are approximately 7549, 5470, respectively, which are far from mean and stadnard deviation of whole monthly income.

4. Hypothesis Testing

In this section, we explore how to conduct hypothesis testing. For example, let’s conduct one-sample hypothesis testing for mean.

Assume that we are interested in testing the population mean of average of monthly income is 6,500. Then, the null hypothesis is:

$H_0: \mu_{inc} = 6,500$.

The alternative hypothesis is

$H_1: \mu_{inc} \neq 6,500$.

[55]:

null_mean = 6500
t_test = hf['avg_monthly_inc'].t_test(null_mean)
t_test.decrypt().to_series()

[55]:

0        3.99373
1    32767.00000
Name: avg_monthly_inc_t_test, dtype: float64

The result consists of t-statistic in the first slot and degrees of freedom in the second slot. We use scipy to calculate p-value.

[56]:

t_test_series = t_test.to_series()

import scipy.stats as st
st.t.sf(abs(t_test_series[0]), round(t_test_series[1])) * 2

[56]:

6.518366686853663e-05

Since t-statistics is approximately 3.9937, and p-value is approximately 0.000065, we can reject the null hypothesis on 95% significance level, which means that the average anuual income is not equal to 6,500$.

Using the above example, we can also construct 95% confidence interval for the mean of the annual income.

[57]:

conf_interval_t = hf['avg_monthly_inc'].t_interval_for_mean(0.95)
conf_interval_t.decrypt().to_series()

[57]:

0    6556.379042
1    6665.053186
Name: avg_monthly_inc_mean_confint_t_0.95, dtype: float64

The resulting 95% confidence interval of population mean is (6556.379, 6665.053), which means that we are 95% confident that the population mean of annual income will fall between the interval confident by 95%.