lc_stats.Rmd
In this documentation, we introduce default detection model using lending club dataset from Kaggle (https://www.kaggle.com/datasets/ethon0426/lending-club-20072020q1/versions/3?resource=download).
This part is for sample selection from the original dataset. We will extract 32,768 observations. Also, we will select some features for toy example.
library(utils)
df <- read.csv(gzfile("Loan_status_2007-2020Q3.gzip"), header = TRUE)
str(df)
head(df)
loan_status_counts <- table(df$loan_status)
print(loan_status_counts)
set.seed(123)
df_sub <- df[sample(nrow(df), 32768, replace = FALSE), ]
str(df_sub)
head(df_sub)
df_sub <- df_sub[, c("loan_amnt", "annual_inc", "term", "int_rate", "installment", "grade", "sub_grade", "purpose", "loan_status", "dti", "last_fico_range_high", "last_fico_range_low", "total_acc", "delinq_2yrs", "emp_length")]
head(df_sub)
# Save
write.csv(df_sub, "lc.csv", row.names = FALSE)
For this exercise, we are using only 32,768 observations and selected some sensitive features. Let’s import the provided data.
Before encoding dataset to HEFrame, we need to convert any string-type column to factors, and ensure that the interest rate is in numeric form.
Before preprocessing the data, we need to encode and encrypt data to HEFrame.
library(heaan.sdk.R)
import_heaan_sdk()
params <- heaan_sdk.HEParameter("FGb")
context <- heaan_sdk.Context(
params,
key_dir_path = "./keys",
load_keys = "all",
generate_keys = TRUE
)
hf <- HEFrame.from_frame(context, tb)
hf %>% encrypt()
We will use the average of monthly income.
In this section, we calulate descriptive statistics for some variables. Firstly, we calculate the average and standard deviation of the average monthly income.
mean_inc <- hf["avg_monthly_inc"] %>% mean()
mean_inc %>% decrypt() %>% to_series()
sd_inc <- hf["avg_monthly_inc"] %>% sd()
sd_inc %>% decrypt() %>% to_series()
Now, let’s calculate the correlation coefficient between monthly income and annual income.
The result is almost 1, incidating a strong correlation between two variables. So, we will remove annual income from our HEFrame.
hf$drop("annual_inc", axis = 1L, inplace = TRUE)
Let’s calculate the skewness and kurtosis.
skew_inc <- hf["avg_monthly_inc"] %>% skewness()
skew_inc %>% decrypt() %>% to_series()
kurt_inc <- hf["avg_monthly_inc"] %>% kurtosis()
kurt_inc %>% decrypt() %>% to_series()
Now, let’s calculate descriptive statistics for a specific credit grade. For example, consider the mean and standard deviation of average of monthly income for credit grade ‘A’.
In this section, we will explore how to conduct hypothesis testing. First, let’s conduct one-sample hypothesis test for the mean.
Assume that we are interested in testing the population mean of average of monthly income is 6,500. Then, the null hypothesis is:
\(H_0: \mu_{inc} = 6,500\).
The alternative hypothesis is
\(H_1: \mu_{inc} \neq 6,500\).
null_mean <- 6500
ttest <- t_test(hf["avg_monthly_inc"], mu = null_mean, conf.level = 0.95)
# t_value and df
t_stat <- ttest$statistic
conf_int <- ttest$conf.int
t_stat <- t_stat %>% decrypt() %>% to_series()
conf_int <- conf_int %>% decrypt() %>% to_series()
print("t-statistic:")
t_stat[1]
print("degrees of freedom:")
t_stat[2]
print("confidence interval")
conf_int
Since t-statistics is approximately 3.2, we can reject the null hypothesis on 95% significance level, which means that the average monthly income is not equal to 6500$.