Lending Club Data Analysis Stats • heaan.sdk.R

library(tidyverse)
library(reticulate)

Lending Club Data Analysis

Step 0. Get data

In this documentation, we introduce default detection model using lending club dataset from Kaggle (https://www.kaggle.com/datasets/ethon0426/lending-club-20072020q1/versions/3?resource=download).

This part is for sample selection from the original dataset. We will extract 32,768 observations. Also, we will select some features for toy example.

library(utils)
df <- read.csv(gzfile("Loan_status_2007-2020Q3.gzip"), header = TRUE)

str(df)
head(df)
loan_status_counts <- table(df$loan_status)
print(loan_status_counts)

set.seed(123)
df_sub <- df[sample(nrow(df), 32768, replace = FALSE), ]
str(df_sub)
head(df_sub)

df_sub <- df_sub[, c("loan_amnt", "annual_inc", "term", "int_rate", "installment", "grade", "sub_grade", "purpose", "loan_status", "dti", "last_fico_range_high", "last_fico_range_low", "total_acc", "delinq_2yrs", "emp_length")]
head(df_sub)

# Save
write.csv(df_sub, "lc.csv", row.names = FALSE)

Step 1. Data import

For this exercise, we are using only 32,768 observations and selected some sensitive features. Let’s import the provided data.

tb <- read.csv("lc.csv")
tb <- as_tibble(tb)

Before encoding dataset to HEFrame, we need to convert any string-type column to factors, and ensure that the interest rate is in numeric form.

tb <- tb %>%
    mutate(
            loan_amnt = as.numeric(loan_amnt),
            term = as.factor(term),
            grade = as.factor(grade),
            sub_grade = as.factor(sub_grade),
            purpose = as.factor(purpose),
            loan_status = as.factor(loan_status),
            emp_length = as.factor(emp_length),
            int_rate = as.numeric(gsub("%", "", int_rate)) / 100,
            )

Step 2. Encode and Encrypt

Before preprocessing the data, we need to encode and encrypt data to HEFrame.

library(heaan.sdk.R)
import_heaan_sdk()
params <- heaan_sdk.HEParameter("FGb")
context <- heaan_sdk.Context(
    params,
    key_dir_path = "./keys",
    load_keys = "all",
    generate_keys = TRUE
)
hf <- HEFrame.from_frame(context, tb)
hf %>% encrypt()

We will use the average of monthly income.

hf %>%
  mutate(hf["annual_inc"] * (1 / 12), col_name = "avg_monthly_inc")

Step 3. Descriptive Statistics

In this section, we calulate descriptive statistics for some variables. Firstly, we calculate the average and standard deviation of the average monthly income.

mean_inc <- hf["avg_monthly_inc"] %>% mean()
mean_inc %>% decrypt() %>% to_series()

sd_inc <- hf["avg_monthly_inc"] %>% sd()
sd_inc %>% decrypt() %>% to_series()

Now, let’s calculate the correlation coefficient between monthly income and annual income.

corr_inc <- hf["avg_monthly_inc"] %>%
            corr(hf["annual_inc"])
corr_inc %>% decrypt() %>% to_series()

The result is almost 1, incidating a strong correlation between two variables. So, we will remove annual income from our HEFrame.

hf$drop("annual_inc", axis = 1L, inplace = TRUE)

Let’s calculate the skewness and kurtosis.

skew_inc <- hf["avg_monthly_inc"] %>% skewness()
skew_inc %>% decrypt() %>% to_series()

kurt_inc <- hf["avg_monthly_inc"] %>% kurtosis()
kurt_inc %>% decrypt() %>% to_series()

Now, let’s calculate descriptive statistics for a specific credit grade. For example, consider the mean and standard deviation of average of monthly income for credit grade ‘A’.

mean_inc_with_A <- hf["avg_monthly_inc"] %>%
                                  filter(hf["grade"] == "A") %>%
                                  mean()

mean_inc_with_A %>% decrypt() %>% to_series()

sd_inc_with_A <- hf["avg_monthly_inc"] %>%
                                  filter(hf["grade"] == "A") %>%
                                  sd()

sd_inc_with_A %>% decrypt() %>% to_series()

Step 4. Hypothesis Testing

In this section, we will explore how to conduct hypothesis testing. First, let’s conduct one-sample hypothesis test for the mean.

Assume that we are interested in testing the population mean of average of monthly income is 6,500. Then, the null hypothesis is:

$H_0: \mu_{inc} = 6,500$.

The alternative hypothesis is

$H_1: \mu_{inc} \neq 6,500$.

null_mean <- 6500
ttest <- t_test(hf["avg_monthly_inc"], mu = null_mean, conf.level = 0.95)

# t_value and df
t_stat <- ttest$statistic
conf_int <- ttest$conf.int

t_stat <- t_stat %>% decrypt() %>% to_series()

conf_int <- conf_int %>% decrypt() %>% to_series()

print("t-statistic:")
t_stat[1]
print("degrees of freedom:")
t_stat[2]

print("confidence interval")
conf_int

Since t-statistics is approximately 3.2, we can reject the null hypothesis on 95% significance level, which means that the average monthly income is not equal to 6500$.