Lending Club Data Analysis Logistic Regression • heaan.sdk.R

Lending Club Data Analysis

In this documentation, we introduce default detection model using lending club dataset from Kaggle (https://www.kaggle.com/datasets/ethon0426/lending-club-20072020q1).

Step 1. Data import

library(tidyverse)
library(reticulate)

tb <- read.csv("lc.csv")
tb <- as_tibble(tb)

Step 2. Data preprocessing

Before conducting the analysis, we need to perform some data preprocessing. First, let’s transform int_rate into a numeric variable.

tb <- tb %>%
  mutate(int_rate = as.numeric(gsub("%", "", int_rate)) / 100)

These are several ways to deal with missing observations in variables. In this example, we simply drop the missing observations.

tb <- na.omit(tb)

We can apply a log transformation to the ‘annual_inc’ and “loan_amount” variable using log function.

tb <- tb %>%
  mutate(
    log_inc = log(annual_inc),
    log_loan_amnt = log(loan_amnt))

One of useful variable transformation technique is one-hot encoding. If we have some categorical dataset such as “term”, “interest rate”, “grade”, “sub_grade”, and “purpose of loan”. For example, let’s transform “purpose” “term” and “grade” into one-hot encoded variables. Note that one of one-hot encoded variables will be omitted due to exact collinearity in the analysis.

tb <- tb %>%
  mutate(purpose = as.numeric(as.factor(as.numeric(factor(purpose, levels = unique(purpose)[order(unique(purpose))])))))

col_names <- paste0("purpose_", sort(unique(tb$purpose)))
for (i in seq_along(sort(unique(tb$purpose)))) {
  tb[[col_names[i]]] <- ifelse(tb$purpose == sort(unique(tb$purpose))[i], 1L, 0L)
  as.factor(tb[[col_names[i]]])
}
tb <- tb %>%
  mutate(term = as.numeric(as.factor(as.numeric(factor(term, levels = unique(term)[order(unique(term))])))))
col_names <- paste0("term_", sort(unique(tb$term)))
for (i in seq_along(sort(unique(tb$term)))) {
  tb[[col_names[i]]] <- ifelse(tb$term == sort(unique(tb$term))[i], 1L, 0L)
  as.factor(tb[[col_names[i]]])
}
tb <- tb %>%
  mutate(grade = factor(grade, levels = unique(grade)[order(unique(grade))]))

col_names <- paste0("grade_", sort(unique(tb$grade)))
for (i in seq_along(sort(unique(tb$grade)))) {
  tb[[col_names[i]]] <- ifelse(tb$grade == sort(unique(tb$grade))[i], 1L, 0L)
  as.factor(tb[[col_names[i]]])
}

Step 3. Binary classification

Since our objective is to predict whether a loan is well-paid or not, let’s create a binary variable called “bad”. We can define “bad=0” if the loan_status is either “Fully Paid” or “Current”, and “bad=1” if the loan_status is not well-paid.

tb <- tb %>%
  mutate(bad = ifelse(loan_status == "Fully Paid" | loan_status == "Current", 0L, 1L))

str(tb)

Normalize ‘log_inc’ and ‘log_loan_amnt’ to ensure that the weight of each feature is evenly distributed.

tb <- tb %>%
  mutate(
    log_inc = log_inc / max(log_inc),
    log_loan_amnt = log_loan_amnt / max(log_loan_amnt))

X <- tb[, c("log_inc", "log_loan_amnt", "purpose_2", "purpose_3", "purpose_4", "purpose_5", "purpose_6", "purpose_7", "purpose_8", "purpose_9", "purpose_10", "purpose_11", "purpose_12", "purpose_13", "purpose_14", "term_1", "grade_B", "grade_C", "grade_D", "grade_E", "grade_F", "grade_G")]
y <- tb$bad

Let’s divide our data into training data and inference data.

split <- sample(nrow(X), 0.8 * nrow(X))
X_train <- X[split, ]
X_test <- X[-split, ]
y_train <- y[split]
y_test <- y[-split]

Let’s import HEaaN-SDK to conduct logistic regression.

library(heaan.sdk.R)
import_heaan_sdk()
params <- heaan_sdk.HEParameter("FGb")
context <- heaan_sdk.Context(
    params,
    key_dir_path = "./keys",
    load_keys = "all",
    generate_keys = TRUE
)

Let’s set the hyperparameters.

num_epoch <- 10L
learning_rate <- 1.0
batch_size <- 1024L
optimizer <- "sgd"
lr_scheduler <- "constant"
activation <- "sigmoid_wide"
classes <- unique(y)
num_feature <- ncol(X)
unit_shape <- c(batch_size, as.integer(py_to_r(context$num_slots) / batch_size))
train_data <- encode_train_data(
    context,
    X_train,
    y_train,
    unit_shape,
    dtype = "classification",
    path = "./training"
)
train_data %>% encrypt()

Let’s set up the logistic regression model.

model <- hml_logit(
    context,
    unit_shape,
    num_feature,
    classes,
    path = "./model"
)
model %>% encrypt() %>% to_device()

Let’s fit the logistic regression model.

model %>% fit(
    train_data,
    lr = learning_rate,
    num_epoch = num_epoch,
    batch_size = batch_size,
    optimizer = optimizer,
    lr_scheduler = lr_scheduler,
    activation = activation
)

model %>% to_host() %>% decrypt()
model

Now, we can use the trained logistic regression model to make predictions on test data.

# encode and encrypt X_test
test_data_features <- encode_encrypt(
  context,
  X_test,
  unit_shape
)
test_data_features %>% to_device()

# predict
output_binary <- model %>%
                predict(test_data_features)

output_binary %>% to_host()

# decrypt result
output_arr <- output_binary %>% decrypt_decode()

# compute auc
threshold <- 0.5
probs <- 1 / (1 + exp(-(output_arr)))
probs <- as.vector(probs)
preds <- probs > threshold
correct_cnt <- sum(preds == y_test)
acc <- correct_cnt / length(y_test)
cat(sprintf("Test accuracy: %.2f%%", acc * 100))

According to the result, the out-of-sample test of our model has 86.61% accuracy. This means that it can distinguish whether a borrower causes a bad transaction with an accuracy of about 86.61%