lc_ml.Rmd
In this documentation, we introduce default detection model using lending club dataset from Kaggle (https://www.kaggle.com/datasets/ethon0426/lending-club-20072020q1).
Before conducting the analysis, we need to perform some data preprocessing. First, let’s transform int_rate into a numeric variable.
tb <- tb %>%
mutate(int_rate = as.numeric(gsub("%", "", int_rate)) / 100)
These are several ways to deal with missing observations in variables. In this example, we simply drop the missing observations.
tb <- na.omit(tb)
We can apply a log transformation to the ‘annual_inc’ and “loan_amount” variable using log function.
One of useful variable transformation technique is one-hot encoding. If we have some categorical dataset such as “term”, “interest rate”, “grade”, “sub_grade”, and “purpose of loan”. For example, let’s transform “purpose” “term” and “grade” into one-hot encoded variables. Note that one of one-hot encoded variables will be omitted due to exact collinearity in the analysis.
tb <- tb %>%
mutate(purpose = as.numeric(as.factor(as.numeric(factor(purpose, levels = unique(purpose)[order(unique(purpose))])))))
col_names <- paste0("purpose_", sort(unique(tb$purpose)))
for (i in seq_along(sort(unique(tb$purpose)))) {
tb[[col_names[i]]] <- ifelse(tb$purpose == sort(unique(tb$purpose))[i], 1L, 0L)
as.factor(tb[[col_names[i]]])
}
tb <- tb %>%
mutate(term = as.numeric(as.factor(as.numeric(factor(term, levels = unique(term)[order(unique(term))])))))
col_names <- paste0("term_", sort(unique(tb$term)))
for (i in seq_along(sort(unique(tb$term)))) {
tb[[col_names[i]]] <- ifelse(tb$term == sort(unique(tb$term))[i], 1L, 0L)
as.factor(tb[[col_names[i]]])
}
tb <- tb %>%
mutate(grade = factor(grade, levels = unique(grade)[order(unique(grade))]))
col_names <- paste0("grade_", sort(unique(tb$grade)))
for (i in seq_along(sort(unique(tb$grade)))) {
tb[[col_names[i]]] <- ifelse(tb$grade == sort(unique(tb$grade))[i], 1L, 0L)
as.factor(tb[[col_names[i]]])
}
Since our objective is to predict whether a loan is well-paid or not, let’s create a binary variable called “bad”. We can define “bad=0” if the loan_status is either “Fully Paid” or “Current”, and “bad=1” if the loan_status is not well-paid.
str(tb)
Normalize ‘log_inc’ and ‘log_loan_amnt’ to ensure that the weight of each feature is evenly distributed.
tb <- tb %>%
mutate(
log_inc = log_inc / max(log_inc),
log_loan_amnt = log_loan_amnt / max(log_loan_amnt))
X <- tb[, c("log_inc", "log_loan_amnt", "purpose_2", "purpose_3", "purpose_4", "purpose_5", "purpose_6", "purpose_7", "purpose_8", "purpose_9", "purpose_10", "purpose_11", "purpose_12", "purpose_13", "purpose_14", "term_1", "grade_B", "grade_C", "grade_D", "grade_E", "grade_F", "grade_G")]
y <- tb$bad
Let’s divide our data into training data and inference data.
split <- sample(nrow(X), 0.8 * nrow(X))
X_train <- X[split, ]
X_test <- X[-split, ]
y_train <- y[split]
y_test <- y[-split]
Let’s import HEaaN-SDK to conduct logistic regression.
library(heaan.sdk.R)
import_heaan_sdk()
params <- heaan_sdk.HEParameter("FGb")
context <- heaan_sdk.Context(
params,
key_dir_path = "./keys",
load_keys = "all",
generate_keys = TRUE
)
Let’s set the hyperparameters.
num_epoch <- 10L
learning_rate <- 1.0
batch_size <- 1024L
optimizer <- "sgd"
lr_scheduler <- "constant"
activation <- "sigmoid_wide"
classes <- unique(y)
num_feature <- ncol(X)
unit_shape <- c(batch_size, as.integer(py_to_r(context$num_slots) / batch_size))
train_data <- encode_train_data(
context,
X_train,
y_train,
unit_shape,
dtype = "classification",
path = "./training"
)
train_data %>% encrypt()
Let’s set up the logistic regression model.
model <- hml_logit(
context,
unit_shape,
num_feature,
classes,
path = "./model"
)
model %>% encrypt() %>% to_device()
Let’s fit the logistic regression model.
model %>% fit(
train_data,
lr = learning_rate,
num_epoch = num_epoch,
batch_size = batch_size,
optimizer = optimizer,
lr_scheduler = lr_scheduler,
activation = activation
)
Now, we can use the trained logistic regression model to make predictions on test data.
# encode and encrypt X_test
test_data_features <- encode_encrypt(
context,
X_test,
unit_shape
)
test_data_features %>% to_device()
# predict
output_binary <- model %>%
predict(test_data_features)
output_binary %>% to_host()
# decrypt result
output_arr <- output_binary %>% decrypt_decode()
# compute auc
threshold <- 0.5
probs <- 1 / (1 + exp(-(output_arr)))
probs <- as.vector(probs)
preds <- probs > threshold
correct_cnt <- sum(preds == y_test)
acc <- correct_cnt / length(y_test)
cat(sprintf("Test accuracy: %.2f%%", acc * 100))
According to the result, the out-of-sample test of our model has 86.61% accuracy. This means that it can distinguish whether a borrower causes a bad transaction with an accuracy of about 86.61%