Logistic regression using HEaaN.ml
In this documentation, we introduce default detection model using lending club dataset from Kaggle (https://www.kaggle.com/datasets/ethon0426/lending-club-20072020q1).
1. Data import
We import lending club data extracted from original set downloaded from Kaggle. In this exercise, we only use 32,768 observations and selected some sensitive features. Let’s import the provided data.
[2]:
import pandas as pd
import numpy as np
[3]:
df = pd.read_csv('lc.csv')
[4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32768 entries, 0 to 32767
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 loan_amnt 32768 non-null float64
1 annual_inc 32768 non-null float64
2 term 32768 non-null object
3 int_rate 32768 non-null object
4 installment 32768 non-null float64
5 grade 32768 non-null object
6 sub_grade 32768 non-null object
7 purpose 32768 non-null object
8 loan_status 32768 non-null object
9 dti 32741 non-null float64
10 last_fico_range_high 32768 non-null float64
11 last_fico_range_low 32768 non-null float64
12 total_acc 32768 non-null float64
13 delinq_2yrs 32768 non-null float64
14 emp_length 30468 non-null object
dtypes: float64(8), object(7)
memory usage: 3.8+ MB
2. Data preprocessing
Before conducting the analysis, let’s perform some data preprocessing. First, Let’s transform int_rate as float variable.
[5]:
df['int_rate'] = df['int_rate'].apply(lambda x: float(x.strip('%')) / 100)
There are several ways to deal with missing observations in variables. In this example, we will drop the missing observations.
[6]:
df = df.dropna()
We can apply a log transformation to the ‘annual_inc’ and ‘loan_amnt’ variables using either the numpy or math modules in Python. To apply the transformation, you can use the following code: np.log(df[‘annual_inc’]) and np.log(df[‘loan_amnt’]). Once the transformation is applied, it’s a good practice to check whether it was successful by inspecting the transformed data.
[7]:
df['log_inc']=np.log(df['annual_inc'])
df['log_loan_amnt']=np.log(df['loan_amnt'])
One of useful variable transformation technique is one-hot encoding. We have some categorical dataset such as term, interest rate, grade, sub_grade, and purpose of loan. For example, let’s transform purpose of loan into one-hot encoded variables. Note that one of one-hot encoded variables is omitted because of exact collinearity in analysis.
[8]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
[9]:
df['purpose_trans']=le.fit_transform(df['purpose'])
dummies = pd.get_dummies(df['purpose_trans'], prefix='purpose')
df = pd.concat([df, dummies], axis=1)
The remainings are for same transformation on all categorical variables
[10]:
df['term_trans']=le.fit_transform(df['term'])
dummies_term = pd.get_dummies(df['term_trans'], prefix='term')
df = pd.concat([df, dummies_term], axis=1)
[11]:
df['grade_trans']=le.fit_transform(df['grade'])
dummies_grade = pd.get_dummies(df['grade_trans'], prefix='grade')
df = pd.concat([df, dummies_grade], axis=1)
You can delete some variables for several purpose. For example, to avoid the collinearity problem in the regression analysis, we omit one of one-hot encoded categorical variables.
[12]:
df = df.drop(['purpose_0'], axis=1)
df = df.drop(['term_0'], axis=1)
df = df.drop(['grade_0'], axis=1)
3. Data Analysis
We will use various features defined the above section including all one-hot encoded variables.
[13]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 30467 entries, 0 to 32767
Data columns (total 40 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 loan_amnt 30467 non-null float64
1 annual_inc 30467 non-null float64
2 term 30467 non-null object
3 int_rate 30467 non-null float64
4 installment 30467 non-null float64
5 grade 30467 non-null object
6 sub_grade 30467 non-null object
7 purpose 30467 non-null object
8 loan_status 30467 non-null object
9 dti 30467 non-null float64
10 last_fico_range_high 30467 non-null float64
11 last_fico_range_low 30467 non-null float64
12 total_acc 30467 non-null float64
13 delinq_2yrs 30467 non-null float64
14 emp_length 30467 non-null object
15 log_inc 30467 non-null float64
16 log_loan_amnt 30467 non-null float64
17 purpose_trans 30467 non-null int64
18 purpose_1 30467 non-null bool
19 purpose_2 30467 non-null bool
20 purpose_3 30467 non-null bool
21 purpose_4 30467 non-null bool
22 purpose_5 30467 non-null bool
23 purpose_6 30467 non-null bool
24 purpose_7 30467 non-null bool
25 purpose_8 30467 non-null bool
26 purpose_9 30467 non-null bool
27 purpose_10 30467 non-null bool
28 purpose_11 30467 non-null bool
29 purpose_12 30467 non-null bool
30 purpose_13 30467 non-null bool
31 term_trans 30467 non-null int64
32 term_1 30467 non-null bool
33 grade_trans 30467 non-null int64
34 grade_1 30467 non-null bool
35 grade_2 30467 non-null bool
36 grade_3 30467 non-null bool
37 grade_4 30467 non-null bool
38 grade_5 30467 non-null bool
39 grade_6 30467 non-null bool
dtypes: bool(20), float64(11), int64(3), object(6)
memory usage: 5.5+ MB
Since our objective is to predict whether a loan is well-paid or not, let’s create a binary variable called ‘bad’. We can define ‘bad=0’ if the loan_status is either ‘Fully Paid’ or ‘Current’, and ‘bad=1’ if the loan_status is not well-paid.
[14]:
df['bad'] = df['loan_status'].apply(lambda x: 0 if x == 'Fully Paid' or x == 'Current' else 1)
Now, we can construct a binary logistic regression model to predict the probability of a loan transaction being classified as “bad” based on various independent variables such as borrower credit score, loan amount, debt-to-income ratio, etc.
[15]:
y = df['bad'].to_numpy()
Normalize ‘log_inc’ and ‘log_loan_amnt’ to ensure that the weight of each feature is evenly distributed.
[16]:
df['log_inc'] /= df['log_inc'].max()
df['log_loan_amnt'] /= df['log_loan_amnt'].max()
[17]:
X = df[['log_inc','log_loan_amnt','purpose_1','purpose_2','purpose_3','purpose_4','purpose_5','purpose_6','purpose_7','purpose_8','purpose_9','purpose_10','purpose_11','purpose_12','purpose_13','term_1','grade_1','grade_2','grade_3','grade_4','grade_5','grade_6']].to_numpy()
Divide data into training data and inference data.
[18]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True, stratify=y, random_state=0)
Let’s import HEaaN-SDK to conduct logistic regression.
[19]:
import heaan_sdk
context = heaan_sdk.Context(
parameter=heaan_sdk.HEParameter.from_preset("FGb"),
key_dir_path="./keys",
load_keys="all",
generate_keys=True,
)
HEaaN-SDK uses CUDA v11.7 (> v11.2)
Let us set hyperparameters first and then encrypt train data.
[20]:
num_epoch = 10
learning_rate = 1.0
batch_size = 1024
optimizer = "sgd"
lr_scheduler = "constant"
activation = "sigmoid_wide"
classes = sorted([val for val in df['bad'].unique()])
num_feature = len(X[0])
unit_shape = (batch_size, context.num_slots // batch_size)
[21]:
train_data = heaan_sdk.ml.preprocessing.encode_train_data(context, X_train, y_train, unit_shape, dtype="classification", path="./training")
train_data.encrypt()
Now, create model and encrypt that.
[22]:
model = heaan_sdk.ml.LogisticRegression(context, unit_shape, num_feature, classes, path="./model")
model.encrypt()
If GPU is available, send model to GPU.
[23]:
model.to_device()
To train data, use fit()
of model.
[24]:
model.fit(
train_data,
lr=learning_rate,
num_epoch=num_epoch,
batch_size=batch_size,
optimizer=optimizer,
lr_scheduler=lr_scheduler,
activation=activation,
)
Epoch 0: 100%|██████████| 24/24 [00:45<00:00, 1.91s/it]
Epoch 1: 100%|██████████| 24/24 [00:37<00:00, 1.54s/it]
Epoch 2: 100%|██████████| 24/24 [00:37<00:00, 1.54s/it]
Epoch 3: 100%|██████████| 24/24 [00:36<00:00, 1.52s/it]
Epoch 4: 100%|██████████| 24/24 [00:36<00:00, 1.51s/it]
Epoch 5: 100%|██████████| 24/24 [00:36<00:00, 1.54s/it]
Epoch 6: 100%|██████████| 24/24 [00:36<00:00, 1.52s/it]
Epoch 7: 100%|██████████| 24/24 [00:35<00:00, 1.50s/it]
Epoch 8: 100%|██████████| 24/24 [00:36<00:00, 1.50s/it]
Epoch 9: 100%|██████████| 24/24 [00:36<00:00, 1.50s/it]
Now the training is over. Let’s decrypt the trained model and look at it.
[25]:
model.to_host()
model.decrypt()
[26]:
model
[26]:
========== model description ==========
path: model
epoch_state: 10
theta: [[-3.12874321 2.01474527 -0.04004456 0.01228742 -0.22393505 0.08485522
0.04346048 0.19938321 -0.06895808 0.25302475 0.0347872 -0.13131598
0.24441581 0.34844682 0.19362228 -0.014064 0.85363718 1.32182948
1.77107065 2.25124134 2.70321801 2.69220521 -2.49021181]]
=======================================
[27]:
model.to_dataframe()
[27]:
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -3.128743 | 2.014745 | -0.040045 | 0.012287 | -0.223935 | 0.084855 | 0.04346 | 0.199383 | -0.068958 | 0.253025 | ... | 0.348447 | 0.193622 | -0.014064 | 0.853637 | 1.321829 | 1.771071 | 2.251241 | 2.703218 | 2.692205 | -2.490212 |
1 rows × 23 columns
For the inference, encrypt inference data.
[28]:
test_data_feature = heaan_sdk.HEMatrix.encode_encrypt(context, X_test, unit_shape)
If GPU is available, send test data to GPU.
[29]:
test_data_feature.to_device()
[29]:
HEMatrix([
HESubMatrix(path: "/home/songhpark/.heaan_sdk/16843167795.65194/0",
block_list: [
Ciphertext(log(num_slot): 15, device: GPU, level: 12)
]),
HESubMatrix(path: "/home/songhpark/.heaan_sdk/16843167795.65194/1",
block_list: [
Ciphertext(log(num_slot): 15, device: GPU, level: 12)
]),
HESubMatrix(path: "/home/songhpark/.heaan_sdk/16843167795.65194/2",
block_list: [
Ciphertext(log(num_slot): 15, device: GPU, level: 12)
]),
HESubMatrix(path: "/home/songhpark/.heaan_sdk/16843167795.65194/3",
block_list: [
Ciphertext(log(num_slot): 15, device: GPU, level: 12)
]),
HESubMatrix(path: "/home/songhpark/.heaan_sdk/16843167795.65194/4",
block_list: [
Ciphertext(log(num_slot): 15, device: GPU, level: 12)
]),
HESubMatrix(path: "/home/songhpark/.heaan_sdk/16843167795.65194/5",
block_list: [
Ciphertext(log(num_slot): 15, device: GPU, level: 12)
])
])
To inference data, use predict()
of model.
[30]:
output_binary = model.predict(test_data_feature)
Let’s decrypt the inference and look at the result of model performance.
[31]:
output_binary.to_host()
output_arr_binary = output_binary.decrypt_decode()
The output consists of values before conversion to probability.
[32]:
threshold = 0.5
probs = 1 / (1 + np.exp(-output_arr_binary))
probs = probs.squeeze()
preds = probs > threshold
correct_cnt = (preds == y_test).sum()
acc = correct_cnt / len(y_test)
print(f"Test accuracy: {acc * 100: .2f}%")
Test accuracy: 86.76%
According to the result, the out-of-sample test of our model has 86.50% accuracy. This means that it can distinguish whether a borrower causes a bad transaction with an accuracy of about 86.76%.