Logistic regression using HEaaN.ml

In this documentation, we introduce default detection model using lending club dataset from Kaggle (https://www.kaggle.com/datasets/ethon0426/lending-club-20072020q1).

1. Data import

We import lending club data extracted from original set downloaded from Kaggle. In this exercise, we only use 32,768 observations and selected some sensitive features. Let’s import the provided data.

[2]:
import pandas as pd
import numpy as np
[3]:
df = pd.read_csv('lc.csv')
[4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32768 entries, 0 to 32767
Data columns (total 15 columns):
 #   Column                Non-Null Count  Dtype
---  ------                --------------  -----
 0   loan_amnt             32768 non-null  float64
 1   annual_inc            32768 non-null  float64
 2   term                  32768 non-null  object
 3   int_rate              32768 non-null  object
 4   installment           32768 non-null  float64
 5   grade                 32768 non-null  object
 6   sub_grade             32768 non-null  object
 7   purpose               32768 non-null  object
 8   loan_status           32768 non-null  object
 9   dti                   32741 non-null  float64
 10  last_fico_range_high  32768 non-null  float64
 11  last_fico_range_low   32768 non-null  float64
 12  total_acc             32768 non-null  float64
 13  delinq_2yrs           32768 non-null  float64
 14  emp_length            30468 non-null  object
dtypes: float64(8), object(7)
memory usage: 3.8+ MB

2. Data preprocessing

Before conducting the analysis, let’s perform some data preprocessing. First, Let’s transform int_rate as float variable.

[5]:
df['int_rate'] = df['int_rate'].apply(lambda x: float(x.strip('%')) / 100)

There are several ways to deal with missing observations in variables. In this example, we will drop the missing observations.

[6]:
df = df.dropna()

We can apply a log transformation to the ‘annual_inc’ and ‘loan_amnt’ variables using either the numpy or math modules in Python. To apply the transformation, you can use the following code: np.log(df[‘annual_inc’]) and np.log(df[‘loan_amnt’]). Once the transformation is applied, it’s a good practice to check whether it was successful by inspecting the transformed data.

[7]:
df['log_inc']=np.log(df['annual_inc'])
df['log_loan_amnt']=np.log(df['loan_amnt'])

One of useful variable transformation technique is one-hot encoding. We have some categorical dataset such as term, interest rate, grade, sub_grade, and purpose of loan. For example, let’s transform purpose of loan into one-hot encoded variables. Note that one of one-hot encoded variables is omitted because of exact collinearity in analysis.

[8]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
[9]:
df['purpose_trans']=le.fit_transform(df['purpose'])
dummies = pd.get_dummies(df['purpose_trans'], prefix='purpose')
df = pd.concat([df, dummies], axis=1)

The remainings are for same transformation on all categorical variables

[10]:
df['term_trans']=le.fit_transform(df['term'])
dummies_term = pd.get_dummies(df['term_trans'], prefix='term')
df = pd.concat([df, dummies_term], axis=1)
[11]:
df['grade_trans']=le.fit_transform(df['grade'])
dummies_grade = pd.get_dummies(df['grade_trans'], prefix='grade')
df = pd.concat([df, dummies_grade], axis=1)

You can delete some variables for several purpose. For example, to avoid the collinearity problem in the regression analysis, we omit one of one-hot encoded categorical variables.

[12]:
df = df.drop(['purpose_0'], axis=1)
df = df.drop(['term_0'], axis=1)
df = df.drop(['grade_0'], axis=1)

3. Data Analysis

We will use various features defined the above section including all one-hot encoded variables.

[13]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 30467 entries, 0 to 32767
Data columns (total 40 columns):
 #   Column                Non-Null Count  Dtype
---  ------                --------------  -----
 0   loan_amnt             30467 non-null  float64
 1   annual_inc            30467 non-null  float64
 2   term                  30467 non-null  object
 3   int_rate              30467 non-null  float64
 4   installment           30467 non-null  float64
 5   grade                 30467 non-null  object
 6   sub_grade             30467 non-null  object
 7   purpose               30467 non-null  object
 8   loan_status           30467 non-null  object
 9   dti                   30467 non-null  float64
 10  last_fico_range_high  30467 non-null  float64
 11  last_fico_range_low   30467 non-null  float64
 12  total_acc             30467 non-null  float64
 13  delinq_2yrs           30467 non-null  float64
 14  emp_length            30467 non-null  object
 15  log_inc               30467 non-null  float64
 16  log_loan_amnt         30467 non-null  float64
 17  purpose_trans         30467 non-null  int64
 18  purpose_1             30467 non-null  bool
 19  purpose_2             30467 non-null  bool
 20  purpose_3             30467 non-null  bool
 21  purpose_4             30467 non-null  bool
 22  purpose_5             30467 non-null  bool
 23  purpose_6             30467 non-null  bool
 24  purpose_7             30467 non-null  bool
 25  purpose_8             30467 non-null  bool
 26  purpose_9             30467 non-null  bool
 27  purpose_10            30467 non-null  bool
 28  purpose_11            30467 non-null  bool
 29  purpose_12            30467 non-null  bool
 30  purpose_13            30467 non-null  bool
 31  term_trans            30467 non-null  int64
 32  term_1                30467 non-null  bool
 33  grade_trans           30467 non-null  int64
 34  grade_1               30467 non-null  bool
 35  grade_2               30467 non-null  bool
 36  grade_3               30467 non-null  bool
 37  grade_4               30467 non-null  bool
 38  grade_5               30467 non-null  bool
 39  grade_6               30467 non-null  bool
dtypes: bool(20), float64(11), int64(3), object(6)
memory usage: 5.5+ MB

Since our objective is to predict whether a loan is well-paid or not, let’s create a binary variable called ‘bad’. We can define ‘bad=0’ if the loan_status is either ‘Fully Paid’ or ‘Current’, and ‘bad=1’ if the loan_status is not well-paid.

[14]:
df['bad'] = df['loan_status'].apply(lambda x: 0 if x == 'Fully Paid' or x == 'Current' else 1)

Now, we can construct a binary logistic regression model to predict the probability of a loan transaction being classified as “bad” based on various independent variables such as borrower credit score, loan amount, debt-to-income ratio, etc.

[15]:
y = df['bad'].to_numpy()

Normalize ‘log_inc’ and ‘log_loan_amnt’ to ensure that the weight of each feature is evenly distributed.

[16]:
df['log_inc'] /= df['log_inc'].max()
df['log_loan_amnt'] /= df['log_loan_amnt'].max()
[17]:
X = df[['log_inc','log_loan_amnt','purpose_1','purpose_2','purpose_3','purpose_4','purpose_5','purpose_6','purpose_7','purpose_8','purpose_9','purpose_10','purpose_11','purpose_12','purpose_13','term_1','grade_1','grade_2','grade_3','grade_4','grade_5','grade_6']].to_numpy()

Divide data into training data and inference data.

[18]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True, stratify=y, random_state=0)

Let’s import HEaaN-SDK to conduct logistic regression.

[19]:
import heaan_sdk

context = heaan_sdk.Context(
    parameter=heaan_sdk.HEParameter.from_preset("FGb"),
    key_dir_path="./keys",
    load_keys="all",
    generate_keys=True,
)
HEaaN-SDK uses CUDA v11.7 (> v11.2)

Let us set hyperparameters first and then encrypt train data.

[20]:
num_epoch = 10
learning_rate = 1.0
batch_size = 1024
optimizer = "sgd"
lr_scheduler = "constant"
activation = "sigmoid_wide"
classes = sorted([val for val in df['bad'].unique()])
num_feature = len(X[0])
unit_shape = (batch_size, context.num_slots // batch_size)
[21]:
train_data = heaan_sdk.ml.preprocessing.encode_train_data(context, X_train, y_train, unit_shape, dtype="classification", path="./training")
train_data.encrypt()

Now, create model and encrypt that.

[22]:
model = heaan_sdk.ml.LogisticRegression(context, unit_shape, num_feature, classes, path="./model")
model.encrypt()

If GPU is available, send model to GPU.

[23]:
model.to_device()

To train data, use fit() of model.

[24]:
model.fit(
    train_data,
    lr=learning_rate,
    num_epoch=num_epoch,
    batch_size=batch_size,
    optimizer=optimizer,
    lr_scheduler=lr_scheduler,
    activation=activation,
)
Epoch 0: 100%|██████████| 24/24 [00:45<00:00,  1.91s/it]
Epoch 1: 100%|██████████| 24/24 [00:37<00:00,  1.54s/it]
Epoch 2: 100%|██████████| 24/24 [00:37<00:00,  1.54s/it]
Epoch 3: 100%|██████████| 24/24 [00:36<00:00,  1.52s/it]
Epoch 4: 100%|██████████| 24/24 [00:36<00:00,  1.51s/it]
Epoch 5: 100%|██████████| 24/24 [00:36<00:00,  1.54s/it]
Epoch 6: 100%|██████████| 24/24 [00:36<00:00,  1.52s/it]
Epoch 7: 100%|██████████| 24/24 [00:35<00:00,  1.50s/it]
Epoch 8: 100%|██████████| 24/24 [00:36<00:00,  1.50s/it]
Epoch 9: 100%|██████████| 24/24 [00:36<00:00,  1.50s/it]

Now the training is over. Let’s decrypt the trained model and look at it.

[25]:
model.to_host()
model.decrypt()
[26]:
model
[26]:
========== model description ==========
path: model
epoch_state: 10
theta: [[-3.12874321  2.01474527 -0.04004456  0.01228742 -0.22393505  0.08485522
   0.04346048  0.19938321 -0.06895808  0.25302475  0.0347872  -0.13131598
   0.24441581  0.34844682  0.19362228 -0.014064    0.85363718  1.32182948
   1.77107065  2.25124134  2.70321801  2.69220521 -2.49021181]]
=======================================
[27]:
model.to_dataframe()
[27]:
0 1 2 3 4 5 6 7 8 9 ... 13 14 15 16 17 18 19 20 21 22
0 -3.128743 2.014745 -0.040045 0.012287 -0.223935 0.084855 0.04346 0.199383 -0.068958 0.253025 ... 0.348447 0.193622 -0.014064 0.853637 1.321829 1.771071 2.251241 2.703218 2.692205 -2.490212

1 rows × 23 columns

For the inference, encrypt inference data.

[28]:
test_data_feature = heaan_sdk.HEMatrix.encode_encrypt(context, X_test, unit_shape)

If GPU is available, send test data to GPU.

[29]:
test_data_feature.to_device()
[29]:
HEMatrix([
  HESubMatrix(path: "/home/songhpark/.heaan_sdk/16843167795.65194/0",
    block_list: [
      Ciphertext(log(num_slot): 15, device: GPU, level: 12)
    ]),
  HESubMatrix(path: "/home/songhpark/.heaan_sdk/16843167795.65194/1",
    block_list: [
      Ciphertext(log(num_slot): 15, device: GPU, level: 12)
    ]),
  HESubMatrix(path: "/home/songhpark/.heaan_sdk/16843167795.65194/2",
    block_list: [
      Ciphertext(log(num_slot): 15, device: GPU, level: 12)
    ]),
  HESubMatrix(path: "/home/songhpark/.heaan_sdk/16843167795.65194/3",
    block_list: [
      Ciphertext(log(num_slot): 15, device: GPU, level: 12)
    ]),
  HESubMatrix(path: "/home/songhpark/.heaan_sdk/16843167795.65194/4",
    block_list: [
      Ciphertext(log(num_slot): 15, device: GPU, level: 12)
    ]),
  HESubMatrix(path: "/home/songhpark/.heaan_sdk/16843167795.65194/5",
    block_list: [
      Ciphertext(log(num_slot): 15, device: GPU, level: 12)
    ])
])

To inference data, use predict() of model.

[30]:
output_binary = model.predict(test_data_feature)

Let’s decrypt the inference and look at the result of model performance.

[31]:
output_binary.to_host()
output_arr_binary = output_binary.decrypt_decode()

The output consists of values before conversion to probability.

[32]:
threshold = 0.5
probs = 1 / (1 + np.exp(-output_arr_binary))
probs = probs.squeeze()
preds = probs > threshold
correct_cnt = (preds == y_test).sum()
acc = correct_cnt / len(y_test)
print(f"Test accuracy: {acc * 100: .2f}%")
Test accuracy:  86.76%

According to the result, the out-of-sample test of our model has 86.50% accuracy. This means that it can distinguish whether a borrower causes a bad transaction with an accuracy of about 86.76%.