Logistic regression using HEaaN.ml

In this documentation, we introduce default detection model using lending club dataset from Kaggle (https://www.kaggle.com/datasets/ethon0426/lending-club-20072020q1).

1. Data import

We import lending club data extracted from original set downloaded from Kaggle. In this exercise, we only use 32,768 observations and selected some sensitive features. Let’s import the provided data.

[2]:

import pandas as pd
import numpy as np

[3]:

df = pd.read_csv('lc.csv')

[4]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32768 entries, 0 to 32767
Data columns (total 15 columns):
 #   Column                Non-Null Count  Dtype
---  ------                --------------  -----
 0   loan_amnt             32768 non-null  float64
 1   annual_inc            32768 non-null  float64
 2   term                  32768 non-null  object
 3   int_rate              32768 non-null  object
 4   installment           32768 non-null  float64
 5   grade                 32768 non-null  object
 6   sub_grade             32768 non-null  object
 7   purpose               32768 non-null  object
 8   loan_status           32768 non-null  object
 9   dti                   32741 non-null  float64
 10  last_fico_range_high  32768 non-null  float64
 11  last_fico_range_low   32768 non-null  float64
 12  total_acc             32768 non-null  float64
 13  delinq_2yrs           32768 non-null  float64
 14  emp_length            30468 non-null  object
dtypes: float64(8), object(7)
memory usage: 3.8+ MB

2. Data preprocessing

Before conducting the analysis, let’s perform some data preprocessing. First, Let’s transform int_rate as float variable.

[5]:

df['int_rate'] = df['int_rate'].apply(lambda x: float(x.strip('%')) / 100)

There are several ways to deal with missing observations in variables. In this example, we will drop the missing observations.

[6]:

df = df.dropna()

We can apply a log transformation to the ‘annual_inc’ and ‘loan_amnt’ variables using either the numpy or math modules in Python. To apply the transformation, you can use the following code: np.log(df[‘annual_inc’]) and np.log(df[‘loan_amnt’]). Once the transformation is applied, it’s a good practice to check whether it was successful by inspecting the transformed data.

[7]:

df['log_inc']=np.log(df['annual_inc'])
df['log_loan_amnt']=np.log(df['loan_amnt'])

One of useful variable transformation technique is one-hot encoding. We have some categorical dataset such as term, interest rate, grade, sub_grade, and purpose of loan. For example, let’s transform purpose of loan into one-hot encoded variables. Note that one of one-hot encoded variables is omitted because of exact collinearity in analysis.

[8]:

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

[9]:

df['purpose_trans']=le.fit_transform(df['purpose'])
dummies = pd.get_dummies(df['purpose_trans'], prefix='purpose')
df = pd.concat([df, dummies], axis=1)

The remainings are for same transformation on all categorical variables

[10]:

df['term_trans']=le.fit_transform(df['term'])
dummies_term = pd.get_dummies(df['term_trans'], prefix='term')
df = pd.concat([df, dummies_term], axis=1)

[11]:

df['grade_trans']=le.fit_transform(df['grade'])
dummies_grade = pd.get_dummies(df['grade_trans'], prefix='grade')
df = pd.concat([df, dummies_grade], axis=1)

You can delete some variables for several purpose. For example, to avoid the collinearity problem in the regression analysis, we omit one of one-hot encoded categorical variables.

[12]:

df = df.drop(['purpose_0'], axis=1)
df = df.drop(['term_0'], axis=1)
df = df.drop(['grade_0'], axis=1)

3. Data Analysis

We will use various features defined the above section including all one-hot encoded variables.

[13]:

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 30467 entries, 0 to 32767
Data columns (total 40 columns):
 #   Column                Non-Null Count  Dtype
---  ------                --------------  -----
 0   loan_amnt             30467 non-null  float64
 1   annual_inc            30467 non-null  float64
 2   term                  30467 non-null  object
 3   int_rate              30467 non-null  float64
 4   installment           30467 non-null  float64
 5   grade                 30467 non-null  object
 6   sub_grade             30467 non-null  object
 7   purpose               30467 non-null  object
 8   loan_status           30467 non-null  object
 9   dti                   30467 non-null  float64
 10  last_fico_range_high  30467 non-null  float64
 11  last_fico_range_low   30467 non-null  float64
 12  total_acc             30467 non-null  float64
 13  delinq_2yrs           30467 non-null  float64
 14  emp_length            30467 non-null  object
 15  log_inc               30467 non-null  float64
 16  log_loan_amnt         30467 non-null  float64
 17  purpose_trans         30467 non-null  int64
 18  purpose_1             30467 non-null  bool
 19  purpose_2             30467 non-null  bool
 20  purpose_3             30467 non-null  bool
 21  purpose_4             30467 non-null  bool
 22  purpose_5             30467 non-null  bool
 23  purpose_6             30467 non-null  bool
 24  purpose_7             30467 non-null  bool
 25  purpose_8             30467 non-null  bool
 26  purpose_9             30467 non-null  bool
 27  purpose_10            30467 non-null  bool
 28  purpose_11            30467 non-null  bool
 29  purpose_12            30467 non-null  bool
 30  purpose_13            30467 non-null  bool
 31  term_trans            30467 non-null  int64
 32  term_1                30467 non-null  bool
 33  grade_trans           30467 non-null  int64
 34  grade_1               30467 non-null  bool
 35  grade_2               30467 non-null  bool
 36  grade_3               30467 non-null  bool
 37  grade_4               30467 non-null  bool
 38  grade_5               30467 non-null  bool
 39  grade_6               30467 non-null  bool
dtypes: bool(20), float64(11), int64(3), object(6)
memory usage: 5.5+ MB

Since our objective is to predict whether a loan is well-paid or not, let’s create a binary variable called ‘bad’. We can define ‘bad=0’ if the loan_status is either ‘Fully Paid’ or ‘Current’, and ‘bad=1’ if the loan_status is not well-paid.

[14]:

df['bad'] = df['loan_status'].apply(lambda x: 0 if x == 'Fully Paid' or x == 'Current' else 1)

Now, we can construct a binary logistic regression model to predict the probability of a loan transaction being classified as “bad” based on various independent variables such as borrower credit score, loan amount, debt-to-income ratio, etc.

[15]:

y = df['bad'].to_numpy()

Normalize ‘log_inc’ and ‘log_loan_amnt’ to ensure that the weight of each feature is evenly distributed.

[16]:

df['log_inc'] /= df['log_inc'].max()
df['log_loan_amnt'] /= df['log_loan_amnt'].max()

[17]:

X = df[['log_inc','log_loan_amnt','purpose_1','purpose_2','purpose_3','purpose_4','purpose_5','purpose_6','purpose_7','purpose_8','purpose_9','purpose_10','purpose_11','purpose_12','purpose_13','term_1','grade_1','grade_2','grade_3','grade_4','grade_5','grade_6']].to_numpy()

Divide data into training data and inference data.

[18]:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True, stratify=y, random_state=0)

Let’s import HEaaN-SDK to conduct logistic regression.

[19]:

import heaan_sdk

context = heaan_sdk.Context(
    parameter=heaan_sdk.HEParameter.from_preset("FGb"),
    key_dir_path="./keys",
    load_keys="all",
    generate_keys=True,
)

HEaaN-SDK uses CUDA v11.7 (> v11.2)

Let us set hyperparameters first and then encrypt train data.

[20]:

num_epoch = 10
learning_rate = 1.0
batch_size = 1024
optimizer = "sgd"
lr_scheduler = "constant"
activation = "sigmoid_wide"
classes = sorted([val for val in df['bad'].unique()])
num_feature = len(X[0])
unit_shape = (batch_size, context.num_slots // batch_size)

[21]:

train_data = heaan_sdk.ml.preprocessing.encode_train_data(context, X_train, y_train, unit_shape, dtype="classification", path="./training")
train_data.encrypt()

Now, create model and encrypt that.

[22]:

model = heaan_sdk.ml.LogisticRegression(context, unit_shape, num_feature, classes, path="./model")
model.encrypt()

If GPU is available, send model to GPU.

[23]:

model.to_device()

To train data, use fit() of model.

[24]:

model.fit(
    train_data,
    lr=learning_rate,
    num_epoch=num_epoch,
    batch_size=batch_size,
    optimizer=optimizer,
    lr_scheduler=lr_scheduler,
    activation=activation,
)

Epoch 0: 100%|██████████| 24/24 [00:45<00:00,  1.91s/it]
Epoch 1: 100%|██████████| 24/24 [00:37<00:00,  1.54s/it]
Epoch 2: 100%|██████████| 24/24 [00:37<00:00,  1.54s/it]
Epoch 3: 100%|██████████| 24/24 [00:36<00:00,  1.52s/it]
Epoch 4: 100%|██████████| 24/24 [00:36<00:00,  1.51s/it]
Epoch 5: 100%|██████████| 24/24 [00:36<00:00,  1.54s/it]
Epoch 6: 100%|██████████| 24/24 [00:36<00:00,  1.52s/it]
Epoch 7: 100%|██████████| 24/24 [00:35<00:00,  1.50s/it]
Epoch 8: 100%|██████████| 24/24 [00:36<00:00,  1.50s/it]
Epoch 9: 100%|██████████| 24/24 [00:36<00:00,  1.50s/it]

Now the training is over. Let’s decrypt the trained model and look at it.

[25]:

model.to_host()
model.decrypt()

[26]:

model

[26]:

========== model description ==========
path: model
epoch_state: 10
theta: [[-3.12874321  2.01474527 -0.04004456  0.01228742 -0.22393505  0.08485522
   0.04346048  0.19938321 -0.06895808  0.25302475  0.0347872  -0.13131598
   0.24441581  0.34844682  0.19362228 -0.014064    0.85363718  1.32182948
   1.77107065  2.25124134  2.70321801  2.69220521 -2.49021181]]
=======================================

[27]:

model.to_dataframe()

[27]:

	0	1	2	3	4	5	6	7	8	9	...	13	14	15	16	17	18	19	20	21	22
0	-3.128743	2.014745	-0.040045	0.012287	-0.223935	0.084855	0.04346	0.199383	-0.068958	0.253025	...	0.348447	0.193622	-0.014064	0.853637	1.321829	1.771071	2.251241	2.703218	2.692205	-2.490212

1 rows × 23 columns

For the inference, encrypt inference data.

[28]:

test_data_feature = heaan_sdk.HEMatrix.encode_encrypt(context, X_test, unit_shape)

If GPU is available, send test data to GPU.

[29]:

test_data_feature.to_device()

[29]:

HEMatrix([
  HESubMatrix(path: "/home/songhpark/.heaan_sdk/16843167795.65194/0",
    block_list: [
      Ciphertext(log(num_slot): 15, device: GPU, level: 12)
    ]),
  HESubMatrix(path: "/home/songhpark/.heaan_sdk/16843167795.65194/1",
    block_list: [
      Ciphertext(log(num_slot): 15, device: GPU, level: 12)
    ]),
  HESubMatrix(path: "/home/songhpark/.heaan_sdk/16843167795.65194/2",
    block_list: [
      Ciphertext(log(num_slot): 15, device: GPU, level: 12)
    ]),
  HESubMatrix(path: "/home/songhpark/.heaan_sdk/16843167795.65194/3",
    block_list: [
      Ciphertext(log(num_slot): 15, device: GPU, level: 12)
    ]),
  HESubMatrix(path: "/home/songhpark/.heaan_sdk/16843167795.65194/4",
    block_list: [
      Ciphertext(log(num_slot): 15, device: GPU, level: 12)
    ]),
  HESubMatrix(path: "/home/songhpark/.heaan_sdk/16843167795.65194/5",
    block_list: [
      Ciphertext(log(num_slot): 15, device: GPU, level: 12)
    ])
])

To inference data, use predict() of model.

[30]:

output_binary = model.predict(test_data_feature)

Let’s decrypt the inference and look at the result of model performance.

[31]:

output_binary.to_host()
output_arr_binary = output_binary.decrypt_decode()

The output consists of values before conversion to probability.

[32]:

threshold = 0.5
probs = 1 / (1 + np.exp(-output_arr_binary))
probs = probs.squeeze()
preds = probs > threshold
correct_cnt = (preds == y_test).sum()
acc = correct_cnt / len(y_test)
print(f"Test accuracy: {acc * 100: .2f}%")

Test accuracy:  86.76%

According to the result, the out-of-sample test of our model has 86.50% accuracy. This means that it can distinguish whether a borrower causes a bad transaction with an accuracy of about 86.76%.