{ "cells": [ { "cell_type": "markdown", "id": "d421f136", "metadata": { "id": "d421f136" }, "source": [ "# Logistic regression using HEaaN.ml\n", "\n", "In this documentation, we introduce default detection model using lending club dataset from Kaggle (https://www.kaggle.com/datasets/ethon0426/lending-club-20072020q1)." ] }, { "cell_type": "markdown", "id": "164febb8", "metadata": { "id": "164febb8" }, "source": [ "## 1. Data import" ] }, { "cell_type": "markdown", "id": "603a72c4", "metadata": { "id": "603a72c4" }, "source": [ "We import lending club data extracted from original set downloaded from Kaggle. In this exercise, we only use 32,768 observations and selected some sensitive features. Let's import the provided data." ] }, { "cell_type": "code", "execution_count": 1, "id": "e35a2025", "metadata": { "id": "e35a2025" }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": 2, "id": "cfa9b3ac", "metadata": { "id": "cfa9b3ac", "outputId": "ef0f6120-9aab-4e69-ec37-09a1a8869006" }, "outputs": [], "source": [ "df = pd.read_csv('lc.csv')" ] }, { "cell_type": "code", "execution_count": 3, "id": "3f183ae0", "metadata": { "id": "3f183ae0", "outputId": "4c7d8625-00ae-48b4-cfdf-f2f1ac76001e" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 32768 entries, 0 to 32767\n", "Data columns (total 15 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 loan_amnt 32768 non-null float64\n", " 1 annual_inc 32768 non-null float64\n", " 2 term 32768 non-null object \n", " 3 int_rate 32768 non-null object \n", " 4 installment 32768 non-null float64\n", " 5 grade 32768 non-null object \n", " 6 sub_grade 32768 non-null object \n", " 7 purpose 32768 non-null object \n", " 8 loan_status 32768 non-null object \n", " 9 dti 32743 non-null float64\n", " 10 last_fico_range_high 32768 non-null float64\n", " 11 last_fico_range_low 32768 non-null float64\n", " 12 total_acc 32768 non-null float64\n", " 13 delinq_2yrs 32768 non-null float64\n", " 14 emp_length 30518 non-null object \n", "dtypes: float64(8), object(7)\n", "memory usage: 3.8+ MB\n" ] } ], "source": [ "df.info()" ] }, { "cell_type": "markdown", "id": "-e7_kVk9I1QN", "metadata": { "id": "-e7_kVk9I1QN" }, "source": [ "## 2. Data preprocessing" ] }, { "cell_type": "markdown", "id": "ENNr6eY7DCFY", "metadata": { "id": "ENNr6eY7DCFY" }, "source": [ "Before conducting the analysis, let's perform some data preprocessing. First, Let's transform int_rate as float variable." ] }, { "cell_type": "code", "execution_count": 4, "id": "hIeCWEIiDCaK", "metadata": { "id": "hIeCWEIiDCaK" }, "outputs": [], "source": [ "df['int_rate'] = df['int_rate'].apply(lambda x: float(x.strip('%')) / 100)" ] }, { "cell_type": "markdown", "id": "67ddb4d6", "metadata": { "id": "67ddb4d6" }, "source": [ "There are several ways to deal with missing observations in variables. In this example, we will drop the missing observations." ] }, { "cell_type": "code", "execution_count": 5, "id": "8a722e13", "metadata": { "id": "8a722e13" }, "outputs": [], "source": [ "df = df.dropna()" ] }, { "cell_type": "markdown", "id": "c63b7bae", "metadata": { "id": "c63b7bae" }, "source": [ "We can apply a log transformation to the 'annual_inc' and 'loan_amnt' variables using either the numpy or math modules in Python. To apply the transformation, you can use the following code: np.log(df\\['annual_inc'\\]) and np.log(df\\['loan_amnt'\\]). Once the transformation is applied, it's a good practice to check whether it was successful by inspecting the transformed data." ] }, { "cell_type": "code", "execution_count": 6, "id": "277cc9e2", "metadata": { "id": "277cc9e2", "outputId": "1b39f538-aa01-4f22-b752-18d22e1752d7" }, "outputs": [], "source": [ "df['log_inc']=np.log(df['annual_inc'])\n", "df['log_loan_amnt']=np.log(df['loan_amnt'])" ] }, { "cell_type": "markdown", "id": "c9833b97", "metadata": { "id": "c9833b97" }, "source": [ "One of useful variable transformation technique is one-hot encoding. We have some categorical dataset such as term, interest rate, grade, sub_grade, and purpose of loan. For example, let's transform purpose of loan into one-hot encoded variables. Note that one of one-hot encoded variables is omitted because of exact collinearity in analysis." ] }, { "cell_type": "code", "execution_count": 7, "id": "_XNFQfx8E-ka", "metadata": { "id": "_XNFQfx8E-ka" }, "outputs": [], "source": [ "from sklearn.preprocessing import LabelEncoder\n", "le = LabelEncoder()" ] }, { "cell_type": "code", "execution_count": 8, "id": "dab5ea69", "metadata": { "id": "dab5ea69", "outputId": "14aaa3ce-070e-4f71-9051-c48313965e40" }, "outputs": [], "source": [ "df['purpose_trans']=le.fit_transform(df['purpose'])\n", "dummies = pd.get_dummies(df['purpose_trans'], prefix='purpose')\n", "df = pd.concat([df, dummies], axis=1)" ] }, { "cell_type": "markdown", "id": "0e39cc2e", "metadata": { "id": "0e39cc2e" }, "source": [ "The remainings are for same transformation on all categorical variables" ] }, { "cell_type": "code", "execution_count": 9, "id": "affdb112", "metadata": { "id": "affdb112", "outputId": "a5c3929a-8f69-4546-e7e8-1a3e3387fe18" }, "outputs": [], "source": [ "df['term_trans']=le.fit_transform(df['term'])\n", "dummies_term = pd.get_dummies(df['term_trans'], prefix='term')\n", "df = pd.concat([df, dummies_term], axis=1)" ] }, { "cell_type": "code", "execution_count": 10, "id": "95fe48d3", "metadata": { "id": "95fe48d3", "outputId": "9d4a39e5-78a1-4c23-ab6f-e76e793df9bf" }, "outputs": [], "source": [ "df['grade_trans']=le.fit_transform(df['grade'])\n", "dummies_grade = pd.get_dummies(df['grade_trans'], prefix='grade')\n", "df = pd.concat([df, dummies_grade], axis=1)" ] }, { "cell_type": "markdown", "id": "efc10083", "metadata": { "id": "efc10083" }, "source": [ "You can delete some variables for several purpose. For example, to avoid the collinearity problem in the regression analysis, we omit one of one-hot encoded categorical variables." ] }, { "cell_type": "code", "execution_count": 11, "id": "e23c7e12", "metadata": { "id": "e23c7e12" }, "outputs": [], "source": [ "df = df.drop(['purpose_0'], axis=1)\n", "df = df.drop(['term_0'], axis=1)\n", "df = df.drop(['grade_0'], axis=1)" ] }, { "cell_type": "markdown", "id": "f6a9b14a", "metadata": { "id": "f6a9b14a" }, "source": [ "## 3. Data Analysis" ] }, { "cell_type": "markdown", "id": "c2d03586", "metadata": { "id": "c2d03586" }, "source": [ "We will use various features defined the above section including all one-hot encoded variables." ] }, { "cell_type": "code", "execution_count": 12, "id": "c73050ef", "metadata": { "id": "c73050ef", "outputId": "7e74949b-2ffc-4166-8342-80cc2854d1bb" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Int64Index: 30516 entries, 0 to 32767\n", "Data columns (total 40 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 loan_amnt 30516 non-null float64\n", " 1 annual_inc 30516 non-null float64\n", " 2 term 30516 non-null object \n", " 3 int_rate 30516 non-null float64\n", " 4 installment 30516 non-null float64\n", " 5 grade 30516 non-null object \n", " 6 sub_grade 30516 non-null object \n", " 7 purpose 30516 non-null object \n", " 8 loan_status 30516 non-null object \n", " 9 dti 30516 non-null float64\n", " 10 last_fico_range_high 30516 non-null float64\n", " 11 last_fico_range_low 30516 non-null float64\n", " 12 total_acc 30516 non-null float64\n", " 13 delinq_2yrs 30516 non-null float64\n", " 14 emp_length 30516 non-null object \n", " 15 log_inc 30516 non-null float64\n", " 16 log_loan_amnt 30516 non-null float64\n", " 17 purpose_trans 30516 non-null int64 \n", " 18 purpose_1 30516 non-null uint8 \n", " 19 purpose_2 30516 non-null uint8 \n", " 20 purpose_3 30516 non-null uint8 \n", " 21 purpose_4 30516 non-null uint8 \n", " 22 purpose_5 30516 non-null uint8 \n", " 23 purpose_6 30516 non-null uint8 \n", " 24 purpose_7 30516 non-null uint8 \n", " 25 purpose_8 30516 non-null uint8 \n", " 26 purpose_9 30516 non-null uint8 \n", " 27 purpose_10 30516 non-null uint8 \n", " 28 purpose_11 30516 non-null uint8 \n", " 29 purpose_12 30516 non-null uint8 \n", " 30 purpose_13 30516 non-null uint8 \n", " 31 term_trans 30516 non-null int64 \n", " 32 term_1 30516 non-null uint8 \n", " 33 grade_trans 30516 non-null int64 \n", " 34 grade_1 30516 non-null uint8 \n", " 35 grade_2 30516 non-null uint8 \n", " 36 grade_3 30516 non-null uint8 \n", " 37 grade_4 30516 non-null uint8 \n", " 38 grade_5 30516 non-null uint8 \n", " 39 grade_6 30516 non-null uint8 \n", "dtypes: float64(11), int64(3), object(6), uint8(20)\n", "memory usage: 5.5+ MB\n" ] } ], "source": [ "df.info()" ] }, { "cell_type": "markdown", "id": "Bz2_gpRpDLCy", "metadata": { "id": "Bz2_gpRpDLCy" }, "source": [ "Since our objective is to predict whether a loan is well-paid or not, let's create a binary variable called 'bad'. We can define 'bad=0' if the loan_status is either 'Fully Paid' or 'Current', and 'bad=1' if the loan_status is not well-paid." ] }, { "cell_type": "code", "execution_count": 13, "id": "20d7d359", "metadata": { "id": "20d7d359" }, "outputs": [], "source": [ "df['bad'] = df['loan_status'].apply(lambda x: 0 if x == 'Fully Paid' or x == 'Current' else 1)" ] }, { "cell_type": "markdown", "id": "0e22cf69", "metadata": { "id": "0e22cf69" }, "source": [ "Now, we can construct a binary logistic regression model to predict the probability of a loan transaction being classified as \"bad\" based on various independent variables such as borrower credit score, loan amount, debt-to-income ratio, etc." ] }, { "cell_type": "code", "execution_count": 14, "id": "332d6910", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 167 }, "id": "332d6910", "outputId": "7f1d912c-1895-4e13-d183-67ffa35bae10" }, "outputs": [], "source": [ "y = df['bad'].to_numpy()" ] }, { "cell_type": "markdown", "id": "abcf7f77-bca3-4d73-bbf1-8e5b5a0c4d1e", "metadata": {}, "source": [ "Normalize 'log_inc' and 'log_loan_amnt' to ensure that the weight of each feature is evenly distributed." ] }, { "cell_type": "code", "execution_count": 15, "id": "61defc27-aa32-4e4c-87a2-d27bc10388fd", "metadata": {}, "outputs": [], "source": [ "df['log_inc'] /= df['log_inc'].max()\n", "df['log_loan_amnt'] /= df['log_loan_amnt'].max()" ] }, { "cell_type": "code", "execution_count": 16, "id": "8323649a", "metadata": { "id": "8323649a" }, "outputs": [], "source": [ "X = df[['log_inc','log_loan_amnt','purpose_1','purpose_2','purpose_3','purpose_4','purpose_5','purpose_6','purpose_7','purpose_8','purpose_9','purpose_10','purpose_11','purpose_12','purpose_13','term_1','grade_1','grade_2','grade_3','grade_4','grade_5','grade_6']].to_numpy()" ] }, { "cell_type": "markdown", "id": "6ipe9bl1KnAX", "metadata": { "id": "6ipe9bl1KnAX" }, "source": [ "Divide data into training data and inference data." ] }, { "cell_type": "code", "execution_count": 17, "id": "9nasIJUjKuu2", "metadata": { "id": "9nasIJUjKuu2" }, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True, stratify=y, random_state=0)" ] }, { "cell_type": "markdown", "id": "e1ab62f0", "metadata": { "id": "e1ab62f0" }, "source": [ "Let's import HEaaN-SDK to conduct logistic regression." ] }, { "cell_type": "code", "execution_count": 18, "id": "ZbS10J32Fiou", "metadata": { "id": "ZbS10J32Fiou" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "HEaaN-SDK uses CUDA v11.5 (> v11.2)\n" ] } ], "source": [ "import heaan_sdk\n", "\n", "context = heaan_sdk.Context(\n", " parameter=heaan_sdk.HEParameter.from_preset(\"FGb\"),\n", " key_dir_path=\"./keys\",\n", " load_keys=\"all\",\n", " generate_keys=True,\n", ")" ] }, { "cell_type": "markdown", "id": "fc3deebc", "metadata": {}, "source": [ "Let us set hyperparameters first and then encrypt train data." ] }, { "cell_type": "code", "execution_count": 19, "id": "mVr0RGUk07nT", "metadata": { "id": "mVr0RGUk07nT" }, "outputs": [], "source": [ "num_epoch = 10\n", "learning_rate = 1.0\n", "batch_size = 1024\n", "optimizer = \"sgd\"\n", "lr_scheduler = \"constant\"\n", "activation = \"sigmoid_wide\"\n", "classes = sorted([val for val in df['bad'].unique()])\n", "num_feature = len(X[0])\n", "unit_shape = (batch_size, context.num_slots // batch_size)" ] }, { "cell_type": "code", "execution_count": 20, "id": "7ec92e24", "metadata": {}, "outputs": [], "source": [ "train_data = heaan_sdk.ml.preprocessing.encode_train_data(context, X_train, y_train, unit_shape, dtype=\"classification\", path=\"./training\")\n", "train_data.encrypt()" ] }, { "cell_type": "markdown", "id": "c8faccd2", "metadata": {}, "source": [ "Now, create model and encrypt that." ] }, { "cell_type": "code", "execution_count": 21, "id": "ucSv6iWI1CsJ", "metadata": { "id": "ucSv6iWI1CsJ" }, "outputs": [], "source": [ "model = heaan_sdk.ml.LogisticRegression(context, unit_shape, num_feature, classes, path=\"./model\")\n", "model.encrypt()" ] }, { "cell_type": "markdown", "id": "47819a7f", "metadata": {}, "source": [ "If GPU is available, send model to GPU." ] }, { "cell_type": "code", "execution_count": 22, "id": "cf97a6c1", "metadata": {}, "outputs": [], "source": [ "model.to_device()" ] }, { "cell_type": "markdown", "id": "19d19978", "metadata": {}, "source": [ "To train data, use `fit()` of model." ] }, { "cell_type": "code", "execution_count": 23, "id": "e-BnVsg-1Sk0", "metadata": { "id": "e-BnVsg-1Sk0" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Epoch 0: 100%|██████████| 24/24 [00:42<00:00, 1.78s/it]\n", "Epoch 1: 100%|██████████| 24/24 [00:30<00:00, 1.25s/it]\n", "Epoch 2: 100%|██████████| 24/24 [00:29<00:00, 1.24s/it]\n", "Epoch 3: 100%|██████████| 24/24 [00:30<00:00, 1.25s/it]\n", "Epoch 4: 100%|██████████| 24/24 [00:29<00:00, 1.22s/it]\n", "Epoch 5: 100%|██████████| 24/24 [00:29<00:00, 1.22s/it]\n", "Epoch 6: 100%|██████████| 24/24 [00:29<00:00, 1.22s/it]\n", "Epoch 7: 100%|██████████| 24/24 [00:29<00:00, 1.22s/it]\n", "Epoch 8: 100%|██████████| 24/24 [00:29<00:00, 1.22s/it]\n", "Epoch 9: 100%|██████████| 24/24 [00:29<00:00, 1.22s/it]\n" ] } ], "source": [ "model.fit(\n", " train_data,\n", " lr=learning_rate,\n", " num_epoch=num_epoch,\n", " batch_size=batch_size,\n", " optimizer=optimizer,\n", " lr_scheduler=lr_scheduler,\n", " activation=activation,\n", ")" ] }, { "cell_type": "markdown", "id": "a2fa7539", "metadata": {}, "source": [ "Now the training is over. Let's decrypt the trained model and look at it." ] }, { "cell_type": "code", "execution_count": 24, "id": "9WLYiSusHoWq", "metadata": { "id": "9WLYiSusHoWq" }, "outputs": [], "source": [ "model.to_host()\n", "model.decrypt()" ] }, { "cell_type": "code", "execution_count": 25, "id": "_m257y3G1Zzk", "metadata": { "id": "_m257y3G1Zzk" }, "outputs": [ { "data": { "text/plain": [ "========== model description ==========\n", "path: model\n", "epoch_state: 10\n", "theta: [[-2.49783586 1.59195499 -0.2623614 -0.3699792 -0.00835056 -0.48050553\n", " -0.84698687 -0.10585963 -0.47300105 -0.09100378 -0.3519928 -0.10423002\n", " 0.08120644 -0.27560361 -0.18446809 -0.16446498 0.81136428 1.39605765\n", " 1.75279313 2.18702234 2.54425704 2.97494707 -2.3168639 ]]\n", "=======================================" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model" ] }, { "cell_type": "code", "execution_count": 26, "id": "Tcp00__RJAPX", "metadata": { "id": "Tcp00__RJAPX" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123456789...13141516171819202122
0-2.4978361.591955-0.262361-0.369979-0.008351-0.480506-0.846987-0.10586-0.473001-0.091004...-0.275604-0.184468-0.1644650.8113641.3960581.7527932.1870222.5442572.974947-2.316864
\n", "

1 rows × 23 columns

\n", "
" ], "text/plain": [ " 0 1 2 3 4 5 6 \\\n", "0 -2.497836 1.591955 -0.262361 -0.369979 -0.008351 -0.480506 -0.846987 \n", "\n", " 7 8 9 ... 13 14 15 16 \\\n", "0 -0.10586 -0.473001 -0.091004 ... -0.275604 -0.184468 -0.164465 0.811364 \n", "\n", " 17 18 19 20 21 22 \n", "0 1.396058 1.752793 2.187022 2.544257 2.974947 -2.316864 \n", "\n", "[1 rows x 23 columns]" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.to_dataframe()" ] }, { "cell_type": "markdown", "id": "RspK5HS6K7KQ", "metadata": { "id": "RspK5HS6K7KQ" }, "source": [ "For the inference, encrypt inference data." ] }, { "cell_type": "code", "execution_count": 27, "id": "r4hqFSiAK-WX", "metadata": { "id": "r4hqFSiAK-WX" }, "outputs": [], "source": [ "test_data_feature = heaan_sdk.HEMatrix.encode_encrypt(context, X_test, unit_shape)" ] }, { "cell_type": "markdown", "id": "b5526f5d", "metadata": {}, "source": [ "If GPU is available, send test data to GPU." ] }, { "cell_type": "code", "execution_count": 28, "id": "39fc8f7f", "metadata": {}, "outputs": [], "source": [ "test_data_feature.to_device()" ] }, { "cell_type": "markdown", "id": "32e0d02e", "metadata": {}, "source": [ "To inference data, use `predict()` of model." ] }, { "cell_type": "code", "execution_count": 29, "id": "a6a138a2", "metadata": {}, "outputs": [], "source": [ "output_binary = model.predict(test_data_feature)" ] }, { "cell_type": "markdown", "id": "1de06e0b", "metadata": {}, "source": [ "Let's decrypt the inference and look at the result of model performance." ] }, { "cell_type": "code", "execution_count": 30, "id": "7cfd6d23", "metadata": {}, "outputs": [], "source": [ "output_binary.to_host()\n", "output_arr_binary = output_binary.decrypt_decode()" ] }, { "cell_type": "markdown", "id": "123590fd-4b33-4b10-9773-25abc8e37a64", "metadata": {}, "source": [ "The output consists of values before conversion to probability." ] }, { "cell_type": "code", "execution_count": 31, "id": "wGII6dP1LakD", "metadata": { "id": "wGII6dP1LakD" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Test accuracy: 86.50%\n" ] } ], "source": [ "threshold = 0.5\n", "probs = 1 / (1 + np.exp(-output_arr_binary))\n", "probs = probs.squeeze()\n", "preds = probs > threshold\n", "correct_cnt = (preds == y_test).sum()\n", "acc = correct_cnt / len(y_test)\n", "print(f\"Test accuracy: {acc * 100: .2f}%\")" ] }, { "cell_type": "markdown", "id": "4695ae08", "metadata": {}, "source": [ "According to the result, the out-of-sample test of our model has 86.50% accuracy. This means that it can distinguish whether a borrower causes a bad transaction with an accuracy of about 86.50%." ] } ], "metadata": { "colab": { "collapsed_sections": [ "164febb8" ], "provenance": [] }, "kernelspec": { "display_name": "heaansdk-230317", "language": "python", "name": "heaansdk-1dd88bf-2e47166" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 5 }