{ "cells": [ { "cell_type": "markdown", "id": "371315ac", "metadata": {}, "source": [ "# Statistics using HEaaN.stat\n", "\n", "In this documentation, we introduce default detection model using lending club dataset from Kaggle (https://www.kaggle.com/datasets/ethon0426/lending-club-20072020q1)." ] }, { "cell_type": "markdown", "id": "e4a2ed42", "metadata": { "id": "e4a2ed42" }, "source": [ "## 1. Data import" ] }, { "cell_type": "markdown", "id": "ee5557f8", "metadata": { "id": "ee5557f8" }, "source": [ "We import lending club data extracted from original set downloaded from Kaggle. In this exercise, we only use 32,768 observations and selected some sensitive features. Let's import the provided data." ] }, { "cell_type": "code", "execution_count": 1, "id": "89e1a910", "metadata": { "id": "89e1a910" }, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "id": "c8735dea", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 330 }, "id": "c8735dea", "outputId": "1270720d-da8d-48d0-ac61-c71d5a479297" }, "outputs": [], "source": [ "df = pd.read_csv('lc.csv')" ] }, { "cell_type": "code", "execution_count": 3, "id": "e6dcb7da", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "e6dcb7da", "outputId": "6e56a6ef-82c7-4cb6-d797-45f6717e801d" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 32768 entries, 0 to 32767\n", "Data columns (total 15 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 loan_amnt 32768 non-null float64\n", " 1 annual_inc 32768 non-null float64\n", " 2 term 32768 non-null object \n", " 3 int_rate 32768 non-null object \n", " 4 installment 32768 non-null float64\n", " 5 grade 32768 non-null object \n", " 6 sub_grade 32768 non-null object \n", " 7 purpose 32768 non-null object \n", " 8 loan_status 32768 non-null object \n", " 9 dti 32743 non-null float64\n", " 10 last_fico_range_high 32768 non-null float64\n", " 11 last_fico_range_low 32768 non-null float64\n", " 12 total_acc 32768 non-null float64\n", " 13 delinq_2yrs 32768 non-null float64\n", " 14 emp_length 30518 non-null object \n", "dtypes: float64(8), object(7)\n", "memory usage: 3.8+ MB\n" ] } ], "source": [ "df.info()" ] }, { "cell_type": "markdown", "id": "d602275e", "metadata": { "id": "d602275e" }, "source": [ "## 2. Data preprocessing" ] }, { "cell_type": "markdown", "id": "uW6QDKLL0K2-", "metadata": { "id": "uW6QDKLL0K2-" }, "source": [ "Before encoding and encrypting data, we have to preprocess data. Frist, we have to change data of numerical columns to number. Second, categorical column have to be set to dtype as \"category\"." ] }, { "cell_type": "code", "execution_count": 4, "id": "sZbc3lrW0oP1", "metadata": { "id": "sZbc3lrW0oP1" }, "outputs": [], "source": [ "df['int_rate'] = df['int_rate'].apply(lambda x: float(x.strip('%')) / 100)\n", "\n", "df[\"term\"] = df[\"term\"].astype(\"category\")\n", "df[\"grade\"] = df[\"grade\"].astype(\"category\")\n", "df[\"sub_grade\"] = df[\"sub_grade\"].astype(\"category\")\n", "df[\"purpose\"] = df[\"purpose\"].astype(\"category\")\n", "df[\"loan_status\"] = df[\"loan_status\"].astype(\"category\")\n", "df[\"emp_length\"] = df[\"emp_length\"].astype(\"category\")" ] }, { "cell_type": "markdown", "id": "7Y5ECrvi2x4I", "metadata": { "id": "7Y5ECrvi2x4I" }, "source": [ "Now, let's encrypt data. Note that HEaaN.SDK contains various encryption techniques in context. We will use 'FGb' parameter on encryption, and make keys based on the parameter." ] }, { "cell_type": "code", "execution_count": 5, "id": "McsFn9na21xT", "metadata": { "id": "McsFn9na21xT" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "HEaaN-SDK uses CUDA v11.5 (> v11.2)\n" ] } ], "source": [ "import heaan_sdk\n", "\n", "context = heaan_sdk.Context(\n", " parameter=heaan_sdk.HEParameter.from_preset(\"FGb\"),\n", " key_dir_path=\"./keys_stat\",\n", " load_keys=\"all\",\n", " generate_keys=True,\n", ")" ] }, { "cell_type": "markdown", "id": "43ae9012", "metadata": {}, "source": [ "Now, encode and encrypt data." ] }, { "cell_type": "code", "execution_count": 6, "id": "sWuYsM55KzGi", "metadata": { "id": "sWuYsM55KzGi" }, "outputs": [], "source": [ "hf = heaan_sdk.HEFrame.from_frame(context, df)" ] }, { "cell_type": "code", "execution_count": 7, "id": "fa851c36", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "HEFrame(\n", " number of rows: 32768,\n", " number of columns: 15,\n", " list of columns: ['loan_amnt', 'annual_inc', 'term', 'int_rate', 'installment', 'grade', 'sub_grade', 'purpose', 'loan_status', 'dti', 'last_fico_range_high', 'last_fico_range_low', 'total_acc', 'delinq_2yrs', 'emp_length']\n", ")" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hf.encrypt()" ] }, { "cell_type": "markdown", "id": "f3cb07ac", "metadata": { "id": "f3cb07ac" }, "source": [ "Before conducting the analysis, let's perform some data preprocessing. First, since our objective is to predict whether a loan is well-paid or not, let's create a binary variable called 'bad'. We can define 'bad=0' if the loan_status is either 'Fully Paid' or 'Current', and 'bad=1' if the loan_status is not well-paid." ] }, { "cell_type": "code", "execution_count": 9, "id": "C3STZwGYELVg", "metadata": { "id": "C3STZwGYELVg" }, "outputs": [], "source": [ "hf[\"bad\"] = (hf[\"loan_status\"] != \"Fully Paid\") & (hf[\"loan_status\"] != \"Current\")" ] }, { "cell_type": "markdown", "id": "lkS8Z1MPKtLy", "metadata": { "id": "lkS8Z1MPKtLy" }, "source": [ "\n", "\n", "```\n", "```\n", "\n", "Suppose that we are going to use average of monthly income for individual person." ] }, { "cell_type": "code", "execution_count": 10, "id": "WaGHeS1mFFpj", "metadata": { "id": "WaGHeS1mFFpj" }, "outputs": [], "source": [ "hf[\"avg_monthly_inc\"] = hf[\"annual_inc\"] * (1 / 12)" ] }, { "cell_type": "code", "execution_count": 11, "id": "96WSN0FCEaLz", "metadata": { "id": "96WSN0FCEaLz" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "HEFrame Description: \n", "Data rows: 32768\n", "Data columns (total 17 columns):\n", "# Column Dtype Encrypted \n", "--- ------------------------ --------- ----------\n", " 0 loan_amnt float64 True \n", " 1 annual_inc float64 True \n", " 2 term category True \n", " 3 int_rate float64 True \n", " 4 installment float64 True \n", " 5 grade category True \n", " 6 sub_grade category True \n", " 7 purpose category True \n", " 8 loan_status category True \n", " 9 dti float64 True \n", " 10 last_fico_range_high float64 True \n", " 11 last_fico_range_low float64 True \n", " 12 total_acc float64 True \n", " 13 delinq_2yrs float64 True \n", " 14 emp_length category True \n", " 15 bad bool True \n", " 16 avg_monthly_inc float True \n", "dtypes: float(1), bool(1), category(6), float64(9)\n" ] } ], "source": [ "hf.info()" ] }, { "cell_type": "markdown", "id": "Gvy_uNYJl2Bb", "metadata": { "id": "Gvy_uNYJl2Bb" }, "source": [ "You will see that \"avg_monthly_inc\" is added on the encrypted dataset." ] }, { "cell_type": "markdown", "id": "9391c5dc", "metadata": { "id": "9391c5dc" }, "source": [ "## 3. Descriptive Statistics" ] }, { "cell_type": "markdown", "id": "sjpo80swEinn", "metadata": { "id": "sjpo80swEinn" }, "source": [ "In this section, we calculate descriptive statistics for some variables. First of all, we calculate average and standard deviation of average of monthly income." ] }, { "cell_type": "code", "execution_count": 12, "id": "ee1WX2iSEjQJ", "metadata": { "id": "ee1WX2iSEjQJ" }, "outputs": [ { "data": { "text/plain": [ "6618.304940075873" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mean_inc = hf[\"avg_monthly_inc\"].mean()\n", "mean_inc.decrypt().to_series()[0]" ] }, { "cell_type": "code", "execution_count": 13, "id": "lMBf3TU4EoX4", "metadata": { "id": "lMBf3TU4EoX4" }, "outputs": [ { "data": { "text/plain": [ "6615.626228448389" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "std_inc = hf[\"avg_monthly_inc\"].std()\n", "std_inc.decrypt().to_series()[0]" ] }, { "cell_type": "markdown", "id": "zGzHiEy_mDi_", "metadata": { "id": "zGzHiEy_mDi_" }, "source": [ "You may see that the result of calculated average and standard deviation of monthly income are approximately 6618, 6615, respectively." ] }, { "cell_type": "markdown", "id": "gHL66UY-FrYY", "metadata": { "id": "gHL66UY-FrYY" }, "source": [ "Now, let's calculate the correlation coefficient between monthly income and annual income." ] }, { "cell_type": "code", "execution_count": 14, "id": "PlQC0E0DEocz", "metadata": { "id": "PlQC0E0DEocz" }, "outputs": [ { "data": { "text/plain": [ "1.000000401009209" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "corr_inc = hf[\"avg_monthly_inc\"].corr(hf[\"annual_inc\"])\n", "corr_inc.decrypt().to_series()[0]" ] }, { "cell_type": "markdown", "id": "ENp7Cc0snQIy", "metadata": { "id": "ENp7Cc0snQIy" }, "source": [ "Since the correlation between monthly income and annual income is perfectly correlated, it is better to drop one of both. In this example, let's drop annual income variable." ] }, { "cell_type": "code", "execution_count": 15, "id": "TSkiuqvrF1CV", "metadata": { "id": "TSkiuqvrF1CV" }, "outputs": [ { "data": { "text/plain": [ "HEFrame(\n", " number of rows: 32768,\n", " number of columns: 16,\n", " list of columns: ['loan_amnt', 'term', 'int_rate', 'installment', 'grade', 'sub_grade', 'purpose', 'loan_status', 'dti', 'last_fico_range_high', 'last_fico_range_low', 'total_acc', 'delinq_2yrs', 'emp_length', 'bad', 'avg_monthly_inc']\n", ")" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hf.drop(\"annual_inc\", axis=1, inplace=True)" ] }, { "cell_type": "markdown", "id": "lrvMhu8uGoAT", "metadata": { "id": "lrvMhu8uGoAT" }, "source": [ "If you are interested in calculating skewness and kurtosis, you may also calculate using the program. Let's show how to conduct." ] }, { "cell_type": "code", "execution_count": 16, "id": "274V3JROEofq", "metadata": { "id": "274V3JROEofq" }, "outputs": [ { "data": { "text/plain": [ "47.25089123263138" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "skew_inc = hf[\"avg_monthly_inc\"].skew()\n", "skew_inc.decrypt().to_series()[0]" ] }, { "cell_type": "code", "execution_count": 17, "id": "dHPlcvV6EoiA", "metadata": { "id": "dHPlcvV6EoiA" }, "outputs": [ { "data": { "text/plain": [ "4115.309118000383" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kurt_inc = hf[\"avg_monthly_inc\"].kurt()\n", "kurt_inc.decrypt().to_series()[0]" ] }, { "cell_type": "markdown", "id": "eMt8E5fOG3Dk", "metadata": { "id": "eMt8E5fOG3Dk" }, "source": [ "Now, let's calculate descriptive statistics for specific credit grade. For example, consider the mean and standard deviation of average of monthly income for credit grade 'A'." ] }, { "cell_type": "code", "execution_count": 18, "id": "bbErUukRG39J", "metadata": { "id": "bbErUukRG39J" }, "outputs": [ { "data": { "text/plain": [ "7473.801684469564" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mean_inc_with_A = hf['avg_monthly_inc'][hf['grade']=='A'].mean()\n", "mean_inc_with_A.decrypt().to_series()[0]" ] }, { "cell_type": "code", "execution_count": 19, "id": "YYqcjgiBHDDh", "metadata": { "id": "YYqcjgiBHDDh" }, "outputs": [ { "data": { "text/plain": [ "5925.766645646782" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "std_inc_with_A = hf['avg_monthly_inc'][hf['grade']=='A'].std()\n", "std_inc_with_A.decrypt().to_series()[0]" ] }, { "cell_type": "markdown", "id": "F2WCysVrmfIl", "metadata": { "id": "F2WCysVrmfIl" }, "source": [ "The calculated mean and standard deviation of credit grade \"A\"'s monthly income are approximately 7474, 5926, respectively, which are far from mean and stadnard deviation of whole monthly income." ] }, { "cell_type": "markdown", "id": "5XurJ4tsEmAB", "metadata": { "id": "5XurJ4tsEmAB" }, "source": [ "## 4. Hypothesis Testing" ] }, { "cell_type": "markdown", "id": "yP9q40ZMHOy1", "metadata": { "id": "yP9q40ZMHOy1" }, "source": [ "In this section, we explore how to conduct hypothesis testing.\n", "For example, let's conduct one-sample hypothesis testing for mean." ] }, { "cell_type": "markdown", "id": "K8HvwJbrHO1g", "metadata": { "id": "K8HvwJbrHO1g" }, "source": [ "Assume that we are interested in testing the population mean of average of monthly income is 6,500. Then, the null hypothesis is:\n", "\n", " $H_0: \\mu_{inc} = 6,500$.\n", "\n", " The alternative hypothesis is\n", "\n", " $H_1: \\mu_{inc} \\neq 6,500$." ] }, { "cell_type": "code", "execution_count": 20, "id": "CNJpo6wPHegB", "metadata": { "id": "CNJpo6wPHegB" }, "outputs": [ { "data": { "text/plain": [ "0 3.237106\n", "1 32767.000000\n", "Name: avg_monthly_inc_t_test, dtype: float64" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "null_mean = 6500\n", "t_test = hf['avg_monthly_inc'].t_test(null_mean)\n", "t_test.decrypt().to_series()" ] }, { "cell_type": "markdown", "id": "E_JIbCFmOlFD", "metadata": { "id": "E_JIbCFmOlFD" }, "source": [ "The result consists of t-statistic in the first slot and degrees of freedom in the second slot. We use scipy to calculate p-value." ] }, { "cell_type": "code", "execution_count": 21, "id": "0Gz6BqJCOw5R", "metadata": { "id": "0Gz6BqJCOw5R" }, "outputs": [ { "data": { "text/plain": [ "0.0012086852336215324" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "t_test_series = t_test.to_series()\n", "\n", "import scipy.stats as st\n", "st.t.sf(abs(t_test_series[0]), round(t_test_series[1])) * 2" ] }, { "cell_type": "markdown", "id": "-SQcbdO9H4HH", "metadata": { "id": "-SQcbdO9H4HH" }, "source": [ "Since t-statistics is approximately 3.2371, and p-value is approximately 0.012, we can reject the null hypothesis on 95% significance level, which means that the average anuual income is not equal to 6,500\\$." ] }, { "cell_type": "markdown", "id": "vhXOBdO5H5vS", "metadata": { "id": "vhXOBdO5H5vS" }, "source": [ "Using the above example, we can also construct 95% confidence interval for the mean of the annual income." ] }, { "cell_type": "code", "execution_count": 22, "id": "xoiPLirrH6X5", "metadata": { "id": "xoiPLirrH6X5" }, "outputs": [ { "data": { "text/plain": [ "0 6546.672442\n", "1 6689.937448\n", "Name: avg_monthly_inc_mean_confint_t_0.95, dtype: float64" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "conf_interval_t = hf['avg_monthly_inc'].t_interval_for_mean(0.95)\n", "conf_interval_t.decrypt().to_series()" ] }, { "cell_type": "markdown", "id": "hlyewcL3H-YI", "metadata": { "id": "hlyewcL3H-YI" }, "source": [ "The resulting 95% confidence interval of population mean is (6546.672, 6689.937), which means that we are 95% confident that the population mean of annual income will fall between the interval confident by 95\\%. " ] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "heaansdk-230317", "language": "python", "name": "heaansdk-1dd88bf-2e47166" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 5 }