{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "371315ac",
   "metadata": {},
   "source": [
    "# Statistics using HEaaN.stat\n",
    "\n",
    "In this documentation, we introduce default detection model using lending club dataset from Kaggle (https://www.kaggle.com/datasets/ethon0426/lending-club-20072020q1)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e4a2ed42",
   "metadata": {
    "id": "e4a2ed42"
   },
   "source": [
    "## 1. Data import"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ee5557f8",
   "metadata": {
    "id": "ee5557f8"
   },
   "source": [
    "We import lending club data extracted from original set downloaded from Kaggle. In this exercise, we only use 32,768 observations and selected some sensitive features. Let's import the provided data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "89e1a910",
   "metadata": {
    "id": "89e1a910"
   },
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "c8735dea",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 330
    },
    "id": "c8735dea",
    "outputId": "1270720d-da8d-48d0-ac61-c71d5a479297"
   },
   "outputs": [],
   "source": [
    "df = pd.read_csv('lc.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "e6dcb7da",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "e6dcb7da",
    "outputId": "6e56a6ef-82c7-4cb6-d797-45f6717e801d"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 32768 entries, 0 to 32767\n",
      "Data columns (total 15 columns):\n",
      " #   Column                Non-Null Count  Dtype  \n",
      "---  ------                --------------  -----  \n",
      " 0   loan_amnt             32768 non-null  float64\n",
      " 1   annual_inc            32768 non-null  float64\n",
      " 2   term                  32768 non-null  object \n",
      " 3   int_rate              32768 non-null  object \n",
      " 4   installment           32768 non-null  float64\n",
      " 5   grade                 32768 non-null  object \n",
      " 6   sub_grade             32768 non-null  object \n",
      " 7   purpose               32768 non-null  object \n",
      " 8   loan_status           32768 non-null  object \n",
      " 9   dti                   32743 non-null  float64\n",
      " 10  last_fico_range_high  32768 non-null  float64\n",
      " 11  last_fico_range_low   32768 non-null  float64\n",
      " 12  total_acc             32768 non-null  float64\n",
      " 13  delinq_2yrs           32768 non-null  float64\n",
      " 14  emp_length            30518 non-null  object \n",
      "dtypes: float64(8), object(7)\n",
      "memory usage: 3.8+ MB\n"
     ]
    }
   ],
   "source": [
    "df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d602275e",
   "metadata": {
    "id": "d602275e"
   },
   "source": [
    "## 2. Data preprocessing"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "uW6QDKLL0K2-",
   "metadata": {
    "id": "uW6QDKLL0K2-"
   },
   "source": [
    "Before encoding and encrypting data, we have to preprocess data. Frist, we have to change data of numerical columns to number. Second, categorical column have to be set to dtype as \"category\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "sZbc3lrW0oP1",
   "metadata": {
    "id": "sZbc3lrW0oP1"
   },
   "outputs": [],
   "source": [
    "df['int_rate'] = df['int_rate'].apply(lambda x: float(x.strip('%')) / 100)\n",
    "\n",
    "df[\"term\"] = df[\"term\"].astype(\"category\")\n",
    "df[\"grade\"] = df[\"grade\"].astype(\"category\")\n",
    "df[\"sub_grade\"] = df[\"sub_grade\"].astype(\"category\")\n",
    "df[\"purpose\"] = df[\"purpose\"].astype(\"category\")\n",
    "df[\"loan_status\"] = df[\"loan_status\"].astype(\"category\")\n",
    "df[\"emp_length\"] = df[\"emp_length\"].astype(\"category\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7Y5ECrvi2x4I",
   "metadata": {
    "id": "7Y5ECrvi2x4I"
   },
   "source": [
    "Now, let's encrypt data. Note that HEaaN.SDK contains various encryption techniques in context. We will use 'FGb' parameter on encryption, and make keys based on the parameter."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "McsFn9na21xT",
   "metadata": {
    "id": "McsFn9na21xT"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "HEaaN-SDK uses CUDA v11.5 (> v11.2)\n"
     ]
    }
   ],
   "source": [
    "import heaan_sdk\n",
    "\n",
    "context = heaan_sdk.Context(\n",
    "    parameter=heaan_sdk.HEParameter.from_preset(\"FGb\"),\n",
    "    key_dir_path=\"./keys_stat\",\n",
    "    load_keys=\"all\",\n",
    "    generate_keys=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "43ae9012",
   "metadata": {},
   "source": [
    "Now, encode and encrypt data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "sWuYsM55KzGi",
   "metadata": {
    "id": "sWuYsM55KzGi"
   },
   "outputs": [],
   "source": [
    "hf = heaan_sdk.HEFrame.from_frame(context, df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "fa851c36",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "HEFrame(\n",
       "  number of rows: 32768,\n",
       "  number of columns: 15,\n",
       "  list of columns: ['loan_amnt', 'annual_inc', 'term', 'int_rate', 'installment', 'grade', 'sub_grade', 'purpose', 'loan_status', 'dti', 'last_fico_range_high', 'last_fico_range_low', 'total_acc', 'delinq_2yrs', 'emp_length']\n",
       ")"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "hf.encrypt()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f3cb07ac",
   "metadata": {
    "id": "f3cb07ac"
   },
   "source": [
    "Before conducting the analysis, let's perform some data preprocessing. First, since our objective is to predict whether a loan is well-paid or not, let's create a binary variable called 'bad'. We can define 'bad=0' if the loan_status is either 'Fully Paid' or 'Current', and 'bad=1' if the loan_status is not well-paid."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "C3STZwGYELVg",
   "metadata": {
    "id": "C3STZwGYELVg"
   },
   "outputs": [],
   "source": [
    "hf[\"bad\"] = (hf[\"loan_status\"] != \"Fully Paid\") & (hf[\"loan_status\"] != \"Current\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "lkS8Z1MPKtLy",
   "metadata": {
    "id": "lkS8Z1MPKtLy"
   },
   "source": [
    "\n",
    "\n",
    "```\n",
    "```\n",
    "\n",
    "Suppose that we are going to use average of monthly income for individual person."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "WaGHeS1mFFpj",
   "metadata": {
    "id": "WaGHeS1mFFpj"
   },
   "outputs": [],
   "source": [
    "hf[\"avg_monthly_inc\"] = hf[\"annual_inc\"] * (1 / 12)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "96WSN0FCEaLz",
   "metadata": {
    "id": "96WSN0FCEaLz"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'heaan_sdk.frame.frame.HEFrame'>\n",
      "HEFrame Description: \n",
      "Data rows: 32768\n",
      "Data columns (total 17 columns):\n",
      "#    Column                    Dtype      Encrypted \n",
      "---  ------------------------  ---------  ----------\n",
      "  0  loan_amnt                 float64    True      \n",
      "  1  annual_inc                float64    True      \n",
      "  2  term                      category   True      \n",
      "  3  int_rate                  float64    True      \n",
      "  4  installment               float64    True      \n",
      "  5  grade                     category   True      \n",
      "  6  sub_grade                 category   True      \n",
      "  7  purpose                   category   True      \n",
      "  8  loan_status               category   True      \n",
      "  9  dti                       float64    True      \n",
      " 10  last_fico_range_high      float64    True      \n",
      " 11  last_fico_range_low       float64    True      \n",
      " 12  total_acc                 float64    True      \n",
      " 13  delinq_2yrs               float64    True      \n",
      " 14  emp_length                category   True      \n",
      " 15  bad                       bool       True      \n",
      " 16  avg_monthly_inc           float      True      \n",
      "dtypes: float(1), bool(1), category(6), float64(9)\n"
     ]
    }
   ],
   "source": [
    "hf.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "Gvy_uNYJl2Bb",
   "metadata": {
    "id": "Gvy_uNYJl2Bb"
   },
   "source": [
    "You will see that \"avg_monthly_inc\" is added on the encrypted dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9391c5dc",
   "metadata": {
    "id": "9391c5dc"
   },
   "source": [
    "## 3. Descriptive Statistics"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "sjpo80swEinn",
   "metadata": {
    "id": "sjpo80swEinn"
   },
   "source": [
    "In this section, we calculate descriptive statistics for some variables. First of all, we calculate average and standard deviation of average of monthly income."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "ee1WX2iSEjQJ",
   "metadata": {
    "id": "ee1WX2iSEjQJ"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "6618.304940075873"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mean_inc = hf[\"avg_monthly_inc\"].mean()\n",
    "mean_inc.decrypt().to_series()[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "lMBf3TU4EoX4",
   "metadata": {
    "id": "lMBf3TU4EoX4"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "6615.626228448389"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "std_inc = hf[\"avg_monthly_inc\"].std()\n",
    "std_inc.decrypt().to_series()[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "zGzHiEy_mDi_",
   "metadata": {
    "id": "zGzHiEy_mDi_"
   },
   "source": [
    "You may see that the result of calculated average and standard deviation of monthly income are approximately 6618, 6615, respectively."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "gHL66UY-FrYY",
   "metadata": {
    "id": "gHL66UY-FrYY"
   },
   "source": [
    "Now, let's calculate the correlation coefficient between monthly income and annual income."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "PlQC0E0DEocz",
   "metadata": {
    "id": "PlQC0E0DEocz"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.000000401009209"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "corr_inc = hf[\"avg_monthly_inc\"].corr(hf[\"annual_inc\"])\n",
    "corr_inc.decrypt().to_series()[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ENp7Cc0snQIy",
   "metadata": {
    "id": "ENp7Cc0snQIy"
   },
   "source": [
    "Since the correlation between monthly income and annual income is perfectly correlated, it is better to drop one of both. In this example, let's drop annual income variable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "TSkiuqvrF1CV",
   "metadata": {
    "id": "TSkiuqvrF1CV"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "HEFrame(\n",
       "  number of rows: 32768,\n",
       "  number of columns: 16,\n",
       "  list of columns: ['loan_amnt', 'term', 'int_rate', 'installment', 'grade', 'sub_grade', 'purpose', 'loan_status', 'dti', 'last_fico_range_high', 'last_fico_range_low', 'total_acc', 'delinq_2yrs', 'emp_length', 'bad', 'avg_monthly_inc']\n",
       ")"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "hf.drop(\"annual_inc\", axis=1, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "lrvMhu8uGoAT",
   "metadata": {
    "id": "lrvMhu8uGoAT"
   },
   "source": [
    "If you are interested in calculating skewness and kurtosis, you may also calculate using the program. Let's show how to conduct."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "274V3JROEofq",
   "metadata": {
    "id": "274V3JROEofq"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "47.25089123263138"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "skew_inc = hf[\"avg_monthly_inc\"].skew()\n",
    "skew_inc.decrypt().to_series()[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "dHPlcvV6EoiA",
   "metadata": {
    "id": "dHPlcvV6EoiA"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4115.309118000383"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "kurt_inc = hf[\"avg_monthly_inc\"].kurt()\n",
    "kurt_inc.decrypt().to_series()[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eMt8E5fOG3Dk",
   "metadata": {
    "id": "eMt8E5fOG3Dk"
   },
   "source": [
    "Now, let's calculate descriptive statistics for specific credit grade. For example, consider the mean and standard deviation of average of monthly income for credit grade 'A'."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "bbErUukRG39J",
   "metadata": {
    "id": "bbErUukRG39J"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "7473.801684469564"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mean_inc_with_A = hf['avg_monthly_inc'][hf['grade']=='A'].mean()\n",
    "mean_inc_with_A.decrypt().to_series()[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "YYqcjgiBHDDh",
   "metadata": {
    "id": "YYqcjgiBHDDh"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "5925.766645646782"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "std_inc_with_A = hf['avg_monthly_inc'][hf['grade']=='A'].std()\n",
    "std_inc_with_A.decrypt().to_series()[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "F2WCysVrmfIl",
   "metadata": {
    "id": "F2WCysVrmfIl"
   },
   "source": [
    "The calculated mean and standard deviation of credit grade \"A\"'s monthly income are approximately 7474, 5926, respectively, which are far from mean and stadnard deviation of whole monthly income."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5XurJ4tsEmAB",
   "metadata": {
    "id": "5XurJ4tsEmAB"
   },
   "source": [
    "## 4. Hypothesis Testing"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "yP9q40ZMHOy1",
   "metadata": {
    "id": "yP9q40ZMHOy1"
   },
   "source": [
    "In this section, we explore how to conduct hypothesis testing.\n",
    "For example, let's conduct one-sample hypothesis testing for mean."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "K8HvwJbrHO1g",
   "metadata": {
    "id": "K8HvwJbrHO1g"
   },
   "source": [
    "Assume that we are interested in testing the population mean of average of monthly income is 6,500. Then, the null hypothesis is:\n",
    "\n",
    " $H_0: \\mu_{inc} = 6,500$.\n",
    "\n",
    " The alternative hypothesis is\n",
    "\n",
    " $H_1: \\mu_{inc} \\neq 6,500$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "CNJpo6wPHegB",
   "metadata": {
    "id": "CNJpo6wPHegB"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0        3.237106\n",
       "1    32767.000000\n",
       "Name: avg_monthly_inc_t_test, dtype: float64"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "null_mean = 6500\n",
    "t_test = hf['avg_monthly_inc'].t_test(null_mean)\n",
    "t_test.decrypt().to_series()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "E_JIbCFmOlFD",
   "metadata": {
    "id": "E_JIbCFmOlFD"
   },
   "source": [
    "The result consists of t-statistic in the first slot and degrees of freedom in the second slot. We use scipy to calculate p-value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "0Gz6BqJCOw5R",
   "metadata": {
    "id": "0Gz6BqJCOw5R"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.0012086852336215324"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "t_test_series = t_test.to_series()\n",
    "\n",
    "import scipy.stats as st\n",
    "st.t.sf(abs(t_test_series[0]), round(t_test_series[1])) * 2"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "-SQcbdO9H4HH",
   "metadata": {
    "id": "-SQcbdO9H4HH"
   },
   "source": [
    "Since t-statistics is approximately 3.2371, and p-value is approximately 0.012, we can reject the null hypothesis on 95% significance level, which means that the average anuual income is not equal to 6,500\\$."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "vhXOBdO5H5vS",
   "metadata": {
    "id": "vhXOBdO5H5vS"
   },
   "source": [
    "Using the above example, we can also construct 95% confidence interval for the mean of the annual income."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "xoiPLirrH6X5",
   "metadata": {
    "id": "xoiPLirrH6X5"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    6546.672442\n",
       "1    6689.937448\n",
       "Name: avg_monthly_inc_mean_confint_t_0.95, dtype: float64"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "conf_interval_t = hf['avg_monthly_inc'].t_interval_for_mean(0.95)\n",
    "conf_interval_t.decrypt().to_series()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "hlyewcL3H-YI",
   "metadata": {
    "id": "hlyewcL3H-YI"
   },
   "source": [
    "The resulting 95% confidence interval of population mean is (6546.672, 6689.937), which means that we are 95% confident that the population mean of annual income will fall between the interval confident by 95\\%. "
   ]
  }
 ],
 "metadata": {
  "colab": {
   "provenance": []
  },
  "kernelspec": {
   "display_name": "heaansdk-230317",
   "language": "python",
   "name": "heaansdk-1dd88bf-2e47166"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}