{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Instance based attacks\n", "\n", "This notebook demonstrates additional issues that are unique to instance-based models.\n", "\n", "Instance-based models are popular within Machine Learning -- common examples are K-Nearest-Neighbours and the Support Vector Machine. All machine learning models (instance-based or otherwise) require access to data during the training phase. What makes instance-based models distinct is that they also require access to training data to make predictions and therefore need to store some of the training data within the model file.\n", "\n", "As it is this model file that researchers wish to export from the TRE, this constitutes a problem.\n", "\n", "We will illustrate this with an example." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:c:\\Users\\simonr04\\git\\GRAIMatter\\data_preprocessing\\data_interface.py:ROOT PROJECT FOLDER = c:\\Users\\simonr04\\git\\GRAIMatter\n" ] } ], "source": [ "import logging\n", "import os\n", "\n", "import pylab as plt\n", "\n", "%matplotlib inline\n", "\n", "logging.getLogger(\"matplotlib.font_manager\").disabled = True\n", "\n", "os.chdir(\"c:\\\\Users\\\\simonr04\\\\git\\\\GRAIMatter\")\n", "from data_preprocessing.data_interface import get_data_sklearn\n", "\n", "logging.basicConfig(level=logging.DEBUG)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we are using an open source dataset as we cannot show an example with data from the TRE." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:c:\\Users\\simonr04\\git\\GRAIMatter\\data_preprocessing\\data_interface.py:DATASET FOLDER = c:\\Users\\simonr04\\git\\GRAIMatter\\data\n", "INFO:c:\\Users\\simonr04\\git\\GRAIMatter\\data_preprocessing\\data_interface.py:Loading mimic2-iaccd\n", "INFO:c:\\Users\\simonr04\\git\\GRAIMatter\\data_preprocessing\\data_interface.py:Preprocessing\n" ] } ], "source": [ "DATASET_NAME = \"mimic2-iaccd\"\n", "X, y = get_data_sklearn(DATASET_NAME)\n", "# Choose some features (we don't need all of them)\n", "FEATURES = [\"age\", \"gender_num\", \"bmi\", \"day_icu_intime_num\", \"liver_flg\", \"copd_flg\"]\n", "subX = X[FEATURES].copy()\n", "\n", "# Round bmi to an integer\n", "subX[\"bmi\"] = subX[\"bmi\"].astype(int)\n", "subX[\"age\"] = subX[\"age\"].astype(int)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Training the model\n", "\n", "We now train an instance-based model (a Support Verctor Machine; SVM). In this case, we are predicting whether an individual admitted to hospital died or not." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We start by splitting the data into two sets, and then training the model with one of the sets. We show the model performance via a ROC curve. This is just to show that the model is able to do something (lines above the dashed line show performance better than guessing)." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "from sklearn.metrics import roc_curve\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.svm import SVC\n", "\n", "font = {\"size\": 14}\n", "plt.rc(\"font\", **font)\n", "train_X, test_X, train_y, test_y = train_test_split(subX.values, y.values.flatten())\n", "svm = SVC(probability=True, gamma=0.01)\n", "svm.fit(train_X, train_y)\n", "train_probs = svm.predict_proba(train_X)\n", "test_probs = svm.predict_proba(test_X)\n", "plt.figure(figsize=(10, 10))\n", "fpr, tpr, _ = roc_curve(test_y, test_probs[:, 1])\n", "plt.plot(fpr, tpr, label=\"test\")\n", "fpr, tpr, _ = roc_curve(train_y, train_probs[:, 1])\n", "plt.plot(fpr, tpr, label=\"train\")\n", "plt.legend()\n", "plt.xlabel(\"fpr\")\n", "plt.ylabel(\"tpr\")\n", "plt.plot([0, 1], [0, 1], \"k--\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The python object `svm` is the model. This is what a researcher would want to save and export from the TRE.\n", "\n", "Unfortunately, it includes _exact_ copies of some of the data examples. Details in the next cell." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "In this example, the SVM has stored exact copies of 441 of the original data rows (out of 798 total rows)\n" ] } ], "source": [ "n_support_vectors = len(svm.support_vectors_)\n", "n_total = len(train_X)\n", "print(\n", " f\"In this example, the SVM has stored exact copies of {n_support_vectors} of the original data rows (out of {n_total} total rows)\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Without these, the SVM won't work. They are immediately accessible with access to the `svm` object. For example, here are the top 5, and the same rows from the training data for comparison:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "EXAMPLE 0\n", "\t Original:\t [46. 1. 39. 5. 0. 0.]\n", "\t Stored:\t [46. 1. 39. 5. 0. 0.]\n", "EXAMPLE 1\n", "\t Original:\t [36. 1. 26. 5. 0. 0.]\n", "\t Stored:\t [36. 1. 26. 5. 0. 0.]\n", "EXAMPLE 2\n", "\t Original:\t [67. 1. 19. 4. 0. 0.]\n", "\t Stored:\t [67. 1. 19. 4. 0. 0.]\n", "EXAMPLE 3\n", "\t Original:\t [55. 0. 19. 6. 0. 0.]\n", "\t Stored:\t [55. 0. 19. 6. 0. 0.]\n", "EXAMPLE 4\n", "\t Original:\t [87. 0. 24. 7. 0. 0.]\n", "\t Stored:\t [87. 0. 24. 7. 0. 0.]\n" ] } ], "source": [ "NTOP = 5\n", "for i in range(NTOP):\n", " sv_idx = svm.support_[i]\n", " sv = svm.support_vectors_[i]\n", " original = train_X[sv_idx, :]\n", " print(f\"EXAMPLE {i}\")\n", " print(\"\\t Original:\\t\", original)\n", " print(\"\\t Stored:\\t\", sv)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is immediately clear that exporting the model would, in effect export exact copies of some individual level data.\n", "\n", "This is an issue with all instance-based models where an attacker has direct access to the contents of the model (or the model file). It is not an issue if the attacker is only able to query the model and not have access to its inner workings." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.9.4 ('venv': venv)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.4" }, "orig_nbformat": 4, "vscode": { "interpreter": { "hash": "fcca1ce0a591990538c4a1a2cbe16853d718e2332b5914ea18ddb1937a418955" } } }, "nbformat": 4, "nbformat_minor": 2 }