Hello readers, I’m back after a long time, have been caught up with a lot of stuff recently . Today’s blog is about one of those many things I’ve been up to recently . Today I’m going to talk about my first data science freelance project , won’t disclose the name of the client but it is one the most famous diagnostics centres around our locality .
Now let’s talk about the task that I was given , the task was to predict the mortality probability (what are the chances that a person who has COVID-19 is going to survive and what the chances that her/she won’t). The main purpose of this project was to provide input to various hospitals who got their COVID tests done from that particular diagnostic centre , so that people who had higher chances of dying could be given more priority .
One of the major factor in the success of any data related project is that of the amount and quality of data available and another is feature engineering . After talking with a lot of my friends who are doctors, nurses and so on , I came up with a list of features which I wanted along with the data for mortality rate . The features were :
1 TUBERCULOSIS
2 SYSTEMIC LUPUS ERYTHMATOSUS
3 RHEUMATOID ARTHRITIS
4 CANCER
5 ASPLENIA
6 HYPOSPLENIA
7 MEASLES
8 CYTOMEGALOVIRUS
9 CHICKEN POX
10 HERPES ZOSTER
11 MALNUTRITION
12 CURRENT PREGNANT
13 CHRONIC KIDNEY DISEASE
14 DIABETES TYPE I
15 DIABETES TYPE II
16 TRANSPLANT
This time I didn’t have to go about collecting data , they provided me with a nicely arranged dataset with the asked features and also provided me with some extra features (see how important data is , if collected and maintained correctly it can save lives) . Data is the new fuel . So without further ado lets start .
Like our bible rule we import all the required dependencies and libraries at the beginning .
Note : Install XGBOOST if it is not installed in your system
import pandas as pd
import numpy as np
import warnings
import xgboost as XGB
warnings.filterwarnings('ignore')
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.svm import SVC
df = pd.read_csv('RawData.csv')
df.head()
Checking for missing values .
df.isnull().sum()
PATIENT_ID 0
AGE 0
SEX 0
ZIP 0
BMI 0
HEIGHT 0
WEIGHT 0
TUBERCULOSIS 0
SYSTEMIC LUPUS ERYTHMATOSUS 0
RHEUMATOID ARTHRITIS 0
EXTENSIVE BURNS 0
ASPLENIA 0
HYPOSPLENIA 0
MEASLES 0
CYTOMEGALOVIRUS 0
CHICKEN POX 0
HERPES ZOSTER 0
MALNUTRITION 0
CURRENT PREGNANT 0
CHRONIC KIDNEY DISEASE 0
DIABETES TYPE I 0
DIABETES TYPE II 0
TRANSPLANT 0
HEMODIALYSIS Pre Diagnosis 0
HEMODIALYSIS Post diagnosis 0
CANCER 0
COVID TEST POSITIVE 0
TEST NAME 0
ICU Admit 0
#ICU Admit 0
MORTALITY 0
dtype: int64
cat_features = ['SEX', 'TUBERCULOSIS', 'SYSTEMIC LUPUS ERYTHMATOSUS', 'RHEUMATOID ARTHRITIS',
'EXTENSIVE BURNS', 'ASPLENIA', 'HYPOSPLENIA', 'MEASLES',
'CYTOMEGALOVIRUS', 'CHICKEN POX', 'HERPES ZOSTER', 'MALNUTRITION',
'CURRENT PREGNANT', 'CHRONIC KIDNEY DISEASE', 'DIABETES TYPE I',
'DIABETES TYPE II', 'TRANSPLANT', 'HEMODIALYSIS Pre Diagnosis',
'HEMODIALYSIS Post diagnosis', 'CANCER', 'COVID TEST POSITIVE',
'TEST NAME', 'ICU Admit', 'MORTALITY']
num_features = ['AGE', 'ZIP', 'BMI', 'HEIGHT', 'WEIGHT', '#ICU Admit']
features = cat_features + num_features
features.remove('MORTALITY')
features.remove('HEIGHT')
target = 'MORTALITY'
label = LabelEncoder()
for i in cat_features:
df[i] = label.fit_transform(df[i])
df.tail()
We convert all categorical variables in a encoded form to be used by the model . Models only understand numbers so it is necessary that we encode all our inputs numerically.
Training
Note — One of the main Aim of this experiment is to keep the recall as high as possible even if our accuracy is low . We’ll see why . (Can you guess ?)
I divided the entire dataset into two parts one for training and another for testing , this dataset had around 40000 example or rows , so kept around 20% for validation .
I used a XG-BOOST Model to train my classifier , the class that I’m trying to predict is Mortality (Y/N) .
X_train, X_test, y_train, y_test = train_test_split(df[features], df[target], test_size=0.2)
model = XGB.XGBClassifier()
model.fit(X_train, y_train)
[01:47:01] WARNING: ../src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
gamma=0, gpu_id=-1, importance_type=None,
interaction_constraints='', learning_rate=0.300000012,
max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=100, n_jobs=4,
num_parallel_tree=1, predictor='auto', random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
tree_method='exact', validate_parameters=1, verbosity=None)
Testing
Woah! the accuracy came out to be 99.93% on the validation set , which is aboslutely awesome . But we also need to see how it performs after deployment .
preds = model.predict(X_test)
accuracy_score(y_test, preds)
0.999375
tn, fp, fn, tp = confusion_matrix(y_test, preds).ravel()
print("Precision: ", (tp/(tp + fp)))
print("Recall: ", (tp/(tp + fn)))
Precision: 0.9987496874218554
Recall: 1.0
Test on External Set
So the diagnostic centre also provided me with a data later on after I finished my job to test my model, an external held out set, the model predicted the outcomes with an accuracy of 90.32% , which show a degradation of the original model but still outperforms a lot of existing models .
val = pd.read_csv('Valuation.csv')
y_true = val.MORTALITY
y_pred = model.predict(val.drop('MORTALITY', axis=1)[features])
accuracy_score(y_true, y_pred)
0.9032258064516129
Few points to ponder upon:
1) A cleaned organised data can lead to a successful completion of data science or data related project .
2) Feature engineering is not something that is only done with visualisations with matplotlib and so on, but rather it involves many other factors such as domain knowledge, new developments in the field etc .
3) This Model has a recall of 1.0 and precision of 0.99 , which was very needed since this has to do with people’s live , a large number of False Positives won’t hurt but a False Negative might put someone’s life at risk . So this also make data practitioners like us responsible for what we create and do with our data .
Thank you readers , I hope you liked this article, if you do please drop a clap . See ya geeks!!