Hello readers, I’m back after a long time, have been caught up with a lot of stuff recently . Today’s blog is about one of those many things I’ve been up to recently . Today I’m going to talk about my first data science freelance project , won’t disclose the name of the client but it is one the most famous diagnostics centres around our locality .

Now let’s talk about the task that I was given , the task was to predict the mortality probability (what are the chances that a person who has COVID-19 is going to survive and what the chances that her/she won’t). The main purpose of this project was to provide input to various hospitals who got their COVID tests done from that particular diagnostic centre , so that people who had higher chances of dying could be given more priority .

One of the major factor in the success of any data related project is that of the amount and quality of data available and another is feature engineering . After talking with a lot of my friends who are doctors, nurses and so on , I came up with a list of features which I wanted along with the data for mortality rate . The features were :

This time I didn’t have to go about collecting data , they provided me with a nicely arranged dataset with the asked features and also provided me with some extra features (see how important data is , if collected and maintained correctly it can save lives) . Data is the new fuel . So without further ado lets start .

Like our bible rule we import all the required dependencies and libraries at the beginning .

Note : Install XGBOOST if it is not installed in your system

A glimpse of our dataset

Checking for missing values .

We convert all categorical variables in a encoded form to be used by the model . Models only understand numbers so it is necessary that we encode all our inputs numerically.

Training

Note — One of the main Aim of this experiment is to keep the recall as high as possible even if our accuracy is low . We’ll see why . (Can you guess ?)

I divided the entire dataset into two parts one for training and another for testing , this dataset had around 40000 example or rows , so kept around 20% for validation .

I used a XG-BOOST Model to train my classifier , the class that I’m trying to predict is Mortality (Y/N) .

Testing

Woah! the accuracy came out to be 99.93% on the validation set , which is aboslutely awesome . But we also need to see how it performs after deployment .

Test on External Set

So the diagnostic centre also provided me with a data later on after I finished my job to test my model, an external held out set, the model predicted the outcomes with an accuracy of 90.32% , which show a degradation of the original model but still outperforms a lot of existing models .

Few points to ponder upon:

1) A cleaned organised data can lead to a successful completion of data science or data related project .

2) Feature engineering is not something that is only done with visualisations with matplotlib and so on, but rather it involves many other factors such as domain knowledge, new developments in the field etc .

3) This Model has a recall of 1.0 and precision of 0.99 , which was very needed since this has to do with people’s live , a large number of False Positives won’t hurt but a False Negative might put someone’s life at risk . So this also make data practitioners like us responsible for what we create and do with our data .

Thank you readers , I hope you liked this article, if you do please drop a clap . See ya geeks!!