Predicting Hospital Readmission of Diabetic Patients Using ML Models
Introduction
Diabetes affects 1 in 10 people in the US. According to Ostling et al., diabetic patients have a higher chance of being hospitalized than other patients. The healthcare systems have moved to value-based care. Different programs have developed like the Hospital Readmission Reduction Program(HRRP) that reduces reimbursements to hospitals with readmission above the threshold.
In 2011 alone, there were over $250 million spent treating diabetic patients readmitted. Therefore hospitals need to improve and reduce these rates.
Project
Baring the above in mind, I will try to predict if a diabetic patient will have to the hospital. The data that I am using for my project comes from UCI. It consists of over 100,000 admissions of diabetes from 130 hospitals in the US between 1999–2008.
I will be using different ML techniques like Random Forest, to make my predictions.
Data Preparation
There were different types of data, and below are the steps I took to ready the data for model prediction:
The information that I extracted satisfied the following conditions:
The length of stay in the hospital
Medications
The number of procedures.
Preliminary Analysis and the Final Dataset
The original data had noisy, incomplete, and redundant information. Some features were not dealt with directly due to their high percentage of missing values:
Weight- 97% of the values were missing, and I considered it sparse, so I did not include it in the analysis
Medical_speciality — 47% maintained it and added used NumPy to compute the missing values
Payer Code — 40% was not relevant to the outcome
Also, there were duplicate entries in some of the visits of the patients. These observations could not be considered independent, an assumption that is made by logistic regression.
During data cleaning, I also removed data that resulted in discharging to hospice or patient died to avoid bias.
Numerical Features
Categorical Features
Target
The target was the readmission feature which was divided into three classes
My aim was readmission, so I joined the readmission classes to form a binary class.
Modeling Methods
Logistic Regression
I used this to fit the relationship between readmission while controlling all other covariants. With the help of graphics, the final evaluation was done.
Random Forest
This model works better than the decision tree that tends to overfit. The validation accuracy score for the model was better than that of the logistic regression.
Below is a summary of the performance of the model:
Feature Importance
To improve the models, I needed to look at the features that are important to the model.
This can be useful in new feature ideas and making the model robust
Permutation of Importance
there is a clear difference with the features of importance. In this case, “number of patients” is weighted more.
Analyzing the models.
Using the area under the ROC curve, I was able to evaluate the best model. It was able to capture the trade-off between false positives and true positives that did not require the selection of a threshold
from the graph above, we can see that they all have an almost similar performance on the val dataset.
Limitations and Assumptions
There is useful information in the data, but there are a few aspects that it lacks, for example, access to care, which would significantly impact the rate of readmission.
The data can be processed further to improve the performance of the models.
4. Conclusions
The decision to use the readmission rate within the past ten years can help improve the hospital quality of care.
The readmission rates seemed to be associated with the number of procedures that were being done.
The best model was Xgboost, with an accuracy score of 62.2%. It is good to know that the accuracy score difference was small, with the Decision tree having the lowest score of 60.4%.