Predicting Hospital Readmission of Diabetic Patients Using ML Models

4 min readAug 28, 2020

Introduction

Diabetes affects 1 in 10 people in the US. According to Ostling et al., diabetic patients have a higher chance of being hospitalized than other patients. The healthcare systems have moved to value-based care. Different programs have developed like the Hospital Readmission Reduction Program(HRRP) that reduces reimbursements to hospitals with readmission above the threshold.

In 2011 alone, there were over $250 million spent treating diabetic patients readmitted. Therefore hospitals need to improve and reduce these rates.

Project

Baring the above in mind, I will try to predict if a diabetic patient will have to the hospital. The data that I am using for my project comes from UCI. It consists of over 100,000 admissions of diabetes from 130 hospitals in the US between 1999–2008.

I will be using different ML techniques like Random Forest, to make my predictions.

Data Preparation

There were different types of data, and below are the steps I took to ready the data for model prediction:

The information that I extracted satisfied the following conditions:

The length of stay in the hospital

Medications

The number of procedures.

Preliminary Analysis and the Final Dataset

The original data had noisy, incomplete, and redundant information. Some features were not dealt with directly due to their high percentage of missing values:

Weight- 97% of the values were missing, and I considered it sparse, so I did not include it in the analysis

Medical_speciality — 47% maintained it and added used NumPy to compute the missing values

Payer Code — 40% was not relevant to the outcome

Also, there were duplicate entries in some of the visits of the patients. These observations could not be considered independent, an assumption that is made by logistic regression.

During data cleaning, I also removed data that resulted in discharging to hospice or patient died to avoid bias.

Numerical Features

Categorical Features

Target

The target was the readmission feature which was divided into three classes

My aim was readmission, so I joined the readmission classes to form a binary class.

Modeling Methods

Logistic Regression

I used this to fit the relationship between readmission while controlling all other covariants. With the help of graphics, the final evaluation was done.

Random Forest

This model works better than the decision tree that tends to overfit. The validation accuracy score for the model was better than that of the logistic regression.

Below is a summary of the performance of the model:

Feature Importance

To improve the models, I needed to look at the features that are important to the model.

This can be useful in new feature ideas and making the model robust

Permutation of Importance

there is a clear difference with the features of importance. In this case, “number of patients” is weighted more.

Analyzing the models.

Using the area under the ROC curve, I was able to evaluate the best model. It was able to capture the trade-off between false positives and true positives that did not require the selection of a threshold

from the graph above, we can see that they all have an almost similar performance on the val dataset.

Limitations and Assumptions

There is useful information in the data, but there are a few aspects that it lacks, for example, access to care, which would significantly impact the rate of readmission.

The data can be processed further to improve the performance of the models.

4. Conclusions

The decision to use the readmission rate within the past ten years can help improve the hospital quality of care.

The readmission rates seemed to be associated with the number of procedures that were being done.

The best model was Xgboost, with an accuracy score of 62.2%. It is good to know that the accuracy score difference was small, with the Decision tree having the lowest score of 60.4%.