Abstract:
World health organization estimates an increasing global trend of healthcare costs, and it is anticipated that the machine learning (ML) models may help to predict and manage these costs. However, ML research for predicting patients’ expenditures using EHRs is
relatively new. Furthermore, for multivariate time-series predictions in the healthcare domain, the use of multi-headed neural network architectures has been less explored in the literature.
Additionally, researchers have not explored generative adversarial networks (GANs) for predicting healthcare outcomes using multivariate time-series datasets. In this thesis, a number of experiments addressed these gaps in literature. In the first experiment, the potential
of Apriori frequent item-set mining approach was evaluated to discover the frequently
appearing diagnoses or procedure codes among several features in healthcare datasets. The selected features combined with demographic and clinical features were used to classify
patients according to the medicine consumed by them. Classification algorithm results revealed that the performance of all ML algorithms improved when only frequent features selected from Apriori were used in classification compared to all the features in a US dataset.
However, this finding was not robust across a second dataset collected in India. In the second experiment, state-of-the-art feature selection approaches (information gain, correlation coefficient score, LASSO, and ridge regression) and feature transformation approaches
(principal component analysis and auto-encoders) were evaluated to find relevant features in healthcare datasets. Results revealed that feature engineering helped in improving the
classification accuracy in certain healthcare datasets. In the third experiment, statistical models (persistence and autoregressive integrated moving average (ARIMA)), multi-layer
perceptron (MLP), long short-term memory (LSTM), and a novel ensemble model combining predictions of the ARIMA, MLP, and LSTM models were developed and evaluated on their
prediction of expenditures of certain prescription-based medications. The best performance on test data was obtained from the ensemble model, followed by MLP, LSTM, persistence,
and ARIMA models. In the fourth experiment, multi-headed ML models (MLP, LSTM,
convolutional neural network (CNN), ConvLSTM, and CNN-LSTM) were developed using
multivariate time-series datasets for predicting patients’ expenditures. The performance of these multi-headed models was compared against their single-headed counterparts and baseline vector autoregression (VAR) model. Results revealed that all the multi-headed
models outperformed the corresponding single-headed architectures and the VAR model. In the last experiment, a novel generative adversarial network model (variance-based GAN or V-GAN) was developed that specifically minimized the difference in variance between model and actual data during model training to perform time-series predictions of medicine-related expenditures. The performance of V-GAN model was compared with other GAN-based
variants and several ML models. Results revealed that the V-GAN model outperformed other models incorrectly predicting medicine expenditures of patients. This thesis highlights the
utility of various ML methods and feature engineering techniques for healthcare expenditure forecasting.