The investigation of machine learning and radiomics for the prognostication of recurrence and death following curative intent radiotherapy for non-small cell lung cancer
File(s)
Author(s)
Hindocha, Sumeet
Type
Thesis or dissertation
Abstract
Machine Learning (ML) is a subset of AI in which computer algorithms learn to make predictions from data without having been explicitly programmed to do so. This thesis explores the use of ML methods on electronic healthcare record (EHR) data and radiomic features derived from radiotherapy planning CT scans, for the development of prognostication models for recurrence, recurrence-free-survival (RFS) and overall survival (OS) at 2 years following radical radiotherapy for non-small cell lung cancer (NSCLC).
Radiotherapy is a key treatment modality for NSCLC, however recurrence following this is reported in up to 36% of patients. Earlier detection of recurrence may improve OS and quality of life. AI models that can harness EHR and imaging data to identify which patients may be at higher risk of recurrence would enable clinicians to intensify surveillance accordingly and are therefore of immense value clinically.
Initially, combinations of eight feature reduction techniques and ten ML classification algorithms were compared to develop, validate and externally test models built using readily available EHR data. The respective validation and test set AUCs (with 95% confidence intervals) were as follows: 1) RFS: 0.682 (0.575 – 0.788) and 0.681 (0.597 – 0.766), 2) Recurrence: 0.687 (0.582 – 0.793) and 0.722 (0.635 – 0.81), and 3) OS: 0.759 (0.663 – 0.855) and 0.717 (0.634 – 0.8). The results demonstrate incremental improvement on TNM – the current gold-standard prognostic tool.
Building on this, the use of handcrafted radiomic features extracted from the gross tumour volume (GTV) on radiotherapy planning scans was investigated. The impact of integrating EHR data with radiomic features in a combined model was also explored. Respective validation and test set AUCs for the radiomic-only models were: 1) OS: 0.712 (0.592-0.832) and 0.685 (0.585-0.784), 2) RFS: 0.825 (0.733-0.916) and 0.750 (0.665-0.835), 3) Recurrence: 0.678 (0.554-0.801) and 0.673 (0.577-0.77). For the combined models: 1) OS: 0.702 (0.583-0.822) and 0.683 (0.586-0.78), 2) RFS: 0.805 (0.707-0.903) and 0.755 (0.672-0.838), 3) Recurrence: 0.637 (0.51-0.765) and 0.738 (0.649-0.826). Again, incremental improvement on TNM was demonstrated. In using the existing GTV contoured as part of radiotherapy, the need for tedious manual segmentation is bypassed and a methodology that can be integrated into the routine radiotherapy workflow is presented, thus informing a personalised surveillance strategy at the point of treatment. This work lays the foundations for future prospective clinical trials for quantitative personalised risk-stratification for surveillance following curative-intent radiotherapy for NSCLC.
It is noted that contrast heterogeneity is prevalent in real-world medical imaging datasets, including ones used in this thesis, and may impair the performance of radiomic and DL models. Synthesis of comparable quality non-contrast images from existing contrast-enhanced CT images may present a solution. A cycle-GAN was developed to perform this, using a multi-centre dataset of 2078 CT scans. Human experts identified synthetic vs acquired images, with a false positive rate of 67% and Fleiss’ Kappa 0.06, attesting to the photorealism of the synthetic images. However, on testing performance of ML classifiers with radiomic features, performance decreased with use of synthetic images. Marked percentage difference was noted in feature values between pre- and post-GAN non-contrast images. With DL classification, deterioration in performance was observed with synthetic images. Whilst the cycle-GAN can produce synthetic images of sufficient quality to pass human assessment, it appears to introduce subtle changes at the feature-level, which are detectable by DL and handcrafted radiomic classifiers. This implies caution is warranted before GAN-synthesized images are used for data augmentation prior to radiomic modelling.
Deep Learning (DL) is an alternative approach to handcrafted radiomics. In a set of preliminary experiments, several DL architectures were compared to predict recurrence. The most optimal DL model, SEResNet101, had a validation set AUC of 0.72 which is not a considerable improvement on the handcrafted radiomic model, which had a validation set AUC of 0.678, for predicting recurrence. Possible reasons for this include the training set size of 436 scans, which may be considered relatively small in the context of DL.
Radiotherapy is a key treatment modality for NSCLC, however recurrence following this is reported in up to 36% of patients. Earlier detection of recurrence may improve OS and quality of life. AI models that can harness EHR and imaging data to identify which patients may be at higher risk of recurrence would enable clinicians to intensify surveillance accordingly and are therefore of immense value clinically.
Initially, combinations of eight feature reduction techniques and ten ML classification algorithms were compared to develop, validate and externally test models built using readily available EHR data. The respective validation and test set AUCs (with 95% confidence intervals) were as follows: 1) RFS: 0.682 (0.575 – 0.788) and 0.681 (0.597 – 0.766), 2) Recurrence: 0.687 (0.582 – 0.793) and 0.722 (0.635 – 0.81), and 3) OS: 0.759 (0.663 – 0.855) and 0.717 (0.634 – 0.8). The results demonstrate incremental improvement on TNM – the current gold-standard prognostic tool.
Building on this, the use of handcrafted radiomic features extracted from the gross tumour volume (GTV) on radiotherapy planning scans was investigated. The impact of integrating EHR data with radiomic features in a combined model was also explored. Respective validation and test set AUCs for the radiomic-only models were: 1) OS: 0.712 (0.592-0.832) and 0.685 (0.585-0.784), 2) RFS: 0.825 (0.733-0.916) and 0.750 (0.665-0.835), 3) Recurrence: 0.678 (0.554-0.801) and 0.673 (0.577-0.77). For the combined models: 1) OS: 0.702 (0.583-0.822) and 0.683 (0.586-0.78), 2) RFS: 0.805 (0.707-0.903) and 0.755 (0.672-0.838), 3) Recurrence: 0.637 (0.51-0.765) and 0.738 (0.649-0.826). Again, incremental improvement on TNM was demonstrated. In using the existing GTV contoured as part of radiotherapy, the need for tedious manual segmentation is bypassed and a methodology that can be integrated into the routine radiotherapy workflow is presented, thus informing a personalised surveillance strategy at the point of treatment. This work lays the foundations for future prospective clinical trials for quantitative personalised risk-stratification for surveillance following curative-intent radiotherapy for NSCLC.
It is noted that contrast heterogeneity is prevalent in real-world medical imaging datasets, including ones used in this thesis, and may impair the performance of radiomic and DL models. Synthesis of comparable quality non-contrast images from existing contrast-enhanced CT images may present a solution. A cycle-GAN was developed to perform this, using a multi-centre dataset of 2078 CT scans. Human experts identified synthetic vs acquired images, with a false positive rate of 67% and Fleiss’ Kappa 0.06, attesting to the photorealism of the synthetic images. However, on testing performance of ML classifiers with radiomic features, performance decreased with use of synthetic images. Marked percentage difference was noted in feature values between pre- and post-GAN non-contrast images. With DL classification, deterioration in performance was observed with synthetic images. Whilst the cycle-GAN can produce synthetic images of sufficient quality to pass human assessment, it appears to introduce subtle changes at the feature-level, which are detectable by DL and handcrafted radiomic classifiers. This implies caution is warranted before GAN-synthesized images are used for data augmentation prior to radiomic modelling.
Deep Learning (DL) is an alternative approach to handcrafted radiomics. In a set of preliminary experiments, several DL architectures were compared to predict recurrence. The most optimal DL model, SEResNet101, had a validation set AUC of 0.72 which is not a considerable improvement on the handcrafted radiomic model, which had a validation set AUC of 0.678, for predicting recurrence. Possible reasons for this include the training set size of 436 scans, which may be considered relatively small in the context of DL.
Version
Open Access
Date Issued
2022-12
Date Awarded
2023-05
Copyright Statement
Creative Commons Attribution NonCommercial Licence
Advisor
Aboagye, Eric
Lee, Richard
Blackledge, Matthew
Publisher Department
Computing
Publisher Institution
Imperial College London
Qualification Level
Doctoral
Qualification Name
Doctor of Philosophy (PhD)