Abstract
This thesis addresses challenges in training machine learning (ML) models for tabular classification tasks with limited labelled data or labels containing quantified uncertainty, common in fields like healthcare. It explores two primary approaches: semi-supervised learning (SSL), which leverages both labelled and unlabelled data, and soft label learning (SLL), which explicitly incorporates label uncertainty.
In Part I, Reliable Semi-Supervised Ensemble Learning (RESSEL) is introduced, a novel SSL wrapper method that combines ensemble learning with self-training. RESSEL enhances model robustness without relying on complex assumptions, making it practical and widely applicable. Experimental evaluations on publicly available datasets demonstrate RESSEL's superior performance compared to traditional supervised learning (SL) and existing SSL wrapper methods. Further validating its utility, RESSEL is successfully applied to predict urinary tract infections (UTIs), one of the most common hospital-acquired infections. Due to the lack of structured recording of such infections, clinical data were retrospectively labelled by medical experts. Using these labels, it was established that UTIs can be reliably predicted by combining urinalysis with Gram staining. A clinical decision support system was developed, consisting of an exclusion step based on urinalysis and the application of the RESSEL method to both labelled and unlabelled data. This provided timely and accurate UTI predictions, thereby potentially reducing unnecessary antibiotic prescriptions.
Part II shifts focus to SLL, beginning with SYNLABEL, an innovative approach for generating synthetic datasets featuring controlled label uncertainty. By simulating uncertainty through systematically hidden variables, SYNLABEL produces realistic datasets that accurately model real-world label noise scenarios, facilitating rigorous evaluation of SLL methods. Subsequent analyses utilising data generated through SYNLABEL demonstrate that SLL methods consistently outperform traditional SL, particularly in scenarios involving limited or imbalanced datasets and label noise. When applied to UTI prediction, SLL models provide better-calibrated predictions than their SL counterparts, maintaining predictive accuracy even amidst noisy labels.
Finally, the thesis investigates improving pretrained ensemble models through weight optimisation using soft labels. Empirical evaluations show significant performance enhancements through this optimisation. Additionally, incorporating unlabelled data into the optimisation process was tested but found not to yield additional benefits, indicating that soft label optimisation alone is most effective.
In summary, this thesis significantly advances the accessibility and practicality of SSL and SLL methods by developing robust, easily integrable tools for standard ML workflows. By providing novel frameworks such as RESSEL and SYNLABEL, it promotes broader adoption of SSL and SLL techniques, offering promising strategies to address the inherent challenges of limited and uncertain labelled data in ML classification tasks.
Original language | English |
---|---|
Qualification | Doctor of Philosophy |
Awarding Institution |
|
Supervisors/Advisors |
|
Award date | 15 May 2025 |
Place of Publication | Utrecht |
Publisher | |
Print ISBNs | 978-90-393-7868-7 |
DOIs | |
Publication status | Published - 15 May 2025 |
Keywords
- Machine learning
- Semi-supervised learning
- Soft label learning
- Ensemble learning
- Label noise
- Label uncertainty
- Synthetic data
- Clinical decision support