Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans
DSA ADS Course - 2021
Machine Learning, Radiology, Radiographs, CT scans, Medicine, Learning Algorithms
Machine learning in medicine offers massive potential to help physicians diagnos disease and suggest optimal remedy. Yet in early stages there appears to be problems training data for learning algorithms. Scrutiny of machine learning tools, and the data used to train them, is particularly important because they are making correlations that are hard, if not impossible, for humans to independently verify. It is also important to consider the time-locked nature of AI models when they are evaluated. A model trained on one set of data that is then deployed in an ever-changing world is not guaranteed to work in the same way. The effects of diseases on patients can change, and so can the methods of treating them.
A team from Cambridge began examining the models — more than 400 in total — and every single one was fatally flawed. The review found the algorithms were often trained on small, single-origin data samples with limited diversity; some even reused the same data for training and testing, a cardinal sin that can lead to misleadingly impressive performance. An ever-growing list of papers rely on limited or low-quality data, fail to specify their training approach and statistical methods, and don’t test whether they will work for people of different races, genders, ages, and geographies.
By far the biggest problem — and the trickiest to solve — points to machine learning’s Catch-22: There are few large, diverse data sets to train and validate a new tool on, and many of those that do exist are kept confidential for legal or business reasons. But that means that outside researchers have no data to turn to test a paper’s claims or compare it to similar work, a key step in vetting any scientific research.
The failure to test AI models on data from different sources — a process known as external validation — is common in studies published on preprint servers and in leading medical journals. It often results in an algorithm that looks highly accurate in a study, but fails to perform at the same level when exposed to the variables of the real world, such as different types of patients or imaging scans obtained with different devices.
A recent STAT investigation found that only 73 of 161 AI products approved by the federal Food and Drug Administration publicly disclosed the amount of data used to validate the product, and just seven reported the racial makeup for their study populations. Even the sources of the data were almost never given. Those findings were echoed in a paper by Stanford researchers who highlighted the lack of prospective studies, or studies that examine future outcomes, conducted on even higher-risk AI products cleared by the FDA. They also noted that most AI devices were evaluated at a small number of sites and that only a tiny fraction reported how the AI performed in different demographic groups.
The review conducted by Cambridge found that many studies not only lacked external validation, but also neglected to specify the data sources used or details on how their AI models were trained. All but 62 of the more than 400 papers failed to pass an initial quality screening based on those omissions and other lapses.
Even those that survived the initial screening suffered from multiple shortcomings— 55 of those 62 papers were found to be at high risk of bias and a lack of consensus standards for evaluating machine learning research in medicine, although that is beginning to change.
The AI could simply be detecting differences in scanning methods and equipment, rather than in the physiology of the patients. The Cambridge researchers also noted that performance was not tested on an independent dataset to verify its ability to reliably recognize the illness in different groups of patients. Similar methodological flaws are common in a wide swath of machine learning research. Pointing out these lapses has become its own subgenre of medical research, with many papers and editorials calling for better evaluation models and urging researchers to be more transparent about their methods.
The inability to replicate findings is especially problematic, eroding trust in AI and undermining efforts to deploy it in clinical care.
A recent review of 511 machine learning studies across multiple fields found that the ones produced in health care were particularly hard to replicate, because the underlying code and datasets were seldom disclosed. The review, conducted by MIT researchers, found that only about 23% of machine learning studies in health care used multiple datasets to establish their results, compared to 80% in the adjacent field of computer vision, and 58% in natural language processing.
It is an understandable gap, given the privacy restrictions in health care and the difficulty of accessing data that spans multiple institutions. But it nonetheless makes it more difficult for AI developers in health care to obtain enough data to develop meaningful models in the first place, and makes it even harder for them to publicly disclose their sources so findings can be replicated.
Note there are a number of ways to share data without undermining privacy or intellectual property, such as use of a federated learning method in which institutions can jointly develop models without exchanging their data. Others are also using synthetic data — or data modeled on real patients — to help preserve privacy.
--------------------------
March, 2021
Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans
Abstract
Machine learning methods offer great promise for fast and accurate detection and prognostication of coronavirus disease 2019 (COVID-19) from standard-of-care chest radiographs (CXR) and chest computed tomography (CT) images. Many articles have been published in 2020 describing new machine learning-based models for both of these tasks, but it is unclear which are of potential clinical utility. In this systematic review, we consider all published papers and preprints, for the period from 1 January 2020 to 3 October 2020, which describe new machine learning models for the diagnosis or prognosis of COVID-19 from CXR or CT images. All manuscripts uploaded to bioRxiv, medRxiv and arXiv along with all entries in EMBASE and MEDLINE in this timeframe are considered. Our search identified 2,212 studies, of which 415 were included after initial screening and, after quality screening, 62 studies were included in this systematic review. Our review finds that none of the models identified are of potential clinical use due to methodological flaws and/or underlying biases. This is a major weakness, given the urgency with which validated COVID-19 models are needed. To address this, we give many recommendations which, if followed, will solve these issues and lead to higher-quality model development and well-documented manuscripts.