Hurdles to Artificial Intelligence Deployment: Noise in Schemas and “Gold” Labels


Despite frequent reports of imaging artificial intelligence (AI) that parallels human performance, clinicians often question the safety and robustness of AI products in practice. This work explores two under-reported sources of noise which negatively affect imaging AI: (a) variation in labeling schema definitions and (b) noise in the labeling process. First, the overlap between the schemas of two publicly available datasets and a third-party vendor are compared, showing there is low agreement (< 50%) between them. The authors also highlight the problem of label inconsistency, where different annotation schemas are selected for the same clinical prediction task; this results in inconsistent use of medical ontologies through intermingling or duplicate observations and diseases. Second, the individual radiologist annotations for the CheXpert test set are used to quantify noise in the labeling process. The analysis demonstrated that label noise varies by class, as agreement was high for pneumothorax and medical devices (percent agreement > 90%). Among low agreement classes (pneumonia, consolidation), the labels assigned as “ground truth” were unreliable, suggesting that the result of majority voting is highly dependent on which group of radiologists are assigned to annotation. Noise in labeling schemas and gold label annotations are pervasive in medical imaging classification and impact downstream clinical deployment. Possible solutions (eg, changes to task design, annotation methods and model training) and their potential to improve trust in clinical AI are discussed.

In Radiology Artificial Intelligence
Click the Cite button above to demo the feature to enable visitors to import publication metadata into their reference management software.