These Algorithms Look at X-Rays—and Somehow Detect Your Race

Millions of dollars are being spent to develop artificial intelligence software that reads x-rays and other medical scans in hopes it can spot things doctors look for but sometimes miss, such as lung cancers. A new study reports that these algorithms can also see something doctors don’t look for on such scans: a patient’s race.

The study authors and other medical AI experts say the results make it more crucial than ever to check that health algorithms perform fairly on people with different racial identities. Complicating that task: The authors themselves aren’t sure what cues the algorithms they created use to predict a person’s race.

Evidence that algorithms can read race from a person’s medical scans emerged from tests on five types of imagery used in radiology research, including chest and hand x-rays and mammograms. The images included patients who identified as Black, white, and Asian. For each type of scan, the researchers trained algorithms using images labeled with a patient’s self-reported race. Then they challenged the algorithms to predict the race of patients in different, unlabeled images.

Radiologists don’t generally consider a person’s racial identity—which is not a biological category—to be visible on scans that look beneath the skin. Yet the algorithms somehow proved capable of accurately detecting it for all three racial groups, and across different views of the body.

For most types of scan, the algorithms could correctly identify which of two images was from a Black person more than 90 percent of the time. Even the worst performing algorithm succeeded 80 percent of the time; the best was 99 percent correct. The results and associated code were posted online late last month by a group of more than 20 researchers with expertise in medicine and machine learning, but the study has not yet been peer reviewed.

The results have spurred new concerns that AI software can amplify inequality in health care, where studies show Black patients and other marginalized racial groups often receive inferior care compared to wealthy or white people.

Machine-learning algorithms are tuned to read medical images by feeding them many labeled examples of conditions such as tumors. By digesting many examples, the algorithms can learn patterns of pixels statistically associated with those labels, such as the texture or shape of a lung nodule. Some algorithms made that way rival doctors at detecting cancers or skin problems; there is evidence they can detect signs of disease invisible to human experts.

Judy Gichoya, a radiologist and assistant professor at Emory University who worked on the new study, says the revelation that image algorithms can “see” race in internal scans likely primes them to also learn inappropriate associations.

Medical data used to train algorithms often bears traces of racial inequalities in disease and medical treatment, due to historical and socioeconomic factors. That could lead an algorithm searching for statistical patterns in scans to use its guess at a patient’s race as a kind of shortcut, suggesting diagnoses that correlate with racially biased patterns from its training data, not just the visible medical anomalies that radiologists look for. Such a system might give some patients an incorrect diagnosis or a false all-clear. An algorithm might suggest different diagnoses for a Black person and white person with similar signs of disease.

“We have to educate people about this problem and research what we can do to mitigate it,” Gichoya says. Her collaborators on the project came from institutions including Purdue, MIT, Beth Israel Deaconess Medical Center, National Tsing Hua University in Taiwan, University of Toronto, and Stanford.

Previous studies have shown that medical algorithms have caused biases in care delivery, and that image algorithms may perform unequally for different demographic groups. In 2019, a widely used algorithm for prioritizing care for the sickest patients was found to disadvantage Black people. In 2020, researchers at the University of Toronto and MIT showed that algorithms trained to flag conditions such as pneumonia on chest x-rays sometimes performed differently for people of different sexes, ages, races, and types of medical insurance.

Paul Yi, director of the University of Maryland’s Intelligent Imaging Center, who was not involved in the new study showing algorithms can detect race, describes some of its findings as “eye opening,” even “crazy.”

Radiologists like him don’t typically think about race when interpreting scans, or even know how a patient self-identifies. “Race is a social construct and not in itself a biological phenotype, even though it can be associated with differences in anatomy,” Yi says.

Frustratingly, the authors of the new study could not figure out how exactly their models could so accurately detect a patient’s self-reported race. They say that will likely make it harder to pick up biases in such algorithms.

Follow-on experiments showed that the algorithms were not making predictions based on particular patches of anatomy, or visual features that might be associated with race due to social and environmental factors such as body mass index or bone density. Nor did age, sex, or specific diagnoses that are associated with certain demographic groups appear to be functioning as clues.

The fact that algorithms trained on images from a hospital in one part of the US could accurately identify race in images from institutions in other regions appears to rule out the possibility that the software is picking up on factors unrelated to a patient’s body, says Yi, such as differences in imaging equipment or processes.

Whatever the algorithms were seeing, they saw it clearly. The software could still predict patient race with high accuracy when x-rays were degraded so that they were unreadable to even a trained eye, or blurred to remove fine detail.

Luke Oakden-Rayner, a coauthor on the new study and director of medical imaging research at Royal Adelaide Hospital, Australia, calls the AI ability the collaborators uncovered “the worst superpower.” He says that despite the unknown mechanism, it demands an immediate response from people developing or selling AI systems to analyze medical scans.

A database of AI algorithms maintained by the American College of Radiology lists dozens for analyzing chest imagery that have been approved by the Food and Drug Administration. Many were developed using standard data sets used in the new study that trained algorithms to predict race. Although the FDA recommends that companies measure and report performance on different demographic groups, such data is rarely released.

Oakden-Rayner says that such checks and disclosures should become standard. “Commercial models can almost certainly identify the race of patients, so companies need to ensure that their models are not utilizing that information to produce unequal outcomes,” he says.

Yi agrees, saying the study is a reminder that while machine-learning algorithms can help human experts with practical problems in the clinic, they work differently than people. “Neural networks are sort of like savants, they’re singularly efficient at one task,” he says. “If you train a model to detect pneumonia, it’s going to find one way or another to get that correct answer, leveraging whatever it can find in the data.”