UT Austin, UC Berkeley researchers discovers bias in AI algorithms
April 7, 2022
A UT researcher identified errors in the training of artificial intelligence algorithms that could have detrimental impacts on medical imaging as part of a collaborative study with researchers at the University of California, Berkeley.
It has recently become easier to train AI algorithms, leading to notable benefits for medical imaging research, said Jon Tamir, the UT researcher on the team. However, in their March 21 study, researchers found that when a dataset is used for a different algorithm than the one it’s intended to train, it can create machine learning biases that distort medical scans.
“Machine learning is a really important topic because we’re designing these systems that (collected data) to make decisions, and how these decisions impact people is hugely important,” said Tamir, assistant professor of electrical and computer engineering.
Alex Dimakis, an electrical and computer engineering professor, said machine learning bias occurs when a system error occurs in an algorithm, like when misusing data, or using “off-label” data, which leads to biased outputs.
Efrat Shimron, a postdoctoral fellow at UC Berkeley said when training algorithms, researchers are expected to use raw data, which allows algorithms to register correct measurements. Instead, Shimron said some researchers unintentionally use processed data with incorrect measurements that produce inaccurate results from the algorithm.
The research team focused on JPEG image compression datasets, which compress image data to decrease the space the images take up on computers, making them easier to export. JPEG image compression is the most common dataset used for off-label algorithms, which interferes with the measurements of images, impeding the way algorithms respond to stimulation, Dimakis said.
Scientists use JPEG image compression to train medical algorithms like MRIs, CTs and X-rays. These scans rely on accurate measurements to generate scans so compressed images with smaller measurements can create false scans, Dimakis said.
Ideally, scientists should use original medical imaging scans as raw data to train new AI algorithms, Shimron said. However, raw data is not easily accessible because original MRI scans are costly to obtain, averaging about $600 per scan, and thousands are necessary for training the AI, Shimron said.
“So people go on the web, they go awry, and find some sort of data somewhere out there that was published for something (else), not necessarily image reconstruction,” Shimron said.
The researchers coined the term “implicit data crimes” to describe inaccurate results produced by AI trained with processed data. Studies involving AI trained with processed data often get biased, overly optimistic results and publish them without disclosing to the public where the data came from, Shimron said.
Tamir said he hopes their paper makes people more aware of the harm processed data can have on algorithm training.
“There really is a need to collect this raw data and make it publicly accessible,” Tamir said. “It’s an expensive process, but it’s something that pays off quite well for society.”