Annotating breast cancer tissue by medical students
Computational and Integrative Pathology Group at Northwestern University curated large-scale breast cancer annotation datasets using annotations from medical students
Problem: An enormous amount of labeled data is needed to train deep learning models. Pathology is a medical branch in which trained doctors interpret microscopic slides taken from tissue biopsies, often to provide a cancer vs non-cancer diagnosis. Training deep learning models to perform the same task is tricky since pathologists are very busy, and annotation tasks are often mundane and repetitive.

Solution: We recruited a large crowd of medical students from various countries to annotate various tissue components from scanned breast cancer slides. Medical students have some background knowledge in pathology, but they often have strong incentives to participate in research projects to acquire residency training positions. We developed a "structured" crowdsourcing approach whereby practicing pathologists and pathology trainees supervised the work to ensure data quality. We also asked all participants to produce some annotations of a small subset of common images to measure inter-rater variability and gauge quality of annotations from the medical students.

In the two iterations of this project, we collected a total of 20,000 region annotations (low-resolution patterns) and 200,000 nucleus annotations (object detection).

"We show how suggested annotations generated by a weak algorithm can improve the accuracy of annotations generated by non-experts and can yield useful data for training segmentation algorithms without laborious manual tracing. We systematically examine interrater agreement and describe modifications to the MaskRCNN model to improve cell mapping. We also describe a technique we call Decision Tree Approximation of Learned Embeddings (DTALE) that leverages nucleus segmentations and morphologic features to improve the transparency of nucleus classification models," says the study's lead author, Pre-doctoral Fellow at Northwestern University, Mohamed Amgad.

Video of the Crowd Science Seminar led by Mohamed Amgad:
Published article about the research for deeper understanding:
https://arxiv.org/abs/2102.09099
https://academic.oup.com/bioinformatics/article/35...

Published datasets for training machine learning models:
https://sites.google.com/view/nucls
https://github.com/PathologyDataScience/BCSS

Don't miss the next Crowd Science Seminar. Subscribe to our biweekly updates!