How did the authors integrate the morphological features that the authors extracted?

To integrate the morphological features that the authors extracted, the authors performed k-means clustering on the morphology profiles that the authors collected for every cell (Methods).

Why was annotation time linked to model performance?

Because annotators only needed to correct the mistakes made by the model, not annotate every cell in each image, annotationtime was linked to model performance.

What is the threshold for the intersection over union of the cell masks?

The authors use the intersection over union (IOU) of the cell masks as the criterion for assessing whether two cells match, with thresholds of 0.4 and 0.1 for the cost matrix and IOU overlaps, respectively.

What is the main reason for the inaccuracies in the segmentation of cells?

Inaccuracies in segmentation can lead to substantial bias in the identification and enumeration of the cells present in an image.

How did the authors simulate low signal-to-noise ratio and high background staining?

To simulate low signal-to-noise ratio and high background staining, the authors added uniform random noise of increasing magnitude to each pixel.

How did the authors compute the localization of a panel of markers with known profiles?

For each channel, the authors selected fields of view in which the marker showed clear expression, and computed the localization within each cell, after removing the bottom 20% lowly expressing cells within each marker.

Why did Cellpose fail to identify a large fraction of the cells in the image?

In line with its lower recall score (Figure 2c), Cellpose failed to identify a large fraction of the cells in the image (Figure 2f), likely due to the relative scarcity of tissue images in the data used to train Cellpose.

How did the authors combine the dataset used for model training?

To construct the dataset used for model training, individual .npz files containing annotated images from a single experiment were combined.

How did the authors find that the crowd annotation of dense images was better?

To support accurate crowd annotation of dense images, the authors found that supplying smaller image crops led to significantly better crowd performance (data not shown).

What are some recent tools that have attempted to overcome this barrier?

Several recent tools have sought to overcome this barrier with a variety of software-engineering approaches, including browser-based software (ImJoy41), Google Colab (ZeroCostDL4Mic42), a centralized web portal (NucleAIzer29, Cellpose28, DeepCell39), and ImageJ plugins (StarDist60, DeepCell39).

(Open Access) Whole-cell segmentation of tissue images with human-level performance using large-scale data annotation and deep learning (2021) | Noah F. Greenwald

Q: What have the authors contributed in "Whole-cell segmentation of tissue images with human-level performance using large-scale data annotation and deep learning" ?

The authors created Mesmer, a deep learning-enabled segmentation algorithm trained on TissueNet that performs nuclear and whole-cell segmentation in tissue imaging data. The authors demonstrated that Mesmer has better speed and accuracy than previous methods, generalizes to the full diversity of tissue types and imaging platforms in TissueNet, and achieves human-level performance for whole-cell segmentation. The authors further showed that Mesmer could be adapted to harness cell lineage information present in highly multiplexed datasets. CC-BY-NC 4. 0 International license available under a ( which was not certified by peer review ) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.

Whole-cell segmentation of tissue images with human-level performance using large-scale data

annotation and deep learning

Noah F. Greenwald

1,2

*, Geneva Miller

*, Erick Moen

, Alex Kong

, Adam Kagel

, Christine Camacho

Fullaway

, Brianna J. McIntosh

, Ke Leow

1,2

, Morgan Sarah Schwartz

, Thomas Dougherty

, Cole

Pavelchek

3,4

, Sunny Cui

,5,6

, Isabella Camplisson

, Omer Bar-Tal

, Jaiveer Singh

, Mara Fong

Gautam Chaudhry

, Zion Abraham

, Jackson Moseley

, Shiri Warshawsky

, Erin Soon

, Shirley

Greenbaum

, Tyler Risom

, Travis Hollmann

, Leeat Keren

, Will Graf

, Michael Angelo

2†

, David Van

Valen

3†

1. Cancer Biology Program, Stanford University

2. Department of Pathology, Stanford University

3. Division of Biology and Bioengineering, California Institute of Technology

4. Present address: Washington University in St. Louis Medical School

5. Department of Electrical Engineering, California Institute of Technology

6. Present address: Department of Computer Science, Princeton University

7. Department of Molecular Cell Biology, Weizmann Institute of Science

8. Department of Pathology, Memorial Sloan Kettering Cancer Center

* These authors contributed equally to this work

† These authors jointly supervised this work

Abstract

Understanding the spatial organization of tissues is of critical importance for both basic and translational

research. While recent advances in tissue imaging are opening an exciting new window into the biology of

human tissues, interpreting the data that they create is a significant computational challenge. Cell

segmentation, the task of uniquely identifying each cell in an image, remains a substantial barrier for tissue

imaging, as existing approaches are inaccurate or require a substantial amount of manual curation to yield

useful results. Here, we addressed the problem of cell segmentation in tissue imaging data through large-

scale data annotation and deep learning. We constructed TissueNet, an image dataset containing >1 million

paired whole-cell and nuclear annotations for tissue images from nine organs and six imaging platforms.

We created Mesmer, a deep learning-enabled segmentation algorithm trained on TissueNet that performs

nuclear and whole-cell segmentation in tissue imaging data. We demonstrated that Mesmer has better speed

and accuracy than previous methods, generalizes to the full diversity of tissue types and imaging platforms

in TissueNet, and achieves human-level performance for whole-cell segmentation. Mesmer enabled the

automated extraction of key cellular features, such as subcellular localization of protein signal, which was

challenging with previous approaches. We further showed that Mesmer could be adapted to harness cell

lineage information present in highly multiplexed datasets. We used this enhanced version to quantify cell

morphology changes during human gestation. All underlying code and models are released with permissive

licenses as a community resource.

.CC-BY-NC 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted March 2, 2021. ; https://doi.org/10.1101/2021.03.01.431313doi: bioRxiv preprint

Introduction

Understanding the structural and functional relationships present within tissues is a challenge that is at the

forefront of basic and translational research. Recent advances in multiplexed imaging have dramatically

expanded the number of transcripts and proteins that can be quantified in a single tissue section while also

improving the throughput of these platforms

1–12

. These technological improvements have opened up

exciting new frontiers for large-scale analysis of human tissue samples. Ambitious collaborative efforts

such as the Human Tumor Atlas Network

, the Human BioMolecular Atlas Program

, and the Human

Cell Atlas

are now using novel imaging techniques to comprehensively characterize the location, function,

and phenotype of the cells in the human body. By generating high-quality, open-source datasets

characterizing the full breadth of human tissues, these datasets will be as transformative as the Human

Genome Project in unleashing the next era of biological discovery.

Despite this immense promise, the tools to facilitate the analysis and interpretation of these datasets at scale

do not yet exist. The clearest example of this shortcoming is the lack of a generalized algorithm for locating

single cells in images. Unlike flow cytometry or single-cell RNA sequencing methods, in which individual

cells are dissociated and physically separated from one another prior to being analyzed, tissue imaging is

performed with intact specimens. Thus, in order to extract single-cell information from images, each pixel

must be assigned to a cell after image acquisition in a process known as cell segmentation. Since the features

extracted through this process are the basis for downstream analyses like cell-type identification and tissue

neighborhood analyses

, inaccuracies at this stage have far-reaching consequences for interpreting image

data.

Achieving accurate and automated cell segmentation for tissues remains a substantial challenge. Depending

on the tissue, cells can be rare and dispersed within a large bed of extracellular matrix or densely packed

such that contrast between adjacent neighbors is limited. Cell size in non-neuronal mammalian tissues can

vary over two orders of magnitude

, while cell morphology can vary widely from small mature

lymphocytes with little discernible cytoplasm, to elongated spindle-shaped fibroblasts, to large

multinucleated osteoclasts and megakaryocytes

. Achieving accurate cell segmentation has been a long-

standing goal of the biological image analysis community, and a diverse array of software tools have been

developed to meet this challenge

19–24

. While these efforts have been crucial for advancing our understanding

of biology across a wide range of domains, they fall short for tissue imaging data. A common shortcoming

has been the need to perform manual, image-specific adjustments to produce useful segmentations. This

lack of full automation poses a prohibitive barrier given the increasing scale of tissue imaging experiments.

Recent advances in deep learning have transformed the field of computer vision, and are increasingly being

used for a variety of tasks in biological image analysis, including cell segmentation

25–31

. These methods

differ from conventional algorithms in that they learn how to perform tasks from annotated data. While the

accuracy of these new, data-driven algorithms can render difficult analyses routine, using them in practice

can be challenging: high accuracy requires a substantial amount of annotated data. Generating ground-truth

data for cell segmentation is time intensive due to the need to generate pixel-level labels; as a result, existing

datasets are of modest size (10

-10

annotations). Moreover, most public datasets

26,27,32–38

annotate the

location of cell nuclei rather than the whole cell. Deploying pre-trained models to the life science

community is also difficult, and has been the focus of a number of recent works

39–42

. Despite deep learning’s

potential, these challenges have caused whole-cell segmentation in tissue imaging data to remain an open

problem.

Here, we sought to close these gaps by creating an automated, simple, and scalable algorithm for nuclear

and whole-cell segmentation that performs accurately across a diverse range of tissue types and imaging

platforms. Developing this algorithm required two innovations: (1) a scalable approach for generating large

.CC-BY-NC 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted March 2, 2021. ; https://doi.org/10.1101/2021.03.01.431313doi: bioRxiv preprint

volumes of pixel-level training data in tissue images and (2) an integrated deep learning pipeline that utilizes

these data to achieve human-level performance. To address the first challenge, we developed a

crowdsourced, human-in-the-loop approach for segmenting cells in tissue images where humans and

algorithms work in tandem to produce accurate annotations (Figure 1a). We used this pipeline to create

TissueNet, a comprehensive segmentation dataset of >1 million paired whole-cell and nuclear annotations.

These curated annotations were derived from images of nine different organs acquired from six distinct

imaging platforms. TissueNet is the largest cell-segmentation dataset assembled to date, containing twice

as many nuclear and 16 times as many whole-cell labels as all previously published datasets combined. To

address the second challenge, we developed Mesmer, a deep learning pipeline for scalable, user-friendly

segmentation of imaging data. Mesmer was trained on TissueNet and is the first algorithm to demonstrate

human-level performance on cell segmentation. To enable broad use by the scientific community, we

harnessed DeepCell, an open-source collection of software libraries, to create cloud-native software for

using Mesmer, including plugins for ImageJ and QuPath. We have made all code, data, and trained models

available under a permissive license as a community resource, setting the stage for application of these

modern, data-driven methods to a broad range of fundamental and translational research challenges.

A human-in-the-loop approach drives scalable construction of TissueNet

Existing annotated datasets for cell segmentation are limited in scope and scale (Figure 1b)

26,27,32–38

. This

limitation is largely due to the linear, time-intensive approach used to construct them, which requires the

border of every cell in an image to be manually demarcated. This approach scales poorly, as the time

required to label each image remains constant throughout the annotation effort. We therefore implemented

a three-phase approach to create TissueNet. In the first phase, expert human annotators outlined the border

of each cell in 80 images. The labeled images were then used to train a preliminary model (Figure 1a, left;

Methods). Once the preliminary model reached a sufficient level of accuracy, correcting mistakes required

less time than labeling from scratch. Although the exact point at which this transition occurs depends on

model quality and training data diversity, we found that 10,000 cells was a reasonable approximation.

The process then moved to the second phase (Figure 1a, middle), where images were first passed through

the model to generate predicted annotations. These predictions were sent to crowdsourced annotators to

correct errors. The corrected annotations then underwent final inspection by an expert prior to being added

to the training dataset. When enough new data were compiled, a new model was trained and phase two was

repeated. Each iteration yielded more training data, which led to improved model accuracy and fewer errors

that needed to be manually corrected. This virtuous cycle continued until the model achieved human-level

performance. At this point, we transitioned to the third phase (Figure 1a, right), where the model was run

without human assistance to produce high-quality predictions. One advantage of this approach is that we

utilized annotators with different amounts of bandwidth and expertise: experts have experience but limited

bandwidth, while crowdsourced annotators have limited experience but higher bandwidth. Triaging each

task according to its difficulty and accessing a much larger pool of human annotators further reduced the

time and cost of dataset construction.

Human-in-the-loop pipelines require specialized software that is optimized for the task and can be scalably

deployed. We therefore developed DeepCell Label

, a browser-based graphical user interface optimized

for editing existing cell annotations in tissue images (Figure S1a, Methods). DeepCell Label is supported

by a scalable cloud backend that dynamically adjusts the number of servers according to demand (Figure

S1b). Using DeepCell Label, we trained annotators from multiple crowdsourcing platforms to identify

whole-cell and nuclear boundaries. To further simplify our annotation workflow, we integrated DeepCell

Label into a pipeline that allowed us to prepare and submit images for annotation, have users annotate those

images, and download the results. The images and resulting labels were used to train and update our model,

completing the loop (Figure S1c; Methods).

.CC-BY-NC 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted March 2, 2021. ; https://doi.org/10.1101/2021.03.01.431313doi: bioRxiv preprint

Our goal in creating TissueNet was to use it to power general-purpose tissue segmentation models. To

ensure that models trained on TissueNet would serve as much of the imaging community as possible, we

made two key choices. First, all data in TissueNet contains two channels, a nuclear channel (such as DAPI)

and a membrane or cytoplasm channel (such as E-cadherin or Pan-Keratin). Although some highly

multiplexed platforms are capable of imaging dozens of markers at once

1,2,4,6

, restricting TissueNet to

include only the minimum number of channels necessary for whole-cell segmentation maximizes the

number of imaging platforms where the resulting models can be used. Second, the data in TissueNet are

derived from a wide variety of tissue types, disease states, and imaging platforms. This diversity of data

allows models trained on TissueNet to handle data from many different experimental setups and biological

questions. The images included in TissueNet were acquired from the published and unpublished works of

labs who routinely perform tissue imaging

44–51

. Thus, while this first release of TissueNet encompasses the

tissue types most commonly analyzed by the community, we expect that subsequent versions of TissueNet

will be expanded to include less-studied organs.

Figure 1: A human-in-the-loop approach enables scalable, pixel-level annotation of large image

collections. a, This approach has three phases. During phase 1, annotations are created from scratch to train a model.

During phase 2, new data are fed through a preliminary model to generate predictions. These predictions are used as

a starting point for annotators to correct. As more images are corrected, the model improves, which decreases the

number of errors, increasing the speed with which new data can be annotated. During phase 3, an accurate model is

run without human correction. b, TissueNet has more nuclear and whole-cell annotations than all previously published

datasets. c, The number of cell annotations per platform in TissueNet. d, The number of cell annotations per tissue

type in TissueNet. e, The number of hours of annotation time required to create TissueNet.

Crowd

Expert

TissueNet

annotation time

Cells

per platform

Cells

per tissue

CODEX

CyCIF

Vectra

MIBI-TOF

MxIF

IMC

50k

100k

150k

200k

250k

300k

350k

50k

100k

150k

200k

250k

300k

350k

All previous

TissueNet

Published

segmentations

b c d

Annotations

Hours

Nuclear

Whole Cell

400k

800k

1.2M

Annotations

Annotation throughput

Crowdsourced

correction

Expert

correction

Model

annotation

Add to

training

data

Retrain

and update

Preliminary

model

Expert

annotation

Model

training

Fully

automated

final model

Phase 1 Phase 2 Phase 3

Pancreas

Tonsil

Breast

Lung

Colon

Esoph.

Lymph

Skin

Spleen

.CC-BY-NC 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted March 2, 2021. ; https://doi.org/10.1101/2021.03.01.431313doi: bioRxiv preprint

As a result of the scalability of our human-in-the-loop approach to data labeling, TissueNet is larger than

the sum total of all previously published datasets

26,27,32–38

(Figure 1b), with 1.3 million whole-cell

annotations and 1.2 million nuclear annotations. TissueNet contains data from six imaging platforms

(Figure 1c), nine organs (Figure 1d), and includes both histologically normal and diseased tissue (e.g.,

tumor resections). TissueNet also encompasses three species, with images from human, mouse, and

macaque. Constructing TissueNet required >4,000 person hours, the equivalent of nearly 2 person-years of

full-time effort (Figure 1e). With an average hourly rate of $6 per hour, we anticipate that subsequent

datasets of this size will cost around USD $25,000 to produce—a significant reduction versus highly trained

($30/h) or expert pathologist (>$150/h) annotators.

Mesmer is a novel algorithm for accurate whole-cell segmentation of tissue data

An ideal deep learning model for cell segmentation has two specific requirements. First, a suitable model

must be accurate, which is challenging given the range of cell morphologies, tissue types, and image

platforms present in TissueNet. A model capable of accurately performing whole-cell segmentation in this

setting needs sufficient representational capacity to understand and interpret these heterogeneous images.

Second, a suitable model needs to be fast. Image datasets are increasing rapidly in size, and a model with

high performance but poor inference speed would be of limited utility.

To satisfy these requirements we developed the PanopticNet deep learning architecture. To ensure adequate

model capacity, PanopticNets use a ResNet50 backbone coupled to a modified Feature Pyramid Network

(FPN)

52–54

(Figure S2a; Methods). ResNet backbones are a popular architecture for extracting features from

imaging data for a variety of tasks

. FPNs aggregate features across length scales, producing

representations that contain both low-level details and high-level semantics

. To perform segmentation,

two semantic heads are attached to the highest level of the FPN to create pixel-level predictions. These

heads perform two separate prediction tasks. The first head predicts whether a pixel is inside a cell, at the

cell boundary, or part of the image background

25,26

. The second head predicts the distance of each pixel

within a cell to the cell centroid (Figure S2a; Methods); we extended previous work

30,55

by explicitly

accounting for cell size in this step.

We used the PanopticNet architecture and TissueNet to create Mesmer, a deep learning pipeline for accurate

nuclear and whole-cell segmentation of tissue data. Mesmer’s PanopticNet model contains four semantic

heads (two for nuclear segmentation and two for whole-cell segmentation) that are attached to a common

backbone and FPN. The input to Mesmer is a nuclear image (e.g. DAPI) to define the nucleus of each cell

and a membrane or cytoplasm image (e.g. CD45 or E-cadherin) to define the shape of each cell (Figure

2a). These inputs are normalized

(to improve robustness), tiled into patches of fixed size (to allow

processing of images with arbitrary dimensions), and then fed to the PanopticNet model. The model outputs

are then untiled

to produce predictions for the centroid and boundary of every nucleus and cell in the

image. The centroid and boundary predictions are used as inputs to a watershed algorithm

to create the

final instance segmentation mask for each nucleus and each cell in the image (Methods).

We used the newly created TissueNet dataset to train Mesmer’s model. We randomly partitioned TissueNet

into training (80%), validation (10%), and testing (10%) splits. The training split was used to directly update

the model weights during training, with the validation split used to assess increases in model accuracy after

each epoch. The test split was completely held out during training and used only to evaluate model

performance after training. We used standard image augmentation during training to increase model

robustness. To benchmark model accuracy, we built off our prior framework for classifying segmentation

errors

. In brief, we perform a linear assignment between predicted cells and ground truth cells. Cells that

map 1-to-1 with a ground truth cell are marked as accurately segmented; all other cells are assigned to one

of several error modes depending on their relationship with the ground truth data. We use these assignments

to calculate precision, recall, F1 score, and Jaccard index; see the Methods section for detailed descriptions.

.CC-BY-NC 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted March 2, 2021. ; https://doi.org/10.1101/2021.03.01.431313doi: bioRxiv preprint

Whole-cell segmentation of tissue images with human-level performance using large-scale data annotation and deep learning

Citations

Signatures of plasticity, metastasis, and immunosuppression in an atlas of human small cell lung cancer

Spatial components of molecular tissue biology

Multiplexed imaging mass cytometry of the chemokine milieus in melanoma characterizes features of the response to immunotherapy

Image-based cell phenotyping with deep learning.

Strategies for Accurate Cell Type Identification in CODEX Multiplexed Imaging Data.

References

Deep Residual Learning for Image Recognition

U-Net: Convolutional Networks for Biomedical Image Segmentation

NIH Image to ImageJ: 25 years of image analysis

Fiji: an open-source platform for biological-image analysis

Scikit-learn: Machine Learning in Python

Related Papers (5)

Multiplexed ion beam imaging of human breast tumors

A Structured Tumor-Immune Microenvironment in Triple Negative Breast Cancer Revealed by Multiplexed Ion Beam Imaging

Deep Profiling of Mouse Splenic Architecture with CODEX Multiplexed Imaging.

Highly multiplexed imaging of tumor tissues with subcellular resolution by mass cytometry

Highly multiplexed immunofluorescence imaging of human tissues and tumors using t-CyCIF and conventional optical microscopes

Frequently Asked Questions (13)

Q1. What have the authors contributed in "Whole-cell segmentation of tissue images with human-level performance using large-scale data annotation and deep learning" ?

Q2. What is the final instance segmentation mask for each nucleus and cell in the image?

Q3. How did the authors integrate the morphological features that the authors extracted?

Q4. Why was annotation time linked to model performance?

Q5. What is the threshold for the intersection over union of the cell masks?

Q6. What is the main reason for the inaccuracies in the segmentation of cells?

Q7. How did the authors simulate low signal-to-noise ratio and high background staining?

Q8. How did the authors compute the localization of a panel of markers with known profiles?

Q9. Why did Cellpose fail to identify a large fraction of the cells in the image?

Q10. How did the authors combine the dataset used for model training?

Q11. How did the authors find that the crowd annotation of dense images was better?

Q12. Why is the border of each cell in an image to be manually demarcated?

Q13. What are some recent tools that have attempted to overcome this barrier?

Trending Questions (1)