Cell Type Prediction¶
The goal of cell type prediction is to assign a label to each cell. While methods such as clustering are frequently used in single cell transcriptomics data, these methods do not necessarily translate well to highly multiplexed fluorescence images. Instead, this notebook follows a two-step approach. In the first step, major cell types (such as B cells, T cells, macrophages, etc.) are predicted by using a set of mutually exclusive markers. These initial labels are predicted with methods such as
argmax
or astir
and build the foundation for step two, in which we will look at functional markers. Here, we use a binarization-like approach. This means that for our functional markers, each cell gets either a value of 1 (if the cell expresses that marker) or 0 (if the cell does not express that marker). While this is a simplification of the underlying biology, this abstraction can be useful to counteract the often poor signal-to-noise ratio inherent in the data.
[1]:
%reload_ext autoreload
%autoreload 2
[2]:
import spatialproteomics as sp
import pandas as pd
import matplotlib.pyplot as plt
import xarray as xr
from scipy.signal import medfilt2d
[3]:
ds = xr.load_dataset('../../data/BNHL_166_4_I2_LK_2.zarr', engine='zarr')
Image Processing¶
Before predicting cell types, it is typically helpful to perform some preprocessing on the images. Here, we simply perform some custom thresholding on the markers to clean up the images and boost the signal-to-noise ratio.
[4]:
# preprocessing: thresholding by percentiles and applying a median filter
channels = ['PAX5', 'CD3', 'CD11b', 'CD11c', 'CD15', 'CD68', 'Podoplanin', 'CD31', 'CD34', 'CD90', 'CD56', 'CD4', 'CD8', 'Ki-67', 'CD45RA', 'CD45RO']
quantiles = [0.8, 0.5, 0.8, 0.8, 0.8, 0.8, 0.95, 0.95, 0.95, 0.95, 0.8, 0.95, 0.95, 0.95, 0.95, 0.95]
colors = ['#e6194B', '#3cb44b', '#ffe119', '#4363d8', '#ffd8b1', '#f58231', '#911eb4', '#fffac8', '#469990', '#fabed4', '#9A6324']
channels_celltypes = channels[:-5]
# processing and recomputing quantification
ds_processed = ds.pp[channels].pp.threshold(quantiles).pp.add_quantification().pp.transform_expression_matrix(method='arcsinh')
[5]:
# plotting the ds and ds processed next to one another
# visualizing the raw vs the processed image
fig, ax = plt.subplots(1, 2, figsize=(10, 5))
_ = ds.pp[channels_celltypes].pl.colorize(colors).pl.show(ax=ax[0])
_ = ds_processed.pp[channels_celltypes].pl.colorize(colors).pl.show(ax=ax[1])
ax[0].set_title("Raw")
ax[1].set_title("Processed")
[5]:
Text(0.5, 1.0, 'Processed')
Cell Type Prediction with Argmax¶
Given the thresholded image, the simplest way to derive cell type labels is to define associations between cell types and markers, and then assign the cell type whose marker is most highly expressed (i. e. taking the argmax). This technique, also termed argmax
, is demonstrated below.
[6]:
ct_marker_dict = {'B': 'PAX5', 'T': 'CD3', 'Myeloid': 'CD11b', 'Dendritic': 'CD11c', 'Granulo': 'CD15', 'Macro': 'CD68', 'Stroma PDPN': 'Podoplanin', 'Stroma CD31': 'CD31', 'Stroma CD34': 'CD34', 'Stroma CD90': 'CD90', 'NK': 'CD56'}
[7]:
# predicting the cell type using the argmay
ds_with_ct_predictions = ds_processed.la.predict_cell_types_argmax(ct_marker_dict, key='_intensity', overwrite_existing_labels=True)
# adding colors to match the markers
ds_with_ct_predictions = ds_with_ct_predictions.la.set_label_colors(list(ct_marker_dict.keys()), colors)
Removing labels from observations. If you want to keep the labels in the obs layer, set drop_obs=False.
Label Stroma PDPN not found in the data object. Skipping.
[8]:
# plotting the ct predictions next to the processed image
fig, ax = plt.subplots(1, 2, figsize=(10, 5))
_ = ds_with_ct_predictions.pp[channels_celltypes].pl.colorize(colors).pl.show(ax=ax[0])
_ = ds_with_ct_predictions.pl.show(render_image=False, render_labels=True, ax=ax[1])
ax[0].set_title("Markers")
ax[1].set_title("Cell Type Predictions")
[8]:
Text(0.5, 1.0, 'Cell Type Predictions')
[9]:
# this is how the image in the README was created
fig, ax = plt.subplots(1, 3, figsize=(21, 7))
_ = ds_processed.pp[channels_celltypes].pl.colorize(colors).pl.autocrop(padding=100).pl.show(ax=ax[0])
_ = ds.pp['DAPI'].pl.autocrop(padding=100).pl.show(render_segmentation=True, ax=ax[1])
_ = ds_with_ct_predictions.pl.autocrop(padding=100).pl.show(render_image=False, render_labels=True, ax=ax[2])
# removing the x and y ticks
ax[0].set_xticks([])
ax[0].set_yticks([])
ax[1].set_xticks([])
ax[1].set_yticks([])
ax[2].set_xticks([])
ax[2].set_yticks([])
# setting titles
ax[0].set_title('Markers')
ax[1].set_title('Segmentation')
ax[2].set_title('Cell Types')
plt.tight_layout()
Getting observations from the xarray object is very easy. We can simply use the method pp.get_layer_as_df()
, which can also parse our cell types into human-readable format automatically.
[10]:
# getting the obs with get_layer_as_df() to get human readable cell type annotations in a pandas data frame
ds_with_ct_predictions.pp.get_layer_as_df('_obs', celltypes_to_str=True)
[10]:
_labels | centroid-0 | centroid-1 | |
---|---|---|---|
1 | T | 613.329787 | 768.420213 |
2 | Macro | 769.098446 | 707.544041 |
3 | T | 774.528409 | 644.284091 |
4 | T | 775.902878 | 592.744604 |
5 | Myeloid | 668.844749 | 729.310502 |
... | ... | ... | ... |
12556 | B | 2340.362319 | 1846.492754 |
12557 | Macro | 2216.324138 | 2296.089655 |
12558 | T | 2265.910853 | 2231.992248 |
12559 | B | 2281.480687 | 2218.424893 |
12560 | B | 2249.069444 | 2236.564815 |
12560 rows × 3 columns
Cell Type Prediction with Astir¶
There also exist more involved methods of predicting cell types, such as astir. Here is a quick example on how one could apply astir to the previous image.
[11]:
ds_processed
[11]:
<xarray.Dataset> Dimensions: (labels: 8, la_props: 2, cells: 12560, features: 3, y: 3000, x: 3000, channels: 16) Coordinates: * cells (cells) int64 1 2 3 4 5 6 ... 12556 12557 12558 12559 12560 * channels (channels) <U11 'PAX5' 'CD3' 'CD11b' ... 'CD45RA' 'CD45RO' * features (features) <U10 '_labels' 'centroid-0' 'centroid-1' * la_props (la_props) <U6 '_color' '_name' * labels (labels) int64 1 2 3 4 5 6 7 8 * x (x) int64 0 1 2 3 4 5 6 ... 2994 2995 2996 2997 2998 2999 * y (y) int64 0 1 2 3 4 5 6 ... 2994 2995 2996 2997 2998 2999 Data variables: _la_properties (labels, la_props) <U20 'C1' ... 'Vascular (CD31+CD34)' _obs (cells, features) float64 7.0 613.3 ... 2.249e+03 2.237e+03 _segmentation (y, x) int64 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 _image (channels, y, x) float64 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 _intensity (cells, channels) float64 0.004255 2.46 ... 1.384 0.06662
[12]:
# minor reformatting
seg = ds_processed['_segmentation'].values
ds_for_astir = ds_processed.pp.drop_layers('_obs').pp.drop_layers('_la_properties').pp.add_segmentation(seg)
[13]:
ds_for_astir
[13]:
<xarray.Dataset> Dimensions: (channels: 16, x: 3000, y: 3000, cells: 12560, features: 2) Coordinates: * channels (channels) <U11 'PAX5' 'CD3' 'CD11b' ... 'CD45RA' 'CD45RO' * x (x) int64 0 1 2 3 4 5 6 ... 2994 2995 2996 2997 2998 2999 * y (y) int64 0 1 2 3 4 5 6 ... 2994 2995 2996 2997 2998 2999 * cells (cells) int64 1 2 3 4 5 6 ... 12556 12557 12558 12559 12560 * features (features) <U10 'centroid-0' 'centroid-1' Data variables: _image (channels, y, x) float64 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 _intensity (cells, channels) float64 0.004255 2.46 ... 1.384 0.06662 _segmentation (y, x) int64 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 _obs (cells, features) float64 613.3 768.4 ... 2.249e+03 2.237e+03
[14]:
# using astir to predict cell types
# note the slightly different structure of the marker dictionary
ct_marker_dict = {'cell_type': {'B': ['PAX5'], 'T': ['CD3'], 'Myeloid': ['CD11b'], 'Dendritic': ['CD11c'], 'Granulo': ['CD15'], 'Macro': ['CD68'], 'Stroma PDPN': ['Podoplanin'], 'Stroma CD31': ['CD31'], 'Stroma CD34': ['CD34'], 'Stroma CD90': ['CD90'], 'NK': ['CD56']}}
ds_with_ct_predictions_astir = ds_for_astir.tl.astir(ct_marker_dict)
The image is not of type uint8, which is required for astir to work properly. Use pp.convert_to_8bit() to convert the image to uint8. If you applied operations such as filtering, you may ignore this warning.
training restart 1/5: 100%|██████████| 5/5 [ 4.01s/epochs, current loss: -17157.1]
training restart 2/5: 100%|██████████| 5/5 [ 4.10s/epochs, current loss: -18852.9]
training restart 3/5: 100%|██████████| 5/5 [ 4.12s/epochs, current loss: -19201.0]
training restart 4/5: 100%|██████████| 5/5 [ 3.82s/epochs, current loss: -17167.5]
training restart 5/5: 100%|██████████| 5/5 [ 3.84s/epochs, current loss: -17890.4]
training restart (final): 4%|▍ | 21/500 [ 4.10s/epochs, current loss: -47704.8]
[15]:
# adding custom colors and adding observations back in
cell_types = ['B', 'T', 'Myeloid', 'Dendritic', 'Granulo', 'Macro', 'Stroma PDPN', 'Stroma CD31', 'Stroma CD34', 'Stroma CD90', 'NK', 'Other']
ds_with_ct_predictions_astir = ds_with_ct_predictions_astir.la.set_label_colors(cell_types, colors + ['darkgray']).pp.add_observations()
[16]:
ds_with_ct_predictions_astir
[16]:
<xarray.Dataset> Dimensions: (labels: 12, la_props: 2, channels: 16, x: 3000, y: 3000, cells: 12560, features: 3) Coordinates: * labels (labels) int64 1 2 3 4 5 6 7 8 9 10 11 12 * la_props (la_props) <U6 '_color' '_name' * channels (channels) <U11 'PAX5' 'CD3' 'CD11b' ... 'CD45RA' 'CD45RO' * x (x) int64 0 1 2 3 4 5 6 ... 2994 2995 2996 2997 2998 2999 * y (y) int64 0 1 2 3 4 5 6 ... 2994 2995 2996 2997 2998 2999 * cells (cells) int64 1 2 3 4 5 6 ... 12556 12557 12558 12559 12560 * features (features) <U10 '_labels' 'centroid-0' 'centroid-1' Data variables: _image (channels, y, x) float64 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 _intensity (cells, channels) float64 0.004255 2.46 ... 1.384 0.06662 _segmentation (y, x) int64 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 _obs (cells, features) float64 12.0 613.3 ... 2.249e+03 2.237e+03 _la_properties (labels, la_props) object '#e6194B' 'B' ... '#3cb44b' 'T'
[17]:
# plotting the astir and the argmax predictions
fig, ax = plt.subplots(1, 3, figsize=(21, 7))
_ = ds_processed.pp[channels_celltypes].pl.colorize(colors).pl.autocrop(padding=100).pl.show(ax=ax[0])
_ = ds_with_ct_predictions_astir.pl.autocrop(padding=100).pl.show(render_image=False, render_labels=True, ax=ax[1])
_ = ds_with_ct_predictions.pl.autocrop(padding=100).pl.show(render_image=False, render_labels=True, ax=ax[2])
# removing the x and y ticks
ax[0].set_xticks([])
ax[0].set_yticks([])
ax[1].set_xticks([])
ax[1].set_yticks([])
ax[2].set_xticks([])
ax[2].set_yticks([])
# setting titles
ax[0].set_title('Markers')
ax[1].set_title('Astir Predictions')
ax[2].set_title('Argmax Predictions')
plt.tight_layout()
Marker Binarization¶
For some (functional) markers, we want to binarize them (i. e. is a cell positive or negative for that marker). This is performed in several steps: 1. First, the image needs to be thresholded. We already did this at the beginning of the notebook using the pp.threshold()
method. 2. Afterwards, we need to once again quantify the expression for each cell. While we have already seen how to do this using the mean expression, we can also use things such as the percentage of positive pixels (after
thresholding). 3. Next, we actually perform the binarization using la.threshold_labels()
. This way, we could for example consider a cell positive for CD4 only if more than 90% of its pixels have a non-zero intensity value. 4. Finally, we create a hierarchy of cell subtypes, which is based on the labels we computed previously and subsets them further by using the binarized functional markers.
[18]:
# adding a new quantification: the percentage of positive pixels (this requires appropriate thresholding, as we have performed ealier)
ds_with_ct_predictions = ds_with_ct_predictions.pp.add_quantification(func=sp.percentage_positive, key_added='_percentage_positive')
[19]:
# these thresholds are the percentages of positive cells required
# so a cell with more than 50% of non-zero CD4 pixels would be considered CD4 positive
threshold_dict = {
"CD4": 0.5,
"CD8": 0.5,
"Ki-67": 0.5,
"CD45RA": 0.5,
"CD45RO": 0.5
}
ds_with_binarization = ds_with_ct_predictions.la.threshold_labels(threshold_dict, layer_key='_percentage_positive')
[20]:
# under the hood, we can see that our new object now contains binarization columns in the _obs field
ds_with_binarization.pp.get_layer_as_df().head()
[20]:
CD45RA_binarized | CD45RO_binarized | CD4_binarized | CD8_binarized | Ki-67_binarized | _labels | centroid-0 | centroid-1 | |
---|---|---|---|---|---|---|---|---|
1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | T | 613.329787 | 768.420213 |
2 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | Macro | 769.098446 | 707.544041 |
3 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | T | 774.528409 | 644.284091 |
4 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | T | 775.902878 | 592.744604 |
5 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | Myeloid | 668.844749 | 729.310502 |
Cell Subtype Prediction¶
Finally, it is time to combine the broad cell type predictions with the binarization results in order to predict more fine-grained cell types. Here, we use a dictionary to define a cell type hierarchy, and then use the method la.predict_cell_subtypes()
to subset our cell types accordingly. Instead of a dictionary, you can also provide the path to a yaml file. This file could look something like this:
B:
subtypes:
- name: B_prol
markers: ["Ki67+"]
T:
subtypes:
- name: T_h
markers: ["CD4+"]
- name: T_tox
markers: ["CD8+"]
subtypes:
- name: T_tox_naive
markers: ["CD45RA+"]
- name: T_tox_mem
markers: ["CD45RO+"]
Note that after each marker, we add a +
after its name if we want cells of this cell type to be positive for a marker, or a -
if we require it to be negative for the marker instead.
[21]:
# calling of cell subtypes with a provided hierarchy
subtype_dict = {
'B': {
'subtypes': [
{
'name': 'B_prol',
'markers': ['Ki-67+']
}
]
},
'T': {
'subtypes': [
{
'name': 'T_h',
'markers': ['CD4+']
},
{
'name': 'T_tox',
'markers': ['CD8+'],
'subtypes': [
{
'name': 'T_tox_naive',
'markers': ['CD45RA+']
},
{
'name': 'T_tox_mem',
'markers': ['CD45RO+']
}
]
}
]
}
}
ds_with_subtypes = ds_with_binarization.la.predict_cell_subtypes(subtype_dict)
[22]:
# if you wanted to read the hierarchy from a file, you could do it like this
ds_with_subtypes = ds_with_binarization.la.predict_cell_subtypes("../../data/subtype_hierarchy.yaml")
You can see that this added cell type labels on multiple levels to the _obs
table, so you can always go back to a lower level. In addition, the most granular cell type labels have been added as the new default labels.
[23]:
ds_with_subtypes.pp.get_layer_as_df().head()
[23]:
CD45RA_binarized | CD45RO_binarized | CD4_binarized | CD8_binarized | Ki-67_binarized | _labels | centroid-0 | centroid-1 | |
---|---|---|---|---|---|---|---|---|
1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | T | 613.329787 | 768.420213 |
2 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | Macro | 769.098446 | 707.544041 |
3 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | T | 774.528409 | 644.284091 |
4 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | T | 775.902878 | 592.744604 |
5 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | Myeloid | 668.844749 | 729.310502 |
[24]:
# plotting the cell type predictions
plt.figure(figsize=(10, 10))
_ = ds_with_subtypes.pl.autocrop().pl.show(render_image=False, render_labels=True)
[25]:
# plotting the cell type predictions
plt.figure(figsize=(10, 10))
_ = (ds_with_subtypes.la[['T', 'T_h', 'T_tox', 'T_tox_mem', 'T_tox_naive']]
.la.set_label_colors(['T', 'T_h', 'T_tox', 'T_tox_mem', 'T_tox_naive'],
colors=['gray', 'orange', 'deepskyblue', 'darkblue', 'darkviolet'])
.pl.autocrop()
.pl.show(render_image=False, render_labels=True))
Advanced Cell Type Gating¶
The example above provided an overview on how to use marker positivity to gate different cell types. You can also encode more complex gating schemes with marker negativity, double-positivity, etc. Let’s look at some ways we can encode a more complex cell type hierarchy. Note that these are just for illustrative purposes and do not represent the most biologically sensible gating strategy. Furthermore, they are shown in yaml
format for better readability.
Double positivity: You can encode a cell type with double positivity. In the case below, we would only call a cell as naive cytotoxic, if both CD8 and CD45RA are positive.
B:
subtypes:
- name: B_prol
markers: ["Ki-67+"]
T:
subtypes:
- name: T_h
markers: ["CD4+"]
- name: T_tox
markers: ["CD8+", "CD45RA+"]
Alternative markers: Alternatively, you could also define a cell type if either one of a set of markers is positive. Here, cells would be called T_tox if either CD45RO or CD45RA are positive.
B:
subtypes:
- name: B_prol
markers: ["Ki-67+"]
T:
subtypes:
- name: T_tox
markers: ["CD45RA+"]
- name: T_tox
markers: ["CD45RO+"]
Marker negativity: By using a minus sign after a marker, we call a cell if it is negative for that marker. For example, we could define T helper cells as being CD8 negative.
B:
subtypes:
- name: B_prol
markers: ["Ki-67+"]
T:
subtypes:
- name: T_h
markers: ["CD8-"]
- name: T_tox
markers: ["CD8+"]
Combined positivity and negativity: We can also use combinations of positive and negative markers to define certain cell subtypes.
B:
subtypes:
- name: B_prol
markers: ["Ki-67+"]
T:
subtypes:
- name: T_h
markers: ["CD4+", "CD8-"]
- name: T_tox
markers: ["CD8+"]