Cell Type Prediction¶

The goal of cell type prediction is to assign a label to each cell. While methods such as clustering are frequently used in single cell transcriptomics data, these methods do not necessarily translate well to highly multiplexed fluorescence images. Instead, this notebook follows a two-step approach. In the first step, major cell types (such as B cells, T cells, macrophages, etc.) are predicted by using a set of mutually exclusive markers. These initial labels are predicted with methods such as argmax or astir and build the foundation for step two, in which we will look at functional markers. Here, we use a binarization-like approach. This means that for our functional markers, each cell gets either a value of 1 (if the cell expresses that marker) or 0 (if the cell does not express that marker). While this is a simplification of the underlying biology, this abstraction can be useful to counteract the often poor signal-to-noise ratio inherent in the data.

[1]:

%reload_ext autoreload
%autoreload 2

[2]:

import spatialproteomics as sp
import pandas as pd
import matplotlib.pyplot as plt
import xarray as xr
from scipy.signal import medfilt2d

[3]:

ds = xr.open_zarr("../../data/BNHL_166_4_I2_LK_2.zarr")

Image Processing¶

Before predicting cell types, it is typically helpful to perform some preprocessing on the images. Here, we simply perform some custom thresholding on the markers to clean up the images and boost the signal-to-noise ratio.

[4]:

# preprocessing: thresholding by percentiles and applying a median filter
channels = [
    "PAX5",
    "CD3",
    "CD11b",
    "CD11c",
    "CD15",
    "CD68",
    "Podoplanin",
    "CD31",
    "CD34",
    "CD90",
    "CD56",
    "CD4",
    "CD8",
    "Ki-67",
    "CD45RA",
    "CD45RO",
]
quantiles = [0.8, 0.5, 0.8, 0.8, 0.8, 0.8, 0.95, 0.95, 0.95, 0.95, 0.8, 0.95, 0.95, 0.95, 0.95, 0.95]
colors = [
    "#e6194B",
    "#3cb44b",
    "#ffe119",
    "#4363d8",
    "#ffd8b1",
    "#f58231",
    "#911eb4",
    "#fffac8",
    "#469990",
    "#fabed4",
    "#9A6324",
]

channels_celltypes = channels[:-5]

# processing and recomputing quantification
ds_processed = (
    ds.pp[channels].pp.threshold(quantiles).pp.add_quantification().pp.transform_expression_matrix(method="arcsinh")
)

[5]:

# plotting the ds and ds processed next to one another

# visualizing the raw vs the processed image
fig, ax = plt.subplots(1, 2, figsize=(10, 5))

_ = ds.pp[channels_celltypes].pl.colorize(colors).pl.show(ax=ax[0])
_ = ds_processed.pp[channels_celltypes].pl.colorize(colors).pl.show(ax=ax[1])

ax[0].set_title("Raw")
ax[1].set_title("Processed")

[5]:

Text(0.5, 1.0, 'Processed')

../_images/notebooks_CellTypePrediction_6_1.png

Cell Type Prediction with Argmax¶

Given the thresholded image, the simplest way to derive cell type labels is to define associations between cell types and markers, and then assign the cell type whose marker is most highly expressed (i. e. taking the argmax). This technique, also termed argmax, is demonstrated below.

[6]:

marker_ct_dict = {
    "PAX5": "B",
    "CD3": "T",
    "CD11b": "Myeloid",
    "CD11c": "Dendritic",
    "CD15": "Granulo",
    "CD68": "Macro",
    "Podoplanin": "Stromal PDPN",
    "CD31": "Stromal CD31",
    "CD34": "Stromal CD34",
    "CD90": "Stromal CD90",
    "CD56": "NK",
}

[7]:

# predicting the cell type using the argmay
ds_with_ct_predictions = ds_processed.la.predict_cell_types_argmax(
    marker_ct_dict, key="_intensity", overwrite_existing_labels=True
)

# adding colors to match the markers

ds_with_ct_predictions = ds_with_ct_predictions.la.set_label_colors(list(marker_ct_dict.values()), colors)

Removing labels from observations. If you want to keep the labels in the obs layer, set drop_obs=False.
Label Stromal PDPN not found in the data object. Skipping.

[8]:

# plotting the ct predictions next to the processed image
fig, ax = plt.subplots(1, 2, figsize=(10, 5))

_ = ds_with_ct_predictions.pp[channels_celltypes].pl.colorize(colors).pl.show(ax=ax[0])
_ = ds_with_ct_predictions.pl.show(render_image=False, render_labels=True, ax=ax[1])

ax[0].set_title("Markers")
ax[1].set_title("Cell Type Predictions")

[8]:

Text(0.5, 1.0, 'Cell Type Predictions')

../_images/notebooks_CellTypePrediction_10_1.png

[9]:

# this is how the image in the README was created
fig, ax = plt.subplots(1, 3, figsize=(21, 7))
_ = ds_processed.pp[channels_celltypes].pl.colorize(colors).pl.autocrop(padding=100).pl.show(ax=ax[0])
_ = ds.pp["DAPI"].pl.autocrop(padding=100).pl.show(render_segmentation=True, ax=ax[1])
_ = ds_with_ct_predictions.pl.autocrop(padding=100).pl.show(render_image=False, render_labels=True, ax=ax[2])


# removing the x and y ticks
ax[0].set_xticks([])
ax[0].set_yticks([])
ax[1].set_xticks([])
ax[1].set_yticks([])
ax[2].set_xticks([])
ax[2].set_yticks([])

# setting titles
ax[0].set_title("Markers")
ax[1].set_title("Segmentation")
ax[2].set_title("Cell Types")

plt.tight_layout()

../_images/notebooks_CellTypePrediction_11_0.png

Getting observations from the xarray object is very easy. We can simply use the method pp.get_layer_as_df(), which can also parse our cell types into human-readable format automatically.

[10]:

# getting the obs with get_layer_as_df() to get human readable cell type annotations in a pandas data frame
ds_with_ct_predictions.pp.get_layer_as_df("_obs", celltypes_to_str=True)

[10]:

	_labels	centroid-0	centroid-1
1	T	613.329787	768.420213
2	Macro	769.098446	707.544041
3	T	774.528409	644.284091
4	T	775.902878	592.744604
5	Myeloid	668.844749	729.310502
...	...	...	...
12556	B	2340.362319	1846.492754
12557	Macro	2216.324138	2296.089655
12558	T	2265.910853	2231.992248
12559	B	2281.480687	2218.424893
12560	B	2249.069444	2236.564815

12560 rows × 3 columns

Cell Type Prediction with Astir¶

There also exist more involved methods of predicting cell types, such as astir. Here is a quick example on how one could apply astir to the previous image.

[11]:

ds_processed

[11]:

<xarray.Dataset>
Dimensions:         (labels: 8, la_props: 2, cells: 12560, features: 3,
                     channels: 16, x: 3000, y: 3000)
Coordinates:
  * cells           (cells) int64 1 2 3 4 5 6 ... 12556 12557 12558 12559 12560
  * channels        (channels) <U11 'PAX5' 'CD3' 'CD11b' ... 'CD45RA' 'CD45RO'
  * features        (features) <U10 '_labels' 'centroid-0' 'centroid-1'
  * la_props        (la_props) <U6 '_color' '_name'
  * labels          (labels) int64 1 2 3 4 5 6 7 8
  * x               (x) int64 0 1 2 3 4 5 6 ... 2994 2995 2996 2997 2998 2999
  * y               (y) int64 0 1 2 3 4 5 6 ... 2994 2995 2996 2997 2998 2999
Data variables:
    _la_properties  (labels, la_props) <U20 dask.array<chunksize=(8, 2), meta=np.ndarray>
    _obs            (cells, features) float64 dask.array<chunksize=(6280, 3), meta=np.ndarray>
    _segmentation   (y, x) int64 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0
    _image          (channels, y, x) uint8 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0
    _intensity      (cells, channels) float64 0.004255 2.46 ... 1.384 0.06662

[12]:

# minor reformatting
seg = ds_processed["_segmentation"].values
ds_for_astir = ds_processed.pp.drop_layers("_obs").pp.drop_layers("_la_properties").pp.add_segmentation(seg)

[13]:

ds_for_astir

[14]:

# using astir to predict cell types
# note the slightly different structure of the marker dictionary
ct_marker_dict = {
    "cell_type": {
        "B": ["PAX5"],
        "T": ["CD3"],
        "Myeloid": ["CD11b"],
        "Dendritic": ["CD11c"],
        "Granulo": ["CD15"],
        "Macro": ["CD68"],
        "Stroma PDPN": ["Podoplanin"],
        "Stroma CD31": ["CD31"],
        "Stroma CD34": ["CD34"],
        "Stroma CD90": ["CD90"],
        "NK": ["CD56"],
    }
}
ds_with_ct_predictions_astir = ds_for_astir.tl.astir(ct_marker_dict)

training restart 1/5: 100%|██████████| 5/5 [ 3.78s/epochs, current loss: -17157.1]
training restart 2/5: 100%|██████████| 5/5 [ 3.28s/epochs, current loss: -18852.9]
training restart 3/5: 100%|██████████| 5/5 [ 3.30s/epochs, current loss: -19201.0]
training restart 4/5: 100%|██████████| 5/5 [ 3.28s/epochs, current loss: -17167.5]
training restart 5/5: 100%|██████████| 5/5 [ 3.42s/epochs, current loss: -17890.4]
training restart (final):   4%|▍         | 21/500 [ 3.86s/epochs, current loss: -47704.8]

[15]:

# adding custom colors and adding observations back in
cell_types = [
    "B",
    "T",
    "Myeloid",
    "Dendritic",
    "Granulo",
    "Macro",
    "Stroma PDPN",
    "Stroma CD31",
    "Stroma CD34",
    "Stroma CD90",
    "NK",
    "Other",
]
ds_with_ct_predictions_astir = ds_with_ct_predictions_astir.la.set_label_colors(
    cell_types, colors + ["darkgray"]
).pp.add_observations()

[16]:

ds_with_ct_predictions_astir

[17]:

# plotting the astir and the argmax predictions
fig, ax = plt.subplots(1, 3, figsize=(21, 7))
_ = ds_processed.pp[channels_celltypes].pl.colorize(colors).pl.autocrop(padding=100).pl.show(ax=ax[0])
_ = ds_with_ct_predictions_astir.pl.autocrop(padding=100).pl.show(render_image=False, render_labels=True, ax=ax[1])
_ = ds_with_ct_predictions.pl.autocrop(padding=100).pl.show(render_image=False, render_labels=True, ax=ax[2])

# removing the x and y ticks
ax[0].set_xticks([])
ax[0].set_yticks([])
ax[1].set_xticks([])
ax[1].set_yticks([])
ax[2].set_xticks([])
ax[2].set_yticks([])

# setting titles
ax[0].set_title("Markers")
ax[1].set_title("Astir Predictions")
ax[2].set_title("Argmax Predictions")

plt.tight_layout()

../_images/notebooks_CellTypePrediction_21_0.png

Marker Binarization¶

For some (functional) markers, we want to binarize them (i. e. is a cell positive or negative for that marker). This is performed in several steps: 1. First, the image needs to be thresholded. We already did this at the beginning of the notebook using the pp.threshold() method. 2. Afterwards, we need to once again quantify the expression for each cell. While we have already seen how to do this using the mean expression, we can also use things such as the percentage of positive pixels (after thresholding). 3. Next, we actually perform the binarization using la.threshold_labels(). This way, we could for example consider a cell positive for CD4 only if more than 90% of its pixels have a non-zero intensity value. 4. Finally, we create a hierarchy of cell subtypes, which is based on the labels we computed previously and subsets them further by using the binarized functional markers.

[18]:

# adding a new quantification: the percentage of positive pixels (this requires appropriate thresholding, as we have performed ealier)
ds_with_ct_predictions = ds_with_ct_predictions.pp.add_quantification(
    func=sp.percentage_positive, key_added="_percentage_positive"
)

[19]:

# these thresholds are the percentages of positive cells required
# so a cell with more than 50% of non-zero CD4 pixels would be considered CD4 positive
threshold_dict = {"CD4": 0.5, "CD8": 0.5, "Ki-67": 0.5, "CD45RA": 0.5, "CD45RO": 0.5}

ds_with_binarization = ds_with_ct_predictions.la.threshold_labels(threshold_dict, layer_key="_percentage_positive")

[20]:

# under the hood, we can see that our new object now contains binarization columns in the _obs field
ds_with_binarization.pp.get_layer_as_df().head()

[20]:

	CD45RO_binarized	_labels	centroid-0	centroid-1
1	0.0	T	613.329787	768.420213
2	0.0	Macro	769.098446	707.544041
3	1.0	T	774.528409	644.284091
4	0.0	T	775.902878	592.744604
5	0.0	Myeloid	668.844749	729.310502

Cell Subtype Prediction¶

Finally, it is time to combine the broad cell type predictions with the binarization results in order to predict more fine-grained cell types. Here, we use a dictionary to define a cell type hierarchy, and then use the method la.predict_cell_subtypes() to subset our cell types accordingly. Instead of a dictionary, you can also provide the path to a yaml file. This file could look something like this:

B:
  subtypes:
    - name: B_prol
      markers: ["Ki67+"]
T:
  subtypes:
    - name: T_h
      markers: ["CD4+"]
    - name: T_tox
      markers: ["CD8+"]
      subtypes:
        - name: T_tox_naive
          markers: ["CD45RA+"]
        - name: T_tox_mem
          markers: ["CD45RO+"]

Note that after each marker, we add a + after its name if we want cells of this cell type to be positive for a marker, or a - if we require it to be negative for the marker instead.

[21]:

# calling of cell subtypes with a provided hierarchy
subtype_dict = {
    "B": {"subtypes": [{"name": "B_prol", "markers": ["Ki-67+"]}]},
    "T": {
        "subtypes": [
            {"name": "T_h", "markers": ["CD4+"]},
            {
                "name": "T_tox",
                "markers": ["CD8+"],
                "subtypes": [
                    {"name": "T_tox_naive", "markers": ["CD45RA+"]},
                    {"name": "T_tox_mem", "markers": ["CD45RO+"]},
                ],
            },
        ]
    },
}

ds_with_subtypes = ds_with_binarization.la.predict_cell_subtypes(subtype_dict)

[22]:

# if you wanted to read the hierarchy from a file, you could do it like this
ds_with_subtypes = ds_with_binarization.la.predict_cell_subtypes("../../data/subtype_hierarchy.yaml")

You can see that this added cell type labels on multiple levels to the _la_layers table, so you can always go back to a lower level. In addition, the most granular cell type labels have been added as the new default labels in _obs.

[23]:

ds_with_subtypes.pp.get_layer_as_df("_la_layers").iloc[15:25]

[23]:

	labels_0	labels_1	labels_2
16	T	T_h	T_h
17	Macro	Macro	Macro
18	T	T	T
19	T	T_tox	T_tox
20	T	T	T
21	T	T	T
22	T	T	T
23	T	T	T
24	T	T_tox	T_tox
25	T	T_h	T_h

[24]:

# plotting the cell type predictions
plt.figure(figsize=(10, 10))
_ = ds_with_subtypes.pl.autocrop().pl.show(render_image=False, render_labels=True)

../_images/notebooks_CellTypePrediction_31_0.png

[25]:

# plotting the cell type predictions
plt.figure(figsize=(10, 10))
_ = (
    ds_with_subtypes.la[["T", "T_h", "T_tox", "T_tox_mem", "T_tox_naive"]]
    .la.set_label_colors(
        ["T", "T_h", "T_tox", "T_tox_mem", "T_tox_naive"],
        colors=["gray", "orange", "deepskyblue", "darkblue", "darkviolet"],
    )
    .pl.autocrop()
    .pl.show(render_image=False, render_labels=True)
)

../_images/notebooks_CellTypePrediction_32_0.png

As a last step, let’s say you want to go back to a lower level. You can do this using the la.set_label_level() method as shown below.

[26]:

fig, ax = plt.subplots(1, 2, figsize=(20, 10))
# showing the fine-grained labels, which are the defaults
_ = ds_with_subtypes.pl.autocrop().pl.show(render_image=False, render_labels=True, ax=ax[0])
# going back a level (so that T_tox cells are not split into memory and naive)
_ = (
    ds_with_subtypes.la.set_label_level("labels_1")
    .pl.autocrop()
    .pl.show(render_image=False, render_labels=True, ax=ax[1])
)

../_images/notebooks_CellTypePrediction_34_0.png

Advanced Cell Type Gating¶

The example above provided an overview on how to use marker positivity to gate different cell types. You can also encode more complex gating schemes with marker negativity, double-positivity, etc. Let’s look at some ways we can encode a more complex cell type hierarchy. Note that these are just for illustrative purposes and do not represent the most biologically sensible gating strategy. Furthermore, they are shown in yaml format for better readability.

Double positivity: You can encode a cell type with double positivity. In the case below, we would only call a cell as naive cytotoxic, if both CD8 and CD45RA are positive.

B:
  subtypes:
    - name: B_prol
      markers: ["Ki-67+"]
T:
  subtypes:
    - name: T_h
      markers: ["CD4+"]
    - name: T_tox
      markers: ["CD8+", "CD45RA+"]

Alternative markers: Alternatively, you could also define a cell type if either one of a set of markers is positive. Here, cells would be called T_tox if either CD45RO or CD45RA are positive.

B:
  subtypes:
    - name: B_prol
      markers: ["Ki-67+"]
T:
  subtypes:
    - name: T_tox
      markers: ["CD45RA+"]
    - name: T_tox
      markers: ["CD45RO+"]

Marker negativity: By using a minus sign after a marker, we call a cell if it is negative for that marker. For example, we could define T helper cells as being CD8 negative.

B:
  subtypes:
    - name: B_prol
      markers: ["Ki-67+"]
T:
  subtypes:
    - name: T_h
      markers: ["CD8-"]
    - name: T_tox
      markers: ["CD8+"]

Combined positivity and negativity: We can also use combinations of positive and negative markers to define certain cell subtypes.

B:
  subtypes:
    - name: B_prol
      markers: ["Ki-67+"]
T:
  subtypes:
    - name: T_h
      markers: ["CD4+", "CD8-"]
    - name: T_tox
      markers: ["CD8+"]

Image Processing Neighborhood Analysis