Example Workflow with spatialproteomics

Welcome to spatialproteomics! In this notebook, we will go through an example workflow by looking at the following steps: 1. Reading in a highly multiplexed image and creating a spatialproteomics object 2. Performing basic image processing steps to boost the signal-to-noise ratio 3. Performing cell segmentation using cellpose 4. Quantifying protein expression per cell 5. Predicting cell types with a simple argmax technique 6. Plotting the results

[1]:
%load_ext autoreload
%autoreload 2

import spatialproteomics as sp
from skimage.io import imread
import matplotlib.pyplot as plt
from scipy.signal import medfilt2d
import pandas as pd

1. Getting Started

Before we can get started with spatialproteomics, we first need to read in the image. This can be done with the skimage or tifffile libraries.

[2]:
# reading in the image and displaying its shape
image = imread('../../data/input.tiff')
image.shape
[2]:
(5, 500, 500)

We can see that our image has 5 channels and has a size of 500 by 500 pixels. Next, we want to create a spatialproteomics object and store the image within the object. To do this, we also need to feed in the names of the channels.

[3]:
sp_object = sp.load_image_data(image, channel_coords=['DAPI', 'CD4', 'CD8', 'FOXP3', 'BCL6'])

Instead of manually typing out all channels in the form of a list, you can also simply read it in from a file, and then provide the resulting list to spatialproteomics.

[4]:
markers = pd.read_csv('../../data/markerlist.txt', header=None)[0].tolist()
markers
[4]:
['DAPI', 'CD4', 'CD8', 'FOXP3', 'BCL6']
[5]:
# loading the markers into the object
sp_object = sp.load_image_data(image, channel_coords=markers)

sp.load_image_data() returns an xarray object that we can simply inspect by calling in an jupyter cell. Note that the image is stored as the data variable _image.

[6]:
sp_object
[6]:
<xarray.Dataset>
Dimensions:   (channels: 5, y: 500, x: 500)
Coordinates:
  * channels  (channels) <U5 'DAPI' 'CD4' 'CD8' 'FOXP3' 'BCL6'
  * y         (y) int64 0 1 2 3 4 5 6 7 8 ... 492 493 494 495 496 497 498 499
  * x         (x) int64 0 1 2 3 4 5 6 7 8 ... 492 493 494 495 496 497 498 499
Data variables:
    _image    (channels, y, x) uint16 10816 12359 14504 10965 ... 147 129 149 59

2. Image Processing

Highly multiplexed fluorescence imaging techniques frequently suffer from poor signal-to-noise ratio. To alleviate this problem, you can threshold out low intensity pixels, thereby boosting the contrast of the image. While there are automated methods to determine the thresholds for such operations, it is difficult to come up with one that works in all cases. Here, we therefore set the thresholds based on manual inspection. For more details, check the notebook on Image Processing.

Let’s start by looking at the data set in the current form.

[7]:
_ = sp_object.pl.show()  # pl: the plotting module
../_images/notebooks_ExampleWorkflow_13_0.png

This is a bit much to look at. First, we want to check only the DAPI channel, to get a feel for the signal-to-noise ratio.

[8]:
_ = sp_object.pp['DAPI'].pl.show()  # we can use pp[channels] to subset the spatialproteomics object
../_images/notebooks_ExampleWorkflow_15_0.png

This looks promising. What about the other channels?

[9]:
_ = sp_object.pp[['CD4', 'CD8', 'FOXP3', 'BCL6']].pl.show() # pp: the preprocessing module
../_images/notebooks_ExampleWorkflow_17_0.png

Looking good, but we can make the signal a bit clearer by performing some image processing. In the following codeblock, we first threshold the channels by some percentile (so every value below that percentile gets set to 0). We then apply a 2D median filter with a kernel size of 3 to apply some smoothing.

[10]:
# the percentiles by which the channels will be thresholded
percentiles = [0.2, 0.6, 0.6, 0.6, 0.6]

# thresholding and smoothing the data
sp_object = sp_object.pp.threshold(percentiles).pp.apply(medfilt2d, kernel_size=3)

Let’s plot the new object.

[11]:
fig, ax = plt.subplots(1, 2, figsize=(10, 5))
_ = sp_object.pp['DAPI'].pl.show(ax=ax[0])
_ = sp_object.pp[['CD4', 'CD8', 'FOXP3', 'BCL6']].pl.show(ax=ax[1])
../_images/notebooks_ExampleWorkflow_21_0.png

3. Cell Segmentation

Great, this is our preprocessing done. Next, we can perform cell segmentation. Since we only have a universal nuclear marker at hand (and no universal membrane marker), we will segment the nuclei and then simply extend the segmentation masks by two pixels in every direction. We are going to use cellpose for this purpose, which is implemented in the tool (tl) module. Note that this requires you to install cellpose first, for example by running pip install cellpose in your terminal.

[12]:
sp_object = sp_object.tl.cellpose(channel='DAPI')
Neither TORCH CUDA nor MPS version not installed/working.
>>>> using CPU
>>>> using CPU
>> cyto3 << model set to be used
>>>> loading model /home/meyerben/.cellpose/models/cyto3
/home/meyerben/meyerben/.conda/envs/tmp_env_2/lib/python3.10/site-packages/cellpose/resnet_torch.py:280: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  state_dict = torch.load(filename, map_location=torch.device("cpu"))
>>>> model diam_mean =  30.000 (ROIs rescaled to this size during training)
channels set to [0, 0]
~~~ ESTIMATING CELL DIAMETER(S) ~~~
estimated cell diameter(s) in 17.52 sec
>>> diameter(s) =
[ 9.90 ]
~~~ FINDING MASKS ~~~
>>>> TOTAL TIME 40.22 sec

Looking at the object, you will realize that a new layer called _cellpose_segmentation has appeared. Before we can do anything with this information, we need to tell spatialproteomics that this should be treated as the main segmentation from now on. We do this via the pp.add_segmentation() function. This will cause a new layer called _segmentation to appear, which we can use for downstream analysis and visualization.

In general, after using a segmentation method from the tl module, you always need to call pp.add_segmentation() afterwards. The reason for this is that we later want to keep track of things like cell position, cell size etc. To ensure that these are always synchronized to our segmentation, we need to let spatialproteomics know which segmentation to treat as the main one.

[13]:
sp_object = sp_object.pp.add_segmentation('_cellpose_segmentation')
sp_object
[13]:
<xarray.Dataset>
Dimensions:                 (channels: 5, y: 500, x: 500, cells: 1335,
                             features: 2)
Coordinates:
  * channels                (channels) <U5 'DAPI' 'CD4' 'CD8' 'FOXP3' 'BCL6'
  * y                       (y) int64 0 1 2 3 4 5 6 ... 494 495 496 497 498 499
  * x                       (x) int64 0 1 2 3 4 5 6 ... 494 495 496 497 498 499
  * cells                   (cells) int64 1 2 3 4 5 ... 1331 1332 1333 1334 1335
  * features                (features) <U10 'centroid-0' 'centroid-1'
Data variables:
    _image                  (channels, y, x) float64 0.0 9.602e+03 ... 0.0 0.0
    _cellpose_segmentation  (y, x) uint16 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0
    _segmentation           (y, x) int64 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0
    _obs                    (cells, features) float64 1.538 40.88 ... 262.0

We can plot this segmentation over the DAPI channel to see what exactly cellpose did. We can zoom in a little bit to get a clearer picture.

[16]:
_ = sp_object.pp['DAPI'].pp[100:200, 100:200].pl.show(render_segmentation=True)
../_images/notebooks_ExampleWorkflow_27_0.png

There are two issues with the current masks. One is that sometimes very small cells get segmented, which are likely artifacts. We can hence filter cells that are too small or too big. In addition, we will grow the masks by two pixels in each direction to try to capture cytosole and membrane.

[17]:
# checking the distribution of cell sizes
sp_object = sp_object.pp.add_observations("area")
df = sp_object.pp.get_layer_as_df('_obs')
_ = plt.hist(df['area'], bins=100)
Found _obs in image container. Concatenating.
../_images/notebooks_ExampleWorkflow_29_1.png
[18]:
# filtering out cells with less than 20 or more than 150 pixels
sp_object = sp_object.pp.filter_by_obs('area', func=lambda x: (x > 20) & (x < 150))
# plotting the result
_ = sp_object.pp['DAPI'].pp[100:200, 100:200].pl.show(render_segmentation=True)
../_images/notebooks_ExampleWorkflow_30_0.png
[19]:
# expanding the masks
sp_object = sp_object.pp.grow_cells(iterations=2)
Mask growing requires recalculation of the observations. All features other than the centroids will be removed and should be recalculated with pp.add_observations().
[20]:
# plotting the resulting segmentation masks
fig, ax = plt.subplots(1, 2, figsize=(10, 5))
_ = sp_object.pp['DAPI'].pp[200:300, 200:300].pl.show(render_segmentation=True, ax=ax[0])
_ = sp_object.pp[['CD4', 'CD8', 'FOXP3', 'BCL6']].pp[100:200, 100:200].pl.show(render_segmentation=True, ax=ax[1])
../_images/notebooks_ExampleWorkflow_32_0.png

4. Quantifying Protein Expression per Cell

Now that we have sensible segmentation masks, we can quantify the protein expression in each cell. There are multiple ways to do this, but taking the median intensity and then applying an arcsinh-transform has been proven to work pretty well.

[21]:
sp_object = sp_object.pp.add_quantification(func='intensity_mean').pp.transform_expression_matrix(method='arcsinh')
sp_object
[21]:
<xarray.Dataset>
Dimensions:                 (channels: 5, y: 500, x: 500, cells: 1320,
                             features: 2)
Coordinates:
  * channels                (channels) <U5 'DAPI' 'CD4' 'CD8' 'FOXP3' 'BCL6'
  * y                       (y) int64 0 1 2 3 4 5 6 ... 494 495 496 497 498 499
  * x                       (x) int64 0 1 2 3 4 5 6 ... 494 495 496 497 498 499
  * cells                   (cells) int64 1 2 3 4 5 ... 1316 1317 1318 1319 1320
  * features                (features) <U10 'centroid-0' 'centroid-1'
Data variables:
    _image                  (channels, y, x) float64 0.0 9.602e+03 ... 0.0 0.0
    _cellpose_segmentation  (y, x) uint16 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0
    _segmentation           (y, x) int64 0 0 0 0 0 10 10 10 ... 0 0 0 0 0 0 0 0
    _obs                    (cells, features) float64 1.85 41.85 ... 496.1 422.9
    _intensity              (cells, channels) float64 8.025 4.431 ... 1.26

As you can see, this introduced a new layer called _intensity. We can now use this to predict cell types.

5. Cell Type Prediction

There are several ways to predict cell types. Since we thresholded our data beforehand, we can simply take the argmax of the cell type specific channels to get an idea of the cell types we are looking at. Methods related to cell type prediction are all implemented in the label (la) module.

[22]:
# this dictionary maps from cell types to markers
marker_dict = {'T_h': 'CD4', 'T_tox': 'CD8', 'T_reg': 'FOXP3', 'T_fh': 'BCL6'}
sp_object = sp_object.la.predict_cell_types_argmax(marker_dict)
[23]:
sp_object
[23]:
<xarray.Dataset>
Dimensions:                 (labels: 4, la_props: 2, channels: 5, y: 500,
                             x: 500, cells: 1320, features: 3)
Coordinates:
  * labels                  (labels) int64 1 2 3 4
  * la_props                (la_props) <U6 '_color' '_name'
  * channels                (channels) <U5 'DAPI' 'CD4' 'CD8' 'FOXP3' 'BCL6'
  * y                       (y) int64 0 1 2 3 4 5 6 ... 494 495 496 497 498 499
  * x                       (x) int64 0 1 2 3 4 5 6 ... 494 495 496 497 498 499
  * cells                   (cells) int64 1 2 3 4 5 ... 1316 1317 1318 1319 1320
  * features                (features) <U10 '_labels' 'centroid-0' 'centroid-1'
Data variables:
    _la_properties          (labels, la_props) object '#7B4F4B' ... 'T_tox'
    _image                  (channels, y, x) float64 0.0 9.602e+03 ... 0.0 0.0
    _cellpose_segmentation  (y, x) uint16 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0
    _segmentation           (y, x) int64 0 0 0 0 0 10 10 10 ... 0 0 0 0 0 0 0 0
    _obs                    (cells, features) float64 4.0 1.85 ... 496.1 422.9
    _intensity              (cells, channels) float64 8.025 4.431 ... 1.26

This added a couple of things. For one, _obs now contains a feature called _labels. Furthermore, the _la_properties layer assigns each cell type to a color, which can be useful for plotting.

6. Plotting

Finally, let’s do some plotting of the predicted cell types next to the markers. Before plotting, we can set some colors for the cell types (labels).

[24]:
# setting the colors for the cell types
sp_object = sp_object.la.set_label_colors(['T_h', 'T_tox', 'T_reg', 'T_fh'], ['red', 'green', 'yellow', 'blue'])
[25]:
fig, ax = plt.subplots(1, 3, figsize=(15, 5))
_ = sp_object.pp['DAPI'].pl.show(ax=ax[0])
_ = sp_object.pp[['CD4', 'CD8', 'FOXP3', 'BCL6']].pl.show(ax=ax[1])
_ = sp_object.pl.show(render_image=False, render_segmentation=True, render_labels=True, ax=ax[2])
../_images/notebooks_ExampleWorkflow_42_0.png

And this is how easy it can be to perform analysis of highly multiplexed immunofluorescence images! If you have any additional questions, check out the other notebooks for details.