Data denoising for visual & qualitative analysis#

In this tutorial, we will demonstrate how CellMapper can be used to “denoise” scRNA-seq data by sharing information among neighboring cells. Imputation for scRNA-seq data is firecly debated, and we do not advocate building quantiative models or evaluations on top of imputed data. However, imputation can be useful for visual analysis, to inspect gene expression values and how they change along (pseudo)-temporal covariates.

The approach we take here is inspired by MAGIC [VDSN+18], however, it’s much faster thanks to some linear algebra tricks. In this tutorial, we’ll use CellRank[LBK+22, WLK+24] to demonstrate expression trend plotting with imputed data.

Preliminaries#

Import packages & data#

import matplotlib.pyplot as plt
import warnings
import scanpy as sc
import cellmapper
sc.set_figure_params(scanpy=True, frameon=False, fontsize=14)

We’ll use a dataset of CD34+ human bone marrow cells. To learn more about this dataset, take a look at the original publication [SKL+19].

adata = sc.read("data/bone_marrow.h5ad", backup_url="https://figshare.com/ndownloader/files/56264870")
adata
AnnData object with n_obs × n_vars = 5780 × 27876
    obs: 'clusters', 'palantir_pseudotime', 'palantir_diff_potential'
    uns: 'clusters_colors', 'palantir_branch_probs_cell_types'
    obsm: 'X_tsne', 'palantir_branch_probs'

This is raw data with some labels and tsne embedding coordinates.

Basic preprocessing#

Let’s go though some standard scanpy preprocessing.

sc.pp.filter_genes(adata, min_cells=10)

sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

sc.pp.highly_variable_genes(adata)
sc.pp.pca(adata)

sc.pp.neighbors(adata)

adata
AnnData object with n_obs × n_vars = 5780 × 12978
    obs: 'clusters', 'palantir_pseudotime', 'palantir_diff_potential'
    var: 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'clusters_colors', 'palantir_branch_probs_cell_types', 'log1p', 'hvg', 'pca', 'neighbors'
    obsm: 'X_tsne', 'palantir_branch_probs', 'X_pca'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'

Basic visualization#

Let’s take a look at this in a t-SNE embedding, visualizing cell type labels and a pre-computed Palantir[SKL+19] pseudotime.

sc.pl.embedding(adata, color=["clusters", "palantir_pseudotime"], basis="tsne", wspace=0.3, cmap="gnuplot2")

Denoising with CellMapper#

Denoising in CellMapper comes down to computing a k-NN graph in self-mapping mode and matrix-multiplying the gene expression matrix with the normalized k-NN adjacencies. The degree of smoothing can be controlled with the t parameter, which controls how often we multiply the expression matric with the normalized graph adjacencies. For large t, we implement a spectral appraoch which is much faster than the native iterative approach. Let’s start by initializing CellMapper in self-mapping mode by only providing a single AnnData object.

smap = cellmapper.CellMapper(adata)
smap
INFO     Initialized CellMapper for self-mapping with 5780 cells.
CellMapper(self-mapping, data=AnnData(n_obs=5,780, n_vars=12,978), 

Compute a k-NN graph and mapping matrix in PCA space.

smap.compute_neighbors(use_rep="X_pca")
smap.compute_mapping_matrix()
INFO     Self-mapping mode detected. Computing only yx neighbors for efficiency (all neighbor matrices will contain
         the same information).                                                                                    
INFO     Using sklearn to compute 30 neighbors.                                                                    
INFO     Computing mapping matrix using kernel method 'umap'.

Great, now let’s do the actual smoothing operation by multiplying the expression data with the mapping matrix, which corresponds to normalized graph adjacencies.

smap.map_layers("X", t=3)
adata.layers["smap_t3"] = smap.query_imputed.X
INFO     Mapping layer for key 'X' with t=3 steps using iterative diffusion_method                                 
INFO     Imputed expression matrix with shape (5780, 12978) converted to AnnData object.                           
         Observation metadata from query and feature metadata from reference were linked (not copied).             
INFO     Expression for layer 'X' mapped and stored in query_imputed.X.

Let’s visualize original and imputed expression values

genes = ["GATA1", "GATA2", "CD34", "IRF8", "FOSL2", "SPI1"]

with plt.rc_context({"figure.figsize": (1.5, 2), "legend.fontsize": 8, "axes.titlesize": 8}):
    for layer in [None, "smap_t3"]:
        print(f"layer={layer}")
        sc.pl.embedding(
            adata, basis="tsne", color=genes + ["clusters"], layer=layer, ncols=9, colorbar_loc=None, wspace=0.1, size=4
        )
layer=None
layer=smap_t3
../../_images/fb9207b861b573b17e7d7974024b7dfba03d100c9aa1a23a692e88f3c56fb354.png ../../_images/923471821dca6d8f08ab95623ead35571b27227b6a0616821cf864b3bc6dd8d1.png

For larger t values (corresponding to more smoothing), we recommend using the spectral appraoch, which is much faster, as we demonstrate below.

%%time

smap.map_layers("X", t=10, diffusion_method="iterative")
INFO     Mapping layer for key 'X' with t=10 steps using iterative diffusion_method                                
INFO     Imputed expression matrix with shape (5780, 12978) converted to AnnData object.                           
         Observation metadata from query and feature metadata from reference were linked (not copied).             
INFO     Expression for layer 'X' mapped and stored in query_imputed.X.                                            
CPU times: user 32.5 s, sys: 693 ms, total: 33.2 s
Wall time: 33.3 s
%%time

smap.map_layers("X", t=10, diffusion_method="spectral")
INFO     Mapping layer for key 'X' with t=10 steps using spectral diffusion_method                                 
INFO     Computing symmetric eigendecomposition with 50 components for matrix powers                               
INFO     Imputed expression matrix with shape (5780, 12978) converted to AnnData object.                           
         Observation metadata from query and feature metadata from reference were linked (not copied).             
INFO     Expression for layer 'X' mapped and stored in query_imputed.X.                                            
CPU times: user 171 ms, sys: 10.6 ms, total: 181 ms
Wall time: 163 ms

We obtain a good two orders of magnitude speedup with the spectral approach. This becomes crucial for large t or large datasets.