Data denoising for visual & qualitative analysis#

In this tutorial, we will demonstrate how CellMapper can be used to “denoise” scRNA-seq data by sharing information among neighboring cells. Imputation for scRNA-seq data is firecly debated, and we do not advocate building quantiative models or evaluations on top of imputed data. However, imputation can be useful for visual analysis, to inspect gene expression values and how they change along (pseudo)-temporal covariates.

The approach we take here is inspired by MAGIC [VDSN+18], however, it’s much faster thanks to some linear algebra tricks. In this tutorial, we’ll use CellRank[LBK+22, WLK+24] to demonstrate expression trend plotting with imputed data.

Preliminaries#

Import packages & data#

import matplotlib.pyplot as plt
import warnings
import scanpy as sc
import cellmapper

sc.set_figure_params(scanpy=True, frameon=False, fontsize=14)

We’ll use a dataset of CD34+ human bone marrow cells. To learn more about this dataset, take a look at the original publication [SKL+19].

adata = sc.read("data/bone_marrow.h5ad", backup_url="https://figshare.com/ndownloader/files/56264870")
adata

AnnData object with n_obs × n_vars = 5780 × 27876
    obs: 'clusters', 'palantir_pseudotime', 'palantir_diff_potential'
    uns: 'clusters_colors', 'palantir_branch_probs_cell_types'
    obsm: 'X_tsne', 'palantir_branch_probs'

This is raw data with some labels and tsne embedding coordinates.

Basic preprocessing#

Let’s go though some standard scanpy preprocessing.

sc.pp.filter_genes(adata, min_cells=10)

sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

sc.pp.highly_variable_genes(adata)
sc.pp.pca(adata)

sc.pp.neighbors(adata)

adata

AnnData object with n_obs × n_vars = 5780 × 12978
    obs: 'clusters', 'palantir_pseudotime', 'palantir_diff_potential'
    var: 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'clusters_colors', 'palantir_branch_probs_cell_types', 'log1p', 'hvg', 'pca', 'neighbors'
    obsm: 'X_tsne', 'palantir_branch_probs', 'X_pca'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'

Basic visualization#

Let’s take a look at this in a t-SNE embedding, visualizing cell type labels and a pre-computed Palantir[SKL+19] pseudotime.

sc.pl.embedding(adata, color=["clusters", "palantir_pseudotime"], basis="tsne", wspace=0.3, cmap="gnuplot2")

../../_images/724a8329c45d2e1eab66e993afa64ee6a69fd3fd618694cedfb646b51a03d322.png

Denoising with CellMapper#

Denoising in CellMapper comes down to computing a k-NN graph in self-mapping mode and matrix-multiplying the gene expression matrix with the normalized k-NN adjacencies. The degree of smoothing can be controlled with the t parameter, which controls how often we multiply the expression matric with the normalized graph adjacencies. For large t, we implement a spectral appraoch which is much faster than the native iterative approach. Let’s start by initializing CellMapper in self-mapping mode by only providing a single AnnData object.

smap = cellmapper.CellMapper(adata)
smap

INFO     Initialized CellMapper for self-mapping with 5780 cells.

CellMapper(self-mapping, data=AnnData(n_obs=5,780, n_vars=12,978),

Compute a k-NN graph and mapping matrix in PCA space.

smap.compute_neighbors(use_rep="X_pca")
smap.compute_mapping_matrix()

INFO     Self-mapping mode detected. Computing only yx neighbors for efficiency (all neighbor matrices will contain
         the same information).                                                                                    
INFO     Using sklearn to compute 30 neighbors.                                                                    
INFO     Computing mapping matrix using kernel method 'umap'.

Great, now let’s do the actual smoothing operation by multiplying the expression data with the mapping matrix, which corresponds to normalized graph adjacencies.

smap.map_layers("X", t=3)
adata.layers["smap_t3"] = smap.query_imputed.X

INFO     Mapping layer for key 'X' with t=3 steps using iterative diffusion_method                                 
INFO     Imputed expression matrix with shape (5780, 12978) converted to AnnData object.                           
         Observation metadata from query and feature metadata from reference were linked (not copied).             
INFO     Expression for layer 'X' mapped and stored in query_imputed.X.

Let’s visualize original and imputed expression values

genes = ["GATA1", "GATA2", "CD34", "IRF8", "FOSL2", "SPI1"]

with plt.rc_context({"figure.figsize": (1.5, 2), "legend.fontsize": 8, "axes.titlesize": 8}):
    for layer in [None, "smap_t3"]:
        print(f"layer={layer}")
        sc.pl.embedding(
            adata, basis="tsne", color=genes + ["clusters"], layer=layer, ncols=9, colorbar_loc=None, wspace=0.1, size=4
        )

layer=None
layer=smap_t3

../../_images/fb9207b861b573b17e7d7974024b7dfba03d100c9aa1a23a692e88f3c56fb354.png

../../_images/923471821dca6d8f08ab95623ead35571b27227b6a0616821cf864b3bc6dd8d1.png

For larger t values (corresponding to more smoothing), we recommend using the spectral appraoch, which is much faster, as we demonstrate below.

%%time

smap.map_layers("X", t=10, diffusion_method="iterative")

INFO     Mapping layer for key 'X' with t=10 steps using iterative diffusion_method                                
INFO     Imputed expression matrix with shape (5780, 12978) converted to AnnData object.                           
         Observation metadata from query and feature metadata from reference were linked (not copied).             
INFO     Expression for layer 'X' mapped and stored in query_imputed.X.                                            
CPU times: user 32.5 s, sys: 693 ms, total: 33.2 s
Wall time: 33.3 s

%%time

smap.map_layers("X", t=10, diffusion_method="spectral")

INFO     Mapping layer for key 'X' with t=10 steps using spectral diffusion_method                                 
INFO     Computing symmetric eigendecomposition with 50 components for matrix powers                               
INFO     Imputed expression matrix with shape (5780, 12978) converted to AnnData object.                           
         Observation metadata from query and feature metadata from reference were linked (not copied).             
INFO     Expression for layer 'X' mapped and stored in query_imputed.X.                                            
CPU times: user 171 ms, sys: 10.6 ms, total: 181 ms
Wall time: 163 ms

We obtain a good two orders of magnitude speedup with the spectral approach. This becomes crucial for large t or large datasets.

Visualizing imputed expression trends with CellRank#

One common application of data denoising is visualizing gene exression trends in pseudotime. Denoising isn’t strictly necessary here - an alternative approach is using an NB noise model in the Generalized Additive Model (GAM), but that’s computationally more expensive. Thus, many tools like CellRank work better with simple Gaussian noise models on imputed data. Again, keep in mind that we’re only using the imputed data for visualization and not to compute other CellRank properties, like terminal states and fate probabilities.

The workflow will be:

Set up a CellRank PseudotimeKernel to infer cellular dynamics based on the PalantirPseudotime
Using CellRank’s GPCCA estimator, compute terminal states and fate probabilities towards them.
Using fate probabilities, Cellmapper-smoothed expression values, and the PalantirPseudotime, visualize trajectory-specific expression trends.

Set up CellRank PseudotimeKernel#

To get started, let’s set up a PseudotimeKernel in CellRank. You can learn more about this in the CellRank docs: https://cellrank.readthedocs.io/en/latest/index.html.

from cellrank.kernels import PseudotimeKernel
from cellrank.estimators import GPCCA
import cellrank as cr

pk = PseudotimeKernel(adata, time_key="palantir_pseudotime")
pk.compute_transition_matrix()

pk

INFO     Computing transition matrix based on pseudotime
INFO         Finish (0.63s)

PseudotimeKernel[n=5780, dnorm=False, scheme='hard', frac_to_keep=0.3]

Let’s take a look at the inferred dynamics with random walks.

pk.plot_random_walks(
    n_sims=100,
    start_ixs={"clusters": "HSC_1"},
    basis="tsne",
    color="clusters",
    legend_loc="right",
    seed=1,
)

INFO     Simulating 100 random walks of maximum length 1445
INFO         Finish (9.46s)                                                                                        
INFO     Plotting random walks

../../_images/9a71fcfb7896a011e2173b31acaae34e1c9a587d9b95c6ef3919166d6f86cb47.png

When we initialize random walks in the HSC_1 stem cell population, the finish mostly in terminal populations, like Monocytes or Erythrocytes. Next, we setup the CellRank estimator.

Compute terminal states and fate probabilities towards them#

g = GPCCA(pk)
g

GPCCA[kernel=PseudotimeKernel[n=5780], initial_states=None, terminal_states=None]

Compute and visualize some macrostates

with warnings.catch_warnings():
    warnings.simplefilter("ignore", category=UserWarning)

    g.compute_macrostates(n_states=5, cluster_key="clusters")
    g.plot_macrostates(which="all", legend_loc="right", basis="tsne")

INFO     Computing 5 macrostates                                                                                   
WARNING  Color sequence contains non-unique elements                                                               
INFO     Adding `.macrostates`                                                                                     
                `.macrostates_memberships`                                                                         
                `.coarse_T`                                                                                        
                `.coarse_initial_distribution                                                                      
                `.coarse_stationary_distribution`                                                                  
                `.schur_vectors`                                                                                   
                `.schur_matrix`                                                                                    
                `.eigendecomposition`                                                                              
             Finish (0.23s)

../../_images/2180411ff3b86ea0d07eb02b7eb8c4a8316ba4ccf4cad2227e3deac7e690289b.png

Great, all of these represent terminal states, let’s thus declare them terminal and compute fate probabilities towards them.

g.set_terminal_states()
g.compute_fate_probabilities()

INFO     Adding `adata.obs['term_states_fwd']`                                                                     
                `adata.obs['term_states_fwd_probs']`                                                               
                `.terminal_states`                                                                                 
                `.terminal_states_probabilities`                                                                   
                `.terminal_states_memberships                                                                      
             Finish`                                                                                               
INFO     Computing fate probabilities
INFO     Adding `adata.obsm['lineages_fwd']`                                                                       
                `.fate_probabilities`                                                                              
             Finish (0.12s)

Let’s take a look at these fate probabilities now.

with plt.rc_context({"figure.figsize": (2, 2.5)}):
    g.plot_fate_probabilities(same_plot=False, ncols=3, basis="tsne")

../../_images/a04f6d0f4281d4b8e60f80f4b4f1f557fc71479ee0e8bad1609da31d55d55530.png

We can summarize these fate probabilities in a cirular embedding.

cr.pl.circular_projection(adata, keys="clusters", legend_loc="right")

INFO     Solving TSP for `5` states

../../_images/f642d902041b5e6760b4e5638cd848413c757e3bf7dbdb0813e5467cbab8a908.png

Visualize gene expression trends#

Finally, we can put the pieces together. Below, we use Generative Additive Models (GAMs) to visualize CellMapper-smoothed expression values as a function of pseudotime, where we use CellRank fate probabilities to softly assign cells to different trajectories (each leading to one terminal state).

model = cr.models.GAM(adata, n_knots=6)
model

<GAM[gene=None, lineage=None, model=GammaGAM(callbacks=['deviance', 'diffs'], fit_intercept=True, max_iter=2000, scale=None, terms=s(0), tol=0.0001, verbose=False)]>

cr.pl.gene_trends(
    adata,
    figsize=(10, 16),
    model=model,
    data_key="smap_t3",
    genes=genes,
    same_plot=True,
    ncols=2,
    time_key="palantir_pseudotime",
    hide_cells=True,
    weight_threshold=(1e-3, 1e-3),
)

INFO     Computing trends using 1 core(s)
INFO         Finish (1.72s)                                                                                        
INFO     Plotting trends

../../_images/53f397917c8e44efc4c122a97d1c28eb2a3003f579a476010ec481d9bca25887.png

We can see that we find genes with trajectory-specific trends, like GATA1, and others with more similar behavious across trajectories, like CD34. CellRank offers further analysis possibilities of these trends, like clustering, heatmap visualization, etc. Check out the corresponding CellRank tutorial to learn more.

Data denoising for visual & qualitative analysis

Contents