Data denoising for visual & qualitative analysis#
In this tutorial, we will demonstrate how CellMapper can be used to “denoise” scRNA-seq data by sharing information among neighboring cells. Imputation for scRNA-seq data is firecly debated, and we do not advocate building quantiative models or evaluations on top of imputed data. However, imputation can be useful for visual analysis, to inspect gene expression values and how they change along (pseudo)-temporal covariates.
The approach we take here is inspired by MAGIC [VDSN+18], however, it’s much faster thanks to some linear algebra tricks. In this tutorial, we’ll use CellRank[LBK+22, WLK+24] to demonstrate expression trend plotting with imputed data.
Preliminaries#
Import packages & data#
import matplotlib.pyplot as plt
import warnings
import scanpy as sc
import cellmapper
sc.set_figure_params(scanpy=True, frameon=False, fontsize=14)
We’ll use a dataset of CD34+ human bone marrow cells. To learn more about this dataset, take a look at the original publication [SKL+19].
adata = sc.read("data/bone_marrow.h5ad", backup_url="https://figshare.com/ndownloader/files/56264870")
adata
AnnData object with n_obs × n_vars = 5780 × 27876
obs: 'clusters', 'palantir_pseudotime', 'palantir_diff_potential'
uns: 'clusters_colors', 'palantir_branch_probs_cell_types'
obsm: 'X_tsne', 'palantir_branch_probs'
This is raw data with some labels and tsne embedding coordinates.
Basic preprocessing#
Let’s go though some standard scanpy preprocessing.
sc.pp.filter_genes(adata, min_cells=10)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata)
sc.pp.pca(adata)
sc.pp.neighbors(adata)
adata
AnnData object with n_obs × n_vars = 5780 × 12978
obs: 'clusters', 'palantir_pseudotime', 'palantir_diff_potential'
var: 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
uns: 'clusters_colors', 'palantir_branch_probs_cell_types', 'log1p', 'hvg', 'pca', 'neighbors'
obsm: 'X_tsne', 'palantir_branch_probs', 'X_pca'
varm: 'PCs'
obsp: 'distances', 'connectivities'
Basic visualization#
Let’s take a look at this in a t-SNE embedding, visualizing cell type labels and a pre-computed Palantir[SKL+19] pseudotime.
Denoising with CellMapper#
Denoising in CellMapper comes down to computing a k-NN graph in self-mapping mode and matrix-multiplying the gene expression matrix with the normalized k-NN adjacencies. The degree of smoothing can be controlled with the t parameter, which controls how often we multiply the expression matric with the normalized graph adjacencies. For large t, we implement a spectral appraoch which is much faster than the native iterative approach. Let’s start by initializing CellMapper in self-mapping mode by only providing a single AnnData object.
smap = cellmapper.CellMapper(adata)
smap
INFO Initialized CellMapper for self-mapping with 5780 cells.
CellMapper(self-mapping, data=AnnData(n_obs=5,780, n_vars=12,978),
Compute a k-NN graph and mapping matrix in PCA space.
smap.compute_neighbors(use_rep="X_pca")
smap.compute_mapping_matrix()
INFO Self-mapping mode detected. Computing only yx neighbors for efficiency (all neighbor matrices will contain
the same information).
INFO Using sklearn to compute 30 neighbors.
INFO Computing mapping matrix using kernel method 'umap'.
Great, now let’s do the actual smoothing operation by multiplying the expression data with the mapping matrix, which corresponds to normalized graph adjacencies.
smap.map_layers("X", t=3)
adata.layers["smap_t3"] = smap.query_imputed.X
INFO Mapping layer for key 'X' with t=3 steps using iterative diffusion_method
INFO Imputed expression matrix with shape (5780, 12978) converted to AnnData object.
Observation metadata from query and feature metadata from reference were linked (not copied).
INFO Expression for layer 'X' mapped and stored in query_imputed.X.
Let’s visualize original and imputed expression values
genes = ["GATA1", "GATA2", "CD34", "IRF8", "FOSL2", "SPI1"]
with plt.rc_context({"figure.figsize": (1.5, 2), "legend.fontsize": 8, "axes.titlesize": 8}):
for layer in [None, "smap_t3"]:
print(f"layer={layer}")
sc.pl.embedding(
adata, basis="tsne", color=genes + ["clusters"], layer=layer, ncols=9, colorbar_loc=None, wspace=0.1, size=4
)
For larger t values (corresponding to more smoothing), we recommend using the spectral appraoch, which is much faster, as we demonstrate below.
%%time
smap.map_layers("X", t=10, diffusion_method="iterative")
INFO Mapping layer for key 'X' with t=10 steps using iterative diffusion_method
INFO Imputed expression matrix with shape (5780, 12978) converted to AnnData object.
Observation metadata from query and feature metadata from reference were linked (not copied).
INFO Expression for layer 'X' mapped and stored in query_imputed.X.
CPU times: user 32.5 s, sys: 693 ms, total: 33.2 s
Wall time: 33.3 s
%%time
smap.map_layers("X", t=10, diffusion_method="spectral")
INFO Mapping layer for key 'X' with t=10 steps using spectral diffusion_method
INFO Computing symmetric eigendecomposition with 50 components for matrix powers
INFO Imputed expression matrix with shape (5780, 12978) converted to AnnData object.
Observation metadata from query and feature metadata from reference were linked (not copied).
INFO Expression for layer 'X' mapped and stored in query_imputed.X.
CPU times: user 171 ms, sys: 10.6 ms, total: 181 ms
Wall time: 163 ms
We obtain a good two orders of magnitude speedup with the spectral approach. This becomes crucial for large t or large datasets.
Visualizing imputed expression trends with CellRank#
One common application of data denoising is visualizing gene exression trends in pseudotime. Denoising isn’t strictly necessary here - an alternative approach is using an NB noise model in the Generalized Additive Model (GAM), but that’s computationally more expensive. Thus, many tools like CellRank work better with simple Gaussian noise models on imputed data. Again, keep in mind that we’re only using the imputed data for visualization and not to compute other CellRank properties, like terminal states and fate probabilities.
The workflow will be:
Set up a CellRank
PseudotimeKernelto infer cellular dynamics based on thePalantirPseudotimeUsing CellRank’s
GPCCAestimator, compute terminal states and fate probabilities towards them.Using fate probabilities, Cellmapper-smoothed expression values, and the
PalantirPseudotime, visualize trajectory-specific expression trends.
Set up CellRank PseudotimeKernel#
To get started, let’s set up a PseudotimeKernel in CellRank. You can learn more about this in the CellRank docs: https://cellrank.readthedocs.io/en/latest/index.html.
from cellrank.kernels import PseudotimeKernel
from cellrank.estimators import GPCCA
import cellrank as cr
pk = PseudotimeKernel(adata, time_key="palantir_pseudotime")
pk.compute_transition_matrix()
pk
INFO Computing transition matrix based on pseudotime
INFO Finish (0.63s)
PseudotimeKernel[n=5780, dnorm=False, scheme='hard', frac_to_keep=0.3]
Let’s take a look at the inferred dynamics with random walks.
pk.plot_random_walks(
n_sims=100,
start_ixs={"clusters": "HSC_1"},
basis="tsne",
color="clusters",
legend_loc="right",
seed=1,
)
When we initialize random walks in the HSC_1 stem cell population, the finish mostly in terminal populations, like Monocytes or Erythrocytes. Next, we setup the CellRank estimator.
Compute terminal states and fate probabilities towards them#
g = GPCCA(pk)
g
GPCCA[kernel=PseudotimeKernel[n=5780], initial_states=None, terminal_states=None]
Compute and visualize some macrostates
with warnings.catch_warnings():
warnings.simplefilter("ignore", category=UserWarning)
g.compute_macrostates(n_states=5, cluster_key="clusters")
g.plot_macrostates(which="all", legend_loc="right", basis="tsne")
Great, all of these represent terminal states, let’s thus declare them terminal and compute fate probabilities towards them.
g.set_terminal_states()
g.compute_fate_probabilities()
INFO Adding `adata.obs['term_states_fwd']`
`adata.obs['term_states_fwd_probs']`
`.terminal_states`
`.terminal_states_probabilities`
`.terminal_states_memberships
Finish`
INFO Computing fate probabilities
INFO Adding `adata.obsm['lineages_fwd']`
`.fate_probabilities`
Finish (0.12s)
Let’s take a look at these fate probabilities now.
with plt.rc_context({"figure.figsize": (2, 2.5)}):
g.plot_fate_probabilities(same_plot=False, ncols=3, basis="tsne")
We can summarize these fate probabilities in a cirular embedding.
Visualize gene expression trends#
Finally, we can put the pieces together. Below, we use Generative Additive Models (GAMs) to visualize CellMapper-smoothed expression values as a function of pseudotime, where we use CellRank fate probabilities to softly assign cells to different trajectories (each leading to one terminal state).
model = cr.models.GAM(adata, n_knots=6)
model
<GAM[gene=None, lineage=None, model=GammaGAM(callbacks=['deviance', 'diffs'], fit_intercept=True, max_iter=2000, scale=None, terms=s(0), tol=0.0001, verbose=False)]>
cr.pl.gene_trends(
adata,
figsize=(10, 16),
model=model,
data_key="smap_t3",
genes=genes,
same_plot=True,
ncols=2,
time_key="palantir_pseudotime",
hide_cells=True,
weight_threshold=(1e-3, 1e-3),
)
We can see that we find genes with trajectory-specific trends, like GATA1, and others with more similar behavious across trajectories, like CD34. CellRank offers further analysis possibilities of these trends, like clustering, heatmap visualization, etc. Check out the corresponding CellRank tutorial to learn more.