EntryPoints

STEP follows the workflows of AnnData and Scanpy, and provides 3 major entry points for the users to interact with the package and analyze their data. All corresponding python object take the data, data processing parameters and model configurations as inputs, to perform data preprocessing, dataset construction, and functional model initialization. As as result, no more preprocessing is need except gene selection and cell filtering. More details about the usage of the entry points can be found in the API reference.

1. step.scModel

For the generation of embedding for scRNA-seq or SRT data without incorporating spatial information. The following example showcase how to handle an AnnData object adata.

Two important parameters are adata.obs related keys:

  1. batch_key: name of key in adata.obs that stored batch indicators for each cell/spot. This information will be used to batch-correction. None for single-batch data.

  2. class_key: name of key in adata.obs that stored cell-type/cell-state annotations for each cell/spot. This information will be used to refine the embedding if provided, otherwise the model will run in unsupervised mode.

Embedding generation

from step import scModel

stepc = scModel(
    adata, 
    n_top_genes=2000,
    geneset_to_use=None, # or predefined gene list 
    batch_key='batch', # None for single-batch data
    class_key='celltype', # None for unsupervised mode
)

stepc.run(
    epochs=400,
    batch_size=1024,
    split_rate=0.2,
)

Save the model config and weights

stepc.save('scmodel_config')

This will create a folder named scmodel_config in the current working directory, containing the config.json, model.pth files.

Do downstream analysis with scanpy workflow

adata = stepc.adata
# 'X_rep' is the embedding generated by STEP in unsupervised mode
sc.pp.neighbors(adata, use_rep='X_rep')
sc.tl.umap(adata)
sc.tl.leiden(adata)
sc.pl.umap(adata, color=['celltype', 'batch', 'leiden'])

# 'X_anchrod' is the embedding refined from 'X_rep' by using cell-type annotations
sc.pp.neighbors(adata, use_rep='X_anchord')
sc.tl.umap(adata)
sc.tl.leiden(adata)
sc.pl.umap(adata, color=['celltype', 'batch', 'leiden'])

2. step.stModel

For the generation of embedding for SRT data with spatial information. STEP does not rely on the contiguity between the sections, so just pass the AnnData object adata containing the section identifiers stored in adata.obs as batch_key. The parameters different from scModel is two spatial neighboring graph associated ones:

  1. coord_keys: keys in adata.obs that stored the spatial coordinates of each spot/cell. Default is ('array_row', 'array_col') for Visium data.

  2. edge_clip: cut-off distance between two indicies of grids, rather for pixel distance. 2 for Visium; 1 for Visium HD, ST, Stereo-seq bin; None for activation of the kNN based graph construction.

  3. max_neighs: maximum number of neighbors for each cell using kNN for image segmentation based SRT data. Euclidean distance is used for the calculation of the neighbors.

In default, the stModel will run on the end-to-end mode(the ‘fast mode’ in paper), which means the model will be trained on the spatial graph directly. If you want to train the model on the gene expression data first, and then smooth the embedding with the spatial information, you can set e2e=False.

Embedding generation

from step import stModel

stepc = stModel(
    adata, 
    n_top_genes=2000,
    batch_key='library_id', # None for single section data
    coord_keys=('array_row', 'array_col'),
    geneset_to_use=None, # or predefined gene list
    edge_clip=2,
    max_neighs=30,
)

stepc.run(
    e2e=True,
    n_iterations=2000,
    n_samples=2048, # number of sampled nodes for each iteration
    graph_batch_size=2, # when multiple sections are available
)

Save the model config and weights

stepc.save('stmodel_config')

identify spatial domains and sub-domains

stepc.cluster(n_clusters=10)
stepc.sub_cluster(n_clusters=3, pre_key='domain') # identify 3 sub-domains for each identified domains

Do visualization with wrapped scanpy plotting functions when containing multiple sections

stepc.spatial_plot(
    color='domain',
    with_images=True, # show the images of the sections if available
    **kwargs,
)

# Perfrom downstream analysis with scanpy workflow
adata = stepc.adata
# 'X_smoothed' is the embedding generated by STEP
sc.pp.neighbors(adata, use_rep='X_smoothed')
sc.tl.umap(adata)
sc.tl.leiden(adata)
sc.pl.umap(adata, color=['domain', 'leiden', 'library_id'])

3. step.crossModel

For the integration of scRNA-seq and SRT data.

Following parameters are different from scModel and stModel:

  1. st_adata: the AnnData object for the SRT data.

  2. st_batch_key: name of key in st_adata.obs that stored batch indicators for each spot. This information will be used to batch-correction.

  3. sc_adata: the AnnData object for the scRNA-seq data.

Co-embeding

from step import crossModel

stepc = crossModel(
    sc_adata=sc_adata,
    st_adata=st_adata,
    n_top_genes=2000,
    batch_key='batch',
    class_key='celltype',
    st_batch_key='library_id',
)

stepc.integrate(
    epochs=400,
    batch_size=1024,
    split_rate=0.2,
    need_anchors=False, # whether use cell-type annotations to postprocess the co-embedding
)

# Save the model config and weights
stepc.save('crossmodel_config')

After co-embedding, the SRT part can be further processed by STEP’s spatial model to identify spatial domains, but only recommended on single section data. Because the batch-effect seems to be retrived by the section-depnenent spatial bias when training the spatial model on multiple sections. Thereby, for multi-section scenario, we recommend use step.stModel to identify spatial domains.

Smooth the embedding of SRT data

stepc.generate_domains(
    use_st_decoder=False, # whether to train a new decoder for the SRT data
    epochs=800, # epochs for the training of the decoder
    batch_size=1024, # batch size for the training of the decoder
    smooth_epochs=800, # epochs for the smoothing of the embedding with training a spaital model
)

Domain-wise Cell-type Deconvolution

This process aims to infer the cell-type composition in each spot of SRT data (non-single-cell resolution). Any available cell-type annoations stored in sc_adata.obs can passed to arg cell_type_key; and so do the any spatial-domain annotations, can be passed to arg domain_key to support the deconvolution. Recommend passing sub_domain to match sub_types.

stpec.deconv(
    epochs=1500,
    domain_key='domain'
    cell_type_key='celltype'
)

Visualizing the results:

celltypes = stepc.st_adata.obsm['deconv'].columns
stpec.st_adata.obs = stpec.st_adata.obsm['deconv']

stepc.spatial_plot(color=celltypes)
# some cell-types in section 'A' 
stepc.spatial_plot(
    color=celltypes[:4],
    library_id='A',
)