library(scapGNN)
With the widespread availability of single-cell sequencing technology, numerous single-cell RNA-seq and ATAC-seq data are collected into public databases. However, the single-cell data have characteristics such as drop-out events and low library sizes. How to leverage these data to parse cellular heterogeneity at the pathway level and identify activated gene modules with multiple cell phenotypes is still a challenge.
Here, we propose scapGNN
, an accurate, effective, and
robust framework to comprehensively parse single-cell data at the
pathway level from single-omics data or multi-omics integration data.
The scapGNN
model takes single-cell gene expression
profiles or the gene activity matrix of scATAC-seq as input data. As
shown in Figure 1A, the computing architecture consists of three
modules:
A depp neural network-based autoencoder, which is used to extract cellular features, gene features, and mine latent associations between genes and cells.
A graph convolutional network-based graph autoencoder, which is used to construct cell-cell association network and gene-gene association network.
Random walk with restart (RWR), which is used to calculate the activity score of a pathway or gene set in each cell.
For the SNARE-seq dataset, scapGNN
can integrate
gene-cell association networks from scATAC-seq and scRNA-seq into one
combined network by the Brown’s extension of the Fisher’s combined
probability test. According to the combined gene-cell association
network, pathway activity scores or activated gene modules are supported
by single-cell multi-omics (Figure 2). scapGNN
also designs
abundant visualization programs to clearly display the analysis results.
Taken together, its capabilities enable scapGNN
to
calculate pathway activity scores in a single cell, assess pathway
heterogeneity between cells, identify key active genes in pathway, mine
gene modules under multiple cell phenotypes, and provide gene and cell
network data (Figure 1B).
Figure 1 Overview of the scapGNN framework.
Figure 2 The workflow of integrating scRNA-seq data and scATAC-seq data by scapGNN.
Single-cell gene expression profiles data needs to pass
Seurat
’s pre-processing workflow. These represent the
selection and filtration of cells based on QC metrics, data
normalization, and the detection of highly variable features.
library(Seurat)
# Load the PBMC dataset (Case data for seurat)
pbmc.data <- Read10X(data.dir = "../data/pbmc3k/filtered_gene_bc_matrices/hg19/")
# Our recommended data filtering is that only genes expressed as non-zero in more than
# 1% of cells, and cells expressed as non-zero in more than 1% of genes are kept.
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "case",
min.cells = 3, min.features = 200)
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
# Normalizing the data.
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize")
# Identification of highly variable features.
pbmc <- FindVariableFeatures(pbmc, selection.method = 'vst', nfeatures = 2000)
# Run Preprocessing() of scapGNN.
Prep_data <- Preprocessing(pbmc,verbose=FALSE)
Detailed procedures and descriptions can be found in vignettes of
Seurat
(https://satijalab.org/seurat/).
# Users can also directly input data in a data frame or matrix format which contains hypervariable genes and is log-transformed.
data("Hv_exp")
<- Preprocessing(Hv_exp,verbose=FALSE)
Prep_data summary(Prep_data)
#> Length Class Mode
#> orig_dara 45000 -none- numeric
#> cell_features 45000 -none- numeric
#> gene_features 45000 -none- numeric
#> ltmg_matrix 1 -none- function
#> cell_adj 8100 -none- numeric
#> gene_adj 250000 -none- numeric
First, the Signac
package is used to quantify the
activity of each gene in the genome by assessing the chromatin
accessibility associated with each gene, and create a new gene activity
matrix derived from the scATAC-seq data. For details, please refer to
vignettes of Signac
(https://satijalab.org/signac/index.html). As with
single-cell gene expression profiles, the ensuing gene activity matrix
is normalized and detected highly variable features by the
Seurat
package. Finally, the Preprocessing()
of scapGNN
further processes the data into the required
format.
Using the Preprocessing()
function results of scRNA-seq
or scATAC-seq as input, the ConNetGNN()
function evaluates
whether gene-gene, cell-cell, and gene-cell are associated and levels of
strength. Finally, an undirected and weighted gene-cell network is
constructed and use for subsequent analysis.
The ConNetGNN()
function is implemented based on
pytorch
, so an appropriate python environment is
required:
We also provide environment files for conda:
/inst/extdata/scapGNN_env.yaml
. Users can install it with
the command: conda env create -f scapGNN_env.yaml
.
library(coop)
library(reticulate)
library(parallel)
# Data preprocessing
data("Hv_exp")
Prep_data <- Preprocessing(Hv_exp)
# Run ConNetGNN()
ConNetGNN_data <- ConNetGNN(Prep_data,python.path="../miniconda3/envs/scapGNN_env/python.exe")
# View the content of the ConNetGNN() results.
data(ConNetGNN_data)
summary(ConNetGNN_data)
#> Length Class Mode
#> cell_network 8100 -none- numeric
#> gene_network 250000 -none- numeric
#> gene_cell_network 45000 -none- numeric
For the SNARE-seq dataset, the ConNetGNN()
results of
scRNA-seq and scATAC-seq can be integrated into a combined gene-cell
association network via InteNet()
.
library(ActivePathways)
library(parallel)
data(RNA_net)
data(ATAC_net)
RNA_ATAC_IntNet<-InteNet(RNA_net,ATAC_net,parallel.cores=1)
For the gene-cell association network constructed by
ConNetGNN()
, the scPathway()
function uses RWR
algorithm to calculate the pathway activity score of each cell.
library(parallel)
library(utils)
# Load the result of the ConNetGNN function.
data(ConNetGNN_data)
scPathway_data<-scPathway(ConNetGNN_data,gmt.path=system.file("extdata", "KEGG_human.gmt", package = "scapGNN"),parallel.cores=1)
The result scPathway_data
of scPathway()
is
a matrix of pathway activity scores, with rows as pathways and columns
as cells.
data(scPathway_data)
1:3,1:3]
scPathway_data[#> H9_0h_1 H9_0h_2 H9_0h_3
#> Ras signaling pathway [PATH:hsa04014] 0.20 0.19 0.17
#> Rap1 signaling pathway [PATH:hsa04015] 0.08 0.10 0.10
#> MAPK signaling pathway [PATH:hsa04010] 0.68 0.59 0.70
library(pheatmap)
pheatmap(scPathway_data)
Use the plotGANetwork()
function to plot the gene
network of pathways in the specified cell phenotype.
library(igraph)
#> Warning: 程辑包'igraph'是用R版本4.1.3 来建造的
#>
#> 载入程辑包:'igraph'
#> The following objects are masked from 'package:stats':
#>
#> decompose, spectrum
#> The following object is masked from 'package:base':
#>
#> union
# Load data.
data(ConNetGNN_data)
data("Hv_exp")
# Construct cell set.
<-grep("0h",colnames(Hv_exp))
index<-colnames(Hv_exp)[index]
cellset
# Construct gene set.
<-load_path_data(system.file("extdata", "KEGG_human.gmt", package = "scapGNN"))
pathways<-pathways[[which(names(pathways)=="Tight junction [PATH:hsa04530]")]]
geneset
plotGANetwork(ConNetGNN_data,cellset,geneset,vertex.label.dist=1.5,main = "Tight junction [PATH:hsa04530]")
The plotGANetwork()
function can construct and display a
cell community network.
require(igraph)
require(graphics)
data(ConNetGNN_data)
# Construct the cell phenotype vector.
<-colnames(ConNetGNN_data[["cell_network"]])
cell_id<-unlist(strsplit(cell_id,"_"))
temp<-temp[seq(2,length(temp)-1,by=3)]
cell_phennames(cell_id)<-cell_phen
head(cell_id)
#> 0h 0h 0h 0h 0h 0h
#> "H9_0h_1" "H9_0h_2" "H9_0h_3" "H9_0h_4" "H9_0h_5" "H9_0h_6"
plotCCNetwork(ConNetGNN_data,cell_id,edge.width=10)
The cpGModule()
can identify cell phenotype activated
gene module.
require(parallel)
#> 载入需要的程辑包:parallel
require(stats)
# Load the result of the ConNetGNN function.
data(ConNetGNN_data)
data(Hv_exp)
# Construct the cell set corresponding to 0h.
<-grep("0h",colnames(Hv_exp))
index<-colnames(Hv_exp)[index]
cellset<-cpGModule(ConNetGNN_data,cellset,parallel.cores=1)
H9_0h_cpGM_data#> Start RWR
#> Start perturbation analysis
summary(H9_0h_cpGM_data)
#> Genes AS Pvalue FDR
#> Length:5 Min. :0.002330 Min. :0.00000 Min. :0.0000
#> Class :character 1st Qu.:0.002377 1st Qu.:0.00000 1st Qu.:0.0000
#> Mode :character Median :0.002407 Median :0.00016 Median :0.0250
#> Mean :0.002454 Mean :0.00016 Mean :0.0188
#> 3rd Qu.:0.002546 3rd Qu.:0.00020 3rd Qu.:0.0250
#> Max. :0.002608 Max. :0.00044 Max. :0.0440
Use the plotGANetwork()
function to plot the gene
network of module.
library(igraph)
# Load data.
data(ConNetGNN_data)
data("Hv_exp")
data("H9_0h_cpGM_data")
# Construct cell set.
<-grep("0h",colnames(Hv_exp))
index<-colnames(Hv_exp)[index]
cellset
# Construct gene set.
<-H9_0h_cpGM_data$Genes
geneset
plotGANetwork(ConNetGNN_data,cellset,geneset,vertex.label.dist=1.5,main = "Gene network of 0h cells activated gene module")
For multiple cellular phenotypes (e.g. classification, time stage,
etc.), cpGModule()
identifies the activated gene modules
individually. The plotMulPhenGM()
function then quantifies
the activation strength of genes in different cell phenotypes and
displays gene networks under multiple cell phenotypes.
require(igraph)
require(grDevices)
# Load the result of the ConNetGNN function.
data(ConNetGNN_data)
# Obtain cpGModule results for each cell phenotype.
data(H9_0h_cpGM_data)
data(H9_24h_cpGM_data)
data(H9_36h_cpGM_data)
<-list(H9_0h=H9_0h_cpGM_data,H9_24h=H9_24h_cpGM_data,H9_36h=H9_36h_cpGM_data)
data.listplotMulPhenGM(data.list,ConNetGNN_data,margin=-0.05)
The single-cell pathway activity matrix can be docked with the
Seurat
package to perform non-linear dimensional reduction,
identify differentially active pathways, draw heat map, violin plot, and
more.
library(dplyr)
library(Seurat)
library(patchwork)
data(scPathway_data)
data <- CreateSeuratObject(counts = scPathway_data, project = "Pathway")
all.genes <- rownames(data)
data <- ScaleData(data, features = all.genes,verbose = FALSE)
data <- RunPCA(data, features =all.genes,verbose = FALSE)
data <- FindNeighbors(data, verbose = FALSE)
data <- FindClusters(data, resolution = 0.5,verbose = FALSE)
data <- RunUMAP(data,dims = 1:15, verbose = FALSE)
cell_phen<-unlist(strsplit(colnames(scPathway_data),"_"))
cell_phen<-cell_phen[seq(2,length(cell_phen)-1,by=3)]
data@active.ident<-as.factor(cell_phen)
# Visualize non-linear dimensional reduction
DimPlot(data, reduction = "umap")
# Identify differentially active pathways
diff.pathways <- FindAllMarkers(data,verbose = FALSE)
# violin plot
VlnPlot(data, features = "MAPK signaling pathway [PATH:hsa04010]")
# heat map
DoHeatmap(data, features = row.names(scPathway_data))