references:
The SparseMDC package implements the multiple condition SparseDC model. This package is suitable for data coming from >=2 ordered conditions. This method clusters samples (cells) in each condition, links corresponding clusters (cell-types) across conditions, identifies a unique set of characteristic features (marker genes) for each cluster and identifies features which characterize the condition change for each cluster. This vignetter will guide you through the application of SparseMDC including pre-processing of data, calculation of penalty parameters, and extraction.
SparseMDC was designed for the analysis of single-cell RNA-squencing(scRNA-seq) data coming from multiple ordered conditions. These conditions could include cells before and after treatment, cells at different developmental stages, or cells taken from different time points in a time course experiment. The data SparseMDC takes as input is scRNA-seq gene expression data. SparseMDC assumes that this data has been properly normalized for sequencing depth and other technical effects prior to the application of SparseMDC.
The first step is to install and load the SparseMDC package.
# install.packages("SparseMDC")
library("SparseMDC")
## Loading required package: doRNG
## Loading required package: foreach
## Loading required package: rngtools
## Loading required package: pkgmaker
## Loading required package: registry
##
## Attaching package: 'pkgmaker'
## The following object is masked from 'package:base':
##
## isFALSE
To demonstrate the data format and application of SparseMDC scRNA-seq data created by Biase et al. to study cell fate inclination in mouse embryos [@biase2014] will be used. This dataset contains gene expression, FPKM, measurements for 49 cells and 16,514 genes. The cells in the dataset come from three different cell types, zygote, two-cell embryos and four-cell embryos. While the cells in this dataset are all from a single condition we have dveloped an approach to split the data into three conditions so that the linking of clusters across conditions can be demonstrated. For this example we split the data so that the zygote and 10 two-cell cells are in condition A, 10 two-cell and 10 four-cell cells are in condition B and 10 four-cells cells are in condition C.
# Load Dataset
data(data_biase)
# Summarize condition vector
summary(as.factor(condition_biase))
## A B C
## 19 20 10
# Compare condition and cell type
table(condition_biase, cell_type_biase)
## cell_type_biase
## condition_biase Four-cell Embryo Two-cell Embryo Zygote
## A 0 10 9
## B 10 10 0
## C 10 0 0
SparseMDC takes as input scRNA-seq gene expression data for multiple conditions. This data should be stored as a list with each entry containing a gene expression matrix for a single condition. The data should be stored with genes as rows and cells as columns.
The next step to check the data is stored correctly, separate the data into different conditions, and store the data as a list
# Check rows are genes and columns are cells
head(data_biase[,1:5])
## GSM1377859 GSM1377860 GSM1377861 GSM1377862 GSM1377863
## ENSMUSG00000000001 3.288693 3.631147 2.290201 3.241467 3.4727581
## ENSMUSG00000000028 4.547849 4.533851 4.560077 4.682483 4.8076946
## ENSMUSG00000000037 3.662392 3.154039 3.192203 3.767524 3.2131756
## ENSMUSG00000000049 0.000000 0.000000 0.000000 0.000000 0.0000000
## ENSMUSG00000000056 2.544338 1.889089 2.143146 1.975677 1.9743810
## ENSMUSG00000000078 1.127634 1.278873 1.085679 2.132017 0.9719827
# Separate data by condition
biase_A <- data_biase[,which(condition_biase == "A")]
cell_type_A <- cell_type_biase[which(condition_biase == "A")]
biase_B <- data_biase[,which(condition_biase == "B")]
cell_type_B <- cell_type_biase[which(condition_biase == "B")]
biase_C <- data_biase[,which(condition_biase == "C")]
cell_type_C <- cell_type_biase[which(condition_biase == "C")]
# Move data into list
dat_l <- list(biase_A, biase_B, biase_C)
Prior to the application of SparseMDC the data needs to be normalized for sequencing depth and other technical effects and centered on a gene-by-gene basis. We also recommend that the data is log-transformed. To do this we have included a function that can easily pre-process the data. For the normalization it is recommended that users make use of one of the many methods that exist for normalizing scRNA-Seq data. The centering of the data is crucially important to the function of SparseMDC and is vital to accurately clustering the data and identifying marker genes. We recommend that all users use this function to center their data and that only experienced users set “center=FALSE”.
The Biase data are FPKM measurements of gene expression and so have been normalized using an alternate method as advised. This means we can set “norm = FALSE”. The Biase data then needs to be both log transformed and centered so we can set “log =TRUE”“ and "center = TRUE”. We also set “dim=3” which is the number of conditions in the data:
pdat <- pre_proc_data(dat_l, dim = 3, norm = FALSE, log = TRUE, center = TRUE)
The data is returned as list containing the centered and log-transformed data. To check the data has been correctly centered we can view the rowSums for each gene.
# Check condition A
summary(rowSums(pdat[[1]]))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -14.4967 -2.1420 0.2121 0.5778 3.4113 14.4990
# Check condition B
summary(rowSums(pdat[[2]]))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -10.44322 -1.74132 -0.03822 -0.19947 1.27058 11.61908
# Check condition C
summary(rowSums(pdat[[3]]))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -11.8084 -2.0521 -0.2319 -0.3783 1.3200 11.1150
Now that the data has been centered and log-transformed it is time to estimate the penalty parameters used in SparseMDC. The target function of SparseMDC contains two penalty parameters, \(\lambda_{1}\) and \(\lambda_{2}\). The \(\lambda_{1}\) penalty acts on the center value of each cluster driving them towards zero and revealing the marker genes for each cell-type. This term then controls the number of non-zero center values which are the characteristic features (marker genes) in the solution. Larger values of \(\lambda_{1}\) lead to a sparser solution. The \(\lambda_{2}\) penalty acts on the difference between center values for each cluster across conditions. This has the effect of driving similar cells across conditions together, linking clusters across conditions, and revealing features which characterize the change. Larger values of \(\lambda_{2}\) lead to less change across conditions. Details on the estimation of these parameters can be found in the supplementary material of the manuscript.
To estimate \(\lambda_{1}\) we just need to provide the centered and log-transformed data, the number of conditions and the number of clusters. As there are three cell-types present we set the number of clusters, \code{nclust} as 3.
lambda1 <- lambda1_calculator(pdat, dim = 3, nclust = 3 )
lambda1
## [1] 0.5909344 0.5909344 0.5909344
To estimate \(\lambda_{2}\) we again need to provide the centered and log-transformed data, the number of conditions and clusters as well as the calculated \(\lambda_{1}\) value.
lambda2 <- lambda2 <- lambda2_calculator(pdat, dim = 3, nclust = 3,
lambda1 = lambda1)
lambda2
## [1] 1.732187 1.732187
Once the parameters have been calculated we are now ready to apply SparseMDC. SparseMDC requires us to input the processed data, the number of conditions and clusters and the calculated penalty parameters. Other parameters which can changed are:
# Apply SparseMDC
smdc_res <- sparse_mdc(pdat, dim = 3, nclust = 3, lambda1 = lambda1,
lambda2 = lambda2, nstarts = 50, init_iter = 1)
## Calculating start values
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## The number of unique start values is: 22
## Start number: 1
## Iteration: 1
## Iteration: 2
## Iteration: 3
## Iteration: 4
## Iteration: 5
## Iteration: 6
## Iteration: 7
## Iteration: 8
## Clusters are unchanged!
## The score for start 1 is 81157.58
## New minimum score!
## Start number: 2
## Iteration: 1
## Iteration: 2
## Iteration: 3
## Iteration: 4
## Iteration: 5
## Iteration: 6
## Iteration: 7
## Iteration: 8
## Clusters are unchanged!
## The score for start 2 is 81157.58
## New minimum score!
## Start number: 3
## Iteration: 1
## Iteration: 2
## Iteration: 3
## Iteration: 4
## Iteration: 5
## Iteration: 6
## Iteration: 7
## Iteration: 8
## Clusters are unchanged!
## The score for start 3 is 81157.58
## New minimum score!
## Start number: 4
## Iteration: 1
## Iteration: 2
## Iteration: 3
## Iteration: 4
## Iteration: 5
## Iteration: 6
## Iteration: 7
## Iteration: 8
## Clusters are unchanged!
## The score for start 4 is 81157.58
## New minimum score!
## Start number: 5
## Iteration: 1
## Iteration: 2
## Iteration: 3
## Iteration: 4
## Iteration: 5
## Iteration: 6
## Iteration: 7
## Iteration: 8
## Clusters are unchanged!
## The score for start 5 is 81157.58
## New minimum score!
## Start number: 6
## Iteration: 1
## Iteration: 2
## Iteration: 3
## Iteration: 4
## Iteration: 5
## Iteration: 6
## Iteration: 7
## Iteration: 8
## Clusters are unchanged!
## The score for start 6 is 81157.58
## New minimum score!
## Start number: 7
## Iteration: 1
## Iteration: 2
## Clusters are unchanged!
## The score for start 7 is 82628.38
## Start number: 8
## Iteration: 1
## Iteration: 2
## Iteration: 3
## Iteration: 4
## Iteration: 5
## Iteration: 6
## Iteration: 7
## Iteration: 8
## Clusters are unchanged!
## The score for start 8 is 81157.58
## New minimum score!
## Start number: 9
## Iteration: 1
## Iteration: 2
## Iteration: 3
## Iteration: 4
## Iteration: 5
## Iteration: 6
## Iteration: 7
## Iteration: 8
## Clusters are unchanged!
## The score for start 9 is 81157.58
## New minimum score!
## Start number: 10
## Iteration: 1
## Iteration: 2
## Iteration: 3
## Iteration: 4
## Iteration: 5
## Iteration: 6
## Iteration: 7
## Iteration: 8
## Clusters are unchanged!
## The score for start 10 is 81157.58
## New minimum score!
## Start number: 11
## Iteration: 1
## Iteration: 2
## Iteration: 3
## Iteration: 4
## Iteration: 5
## Iteration: 6
## Iteration: 7
## Iteration: 8
## Clusters are unchanged!
## The score for start 11 is 81157.58
## New minimum score!
## Start number: 12
## Iteration: 1
## Iteration: 2
## Clusters are unchanged!
## The score for start 12 is 81154.73
## New minimum score!
## Start number: 13
## Iteration: 1
## Iteration: 2
## Iteration: 3
## Iteration: 4
## Iteration: 5
## Clusters are unchanged!
## The score for start 13 is 82273.59
## Start number: 14
## Iteration: 1
## Iteration: 2
## Iteration: 3
## Iteration: 4
## Iteration: 5
## Iteration: 6
## Iteration: 7
## Iteration: 8
## Clusters are unchanged!
## The score for start 14 is 81157.58
## Start number: 15
## Iteration: 1
## Iteration: 2
## Iteration: 3
## Iteration: 4
## Iteration: 5
## Clusters are unchanged!
## The score for start 15 is 82273.59
## Start number: 16
## Iteration: 1
## Iteration: 2
## Iteration: 3
## Iteration: 4
## Iteration: 5
## Iteration: 6
## Iteration: 7
## Iteration: 8
## Clusters are unchanged!
## The score for start 16 is 81157.58
## Start number: 17
## Iteration: 1
## Iteration: 2
## Iteration: 3
## Iteration: 4
## Clusters are unchanged!
## The score for start 17 is 81163.64
## Start number: 18
## Iteration: 1
## Iteration: 2
## Iteration: 3
## Iteration: 4
## Iteration: 5
## Iteration: 6
## Iteration: 7
## Iteration: 8
## Clusters are unchanged!
## The score for start 18 is 81157.58
## Start number: 19
## Iteration: 1
## Iteration: 2
## Clusters are unchanged!
## The score for start 19 is 81154.73
## New minimum score!
## Start number: 20
## Iteration: 1
## Iteration: 2
## Iteration: 3
## Iteration: 4
## Iteration: 5
## Iteration: 6
## Iteration: 7
## Iteration: 8
## Clusters are unchanged!
## The score for start 20 is 81157.58
## Start number: 21
## Iteration: 1
## Iteration: 2
## Iteration: 3
## Iteration: 4
## Iteration: 5
## Iteration: 6
## Iteration: 7
## Iteration: 8
## Clusters are unchanged!
## The score for start 21 is 81157.58
## Start number: 22
## Iteration: 1
## Iteration: 2
## Iteration: 3
## Iteration: 4
## Iteration: 5
## Iteration: 6
## Iteration: 7
## Iteration: 8
## Clusters are unchanged!
## The score for start 22 is 81157.58
After the application the results are stored as a list. The first item contains the clustering assignments while the second item contains the calculated center values. Each of these items is also a list and contains the solution for each condition in its corresponding entry, i.e. the clusters for condition A are stored in the first entry of the cluster list.
# Extract clustering solution
clusters <- smdc_res[[1]]
# Extract clusters for condition A
clusters_A <- clusters[[1]]
# Extract clusters for condition B
clusters_B <- clusters[[2]]
# Extract clusters for condition C
clusters_C <- clusters[[3]]
# Compare clusters and cell type
table(cell_type_A, clusters_A)
## clusters_A
## cell_type_A 1 2
## Two-cell Embryo 10 0
## Zygote 1 8
table(cell_type_B, clusters_B)
## clusters_B
## cell_type_B 1 3
## Four-cell Embryo 0 10
## Two-cell Embryo 10 0
table(cell_type_C, clusters_C)
## clusters_C
## cell_type_C 3
## Four-cell Embryo 10
# View full comparision
table(c(cell_type_A, cell_type_B, cell_type_C),
c(clusters_A, clusters_B, clusters_C))
##
## 1 2 3
## Four-cell Embryo 0 0 20
## Two-cell Embryo 20 0 0
## Zygote 1 8 0
The centers are extracted in a similar manner.
# Extract centers
centers <- smdc_res[[2]]
# Extract centers for condition A
centers_A <- centers[[1]]
# Extract centers for condition B
centers_B <- centers[[2]]
# Extract centers for condition C
centers_C <- centers[[3]]
The center results from SparseMDC can be used to identify marker genes of different categories from the result. Full details on the different categories of marker genes can be found in the original manuscript but a brief desciption is:
The center values are stored in the column corresponding to the cluster number. For example the center values for cluster 1 in condition A are stored in the first column of \code{centers_A}.
To identify housekeeping marker genes for cluster 1 we can use the following approach:
#Identify housekeeping marker gene index
clus_1_hk_gene_ind <- which(centers_A[,1] == centers_B[,1] &
centers_B[,1] == centers_C[,1] &
centers_A[,1] != 0)
# Identify the housekeeping marker genes
clus_1_hk_genes <- row.names(data_biase)[clus_1_hk_gene_ind]
Condition-dependent marker genes can be identified in the following way.
clus_1_cd_gene_ind <- which(centers_A[,1] != 0 & centers_B[,1] != 0 &
centers_A[,1] != centers_B[,1] |
centers_A[,1] != 0 & centers_C[,1] != 0 &
centers_A[,1] != centers_C[,1] |
centers_B[,1] != 0 & centers_C[,1] != 0 &
centers_B[,1] != centers_C[,1])
Condition-specific genes for cluster 1 can be identified in the following way.
# Identify condition A specific genes
clus_1_A_cs_ind <- which(centers_A[,1] != 0 & centers_B[,1] == 0 &
centers_C[,1] == 0)
clus_1_A_cs_genes <- row.names(data_biase)[clus_1_A_cs_ind]
# Identify condition B specific genes
clus_1_B_cs_ind <- which(centers_A[,1] == 0 & centers_B[,1] != 0 &
centers_C[,1] == 0)
clus_1_B_cs_genes <- row.names(data_biase)[clus_1_B_cs_ind]
# Identify condition C specific genes
clus_1_C_cs_ind <- which(centers_A[,1] == 0 & centers_B[,1] == 0 &
centers_C[,1] != 0)
clus_1_C_cs_genes <- row.names(data_biase)[clus_1_C_cs_ind]