metadata_ondisc_matrix
and multimodal_ondisc_matrix
Tutorial 1 covered ondisc_matrix
, the core class implemented by ondisc
. Thus tutorial covers metadata_ondisc_matrix
and multimodal_ondisc_matrix
, two additional classes provided by the package. metadata_ondisc_matrix
stores cell-specific and feature-specific covariate matrices alongside the expression matrix, and multimodal_ondisc_matrix
stores multiple metadata_ondisc_matrices
representing different cellular modalities. Together, metadata_ondisc_matrix
and multimodal_ondisc_matrix
facilitate feature selection, quality control, and other common single-cell data preprocessing tasks.
We begin by loading the package.
library(ondisc)
metadata_ondisc_matrix
classA metadata_ondisc_matrix
object consists of three components: (i) an ondisc_matrix
representing the expression data, (ii) a data frame storing the cell-specific covariates, and (iii) a data frame storing the feature-specific covariates. The easiest way to initialize a metadata_ondisc_matrix
is by calling create_ondisc_matrix_from_mtx
on an mtx file and associated metadata files, setting the optional parameter return_metadata_ondisc_matrix
to TRUE
. Below, we reproduce the example from Tutorial 1, this time returning a metadata_ondisc_matrix
instead of a list.
# Set paths to the .mtx and .tsv files
<- system.file("extdata", package = "ondisc")
raw_data_dir <- paste0(raw_data_dir, "/gene_expression.mtx")
mtx_fp <- paste0(raw_data_dir, "/cell_barcodes.tsv")
barcodes_fp <- paste0(raw_data_dir, "/genes.tsv")
features_fp
# Specify directory in which to store the .h5 file
<- tempdir()
temp_dir
# Initialize metadata_ondisc_matrix
<- create_ondisc_matrix_from_mtx(mtx_fp = mtx_fp,
expressions barcodes_fp = barcodes_fp,
features_fp = features_fp,
on_disk_dir = temp_dir,
return_metadata_ondisc_matrix = TRUE)
#>
|======== | 11%
|================= | 23%
|========================== | 36%
|==================================== | 48%
|============================================= | 61%
|====================================================== | 73%
|=============================================================== | 86%
|=========================================================================| 98%
|=========================================================================| 100%
#>
|======== | 11%
|================= | 23%
|========================== | 36%
|==================================== | 48%
|============================================= | 61%
|====================================================== | 73%
|=============================================================== | 86%
|=========================================================================| 98%
|=========================================================================| 100%
#> Writing CSC data.
#> Writing CSR data.
The variable expressions
is an object of class metadata_ondisc_matrix
; expressions
contains the fields ondisc_matrix
, cell_covariates
, and feature_covariates
.
# Print the variable
expressions#> A metadata_ondisc_matrix with the following components:
#> An ondisc_matrix with 300 features and 900 cells.
#> A cell covariate matrix with columns n_nonzero, n_umis, p_mito.
#> A feature covariate matrix with columns mean_expression, coef_of_variation, n_nonzero.
We alternately can initialize a metadata_ondisc_matrix
by calling the constructor function of the metadata_ondisc_matrix
class; see documentation (via ?metadata_ondisc_matrix) for details.
multimodal_ondisc_matrix
classThe multimodal_ondisc_matrix
class is used to represent multimodal data. multimodal_ondisc_matrix
objects have two fields: (i) a named list of metadata_ondisc_matrices
representing different modalities, and (ii) a global (i.e., cross-modality) cell-specific covariate matrix. The ondisc
package ships with example CRISPR perturbation data, which we use to initialize a new perturbation modality via a call to create_ondisc_matrix_from_mtx
.
# Set paths to the perturbation .mtx and .tsv files
<- paste0(raw_data_dir, "/perturbation.mtx")
mtx_fp <- paste0(raw_data_dir, "/cell_barcodes.tsv")
barcodes_fp <- paste0(raw_data_dir, "/guides.tsv")
features_fp
# Initialize metadata_ondisc_matrix
<- create_ondisc_matrix_from_mtx(mtx_fp = mtx_fp,
perturbations barcodes_fp = barcodes_fp,
features_fp = features_fp,
on_disk_dir = temp_dir,
return_metadata_ondisc_matrix = TRUE)
#>
#> Writing CSC data.
#> Writing CSR data.
Like expressions
, the variable perturbations
is an object of class metadata_ondisc_matrix
. However, because perturbations
represents logical perturabtion data instead of integer gene expression data, the cell-specific and feature-specific covariates of perturbations
differ from those of expressions
.
# These matrices have different columns
head(expressions@cell_covariates)
#> n_nonzero n_umis p_mito
#> 1 43 214 0.04672897
#> 2 26 169 0.00000000
#> 3 22 116 0.05172414
#> 4 37 258 0.08139535
#> 5 36 224 0.08035714
#> 6 31 147 0.07482993
head(perturbations@cell_covariates)
#> n_nonzero
#> 1 21
#> 2 11
#> 3 17
#> 4 14
#> 5 7
#> 6 11
The expressions
and perturbations
data are multimodal – they are collected from the same set of cells. We can create a multimodal_ondisc_matrix
by passing a named list of metadata_ondisc_matrix
objects – in this case, expressions
and perturbations
– to the constructor function of the multimodal_ondisc_matrix
class.
<- list(expressions = expressions, perturbations = perturbations)
modality_list <- multimodal_ondisc_matrix(modality_list) crispr_experiment
The variable crispr_experiment
is an object of class multimodal_ondisc_matrix
. The column names of the global covariate matrix are derived from the names of the modalities.
# print variable
crispr_experiment#> A multimodal_ondisc_matrix with the following modalities:
#> 1. expressions A metadata_ondisc_matrix with the following components:
#> An ondisc_matrix with 300 features and 900 cells.
#> A cell covariate matrix with columns n_nonzero, n_umis, p_mito.
#> A feature covariate matrix with columns mean_expression, coef_of_variation, n_nonzero.
#> 2. perturbations A metadata_ondisc_matrix with the following components:
#> An ondisc_matrix with 250 features and 900 cells.
#> A cell covariate matrix with columns n_nonzero.
#> A feature covariate matrix with columns n_nonzero.
#> Plus, a global cell-covariate matrix containing 4 covariates and 900 cells.
# show the global covariate matrix
head(crispr_experiment@global_cell_covariates)
#> expressions_n_nonzero expressions_n_umis expressions_p_mito
#> 1 43 214 0.04672897
#> 2 26 169 0.00000000
#> 3 22 116 0.05172414
#> 4 37 258 0.08139535
#> 5 36 224 0.08035714
#> 6 31 147 0.07482993
#> perturbations_n_nonzero
#> 1 21
#> 2 11
#> 3 17
#> 4 14
#> 5 7
#> 6 11
The figure below summarizes the relationship between ondisc_matrix
, metadata_ondisc_matrix
, and multimodal_ondisc_matrix
.
We can use the functions get_feature_ids
, get_feature_names
, and get_cell_barcodes
to obtain the feature IDs, feature names (if applicable), and cell barcodes, respectively, of a metadata_ondisc_matrix
or a multimodal_ondisc_matrix
. get_feature_ids
and get_feature_names
return a list when called on a multimodal_ondisc_matrix
, as the different modalities contain different features.
# metadata_ondisc_matrix
<- get_cell_barcodes(expressions)
cell_barcodes <- get_feature_ids(expressions)
feature_ids <- get_feature_names(expressions)
feature_names
# multimodal_ondisc_matrix
<- get_cell_barcodes(crispr_experiment)
cell_barcodes <- get_feature_ids(crispr_experiment) feature_ids
We likewise can use dim
, nrow
, and ncol
to query the dimension, number of rows, and number of columns of a metadata_ondisc_matrix
or multimodal_ondisc_matrix
. dim
and nrow
again return lists when called on a multimodal_ondisc_matrix
.
# metadata_ondisc_matrix
dim(expressions)
#> [1] 300 900
nrow(expressions)
#> [1] 300
ncol(expressions)
#> [1] 900
# multimodal_ondisc_matrix
dim(crispr_experiment)
#> $expressions
#> [1] 300 900
#>
#> $perturbations
#> [1] 250 900
nrow(crispr_experiment)
#> $expressions
#> [1] 300
#>
#> $perturbations
#> [1] 250
ncol(crispr_experiment)
#> [1] 900
Similar to ondisc_matrices
, metadata_ondisc_matrices
and multimodal_ondisc_matrices
can be subset using the [
operator. metadata_ondisc_matrices
can be subset either by feature or cell, while multimodal_ondisc_matrices
can be subset by cell only.
# metadata_ondisc_matrix
# keep cells 100 - 150
<- expressions[,100:150]
expressions_sub # keep genes ENSG00000188305, ENSG00000257284, ENSG00000251655
<- expressions[c("ENSG00000188305", "ENSG00000257284", "ENSG00000251655"),]
expressions_sub
# multimodal_ondisc_matrix
# keep all cells except 1 - 100
<- crispr_experiment[,-c(1:100)] crispr_experiment_sub
As with ondisc_matrices
, the original objects remain unchanged.
expressions#> A metadata_ondisc_matrix with the following components:
#> An ondisc_matrix with 300 features and 900 cells.
#> A cell covariate matrix with columns n_nonzero, n_umis, p_mito.
#> A feature covariate matrix with columns mean_expression, coef_of_variation, n_nonzero.
crispr_experiment#> A multimodal_ondisc_matrix with the following modalities:
#> 1. expressions A metadata_ondisc_matrix with the following components:
#> An ondisc_matrix with 300 features and 900 cells.
#> A cell covariate matrix with columns n_nonzero, n_umis, p_mito.
#> A feature covariate matrix with columns mean_expression, coef_of_variation, n_nonzero.
#> 2. perturbations A metadata_ondisc_matrix with the following components:
#> An ondisc_matrix with 250 features and 900 cells.
#> A cell covariate matrix with columns n_nonzero.
#> A feature covariate matrix with columns n_nonzero.
#> Plus, a global cell-covariate matrix containing 4 covariates and 900 cells.
To save and load a metadata_ondisc_matrix
or multimodal_ondisc_matrix
, simply use the functions saveRDS
and readRDS
, respectively.
It is not possible to apply the double-bracket “pull submatrix into memory” operator to a metadata_ondisc_matrix
or multimodal_ondisc_matrix
. To apply the double-bracket operator to an ondisc_matrix
stored inside a metadata_ondisc_matrix
or multimodal_ondisc_matrix
, simply access the desired ondisc_matrix
first.
The metadata_ondisc_matrix
and multimodal_ondisc_matrix
classes are inspired by analogous classes in the SingleCellExperiment
and Seurat
packages. However, unlike SingleCellExperiment
and Seurat
, ondisc
does not (currently) support storing the dense matrix of normalized UMI counts. We view this as a feature: storing the normalized expression matrix is almost never necessary and is hugely expensive (in space and time) for large-scale data.