This vignette documents the data format used in
PLNmodel by PLN
and its variants. It also
shows how to create an object in the proper format for further analyses
from (i) tabular data, (ii) biom-class objects and (iii) phyloseq-class
objects.
We illustrate the format using trichoptera data set, a full description of which can be found in the corresponding vignette.
library(PLNmodels)
data(trichoptera)
The trichoptera data set is a list made of two data frames:
Abundance
(hereafter referred to as the counts)
and Covariate
(hereafter the covariates).
str(trichoptera, max.level = 1)
## List of 2
## $ Abundance:'data.frame': 49 obs. of 17 variables:
## $ Covariate:'data.frame': 49 obs. of 7 variables:
The covariates include, among others, the wind, pressure and humidity.
names(trichoptera$Covariate)
## [1] "Temperature" "Wind" "Pressure" "Humidity"
## [5] "Cloudiness" "Precipitation" "Group"
In the PLN framework, we model the counts from the covariates, let’s say wind and pressure, using a Poisson Log-Normal model. Most models in R use the so-called formula interface and it would thus be natural to write something like
PLN(Abundance ~ Wind + Pressure, data = trichoptera)
Unfortunately and unlike many generalized linear models, the response in PLN is intrinsically multivariate: it has 17 dimensions in our example. The left hand side (LHS) must encode a multivariate response across multiple samples, using a 2D-array (e.g. a matrix or a data frame).
We must therefore prepare a data structure where
Abundance
refers to a count matrix whereas
Wind
and Pressure
refer to vectors
before feeding it to PLN
. That’s the purpose of
prepare_data
.
<- prepare_data(counts = trichoptera$Abundance,
trichoptera2 covariates = trichoptera$Covariate)
str(trichoptera2)
## 'data.frame': 49 obs. of 9 variables:
## $ Abundance : num [1:49, 1:17] 0 0 0 0 0 0 0 0 0 0 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:49] "1" "2" "3" "4" ...
## .. ..$ : chr [1:17] "Che" "Hyc" "Hym" "Hys" ...
## $ Temperature : num 18.7 19.8 22 23 22.5 23.9 15 17.2 15.4 14.1 ...
## $ Wind : num -2.3 -2.7 -0.7 2.3 2.3 -2 -4.7 -1 -2.7 -3.7 ...
## $ Pressure : num 998 1000 997 991 990 ...
## $ Humidity : num 60 63 73 71 62 64 93 84 88 75 ...
## $ Cloudiness : num 19 0 6 81 50 50 100 19 69 6 ...
## $ Precipitation: num 0 0 0 0 0 0 1.6 0 1.6 0 ...
## $ Group : Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Offset : num 29 13 38 192 79 18 8 34 12 4 ...
If you look carefully, you can notice a few difference between
trichoptera
and trichoptera2
:
list
whereas the second is a
data.frame
1;Abundance
is a matrix-column of
trichoptera2
that you can extract using the usual functions
[
and [[
to retrieve the count matrix;trichoptera2
has an additional Offset
column (more on that later).It is common practice when modeling count data to introduce an offset term to control for different sampling efforts, exposures, baselines, etc. The proper way to compute sample-specific offsets in still debated and may vary depending on the field. There are nevertheless a few popular methods:
Each of these offset be computed from a counts matrix using the
compute_offset
function and changing its
offset
argument:
## same as compute_offset(trichoptera$Abundance, offset = "TSS")
compute_offset(trichoptera$Abundance)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 29 13 38 192 79 18 8 34 12 4 4 3 49 33 600 172
## 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
## 58 51 56 127 35 13 17 3 27 40 44 8 9 1599 2980 88
## 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
## 135 327 66 90 63 15 14 20 70 53 95 43 62 149 16 31
## 49
## 86
In this particular example, the counts are too sparse and sophisticated offset methods all fail (numeric output hidden)
compute_offset(trichoptera$Abundance, "CSS")
## Warning in offset_function(counts, ...): Some samples only have 1 positive
## values. Can't compute quantiles and fall back to TSS normalization
compute_offset(trichoptera$Abundance, "RLE")
## Warning in offset_function(counts, ...): Because of high sparsity, some samples
## have null or infinite offset.
compute_offset(trichoptera$Abundance, "GMPR")
We can mitigate this problem for the RLE offset by adding pseudocounts to the counts although doing so has its own drawbacks.
compute_offset(trichoptera$Abundance, "RLE", pseudocounts = 1)
## 1 2 3 4 5 6 7 8
## 0.9186270 0.8349121 0.8570257 0.9186270 0.9186270 0.8349121 0.8192245 0.8570257
## 9 10 11 12 13 14 15 16
## 0.7322797 0.7322797 0.6321923 0.6321923 0.9240361 0.9186270 1.3788037 0.9240361
## 17 18 19 20 21 22 23 24
## 0.9186270 0.9240361 0.9240361 1.7140514 0.8908577 0.8570257 0.8349121 0.6321923
## 25 26 27 28 29 30 31 32
## 0.9240361 0.9240361 0.9186270 0.8570257 0.8349121 2.7721084 3.2934218 0.9584503
## 33 34 35 36 37 38 39 40
## 1.0406547 0.9584503 0.9584503 0.9186270 0.9584503 0.8908577 0.8349121 0.7322797
## 41 42 43 44 45 46 47 48
## 0.9240361 0.9240361 0.9584503 0.9584503 1.2643846 1.7140514 0.8233555 0.8908577
## 49
## 0.9186270
A better solution consists in using only positive counts to compute the offsets:
compute_offset(trichoptera$Abundance, "RLE", type = "poscounts")
## 1 2 3 4 5 6 7 8
## 0.5631099 0.9462046 0.8299806 0.9462046 0.6460415 0.5756774 0.6308031 0.4947789
## 9 10 11 12 13 14 15 16
## 0.7988730 0.3574190 0.2207270 0.1260525 0.8051707 0.7289732 1.5591914 1.1850090
## 17 18 19 20 21 22 23 24
## 0.6389431 1.0000000 1.4606879 1.8224330 0.8403498 0.3942114 0.4058586 0.1997183
## 25 26 27 28 29 30 31 32
## 0.6286560 0.5631099 0.8777257 0.3440303 0.2504662 2.7086423 2.5800932 1.4584997
## 33 34 35 36 37 38 39 40
## 2.6050843 4.5898981 1.2185072 1.2616062 0.8823673 0.6250713 0.3950030 0.5631099
## 41 42 43 44 45 46 47 48
## 1.4511384 1.0000000 0.9462046 1.0882448 1.0000000 1.0631099 0.4621924 0.7900060
## 49
## 1.0672361
Finally, we can use wrench to compute the offsets:
compute_offset(trichoptera$Abundance, "Wrench")
## [1] 0.41269451 0.17385897 0.31925785 0.70682391 0.44749382 0.20676142
## [7] 0.10226641 0.21204241 0.09228293 0.03919178 0.03946910 0.02295077
## [13] 24.18096281 1.54038084 3.95187144 0.80172207 10.52136472 4.55854738
## [19] 0.46968859 51.88528943 0.32803605 0.14764973 4.13635353 0.03152826
## [25] 1.63626578 0.53769290 0.38482355 0.62689594 0.08030685 83.87692476
## [31] 33.40810265 0.75508731 1.05809625 1.18385013 0.58657751 0.43005272
## [37] 4.79925224 0.18381986 0.16078594 0.18490773 4.25999137 5.46689591
## [43] 0.60139850 7.56594658 9.89766712 33.73727074 0.74264459 29.13747796
## [49] 47.12017975
prepare_data
We’ll already learned that prepare_data
can join counts
and covariates into a single data.frame. It can also compute offset
through compute_offset
and does so by default with
offset = "TSS"
, hence the Offset
column in
trichoptera2
. You can change the offset method and provide
additional arguments that will passed on to
compute_offset
.
str(prepare_data(trichoptera$Abundance,
$Covariate,
trichopteraoffset = "RLE", pseudocounts = 1))
## 'data.frame': 49 obs. of 9 variables:
## $ Abundance : num [1:49, 1:17] 0 0 0 0 0 0 0 0 0 0 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:49] "1" "2" "3" "4" ...
## .. ..$ : chr [1:17] "Che" "Hyc" "Hym" "Hys" ...
## $ Temperature : num 18.7 19.8 22 23 22.5 23.9 15 17.2 15.4 14.1 ...
## $ Wind : num -2.3 -2.7 -0.7 2.3 2.3 -2 -4.7 -1 -2.7 -3.7 ...
## $ Pressure : num 998 1000 997 991 990 ...
## $ Humidity : num 60 63 73 71 62 64 93 84 88 75 ...
## $ Cloudiness : num 19 0 6 81 50 50 100 19 69 6 ...
## $ Precipitation: num 0 0 0 0 0 0 1.6 0 1.6 0 ...
## $ Group : Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Offset : num 0.919 0.835 0.857 0.919 0.919 ...
Different communities use different standard for the count data where
samples are either or columns of the counts matrix.
prepare_data
uses heuristics to guess the direction of the
counts matrix (or fail informatively doing so) and automatically
transpose it if needed.
Finally, prepare_data
enforces sample-consistency
between the counts and the covariates and automatically trims away: -
samples for which only covariates or only counts are available; -
samples with no positive counts
For example, if we remove the first sample from the counts and the last one from the covariates, we end up with 49 - 2 = 47 samples left, as expected.
nrow(prepare_data(trichoptera$Abundance[-1, ], ## remove first sample
$Covariate[-49,] ## remove last sample
trichoptera ))
## [1] 47
prepare_data_from_[phyloseq|biom]
Community composition data are quite popular in microbial ecology and usually stored in flat files using the biom format and/or imported in R as phyloseq-class objects (McMurdie 2013) using the Bioconductor phyloseq package.
We show here how to import data from a biom file (or biom-class object) and form a phyloseq-class object.
Reading from a biom file requires the bioconductor package biomformat. This package is not a standard dependency of PLNmodels and needs to be installed separately.
You can easily prepare your data from a biom file using the following steps:
biomformat::read_biom()
biomformat::biom_data()
biomformat::sample_metadata()
(or build your own)prepare_data
as illustrated below:
## If biomformat is not installed, uncomment the following lines
# if (!requireNamespace("BiocManager", quietly = TRUE)) {
# install.packages("BiocManager")
# }
# BiocManager::install("biomformat")
library(biomformat)
<- system.file("extdata", "rich_dense_otu_table.biom", package = "biomformat")
biomfile <- biomformat::read_biom(biomfile)
biom ## extract counts
<- as(biomformat::biom_data(biom), "matrix")
counts ## extract covariates (or prepare your own)
<- biomformat::sample_metadata(biom)
covariates ## prepare data
<- prepare_data(counts = counts, covariates = covariates)
my_data str(my_data)
Likewise, preparing data from a phyloseq-class object requires the bioconductor package phyloseq. This package is not a standard dependency of PLNmodels and needs to be installed separately.
You can easily prepare your data from a phyloseq object using the following steps:
phyloseq::otu_table()
phyloseq::sample_data()
(or
build your own)prepare_data
as illustrated below:
## If biomformat is not installed, uncomment the following lines
# if (!requireNamespace("BiocManager", quietly = TRUE)) {
# install.packages("BiocManager")
# }
# BiocManager::install("phyloseq")
library(phyloseq)
data("enterotype")
## extract counts
<- as(phyloseq::otu_table(enterotype), "matrix")
counts ## extract covariates (or prepare your own)
<- phyloseq::sample_data(enterotype)
covariates ## prepare data
<- prepare_data(counts = counts, covariates = covariates)
my_data str(my_data)
We detail here the mathematical background behind the various offsets and the way they are computed. Note \(\mathbf{Y} = (Y_{ij})\) the counts matrix where \(Y_{ij}\) is the count of species \(j\) in sample \(i\). Assume that there are \(p\) species and \(n\) samples in total. The offset of sample \(i\) is noted \(O_i\) and computed in the following way.
Offsets are simply the total counts of a sample (frequently called depths in the metabarcoding literature): \[ O_i = \sum_{j=1}^p Y_{ij} \]
Positive counts are used to compute sample-specific quantiles \(q_i^l\) and cumulative sums \(s_i^l\) defined as \[
q_i^l = \min \{q \text{ such that } \sum_j 1_{Y_{ij} \leq q} \geq l
\sum_j 1_{Y_{ij} > 0} \} \qquad s_i^l = \sum_{j: Y_{ij} \leq q_i^l}
Y_{ij}
\] The sample-specific quantiles are then used to compute
reference quantiles defined as \(q^l =
\text{median} \{q^i_l\}\) and median average deviation around the
quantile \(q^l\) as \(d^l = \text{median} |q_i^l - q^l|\). The
method then searches for the smallest quantile \(l\) for which it detects instability,
defined as large relative increase in the \(d^l\). Formally, \(\hat{l}\) is the smallest \(l\) satisfying \(\frac{d^{l+1} - d^l}{d^l} \geq 0.1\). The
scaling sample-specific offset are then chosen as: \[
O_i = s_i^{\hat{l}} / \text{median}_i \{ s_i^{\hat{l}} \}
\] Dividing by the median of the \(s_i^{\hat{l}}\) ensures that offsets are
centered around \(1\) and compare sizes
differences with respect to the reference sample. Note also that the
reference quantiles \(q^l\) can be
computed using either the median (default, as in the original Paulson et al. (2013) paper) or the mean, by specifying
reference = mean
, as implemented in
metagenomeseq
.
A reference sample \((q_j)_j\) is
first built by computing the geometric means of each species count:
\[
q_j = \exp \left( \frac{1}{n} \sum_{i} \log(Y_{ij})\right)
\] Each sample is then compared to the reference sample to
compute one ratio per species and the final offset \(O_i\) is the median of those ratios: \[
O_i = \text{median}_j \frac{Y_{ij}}{q_j}
\] The method fails when no species is shared across all sample
(as all \(q_j\) are then \(0\)) or when a sample shares less than 50%
of species with the reference (in which case the median of the ratios
may be null or infinite). The problem can be alleviated by adding
pseudocounts to the \(c_{ij}\) with
pseudocounts = 1
or using positive counts in the
computations (type = "poscounts"
)
This method is similar to RLE but does create a reference sample. Instead, each sample is compared to each other to compute a median ratio (similar to RLE) \[ r_{ii'} = {\text{median}}_{j: Y_{ij}.Y_{i'j} > 0} \frac{Y_{ij}}{Y_{i'j}} \] The offset is then taken as the median of all the \(r_{ii'}\): \[ O_i = \text{median}_{i' != i} r_{ii'} \] The method fails when there is only one sample in the data set or when a sample shares no species with any other.
This method is fully detailed in Kumar et al.
(2018) and we
only provide a barebone implementation corresponding to the defaults
parameters of Wrench::wrench()
. Assume that samples belong
to \(K\) discrete groups and note \(g_i\) the group of sample \(i\). Wrench is based on the following
(simplified) log-normal model for counts: \[
Y_{ij} \sim \pi_{ij} \delta_0 + (1 - \pi_{ij})\log\mathcal{N}(\mu_{ij},
\sigma^2_j)
\] where the \(Y_{ij}\) are
independent and the mean \(\mu_{ij}\)
is decomposed as: \[
\mu_{ij} = \underbrace{\log{p_{0j}}}_{\text{log-ref. prop.}}
+ \underbrace{\log{d_i}}_{\text{log-depth}} +
\underbrace{\log{\zeta_{0g_i}}}_{\text{log effect of group } g_i} +
\underbrace{a_{i}}_{\text{(f|m)ixed effect}} +
\underbrace{b_{ij}}_{\text{mixed effects}}
\] where the random effects are independents centered gaussian
and the depths is the total sum of counts: \[
\begin{align*}
d_i & = \sum_{j=1}^p c_{ij} \\
b_{ij} & \sim \mathcal{N}(0, \eta^2_{g_i}) \\
\end{align*}
\]
The net log fold change \(\theta_{ij}\) of the proportion ratio \(r_{ij} = c_{ij} / d_i p_{0j}\) of species \(j\) relative to the reference is \(\log(\theta_{ij}) \overset{\Delta}{=} \mathbb{E}[\log(r_{ij}) | a_i, b_{ij}] = \log{\zeta_{0g_i}} + a_i + b_{ij}\). We can decompose it as \(\theta_{ij} = \Lambda_i^{-1} v_{ij}\) where \(\Lambda_i^{-1}\) is the compositional correction factor and \(v_{ij}\) is the fold change of true abundances.
With the above notations, the net fold change compounds both the fold change of true abundances and the compositional correction factors. With the assumption that the \(b_{ij}\) are centered, \(\log(\hat{\Lambda}_i)\) can be estimated through a robust average of the \(\hat{\theta}_{ij}\), which can themselves be computed from the log-ratio of proportions.
We detail here how the different parameters and/or effects are estimated.
Wrench::wrench()
, the
log depth \(\log(d_i)\) is used as
predictor but I believe it makes more sense to use it an offset.The offsets are then the product of compositional correction factors and depths: \[ O_i = \frac{\hat{\Lambda}_i}{(\prod_{i = 1}^n \hat{\Lambda}_i)^{1/n}} \times \frac{d_i}{(\prod_{i = 1}^n d_i)^{1/n}} \]
although a data.frame
is technically a
list
↩︎