This package was developed to help prepare some samples to be send to a facility. It assumes that you have collected the samples and the information but you still need to do the experiment in several batches due to technical or practical limitations. The question that tries to answer is:
Which samples go with each batch?
Of all the possible combinations of samples, it looks for the combination which minimizes the differences between each subgroup according to the following rules:
NA
(not available values) it looks to distribute them randomly.Even with this measures you might end up with some batch effect due to: - Confounding variables not provided for their randomization on the batches Sometimes due to being unknown, impossible to measure. - Lack of replicates(samples with the same conditions) If you can’t provide new replicates, aim to provide more technical replicates. Technical replicates mean reanalyzing the same sample twice or more, the more samples with technical replicates the more accurate your measures will be and easier to avoid or detect batch effects. - Processing If there is a change on the methodology, or you pause and resume later the sample collection there might be changes on the outcome due to external factors.
Before building this package I would like to give credit to those that made also efforts in this direction:
The CRAN task View of Experimental Design includes many packages relevant for designing an experiment before collecting data, but none of them provides how to manage them once the samples are already collected.
Two packages allow to distribute the samples on batches:
The OSAT package handles categorical variables but not numeric data. It doesn’t work with our data.
The minDiff package reported in Stats.SE, handles both numeric and categorical data. But it can only optimize for two nominal criteria. It doesn’t work for our data.
The Omixer package handles both numeric and categorical data (converting categorical variables to numeric). But both the same way either Pearson’s Chi-squared Test if there are few samples or Kendall’s correlation. It does allow to protect some spots from being used.
If you are still designing the experiment and do not have collected any data DeclareDesign might be relevant for you.
Question in Bioinformatics.SE I made before developing the package.
Imagine you have some samples already collected and you want to distributed them in batches:
library("experDesign")
metadata <- expand.grid(height = seq(60, 80, 5),
weight = seq(100, 300, 50),
sex = c("Male","Female"))
head(metadata, 15)
## height weight sex
## 1 60 100 Male
## 2 65 100 Male
## 3 70 100 Male
## 4 75 100 Male
## 5 80 100 Male
## 6 60 150 Male
## 7 65 150 Male
## 8 70 150 Male
## 9 75 150 Male
## 10 80 150 Male
## 11 60 200 Male
## 12 65 200 Male
## 13 70 200 Male
## 14 75 200 Male
## 15 80 200 Male
If you block incorrectly and end up with a group in a single batch we will end up with batch effect. In order to avoid this design()
helps you assign each sample to a batch (in this case each batch has 24 samples at most). First we can explore the number of samples and the number of batches:
size_data <- nrow(metadata)
size_batch <- 24
(batches <- optimum_batches(size_data, size_batch))
## [1] 3
# So now the best number of samples for each batch is less than the available
(size <- optimum_subset(size_data, batches))
## [1] 17
# The distribution of samples per batch
sizes_batches(size_data, size, batches)
## [1] 17 17 16
Note that instead of using a whole batch and then leave a single sample on the third distributes all the samples in the three batches that will be needed.
We can directly look for the distribution of the samples given our max number of samples per batch:
d <- design(metadata, size_batch)
# It is a list but we can convert it to a vector with:
batch_names(d)
## [1] "SubSet2" "SubSet2" "SubSet1" "SubSet2" "SubSet1" "SubSet1" "SubSet1"
## [8] "SubSet1" "SubSet3" "SubSet3" "SubSet3" "SubSet2" "SubSet1" "SubSet3"
## [15] "SubSet2" "SubSet3" "SubSet2" "SubSet2" "SubSet2" "SubSet2" "SubSet1"
## [22] "SubSet3" "SubSet3" "SubSet1" "SubSet2" "SubSet3" "SubSet3" "SubSet2"
## [29] "SubSet3" "SubSet2" "SubSet1" "SubSet1" "SubSet1" "SubSet1" "SubSet3"
## [36] "SubSet1" "SubSet1" "SubSet3" "SubSet3" "SubSet2" "SubSet2" "SubSet2"
## [43] "SubSet3" "SubSet3" "SubSet2" "SubSet3" "SubSet2" "SubSet1" "SubSet1"
## [50] "SubSet1"
Naively one would either fill some batches fully or distribute them not evenly (the first 17 packages together, the next 17 and so on). This solution ensures that the data is randomized. For more random distribution you can increase the number of iterations performed to calculate this distribution.
If you need space for replicates to control for batch effect you can use:
r <- replicates(metadata, size_batch, 5)
lengths(r)
## SubSet1 SubSet2 SubSet3
## 21 21 18
r
## $SubSet1
## [1] 3 9 10 16 17 20 27 28 30 31 32 33 34 35 40 41 42 44 45 48 49
##
## $SubSet2
## [1] 4 5 7 8 9 10 13 15 18 21 22 25 26 32 33 34 36 37 38 39 47
##
## $SubSet3
## [1] 1 2 6 9 10 11 12 14 19 23 24 29 32 33 34 43 46 50
Which seeks as controls the most diverse values and adds them to the samples distribution. Note that if the sample is already present on that batch is not added again, that’s why the number of samples per batch is different from the design without replicates.
Lastly, we can see how these samples would be distributed in a layout of 6x4:
s <- spatial(r, metadata, rows = LETTERS[1:6], columns = 1:4)
head(s)
## $A1
## [1] 33 48 24
##
## $B1
## [1] 15 22 39
##
## $C1
## [1] 40 5 32
##
## $D1
## [1] 3 9 41
##
## $E1
## [1] 28 10 50
##
## $F1
## [1] 4 13 29
We can add the batches to the initial data with inspect()
:
report <- inspect(r, metadata)
report2 <- inspect(s, report, index_name = "position")
head(report2)
## height weight sex batch position
## 1 60 100 Male SubSet3 C2
## 2 65 100 Male SubSet3 D2
## 3 70 100 Male SubSet1 D1
## 4 75 100 Male SubSet2 F1
## 5 80 100 Male SubSet2 C1
## 6 60 150 Male SubSet3 C4
And now we can see the batch and position of each sample
In the previous case the data was mostly balanced (check it out in the orig
object) but let’s create an unbalanced dataset to check it.
n <- 99
samples <- 100
unbalanced <- data.frame(Classroom = rep(c("A", "B"), each = samples/2),
Sex = c(rep("M", n), rep("F", samples-n)),
Age = rnorm(samples, mean = 25, sd = 3))
table(unbalanced)[, , 1:5]
## , , Age = 17.8386553280725
##
## Sex
## Classroom F M
## A 0 0
## B 0 1
##
## , , Age = 19.7652909098992
##
## Sex
## Classroom F M
## A 0 0
## B 0 1
##
## , , Age = 20.0141867569824
##
## Sex
## Classroom F M
## A 0 1
## B 0 0
##
## , , Age = 20.4317639136216
##
## Sex
## Classroom F M
## A 0 0
## B 0 1
##
## , , Age = 20.8002586118616
##
## Sex
## Classroom F M
## A 0 1
## B 0 0
In this dataset the classroom a single classroom has all the females (-49).
i <- design(unbalanced, 15)
# Mean entropy en each subset
rowMeans(evaluate_index(i, unbalanced)["entropy", , ])
## Classroom Sex Age
## 0.95647606 0.05303319 0.00000000
# Original entropy on the dataset
evaluate_orig(unbalanced)["entropy", ]
## Classroom Sex Age
## 1.00000000 0.08079314 0.00000000
# Dispersion of the entropy
apply(evaluate_index(i, unbalanced)["entropy", , ], 1, sd)
## Classroom Sex Age
## 0.02936726 0.14031263 0.00000000
We can see that in this simple case where a single variable has all the other cases we approximately reached the same entropy levels.
If you need a subset with the samples that are more diverse you can use the following function:
data(survey, package = "MASS")
head(survey)
## Sex Wr.Hnd NW.Hnd W.Hnd Fold Pulse Clap Exer Smoke Height M.I
## 1 Female 18.5 18.0 Right R on L 92 Left Some Never 173.00 Metric
## 2 Male 19.5 20.5 Left R on L 104 Left None Regul 177.80 Imperial
## 3 Male 18.0 13.3 Right L on R 87 Neither None Occas NA <NA>
## 4 Male 18.8 18.9 Right R on L NA Neither None Never 160.00 Metric
## 5 Male 20.0 20.0 Right Neither 35 Right Some Never 165.00 Metric
## 6 Female 18.0 17.7 Right L on R 64 Right Some Never 172.72 Imperial
## Age
## 1 18.250
## 2 17.583
## 3 16.917
## 4 20.333
## 5 23.667
## 6 21.000
samples <- extreme_cases(survey, size = 10)
survey[samples, ]
## Sex Wr.Hnd NW.Hnd W.Hnd Fold Pulse Clap Exer Smoke Height M.I
## 5 Male 20.0 20.0 Right Neither 35 Right Some Never 165.00 Metric
## 48 Male 22.5 23.0 Right R on L 96 Right None Never 170.00 Metric
## 60 Male 20.6 21.0 Left L on R NA Left Freq Occas 175.26 Imperial
## 66 Male 19.5 19.8 Right Neither NA Right Freq Never 183.00 Metric
## 80 Male 20.0 20.5 Right L on R NA Right Freq Never 185.42 Imperial
## 83 Female 17.5 17.5 Right R on L 98 Left Freq Never NA <NA>
## 121 Male 20.0 20.0 Right R on L 80 Neither Freq Occas NA <NA>
## 154 Male 21.5 21.6 Right R on L 69 Right Freq Never 172.72 Imperial
## 170 Male 19.0 18.5 Right R on L 72 Right Freq Never 180.34 Imperial
## 225 Female 17.6 17.2 Right L on R NA Right Some Never NA <NA>
## Age
## 5 23.667
## 48 19.417
## 60 18.417
## 66 18.000
## 80 18.750
## 83 17.667
## 121 17.500
## 154 70.417
## 170 17.333
## 225 19.917
sessionInfo()
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=es_ES.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=es_ES.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=es_ES.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] experDesign_0.1.0
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.27 R6_2.5.0 jsonlite_1.7.2 magrittr_2.0.1
## [5] evaluate_0.14 rlang_0.4.10 stringi_1.5.3 jquerylib_0.1.3
## [9] bslib_0.2.4 rmarkdown_2.7 tools_3.6.3 stringr_1.4.0
## [13] xfun_0.22 yaml_2.2.1 compiler_3.6.3 htmltools_0.5.1.1
## [17] knitr_1.32 sass_0.3.1