xadmix
First, let’s load the package. If you haven’t installed it yet, you
can do so with the install.packages("xadmix")
command to
get the latest stable release from CRAN. Alternatively, you can
install the latest development release from GitHub:
devtools::install_github("SpaceCowboy-71/xadmix")
. This,
however, requires the package devtools
.
library(xadmix)
The xadmix
package provides a dummy dataset
(“xadmixture”) containing simulated genetic admixture data.
Each observation has an accession identifier, the country where the
plant material was collected, an entry for the species and five
admixture coefficients, K. The values of these ancestries sum
up to 1.
data("xadmixture")
str(xadmixture)
#> 'data.frame': 600 obs. of 8 variables:
#> $ acc : chr "AA_193373_" "AA_163594_" "AA_35629_" "AA_151602_" ...
#> $ country: chr "GBR" "NLD" "FRA" "ESP" ...
#> $ species: chr "lorem" "lorem" "ipsum" "dolor" ...
#> $ K1 : num 0.0979 0.0196 0.0314 0.093 0.0756 ...
#> $ K2 : num 0.0205 0.2463 0.0863 0.2586 0.079 ...
#> $ K3 : num 0.0618 0.2183 0.3026 0.1551 0.1732 ...
#> $ K4 : num 0.4511 0.4329 0.226 0.2289 0.0472 ...
#> $ K5 : num 0.369 0.083 0.354 0.264 0.625 ...
# number of observations per country
table(xadmixture$country)
#>
#> DEU ESP FRA GBR ITA NLD
#> 100 160 130 100 50 60
# number of observations per species
table(xadmixture$species)
#>
#> dolor ipsum lorem
#> 200 200 200
xadmix
comes with an optimized data subsetting function,
admix_subset()
. It enables filtering of the data for
ancestry (K) percentages greater or less than a certain value.
Additionally, any number of columns can be filtered for values supplied
in a vector. By default, it also prints the progress after each
subsetting step.
Multiple ancestries along with their percentages can be passed as argument vectors. Subsetting is then done pairwise, filtering the first ancestry by the first pecentage, then the second ones and so on. Some examples:
# keep only observations with K1 > 0.15 and K2 > 0.01
<- admix_subset(xadmixture,
subset1 anc = c("K1", "K2"),
pct = c(0.15, 0.01))
#> observations: 600
#> 1. subset, K1 > 0.15
#> observations: 39
#> 2. subset, K2 > 0.01
#> observations: 38
# keep only observations with K2 < 0.1 and K3 < 0.1
<- admix_subset(xadmixture,
subset2 anc = c("K2", "K3"),
pct = c(0.1, 0.1),
comparison = "less")
#> observations: 600
#> 1. subset, K2 < 0.1
#> observations: 215
#> 2. subset, K3 < 0.1
#> observations: 41
# filtering for countries and species
<- admix_subset(xadmixture,
subset3 country = c("GBR", "FRA"),
species = c("lorem", "dolor"))
#> observations: 600
#> keeping only specified values in col: country
#> observations left after this step: 230
#> keeping only specified values in col: species
#> observations left after this step: 144
Subsets can be chained using the pipe operator (%>%
)
from package magrittr
. This way, filtering for percentages
greater and less than certain values is possible.
By setting the argument quiet
to TRUE
, no
subset progress will be printed.
library(magrittr)
# keep only observations with K1 > 0.1 and K4 < 0.3,
# without printing subset progress
<- admix_subset(xadmixture,
subset4 anc = "K1",
pct = 0.1,
quiet = TRUE) %>%
admix_subset(anc = "K1",
pct = 0.3,
comparison = "less",
quiet = TRUE)
# print number of observations for comparison
nrow(xadmixture)
#> [1] 600
nrow(subset4)
#> [1] 157
xadmix
also comes with a user-friendly function for
generating pretty stacked barplots, optimized to use with admixture
data.
If the dataset were to contain only the accession numbers in the first column and ancestry percentages in subsequent columns, the function could be used without any further arguments.
However, the xadmixture
dataset contains additional
columns. Therefore, the columns for all K`s have to be selected
by passing an argument. An example plotting the whole dataset:
# ancestries (K) are in the fourth to last column,
# and plotted without showing bar labels
admix_barplot(xadmixture,
K = 4:ncol(xadmixture),
names = FALSE
)
More information can be gained by looking at the subsets generated above:
# grouping data by column "country",
# and sorting each group by ancestry column "K1"
admix_barplot(subset1,
K = 4:ncol(xadmixture),
grouping = "country",
sortkey = "K1"
)
# changing color palette to "turbo" from package 'viridis',
admix_barplot(subset2,
K = 4:ncol(xadmixture),
palette = "turbo",
grouping = "species",
sortkey = "K4"
)
# removing title and changing axis labels text
admix_barplot(subset3,
K = 4:ncol(xadmixture),
main = "",
xlab = "Accessions",
ylab = "Ancestry [%]",
palette = "alternating",
sortkey = "K1",
names = FALSE
)
Sometimes the group labels can get cut off. When there are not enough
observations in the group, the width of the grouped bars can be smaller
than the width of the group label. ggplot2
does not yet
offer a flag to avoid this problem. In xadmix
, however,
there is an option to remove clipping altogether:
# directly output grouped plot with clipping removed from elements
# (useful if there are groups with a low number of observations)
<- admix_subset(xadmixture,
subset5 anc = c("K3", "K4"),
pct = c(0.3, 0.2),
quiet = TRUE)
# noclip set to "TRUE"
admix_barplot(subset5,
K = 4:ncol(xadmixture),
sortkey = "K5",
grouping = "country",
palette = "viridis",
names = FALSE,
main = "Noclip on",
noclip = TRUE)
# noclip set to "FALSE"
admix_barplot(subset5,
K = 4:ncol(xadmixture),
sortkey = "K5",
grouping = "country",
palette = "viridis",
names = FALSE,
main = "Noclip off",
noclip = FALSE)
♦