# IMPORTANT: this vignette is not created if snpStats is not installed
if (!require("snpStats")) {
knitr::opts_chunk$set(eval = FALSE)
}
## Loading required package: snpStats
## Loading required package: survival
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following object is masked from 'package:S4Vectors':
##
## expand
In this vignette we demonstrate the use of snpClust
function in the adjclust
package. snpClust
performs adjacency-constrained hierarchical clustering of single
nucleotide polymorphisms (SNPs), where the similarity between SNPs is
defined by linkage disequilibrium (LD).
This function implements the algorithm described in [1]. It is an extension of the algorithm described in [3,4]. Denoting by \(p\) the number of SNPs to cluster and assuming that the similarity between SNPs whose indices are more distant than \(h\), its time complexity is \(O(p (\log(p) + h))\), and its space complexity is \(O(hp)\).
library("adjclust")
The beginning of this vignette closely follows the “LD vignette” of the SnpStats package [2]. First, we load genotype data.
data("ld.example", package = "snpStats")
We focus on the ceph.1mb
data.
geno <- ceph.1mb[, -316] ## drop one SNP leading to one missing LD value
p <- ncol(geno)
nSamples <- nrow(geno)
geno
## A SnpMatrix with 90 rows and 602 columns
## Row names: NA06985 ... NA12892
## Col names: rs5993821 ... rs5747302
These data are drawn from the International HapMap Project and concern 602 SNPs1 over a 1Mb region of chromosome 22 in sample of 90 Europeans.
We can compute and display the LD between these SNPs.
ld.ceph <- snpStats::ld(geno, stats = "R.squared", depth = p-1)
image(ld.ceph, lwd = 0)
The snpClust
function can handle genotype data as an
input:
fit <- snpClust(geno, stats = "R.squared")
## Warning in run.snpClust(x, h = h, stats = stats): Forcing the LD similarity to
## be smaller than or equal to 1
## as(<dsTMatrix>, "dgTMatrix") is deprecated since Matrix 1.4-2; do as(., "generalMatrix") instead
## Note: 135 merges with non increasing heights.
Note that due to numerical errors in the LD estimation, some of the estimated LD values may be slightly larger than 1. These values are rounded to 1 internally.
The above figure suggests that the LD signal is concentrated close to
the diagonal. We can focus on a diagonal band with the bandwidth
parameter h
:
fitH <- snpClust(geno, h = 100, stats = "R.squared")
## Warning in run.snpClust(x, h = h, stats = stats): Forcing the LD similarity to
## be smaller than or equal to 1
## Note: 133 merges with non increasing heights.
fitH
##
## Call:
## run.adjclust(mat = mat, type = type, h = h, strictCheck = strictCheck)
##
## Cluster method : snpClust
## Number of objects: 602
The output of the snpClust
is of class
chac
. In particular, it can be plotted as a dendrogram
silently relying on the function plot.dendrogram
:
plot(fitH, type = "rectangle", leaflab = "perpendicular")
## Warning in plot.chac(fitH, type = "rectangle", leaflab = "perpendicular"):
## Detected reversals in dendrogram: mode = 'corrected', 'within-disp' or 'total-disp' might be more relevant.
Moreover, the output contains an element named merge
which describes the successive merges of the clustering, and an element
gains
which gives the improvement in the criterion
optimized by the clustering at each successive merge.
head(cbind(fitH$merge, fitH$gains))
## [,1] [,2]
## [1,] -1 -2
## [2,] -255 -256
## [3,] -488 -489
## [4,] -487 3
## [5,] -486 4
## [6,] -234 -235
In this section we show how the snpClust
function can
also be applied directly to LD values.
h <- 100
ld.ceph <- snpStats::ld(geno, stats = "R.squared", depth = h, symmetric = TRUE)
image(ld.ceph, lwd = 0)
Note that we have forced the snpStats::ld
function to
return a symmetric matrix. We can apply snpClust
directly
to this LD matrix (of class Matrix::dsCMatrix
):
fitL <- snpClust(ld.ceph, h)
## Note: forcing the diagonal of the LD similarity matrix to be 1
## Warning in run.snpClust(x, h = h, stats = stats): Forcing the LD similarity to
## be smaller than or equal to 1
## Note: 133 merges with non increasing heights.
snpClust
also handles inputs of class
base::matrix
:
gmat <- as(geno, "matrix")
fitM <- snpClust(geno, h, stats = "R.squared")
## Note: 133 merges with non increasing heights.
[1] Ambroise C., Dehman A., Neuvial P., Rigaill G., and Vialaneix N. (2019). Adjacency-constrained hierarchical clustering of a band similarity matrix with application to genomics. Algorithms for Molecular Biology, 14, 22.
[2] Clayton D. (2015). snpStats: SnpMatrix and XSnpMatrix classes and methods. R package version 1.20.0
[3] Dehman A., Ambroise C., Neuvial P. (2015). Performance of a blockwise approach in variable selection using linkage disequilibrium information. BMC Bioinformatics, 16, 148.
[4] Randriamihamison N., Vialaneix N., and Neuvial P. (2021). Applicability and interpretability of Ward’s hierarchical agglomerative clustering with or without contiguity constraints. Journal of Classification, 38, 363–389.
sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=fr_FR.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=fr_FR.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] snpStats_1.46.0 Matrix_1.4-2 survival_3.4-0
## [4] adjclust_0.6.5 HiTC_1.40.0 GenomicRanges_1.48.0
## [7] GenomeInfoDb_1.32.3 IRanges_2.30.0 S4Vectors_0.34.0
## [10] BiocGenerics_0.42.0
##
## loaded via a namespace (and not attached):
## [1] SummarizedExperiment_1.26.1 capushe_1.1.1
## [3] xfun_0.31 bslib_0.3.1
## [5] splines_4.2.1 lattice_0.20-45
## [7] htmltools_0.5.2 rtracklayer_1.56.1
## [9] yaml_2.3.5 XML_3.99-0.10
## [11] rlang_1.0.2 jquerylib_0.1.4
## [13] BiocParallel_1.30.3 RColorBrewer_1.1-3
## [15] matrixStats_0.62.0 GenomeInfoDbData_1.2.8
## [17] stringr_1.4.0 zlibbioc_1.42.0
## [19] MatrixGenerics_1.8.1 Biostrings_2.64.0
## [21] codetools_0.2-18 evaluate_0.15
## [23] restfulr_0.0.15 Biobase_2.56.0
## [25] knitr_1.39 fastmap_1.1.0
## [27] parallel_4.2.1 highr_0.9
## [29] Rcpp_1.0.8.3 DelayedArray_0.22.0
## [31] jsonlite_1.8.0 XVector_0.36.0
## [33] Rsamtools_2.12.0 rjson_0.2.21
## [35] digest_0.6.29 stringi_1.7.8
## [37] BiocIO_1.6.0 grid_4.2.1
## [39] cli_3.3.0 tools_4.2.1
## [41] bitops_1.0-7 magrittr_2.0.3
## [43] sass_0.4.1 RCurl_1.98-1.7
## [45] crayon_1.5.1 MASS_7.3-58
## [47] sparseMatrixStats_1.8.0 rmarkdown_2.14
## [49] rstudioapi_0.13 R6_2.5.1
## [51] GenomicAlignments_1.32.1 compiler_4.2.1
We have dropped SNP rs2401075 because it produced a missing value due to the lack of genetic diversity in the considered sample.↩︎