It can be instructive to visualize the distribution of trees in a spatial “landscape”. This can be a helpful means to address whether discrete islands of trees exist, or whether analytical runs have converged. Such analysis is relatively simple to conduct, but a few common oversights can mislead interpretation.
Tree space mapping and analysis is made simple with the Shiny app included in the “TreeDist” R package. Simply install R or RStudio, then copy the code below into the R command line:
install.packages("TreeDist")
::MapTrees() TreeDist
This will allow you to conduct and evaluate basic tree space mappings from tree lists saved in most common file formats; see an outline of the basic functionality. To avoid misinterpreting tree space, it’s worth having a broad idea of what an analysis involves, and some potential pitfalls.
Here’s an example analysis of a series of 200 trees from an ordered
list. The list corresponds to a mixed-base representation of trees (see
TreeTools::as.TreeNumber()
), so is expected to contain some
structure as we jump from one “class” of tree to another. Let’s see
whether we can visualize and corroborate this structure.
First we’ll generate the trees, and load some colours with which we might identify them.
library("TreeTools", quietly = TRUE)
<- c(1:220)
treeNumbers <- as.phylo(treeNumbers, 8)
trees <- hcl.colors(220, "plasma")
spectrum <- spectrum[treeNumbers] treeCols
Now we need to calculate the distance between each pair of trees in our list. The choice of distance metric is important (Smith, 2022). The widely used Robinson–Foulds distance is, unfortunately, unsuitable for tree space analysis. The clustering information distance (Smith, 2020) is a reliable alternative that is fast to calculate:
library("TreeDist")
<- ClusteringInfoDistance(trees) distances
The reader is encouraged to repeat the exercise with other distances:
<- RobinsonFoulds(trees)
distances <- PhylogeneticInfoDistance(trees)
distances <- as.dist(Quartet::QuartetDivergence(
distances ::ManyToManyQuartetAgreement(trees), similarity = FALSE)) Quartet
Then we need to reduce the dimensionality of these distances. We’ll start out with a 12-dimensional mapping; if needed, we can always drop higher dimensions.
Principal coordinates analysis is quick and performs very well:
<- cmdscale(distances, k = 12) mapping
Alternative mapping methods do exist, and sometimes give slightly
better mappings. isoMDS()
performs non-metric
multidimensional scaling (MDS) with the Kruskal-1 stress function (Kruskal, 1964):
<- MASS::isoMDS(distances, k = 12)
kruskal <- kruskal$points mapping
whereas sammon()
, one of many metric MDS methods, uses
Sammon’s stress function (Sammon,
1969):
sammon <- MASS::sammon(distances, k = 12)
mapping <- sammon$points
That’s a good start. It is tempting to plot the first two dimensions arising from this mapping and be done:
par(mar = rep(0, 4))
plot(mapping,
asp = 1, # Preserve aspect ratio - do not distort distances
ann = FALSE, axes = FALSE, # Don't label axes: dimensions are meaningless
col = treeCols, pch = 16
)
A quick visual inspection suggests at least two clusters, with the possibility of further subdivision of the brighter trees. But visual inspection can be highly misleading (Smith, 2022). We must take a statistical approach. A combination of partitioning around medoids and hierarchical clustering with minimax linkage will typically find a clustering solution that is close to optimal, if one exists (Smith, 2022).
<- 2:10
possibleClusters
<- lapply(possibleClusters, function(k) cluster::pam(distances, k = k))
pamClusters <- vapply(pamClusters, function(pamCluster) {
pamSils mean(cluster::silhouette(pamCluster)[, 3])
double(1))
},
<- which.max(pamSils)
bestPam <- pamSils[bestPam]
pamSil <- pamClusters[[bestPam]]$cluster
pamCluster
<- protoclust(distances)
hTree <- lapply(possibleClusters, function(k) cutree(hTree, k = k))
hClusters <- vapply(hClusters, function(hCluster) {
hSils mean(cluster::silhouette(hCluster, distances)[, 3])
double(1))
},
<- which.max(hSils)
bestH <- hSils[bestH]
hSil <- hClusters[[bestH]]
hCluster
plot(pamSils ~ possibleClusters,
xlab = "Number of clusters", ylab = "Silhouette coefficient",
ylim = range(c(pamSils, hSils)))
points(hSils ~ possibleClusters, pch = 2)
legend("topright", c("PAM", "Hierarchical"), pch = 1:2)
Silhouette coefficients of < 0.25 suggest that structure is not meaningful; > 0.5 denotes good evidence of clustering, and > 0.7 strong evidence (Kaufman & Rousseeuw, 1990). The evidence for the visually apparent clustering is not as strong as it first appears. Let’s explore our two-cluster hierarchical clustering solution anyway.
<- hClusters[[2 - 1]] cluster
We can visualize the clustering solution as a tree:
class(hTree) <- "hclust"
par(mar = c(0, 0, 0, 0))
plot(hTree, labels = FALSE, main = "")
points(seq_along(trees), rep(1, length(trees)), pch = 16,
col = spectrum[hTree$order])
Another thing we may wish to do is to take the consensus of each cluster:
par(mfrow = c(1, 2), mar = rep(0.2, 4))
<- spectrum[mean(treeNumbers[cluster == 1])]
col1 <- spectrum[mean(treeNumbers[cluster == 2])]
col2 plot(consensus(trees[cluster == 1], p = 0.5),
edge.color = col1, edge.width = 2, tip.color = col1)
plot(consensus(trees[cluster == 2], p = 0.5),
edge.color = col2, edge.width = 2, tip.color = col2)
Here, we learn that the two clusters are distinguished by the
position of t7
.
Now let’s evaluate whether our map of tree space is representative. First we want to know how many dimensions are necessary to adequately represent the true distances between trees. We hope for a trustworthiness × continuity score of > 0.9 for a usable mapping, or > 0.95 for a good one.
<- vapply(seq_len(ncol(mapping)), function(k) {
txc <- dist(mapping[, seq_len(k)])
newDist MappingQuality(distances, newDist, 10)["TxC"]
0)
}, plot(txc, xlab = "Dimension")
abline(h = 0.9, lty = 2)
We are going to need at least five dimensions to adequately represent the distances between trees.
To help establish visually what structures are more likely to be genuine, we might also choose to calculate a minimum spanning tree:
<- MSTEdges(distances) mstEnds
Let’s plot the first five dimensions of our tree space, highlighting the convex hulls of our clusters:
<- which.max(txc > 0.9)
nDim <- matrix(0, nDim, nDim)
plotSeq upper.tri(plotSeq)] <- seq_len(nDim * (nDim - 1) / 2)
plotSeq[<- t(plotSeq[-nDim, -1])
plotSeq * 1:3] <- (nDim * (nDim - 1) / 2) + 1:3
plotSeq[nDim layout(plotSeq)
par(mar = rep(0.1, 4))
for (i in 2:nDim) for (j in seq_len(i - 1)) {
# Set up blank plot
plot(mapping[, j], mapping[, i], ann = FALSE, axes = FALSE, frame.plot = TRUE,
type = "n", asp = 1, xlim = range(mapping), ylim = range(mapping))
# Plot MST
MSTSegments(mapping[, c(j, i)], mstEnds,
col = StrainCol(distances, mapping[, c(j, i)]))
# Add points
points(mapping[, j], mapping[, i], pch = 16, col = treeCols)
# Mark clusters
for (clI in unique(cluster)) {
<- cluster == clI
inCluster <- mapping[inCluster, j]
clusterX <- mapping[inCluster, i]
clusterY <- chull(clusterX, clusterY)
hull polygon(clusterX[hull], clusterY[hull], lty = 1, lwd = 2,
border = "#54de25bb")
text(mean(clusterX), mean(clusterY), clI, col = "#54de25bb", font = 2)
}
}# Annotate dimensions
plot(0, 0, type = "n", ann = FALSE, axes = FALSE)
text(0, 0, "Dimension 2")
plot(0, 0, type = "n", ann = FALSE, axes = FALSE)
text(0, 0, "Dimension 3")
plot(0, 0, type = "n", ann = FALSE, axes = FALSE)
text(0, 0, "Dimension 4")
Our clusters, so distinct in dimension 1, overlap strongly in every other dimension. The fact that the minimum spanning tree moves between clusters also underlines the fact that they are not as well defined as they appear by eye.
Note that cluster membership, as well as the precise shape of tree space, is a function of the tree distance metric. The phylogenetic information distance recovers a different pair of clusters, which may not correspond to those that are most apparent from a simple visual inspection of the two-dimensional tree space plot:
It is tempting to compare the size of clusters by calculating the area of convex hulls on a two-dimensional mapping. However, mapped areas do not necessarily correspond to true hypervolumes.
Accuracy may be improved by comparing higher dimensions of projections using the “hypervolume” package, though the same considerations apply (Blonder et al., 2018). Interpretation of overlap statistics is detailed in Mammola (2019).
<- requireNamespace("hypervolume", quietly = TRUE)
hypervolumeInstalled if (hypervolumeInstalled) {
library("hypervolume")
<- hypervolume_gaussian(pid_mapping[pid_cluster == 1, 1:3],
hv1 verbose = FALSE)
<- hypervolume_gaussian(pid_mapping[pid_cluster == 2, 1:3],
hv2 verbose = FALSE)
<- hypervolume_distance(hv1, hv2)
hv_dist capture.output(
<- hypervolume_set(hv1, hv2, verbose = FALSE,
hyperset check.memory = FALSE)
-> XX_VerboseNotRespected
) <- hypervolume_overlap_statistics(hyperset)
hv_overlap
hv_dist
hv_overlapelse {
} print("Install the 'hypervolume' package to run this example")
}
## jaccard sorensen frac_unique_1 frac_unique_2
## 0.02655966 0.05174498 0.94574038 0.95054689
An alternative approach to visualizing tree space is to create emergent self-organizing maps (Kohonen, 1982; Thrun, Lerch, Lötsch, & Ultsch, 2016; Ultsch, 2003), which map high-dimensional data into two dimensions, then add a third dimension to indicate distance between data points: nearby points occur in valleys, and are separated by ridges from more distant data points.
<- requireNamespace("Umatrix", quietly = TRUE)
umatrixInstalled if (umatrixInstalled) {
<- Umatrix::esomTrain(as.matrix(distances), Key = seq_along(trees),
map Epochs = 5, # Increase for better results
Lines = 42,
Columns = 42,
Toroid = FALSE)
::plotMatrix(Matrix = map$Umatrix,
UmatrixToroid = FALSE, FixedRatio = TRUE,
TransparentContours = FALSE, Clean = TRUE) +
::geom_point(data = data.frame(x = map$BestMatches[, 3],
ggplot2y = map$BestMatches[, 2]),
shape = 19, color = treeCols, size = 2)
else {
} message("Install the 'Umatrix' package to run this example")
}
## [1] "Epoch: 1 started"
## [1] "Epoch: 2 started"
## [1] "Epoch: 3 started"
## [1] "Epoch: 4 started"
## [1] "Epoch: 5 started"
## [1] "---- Esom Training Finished ----"
You may wish to:
Provide context for tree distances
Review available distance measures and the corresponding functions
Compare the distribution of different sets of trees