The philentropy package has several mechanisms to calculate distances between probability density functions. The main one is to use the the distance()
function, which enables to compute 46 different distances/similarities between probability density functions (see ?philentropy::distance
and a companion vignette for details). Alternatively, it is possible to call each distance/dissimilarity function directly. For example, the euclidean()
function will compute the euclidean distance, while jaccard
- the Jaccard distance. The complete list of available distance measures are available with the philentropy::getDistMethods()
function.
Both of the above approaches have their pros and cons. The distance()
function is more flexible as it allows users to use any distance measure and can return either a matrix
or a dist
object. It also has several defensive programming checks implemented, and thus, it is more appropriate for regular users. Single distance functions, such as euclidean()
or jaccard()
, can be, on the other hand, slightly faster as they directly call the underlining C++ code.
Now, we introduce three new low-level functions that are intermediaries between distance()
and single distance functions. They are fairly flexible, allowing to use of any implemented distance measure, but also usually faster than calling the distance()
functions (especially, if it is needed to use many times). These functions are:
dist_one_one()
- expects two vectors (probability density functions), returns a single valuedist_one_many()
- expects one vector (a probability density function) and one matrix (a set of probability density functions), returns a vector of valuesdist_many_many()
- expects two matrices (two sets of probability density functions), returns a matrix of valuesLet’s start testing them by attaching the philentropy package.
library(philentropy)
dist_one_one()
dist_one_one()
is a lower level equivalent to distance()
. However, instead of accepting a numeric data.frame
or matrix
, it expects two vectors representing probability density functions. In this example, we create two vectors, P
and Q
.
<- 1:10 / sum(1:10)
P <- 20:29 / sum(20:29) Q
To calculate the euclidean distance between them we can use several approaches - (a) build-in R dist()
function, (b) philentropy::distance()
, (c) philentropy::euclidean()
, or the new dist_one_one()
.
# install.packages("microbenchmark")
::microbenchmark(
microbenchmarkdist(rbind(P, Q), method = "euclidean"),
distance(rbind(P, Q), method = "euclidean", test.na = FALSE, mute.message = TRUE),
euclidean(P, Q, FALSE),
dist_one_one(P, Q, method = "euclidean", testNA = FALSE)
)
## Unit: microseconds
## expr
## dist(rbind(P, Q), method = "euclidean")
## distance(rbind(P, Q), method = "euclidean", test.na = FALSE, mute.message = TRUE)
## euclidean(P, Q, FALSE)
## dist_one_one(P, Q, method = "euclidean", testNA = FALSE)
## min lq mean median uq max neval
## 20.581 21.7560 25.74881 23.1155 23.8055 284.644 100
## 28.083 28.7445 51.33078 29.8440 31.6055 2132.509 100
## 2.509 2.7240 3.39208 3.0130 3.5095 8.403 100
## 3.708 4.0495 6.02215 4.7150 5.0955 117.806 100
All of them return the same, single value. However, as you can see in the benchmark above, some are more flexible, and others are faster.
dist_one_many()
The role of dist_one_many()
is to calculate distances between one probability density function (in a form of a vector
) and a set of probability density functions (as rows in a matrix
).
Firstly, let’s create our example data.
set.seed(2020-08-20)
<- 1:10 / sum(1:10)
P <- t(replicate(100, sample(1:10, size = 10) / 55)) M
P
is our input vector and M
is our input matrix.
Distances between the P
vector and probability density functions in M
can be calculated using several approaches. For example, we could write a for
loop (adding a new code) or just use the existing distance()
function and extract only one row (or column) from the results. The dist_one_many()
allows for this calculation directly as it goes through each row in M
and calculates a given distance measure between P
and values in this row.
# install.packages("microbenchmark")
::microbenchmark(
microbenchmarkas.matrix(dist(rbind(P, M), method = "euclidean"))[1, ][-1],
distance(rbind(P, M), method = "euclidean", test.na = FALSE, mute.message = TRUE)[1, ][-1],
dist_one_many(P, M, method = "euclidean", testNA = FALSE)
)
## Unit: microseconds
## expr
## as.matrix(dist(rbind(P, M), method = "euclidean"))[1, ][-1]
## distance(rbind(P, M), method = "euclidean", test.na = FALSE, mute.message = TRUE)[1, ][-1]
## dist_one_many(P, M, method = "euclidean", testNA = FALSE)
## min lq mean median uq max neval
## 284.485 370.5515 456.4489 448.7075 533.3575 776.544 100
## 23620.943 25740.1705 29194.4739 27281.6785 32796.1225 45305.363 100
## 22.739 28.7505 34.3501 34.4265 37.6905 124.308 100
The dist_one_many()
returns a vector of values. It is, in this case, much faster than distance()
, and visibly faster than dist()
while allowing for more possible distance measures to be used.
dist_many_many()
dist_many_many()
calculates distances between two sets of probability density functions (as rows in two matrix
objects).
Let’s create two new matrix
example data.
set.seed(2020-08-20)
<- t(replicate(10, sample(1:10, size = 10) / 55))
M1 <- t(replicate(10, sample(1:10, size = 10) / 55)) M2
M1
is our first input matrix and M2
is our second input matrix. I am not aware of any function build-in R that allows calculating distances between rows of two matrices, and thus, to solve this problem, we can create our own - many_dists()
…
= function(m1, m2){
many_dists = matrix(nrow = nrow(m1), ncol = nrow(m2))
r for (i in seq_len(nrow(m1))){
for (j in seq_len(nrow(m2))){
= rbind(m1[i, ], m2[j, ])
x = distance(x, method = "euclidean", mute.message = TRUE)
r[i, j]
}
}
r }
… and compare it to dist_many_many()
.
# install.packages("microbenchmark")
::microbenchmark(
microbenchmarkmany_dists(M1, M2),
dist_many_many(M1, M2, method = "euclidean", testNA = FALSE)
)
## Unit: microseconds
## expr min
## many_dists(M1, M2) 2310.314
## dist_many_many(M1, M2, method = "euclidean", testNA = FALSE) 37.081
## lq mean median uq max neval
## 2569.2030 2878.19842 2668.8310 2813.3190 12068.858 100
## 42.0015 97.64644 45.9455 49.4415 5060.589 100
Both many_dists()
and dist_many_many()
return a matrix. The above benchmark concludes that dist_many_many()
is about 30 times faster than our custom many_dists()
approach.