Package MKdescr

Illustration of quantiles and box- and whisker plots

The following function should help to understand the meaning and computation of quantiles.

x <- 1:10
illustrate.quantile(x, alpha = 0.2)

illustrate.quantile(x, alpha = 0.5, type = 7)

illustrate.quantile(x, alpha = 0.8, type = c(2, 7))

In addition, there is a function to illustrate the meaning and computation of box-and-whisker plots.

set.seed(123)
illustrate.boxplot(rt(50, df = 5))

IQR

I implemented function IQrange before the standard function IQR gained the type argument. Since 2010 (r53643, r53644) the function is identical to function IQR.

x <- rnorm(100)
IQrange(x)

## [1] 1.501684

IQR(x)

## [1] 1.501684

It is also possible to compute a standardized version of the IQR leading to a normal-consistent estimate of the standard deviation.

sIQR(x)

## [1] 1.1132

sd(x)

## [1] 1.08751

Mean Absolute Deviation

The mean absolute deviation under the assumption of symmetry is a robust alternative to the sample standard deviation (see Tukey 1960).

meanAD(x)

## [1] 1.082176

Huber-type Skipped Mean and SD

Huber-type skipped mean and SD are robust alternatives to the arithmetic mean and sample standard deviation (see Hampel 1985).

skippedMean(x)

## [1] 0.002150837

skippedSD(x)

## [1] 1.08751

Five Number Summary

There is a function that computes a so-called five number summary which in contrast to function fivenum uses the first and third quartile instead of the lower and upper hinge.

fiveNS(x)

##     Minimum         25%      Median         75%     Maximum 
## -2.49483544 -0.75313964 -0.03735759  0.74854387  2.70816828

Coefficient of Variation (CV)

There are functions to compute the (classical) coefficient of variation as well as two robust variants. In case of the robust variants, the mean is replaced by the median and the SD is replaced by the (standardized) MAD and the (standardized) IQR, respectively [see Arachchige2019].

## 5% outliers
out <- rbinom(100, prob = 0.05, size = 1)
sum(out)

## [1] 4

x <- (1-out)*rnorm(100, mean = 10, sd = 2) + out*25
CV(x)

## [1] 0.3456849

medCV(x)

## [1] 0.2295398

iqrCV(x)

## [1] 0.2252741

Signal to Noise Ratio (SNR)

There are functions to compute the (classical) signal to noise ratio as well as two robust variants. In case of the robust variants, the mean is replaced by the median and the SD is replaced by the (standardized) MAD and the (standardized) IQR, respectively.

SNR(x)

## [1] 2.892808

medSNR(x)

## [1] 4.356543

iqrSNR(x)

## [1] 4.439036

Standardized Mean Difference

We compute the standardized mean difference also known as Cohen’s und Hogdes’ g. In addition, there is a new proposal in case of unequal variances.

n1 <- 100
x <- rnorm(100)
n2 <- 150
y <- rnorm(n2, mean = 3, sd = 2)
## true value
(0-3)/sqrt((1 + n1/n2*2^2)/(n1/n2+1))

## [1] -2.0226

## estimates
## Aoki's e
SMD(x, y)

## bias-corrected SMD 
##          -2.068067

## Hedges' g
SMD(x, y, var.equal = TRUE)

## bias-corrected SMD 
##          -1.843242

## standardized test statistic of Welch's t-test
SMD(x, y, bias.cor = FALSE)

##      SMD 
## -2.07466

## Cohen's d
SMD(x, y, bias.cor = FALSE, var.equal = TRUE)

##      SMD 
## -1.84884

z-Score

There are functions to compute z-score and robust variants of z-score based on median and MAD, respectively median and IQR.

## 10% outliers
out <- rbinom(100, prob = 0.1, size = 1)
sum(out)

## [1] 5

x <- (1-out)*rnorm(100, mean = 10, sd = 2) + out*25
z <- zscore(x)
z.med <- medZscore(x)
z.iqr <- iqrZscore(x)
## mean without outliers (should by close to 0)
mean(z[!out])

## [1] -0.193416

mean(z.med[!out])

## [1] 0.02841029

mean(z.iqr[!out])

## [1] 0.02682045

## sd without outliers (should by close to 1)
sd(z[!out])

## [1] 0.5450112

sd(z.med[!out])

## [1] 0.9702241

sd(z.iqr[!out])

## [1] 0.9159303

Box- and Whisker-Plot

In contrast to the standard function boxplot which uses the lower and upper hinge for defining the box and the whiskers, the function qboxplot uses the first and third quartile.

x <- rt(10, df = 3)
par(mfrow = c(1,2))
qboxplot(x, main = "1st and 3rd quartile")
boxplot(x, main = "Lower and upper hinge")

The difference between the two versions often is hardly visible.

Generalized Logarithm

The generalized logarithm, which corresponds to the shifted area hyperbolic sine, \[ \log(x + \sqrt{x^2 + 1}) - \log(2) \] may be useful as a variance stabilizing transformation when also negative values are present.

curve(log, from = -3, to = 5)

## Warning in log(x): NaNs wurden erzeugt

curve(glog, from = -3, to = 5, add = TRUE, col = "orange")
legend("topleft", fill = c("black", "orange"), legend = c("log", "glog"))

As in case of function log there is also glog10 and glog2.

curve(log10(x), from = -3, to = 5)

## Warning in eval(expr, envir = ll, enclos = parent.frame()): NaNs wurden erzeugt

curve(glog10(x), from = -3, to = 5, add = TRUE, col = "orange")
legend("topleft", fill = c("black", "orange"), legend = c("log10", "glog10"))

There are also functions that compute the inverse of the generalized logarithm.

inv.glog(glog(10))

## [1] 10

inv.glog(glog(10, base = 3), base = 3)

## [1] 10

inv.glog10(glog10(10))

## [1] 10

inv.glog2(glog2(10))

## [1] 10

Simulate Correlated Variables

To demonstrate Pearson correlation in my lectures, I have written this simple function to simulate correlated variables and to generate a scatter plot of the data.

res <- simCorVars(n = 500, r = 0.8)

cor(res$Var1, res$Var2)

## [1] 0.8037735

Plot TSH, fT3 and fT4 Values

The thyroid function is usually investigated by determining the values of TSH, fT3 and fT4. The function thyroid can be used to visualize the measured values as relative values with respect to the provided reference ranges.

thyroid(TSH = 1.5, fT3 = 2.5, fT4 = 14, TSHref = c(0.2, 3.0),
        fT3ref = c(1.7, 4.2), fT4ref = c(7.6, 15.0))

Generalized and Negative Logarithm as Transformations

We can use the generalized logarithm for transforming the axes in ggplot2 plots (Wickham 2016).

library(ggplot2)
data(mpg)
p1 <- ggplot(mpg, aes(displ, hwy)) + geom_point()
p1

p1 + scale_x_log10()

p1 + scale_x_glog10()

p1 + scale_y_log10()

p1 + scale_y_glog10()

The negative logrithm is for instance useful for displaying p values. The interesting values are on the top. This is for instance used in a so-called volcano plot.

x <- matrix(rnorm(1000, mean = 10), nrow = 10)
g1 <- rep("control", 10)
y1 <- matrix(rnorm(500, mean = 11.25), nrow = 10)
y2 <- matrix(rnorm(500, mean = 9.75), nrow = 10)
g2 <- rep("treatment", 10)
group <- factor(c(g1, g2))
Data <- rbind(x, cbind(y1, y2))
pvals <- apply(Data, 2, function(x, group) t.test(x ~ group)$p.value,
               group = group)
## compute log-fold change
logfc <- function(x, group){
  res <- tapply(x, group, mean)
  log2(res[1]/res[2])
}
lfcs <- apply(Data, 2, logfc, group = group)
ps <- data.frame(pvals = pvals, logfc = lfcs)
ggplot(ps, aes(x = logfc, y = pvals)) + geom_point() +
    geom_hline(yintercept = 0.05) + scale_y_neglog10() +
    geom_vline(xintercept = c(-0.1, 0.1)) + xlab("log-fold change") +
    ylab("-log10(p value)") + ggtitle("A Volcano Plot")

Change Data from Wide to Long

Often it’s better to have the data in a long format than in a wide format; e.g., when plotting with package ggplot2. The necessary transformation can be done with function melt.long.

## some random data
test <- data.frame(x = rnorm(10), y = rnorm(10), z = rnorm(10))
test.long <- melt.long(test)
test.long

##           value variable
## x1   0.70195275        x
## x2   0.33618151        x
## x3   0.74982570        x
## x4  -0.80088234        x
## x5  -0.12274139        x
## x6   0.66428859        x
## x7   0.05495788        x
## x8   0.21269503        x
## x9   0.05086068        x
## x10  0.18291685        x
## y1  -0.02467293        y
## y2  -1.09939100        y
## y3   0.17399933        y
## y4  -0.45536672        y
## y5   1.43577886        y
## y6  -0.52959017        y
## y7   0.15215140        y
## y8  -1.97326132        y
## y9  -0.65887876        y
## y10 -0.82845265        y
## z1   1.23170354        z
## z2  -0.15108595        z
## z3   0.14495627        z
## z4   0.30029691        z
## z5  -0.53111148        z
## z6   0.30071514        z
## z7   1.50662404        z
## z8   0.61699271        z
## z9   0.71440006        z
## z10 -0.83452283        z

ggplot(test.long, aes(x = variable, y = value)) +
  geom_boxplot(aes(fill = variable))

## introducing an additional grouping variable
group <- factor(rep(c("a","b"), each = 5))
test.long.gr <- melt.long(test, select = 1:2, group = group)
test.long.gr

##     group       value variable
## x1      a  0.70195275        x
## x2      a  0.33618151        x
## x3      a  0.74982570        x
## x4      a -0.80088234        x
## x5      a -0.12274139        x
## x6      b  0.66428859        x
## x7      b  0.05495788        x
## x8      b  0.21269503        x
## x9      b  0.05086068        x
## x10     b  0.18291685        x
## y1      a -0.02467293        y
## y2      a -1.09939100        y
## y3      a  0.17399933        y
## y4      a -0.45536672        y
## y5      a  1.43577886        y
## y6      b -0.52959017        y
## y7      b  0.15215140        y
## y8      b -1.97326132        y
## y9      b -0.65887876        y
## y10     b -0.82845265        y

ggplot(test.long.gr, aes(x = variable, y = value, fill = group)) +
  geom_boxplot()

sessionInfo

sessionInfo()

## R version 4.1.0 Patched (2021-07-20 r80661)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Linux Mint 20.1
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/libf77blas.so.3.10.3
## LAPACK: /home/kohlm/RTOP/Rbranch/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=de_DE.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=de_DE.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=de_DE.UTF-8   
##  [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_3.3.5 MKdescr_0.7  
## 
## loaded via a namespace (and not attached):
##  [1] highr_0.9         bslib_0.2.5.1     compiler_4.1.0    pillar_1.6.2     
##  [5] jquerylib_0.1.4   tools_4.1.0       digest_0.6.27     jsonlite_1.7.2   
##  [9] evaluate_0.14     lifecycle_1.0.0   tibble_3.1.3      gtable_0.3.0     
## [13] pkgconfig_2.0.3   rlang_0.4.11      yaml_2.2.1        xfun_0.24        
## [17] withr_2.4.2       stringr_1.4.0     dplyr_1.0.7       knitr_1.33       
## [21] generics_0.1.0    sass_0.4.0        vctrs_0.3.8       grid_4.1.0       
## [25] tidyselect_1.1.1  glue_1.4.2        R6_2.5.0          fansi_0.5.0      
## [29] rmarkdown_2.9     farver_2.1.0      purrr_0.3.4       magrittr_2.0.1   
## [33] scales_1.1.1      htmltools_0.5.1.1 ellipsis_0.3.2    colorspace_2.0-2 
## [37] labeling_0.4.2    utf8_1.2.2        stringi_1.7.3     munsell_0.5.0    
## [41] crayon_1.4.1