Calculating SHA1 hashes with digest() and sha1()

R FAQ 7.31 illustrates potential problems with floating point arithmetic. Mathematically the equality $x = \sqrt{x}^2$ should hold. But the precision of floating points numbers is finite. Hence some rounding is done, leading to numbers which are no longer identical.

> a0 <- 2
> b <- sqrt(a0)
> a1 <- b^2
> identical(a0, a1)
[1] FALSE
> a0 - a1
[1] -4.440892e-16
> a <- c(a0, a1)
> sprintf("%a", a)
[1] "0x1p+1"               "0x1.0000000000001p+1"

Although the difference is small, any difference will result in different hash when using the digest() function. However, the sha1() function tackles this problem by using the hexadecimal representation of the numbers and truncates that representation to a certain number of digits prior to calculating the hash function.

> library(digest)
> sapply(a, digest, algo = "sha1")
[1] "315a5aa84aa6cfa4f3fb4b652a596770be0365e8"
[2] "5e3999bf79c230f7430e97d7f30070a7eff5ec92"
> sapply(a, sha1)
[1] "8a938d8f5fb9b8ccb6893aa1068babcc517f32d4"
[2] "8a938d8f5fb9b8ccb6893aa1068babcc517f32d4"
> sapply(a, sha1, digits = 15)
[1] "98eb1dc9fada00b945d3ef01c77049ee5a4b7b9c"
[2] "5a173e2445df1134908037f69ac005fbd8afee74"
> sapply(a, sha1, digits = 13)
[1] "43b3b465c975af322c85473190a9214b79b79bf6"
[2] "43b3b465c975af322c85473190a9214b79b79bf6"
> sapply(a, sha1, digits = 10)
[1] "6b537a9693c750ed535ca90527152f06e358aa3a"
[2] "6b537a9693c750ed535ca90527152f06e358aa3a"
> c(sha1(pi), sha1(pi, digits = 13), sha1(pi, digits = 10))
[1] "169388cf1ce60dc4e9904a22bc934c5db33d975b"
[2] "20dc38536b6689d5f2d053f30efb09c44af78378"
[3] "3a727417bd1807af2f0148cf3de69abff32c23ec"

The result of floating point arithematic on 32-bit and 64-bit can be slightly different. E.g. print(pi ^ 11, 22) returns 294204.01797389047 on 32-bit and 294204.01797389053 on 64-bit. Note that only the last 2 digits are different.

command	32-bit	64-bit
`print(pi ^ 11, 22)`	`294204.01797389047`	`294204.01797389053`
`sprintf("%a", pi ^ 11)`	`"0x1.1f4f01267bf5fp+18"`	`"0x1.1f4f01267bf6p+18"`
`digest(pi ^ 11, algo = "sha1")`	`"c5efc7f167df1bb402b27cf9b405d7cebfba339a"`	`"b61f6fea5e2a7952692cefe8bba86a00af3de713"`
`sha1(pi ^ 11, digits = 14)`	`"5c7740500b8f78ec2354ea6af58ea69634d9b7b1"`	`"4f3e296b9922a7ddece2183b1478d0685609a359"`
`sha1(pi ^ 11, digits = 13)`	`"372289f87396b0877ccb4790cf40bcb5e658cad7"`	`"372289f87396b0877ccb4790cf40bcb5e658cad7"`
`sha1(pi ^ 11, digits = 10)`	`"c05965af43f9566bfb5622f335817f674abfc9e4"`	`"c05965af43f9566bfb5622f335817f674abfc9e4"`

Choosing digest() or sha1()

Creating a sha1 method for other classes

How to

summary.lm

Let’s illustrate this using the summary of a simple linear regression. Suppose that we want a hash that takes into account the coefficients, their standard error and sigma.

> lm_SR <- lm(sr ~ pop15 + pop75 + dpi + ddpi, data = LifeCycleSavings)
> lm_sum <- summary(lm_SR)
> class(lm_sum)
[1] "summary.lm"
> str(lm_sum)
List of 11
 $ call         : language lm(formula = sr ~ pop15 + pop75 + dpi + ddpi, data = LifeCycleSavings)
 $ terms        :Classes 'terms', 'formula'  language sr ~ pop15 + pop75 + dpi + ddpi
  .. ..- attr(*, "variables")= language list(sr, pop15, pop75, dpi, ddpi)
  .. ..- attr(*, "factors")= int [1:5, 1:4] 0 1 0 0 0 0 0 1 0 0 ...
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:5] "sr" "pop15" "pop75" "dpi" ...
  .. .. .. ..$ : chr [1:4] "pop15" "pop75" "dpi" "ddpi"
  .. ..- attr(*, "term.labels")= chr [1:4] "pop15" "pop75" "dpi" "ddpi"
  .. ..- attr(*, "order")= int [1:4] 1 1 1 1
  .. ..- attr(*, "intercept")= int 1
  .. ..- attr(*, "response")= int 1
  .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. ..- attr(*, "predvars")= language list(sr, pop15, pop75, dpi, ddpi)
  .. ..- attr(*, "dataClasses")= Named chr [1:5] "numeric" "numeric" "numeric" "numeric" ...
  .. .. ..- attr(*, "names")= chr [1:5] "sr" "pop15" "pop75" "dpi" ...
 $ residuals    : Named num [1:50] 0.864 0.616 2.219 -0.698 3.553 ...
  ..- attr(*, "names")= chr [1:50] "Australia" "Austria" "Belgium" "Bolivia" ...
 $ coefficients : num [1:5, 1:4] 28.566087 -0.461193 -1.691498 -0.000337 0.409695 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:5] "(Intercept)" "pop15" "pop75" "dpi" ...
  .. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
 $ aliased      : Named logi [1:5] FALSE FALSE FALSE FALSE FALSE
  ..- attr(*, "names")= chr [1:5] "(Intercept)" "pop15" "pop75" "dpi" ...
 $ sigma        : num 3.8
 $ df           : int [1:3] 5 45 5
 $ r.squared    : num 0.338
 $ adj.r.squared: num 0.28
 $ fstatistic   : Named num [1:3] 5.76 4 45
  ..- attr(*, "names")= chr [1:3] "value" "numdf" "dendf"
 $ cov.unscaled : num [1:5, 1:5] 3.74 -7.24e-02 -4.46e-01 -7.86e-05 -1.88e-02 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:5] "(Intercept)" "pop15" "pop75" "dpi" ...
  .. ..$ : chr [1:5] "(Intercept)" "pop15" "pop75" "dpi" ...
 - attr(*, "class")= chr "summary.lm"
> coef_sum <- coef(lm_sum)[, c("Estimate", "Std. Error")]
> sigma <- lm_sum$sigma
> class(coef_sum)
[1] "matrix" "array" 
> class(sigma)
[1] "numeric"
> sha1(coef_sum, digits = 4)
[1] "3f0b0c552f94d753fcc8deb4d3e9fc11a83197af"
> sha1(sigma, digits = 4)
[1] "cbc83d1791ef1eeadd824ea9a038891b5889056b"
> sha1(list(coef_sum, sigma), digits = 4)
[1] "476d27265365cd41662eedf059b335d38a221cc2"
> sha1.summary.lm <- function(x, digits = 4, zapsmall = 7) {
+     coef_sum <- coef(x)[, c("Estimate", "Std. Error")]
+     sigma <- x$sigma
+     com .... [TRUNCATED] 
> sha1(lm_sum)
[1] "476d27265365cd41662eedf059b335d38a221cc2"
> LCS2 <- LifeCycleSavings[rownames(LifeCycleSavings) != 
+     "Zambia", ]
> lm_SR2 <- lm(sr ~ pop15 + pop75 + dpi + ddpi, data = LCS2)
> sha1(summary(lm_SR2))
[1] "90beb028833bf0542997fde7c3f19e5b9fdfeef4"

Let’s illustrate this using the summary of a simple linear regression. Suppose that we want a hash that takes into account the coefficients, their standard error and sigma.

> class(lm_SR)
[1] "lm"
> str(lm_SR)
List of 12
 $ coefficients : Named num [1:5] 28.566087 -0.461193 -1.691498 -0.000337 0.409695
  ..- attr(*, "names")= chr [1:5] "(Intercept)" "pop15" "pop75" "dpi" ...
 $ residuals    : Named num [1:50] 0.864 0.616 2.219 -0.698 3.553 ...
  ..- attr(*, "names")= chr [1:50] "Australia" "Austria" "Belgium" "Bolivia" ...
 $ effects      : Named num [1:50] -68.38 -14.29 7.3 -3.52 -7.94 ...
  ..- attr(*, "names")= chr [1:50] "(Intercept)" "pop15" "pop75" "dpi" ...
 $ rank         : int 5
 $ fitted.values: Named num [1:50] 10.57 11.45 10.95 6.45 9.33 ...
  ..- attr(*, "names")= chr [1:50] "Australia" "Austria" "Belgium" "Bolivia" ...
 $ assign       : int [1:5] 0 1 2 3 4
 $ qr           :List of 5
  ..$ qr   : num [1:50, 1:5] -7.071 0.141 0.141 0.141 0.141 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:50] "Australia" "Austria" "Belgium" "Bolivia" ...
  .. .. ..$ : chr [1:5] "(Intercept)" "pop15" "pop75" "dpi" ...
  .. ..- attr(*, "assign")= int [1:5] 0 1 2 3 4
  ..$ qraux: num [1:5] 1.14 1.17 1.16 1.15 1.05
  ..$ pivot: int [1:5] 1 2 3 4 5
  ..$ tol  : num 1e-07
  ..$ rank : int 5
  ..- attr(*, "class")= chr "qr"
 $ df.residual  : int 45
 $ xlevels      : Named list()
 $ call         : language lm(formula = sr ~ pop15 + pop75 + dpi + ddpi, data = LifeCycleSavings)
 $ terms        :Classes 'terms', 'formula'  language sr ~ pop15 + pop75 + dpi + ddpi
  .. ..- attr(*, "variables")= language list(sr, pop15, pop75, dpi, ddpi)
  .. ..- attr(*, "factors")= int [1:5, 1:4] 0 1 0 0 0 0 0 1 0 0 ...
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:5] "sr" "pop15" "pop75" "dpi" ...
  .. .. .. ..$ : chr [1:4] "pop15" "pop75" "dpi" "ddpi"
  .. ..- attr(*, "term.labels")= chr [1:4] "pop15" "pop75" "dpi" "ddpi"
  .. ..- attr(*, "order")= int [1:4] 1 1 1 1
  .. ..- attr(*, "intercept")= int 1
  .. ..- attr(*, "response")= int 1
  .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. ..- attr(*, "predvars")= language list(sr, pop15, pop75, dpi, ddpi)
  .. ..- attr(*, "dataClasses")= Named chr [1:5] "numeric" "numeric" "numeric" "numeric" ...
  .. .. ..- attr(*, "names")= chr [1:5] "sr" "pop15" "pop75" "dpi" ...
 $ model        :'data.frame':  50 obs. of  5 variables:
  ..$ sr   : num [1:50] 11.43 12.07 13.17 5.75 12.88 ...
  ..$ pop15: num [1:50] 29.4 23.3 23.8 41.9 42.2 ...
  ..$ pop75: num [1:50] 2.87 4.41 4.43 1.67 0.83 2.85 1.34 0.67 1.06 1.14 ...
  ..$ dpi  : num [1:50] 2330 1508 2108 189 728 ...
  ..$ ddpi : num [1:50] 2.87 3.93 3.82 0.22 4.56 2.43 2.67 6.51 3.08 2.8 ...
  ..- attr(*, "terms")=Classes 'terms', 'formula'  language sr ~ pop15 + pop75 + dpi + ddpi
  .. .. ..- attr(*, "variables")= language list(sr, pop15, pop75, dpi, ddpi)
  .. .. ..- attr(*, "factors")= int [1:5, 1:4] 0 1 0 0 0 0 0 1 0 0 ...
  .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. ..$ : chr [1:5] "sr" "pop15" "pop75" "dpi" ...
  .. .. .. .. ..$ : chr [1:4] "pop15" "pop75" "dpi" "ddpi"
  .. .. ..- attr(*, "term.labels")= chr [1:4] "pop15" "pop75" "dpi" "ddpi"
  .. .. ..- attr(*, "order")= int [1:4] 1 1 1 1
  .. .. ..- attr(*, "intercept")= int 1
  .. .. ..- attr(*, "response")= int 1
  .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. .. ..- attr(*, "predvars")= language list(sr, pop15, pop75, dpi, ddpi)
  .. .. ..- attr(*, "dataClasses")= Named chr [1:5] "numeric" "numeric" "numeric" "numeric" ...
  .. .. .. ..- attr(*, "names")= chr [1:5] "sr" "pop15" "pop75" "dpi" ...
 - attr(*, "class")= chr "lm"
> lm_model <- lm_SR$model
> lm_terms <- lm_SR$terms
> class(lm_model)
[1] "data.frame"
> class(lm_terms)
[1] "terms"   "formula"
> sha1.formula <- function(x, digits = 14, zapsmall = 7, 
+     ..., algo = "sha1") {
+     sha1(as.character(x), digits = digits, zapsmall = zapsmall .... [TRUNCATED] 
> sha1(lm_terms)
[1] "2737d209720aa7d1c0555050ad06ebe89f3850cd"
> sha1(lm_model)
[1] "27b7dd9e3e09b9577da6947b8473b63a1d0b6eb4"
> sha1.lm <- function(x, digits = 14, zapsmall = 7, 
+     ..., algo = "sha1") {
+     lm_model <- x$model
+     lm_terms <- x$terms
+     combined <- .... [TRUNCATED] 
> sha1(lm_SR)
[1] "7eda2a9d58e458c8e782e40ce140d62b836b2a2f"
> sha1(lm_SR2)
[1] "4d3abdb1f17bd12fdf9d9b91a2ad04c07824fe4a"

Calculating SHA1 hashes with digest() and sha1()

Short intro on hashes

Difference between `digest()` and `sha1()`

Choosing `digest()` or `sha1()`

Creating a sha1 method for other classes

How to

summary.lm

lm

Using hashes to track changes in analysis

Short intro on hashes

Difference between digest() and sha1()

Choosing digest() or sha1()

Creating a sha1 method for other classes

How to

summary.lm

lm

Using hashes to track changes in analysis

Difference between `digest()` and `sha1()`

Choosing `digest()` or `sha1()`