This vignette shows some simple examples of how to use UNF signatures for data citation, using the UNF package on the iris dataset available in base R.

library("UNF")

The datasets should already be available in R, but we will load it explicitly just for good measure, along with the digest package for calculating MD5 sums:

library("datasets")
library("digest")
data(iris)

MD5: The Current Alternative to UNF

A common method of creating a unique hash for a data file (or any file) is the MD5 algorithm which, in most cases, produces a unique hash for a file that will change if the file is changed. This is useful from a data citation perspective because meaningful changes to the data will produce meaningful changes to the hash:

digest(iris, "md5")
## [1] "d3c5d071001b61a9f6131d3004fd0988"
digest(head(iris), "md5")
## [1] "89c2e2229a8ee084e56b2d43d3857e24"

A subset of the iris data has a different MD5 hash than the full dataset. But, MD5 is sensitive to variable ordering:

digest(iris, "md5")
## [1] "d3c5d071001b61a9f6131d3004fd0988"
digest(rev(iris), "md5")
## [1] "2fa0a7e072b20e0a06e4cb22d5544dfd"

iris and rev(iris) contain identical data values, but the variables are simply in reverse order. This means that while MD5 is sensitive to meaningful changes in data contents, it is also sensitive to meaningless changes in data structure. Relatedly, MD5 is sensitive to irrelevant changes in variable naming that might result from modifying how data are read into memory:

iris2 <- iris
names(iris2) <- gsub("\\.","",names(iris2))
names(iris)
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"
names(iris2)
## [1] "SepalLength" "SepalWidth"  "PetalLength" "PetalWidth"  "Species"
digest(iris, "md5")
## [1] "d3c5d071001b61a9f6131d3004fd0988"
digest(iris2, "md5")
## [1] "d5a9772a2b789048c9413f8e7fc9cbb5"

MD5 sums are also sensitive to the format in which data are actually preserved (i.e., they are format-dependent). The same data saved as an .RData file and as a CSV yield different MD5 hashes:

library("tools")
save(iris, file = "iris.RData")
write.csv(iris, file = "iris.csv", row.names = FALSE)
md5sum("iris.RData")
##                         iris.RData 
## "d71875373cd2f97b734ce31444826567"
md5sum("iris.csv")
##                           iris.csv 
## "395d6abd001abf3760a1a794ef4c453c"

And the iris dataset yields a different MD5 after being read back in from the CSV, despite being identical:

# read CSV back into memory
d <- read.csv("iris.csv")
identical(iris, d)
## [1] FALSE
digest(iris, "md5")
## [1] "d3c5d071001b61a9f6131d3004fd0988"
digest(d, "md5")
## [1] "52d8aa9a93ea1934d4ac3be9f208d5f9"

This is problematic if moving between statistical software (e.g., R, Stata, SPSS, SAS, etc.), which prefer a native file format, or even between R sessions when data might be saved from a tabular format into native .RData and back again.

Finally, MD5 depends on insignificant numerical information and that dependency cannot be modified. Thus numbers that are identical (up to a specified level of precision) still yield different MD5 hashes:

x1 <- 1:20
x2 <- x1 + 1e-7
identical(digest(x1), digest(x2))
## [1] FALSE

UNF Solves Problems of MD5

While file checksums are a common strategy for verifying a file, they are not well-suited to being used as global signatures for a dataset. A UNF differs from an ordinary file checksum in several important ways:

  1. UNFs are format independent. The UNF for a dataset will be the same regardless of whether the data is saved as a R binary format, SAS formatted file, Stata formatted file, etc., but file checksums will differ. The UNF is also independent of variable arrangement and naming, which can be unintentionally changed during file reading.

    write.csv(iris, file = "iris.csv", row.names = FALSE)
    iris2 <- read.csv("iris.csv")
    identical(iris, iris2)
    ## [1] FALSE
    identical(unf(iris), unf(iris2))
    ## [1] TRUE
  2. UNFs are robust to insignificant rounding error. This important when dealing with floating-point numeric values. A UNF will also be the same if the data differs in non-significant digits, a file checksum not. The UNF specification also allows tolerances for alternative levels of precision to meet the needs of any particular use case:

    x1 <- 1:20
    x2 <- x1 + 1e-7
    identical(unf(x1), unf(x2))
    ## [1] TRUE
    x3 <- x1 + 1e-3
    identical(unf(x1), unf(x3))
    ## [1] FALSE
    identical(unf(x1, digits = 3), unf(x2, digits = 3))
    ## [1] TRUE
  3. UNFs detect misinterpretation of the data by statistical software. If the statistical software misreads the file, the resulting UNF will not match the original, but the file checksums may match. For example, numeric values read as character will produce a different UNF than those values read in as numerics.

    x1 <- 1:20
    x2 <- as.character(x1)
    identical(unf(x1), unf(x2))
    ## [1] FALSE
  4. UNFs are strongly tamper resistant. Any accidental or intentional changes to data values will change the resulting UNF. Most file checksums and descriptive statistics detect only certain types of changes.

UNF in Practice

UNF has at least two significant use cases:

  1. To create a format-independent citation to a specific (subset of) a version of a dataset for use in academic publishing, data sharing, and archiving.
  2. To make versions and subsets explicit in the description of complex analyses.

This section walks through both of these use cases.

UNF As Data Citation

Because it is format-independent, the UNF hash is appropriated labeled a universal numeric fingerprint: any software application and any programming language can implement the algorithm. This means that data users can use UNF to create UNF-embedded citations for their data and use UNF citations to confirm they have the appropriate version and subset of a specified dataset.

For example, the iris dataset is cited in the R documentation (? iris) as:

Anderson, Edgar (1935). The irises of the Gaspe Peninsula, Bulletin of the American Iris Society, 59, 2-5.

While R users are familiar with these data and they have been a part of R for a long time, it is possible for anyone to go Anderson’s article, transcribe the data (or some part thereof) to a new dataset, start calling it iris, and use the same citation. A third user then, when holding iris in memory, has to wonder how can they confirm what version of iris they’re using? If the citation were expanded to include a UNF signature, there would be no question about whether the data in-hand matched the data cited.

unf(iris)
## UNF6:6oVTvlCR+F1W1HTJ/QUmkA==
## > Anderson, Edgar (1935). The irises of the Gaspe Peninsula, *Bulletin of the American Iris Society*, 59, 2-5. UNF6:6oVTvlCR+F1W1HTJ/QUmkA==

When used in tandem with a persistent locator (e.g., a Handle or Data DOI), this rich citation not only persistently points data users to a location to find the data but also points them to a specific version of the data (of possibly many) available at that location. This means that as datasets evolve, expand, are corrected for errors, or move physical storage locations, a URI + UNF combination provides an incredibly valuable data citation. As an example, the following Handle and UNF citation provide a persistent link to replication files for an article. These files currently live on the Harvard University Dataverse Network, but the use of a handle and UNF mean interested users should be able to find the specific version of the data used in the original article indefinitely:

James Druckman; Jordan Fein; Thomas Leeper, 2012, “Replication data for: A Source of Bias in Public Opinion Stability”, http://hdl.handle.net/1902.1/17864 UNF:5:esVZKwuUnh5kkpDhxXKLxA==

UNF As In-line Data Description

In addition to use as part of a persistent identifier of a dataset, UNF can also be used within an analysis to make subsetting and versioning explicit. For example, iris contains data on three species (stored in the Species variable). If we analyze these data together, we are working with one subset of the data (the subset of the whole):

unf(iris)
## UNF6:6oVTvlCR+F1W1HTJ/QUmkA==

But if we want to work with species individually, each of those subsets has a unique UNF:

unf(iris[iris$Species == 'setosa',])
## UNF6:8rTAejPnZtElgtCerfvi3A==
unf(iris[iris$Species == 'versicolor',])
## UNF6:7ltC0QDS/iRK/QObr61L/Q==
unf(iris[iris$Species == 'virginica',])
## UNF6:UysJ51ubitBP+7vy4QTwiA==

Similarly, if we wanted to work with a random subsample of observations in iris, we could use UNF to state explicitly the contents of that subset:

set.seed(123)
unf(iris[sample(1:nrow(iris), 20, FALSE),])
## UNF6:bYoIEK1D9D5q8682+ixF+A==