This vignette describes the UNF algorithm (Altman, Gill, McDonald 2003; Altman and King 2007; Altman 2008) and the R implementation thereof, which includes some peculiarities. The official specifications for various versions of UNF can be found elsewhere online. The algorithm is described in general terms here and one can also find more specific descriptions of the Version 3/4, Version 5 and Version 6 algorithms.
The current version of this package is an R implementation that
relies on general implementations of the relevant hash functions
provided by digest and base64
encoding provided by base64enc.
Versions 1 and 2 were available in an earlier version of the UNF package
authored by Micah Altman, which was built on custom C libraries, and is
included in the version logs on GitHub. That package was orphaned by
CRAN in 2009. The package retains the core unf()
function
from the earlier versions of the UNF package, but simplifies its use
considerably. The package additionally implements some new helper
functions.
Round numerics to k digits, where the default value of k is 7. (Note: In UNF versions <= 5, k was labeled n.) Then, convert those numerics to a character-class string containing exponential notation in the following form:
- A sign character
- A single leading non-zero digit
- A decimal point
- Up to *k*-1 remaining digits following the decimal, no trailing zeros
- A lowercase letter "e"
- A sign character
- The digits of the exponent, omitting trailing zeros
Note (a): Zero can be positive (“+0.e+”) or negative (“-0.e+”).
Note (b): Inf
, -Inf
, and NaN
are represented as: “+inf”, “-inf”, and “+nan”, respectively. At some
point in time, Dataverse handled non-finites by treating them as
missing.
Note (c): The Dataverse
implementation of UNFv5 represents zero values (and logical
FALSE
) values as “+0.e-6” rather than the implied “+0.e+”
(like logical TRUE
values: “+1.e+”). This can be replicated
in unf5()
by adding the argument
dvn_zero = TRUE
.
Truncate character strings to l characters, where the default value of l is 128. (Note: In UNF versions <= 5, l was labeled k.)
Handle other types of data in the following ways.
a. Convert logical values to numeric (`TRUE` is "1" and `FALSE` is "0") and handle as in (1), above.
b. In this package, "factor" and "AsIs" class vectors are coerced to character and handled as in (2), above.
c. Treat bits (raw) variables as base64-encoded big-endian bit sequences.
d. Handle dates, times, and datetimes as in (4), below. In this package, time-series classes ("ts" and "zoo") and "difftime" class objects are coerced to numeric.
e. Format complex numbers as `A,iB`, where `A` is the real component and and `B` is the complex component and both are formatted as numeric values as in (1), above.
Dates, times, datetimes, intervals, and durations are handled as follows:
Dates are converted to character strings in the form
“YYYY-MM-DD”, but partial dates (“YYYY” and “YYYY-MM”) are permitted.
(Note: Partial dates are not supported in R, but one can create
character representations of partial dates in the package by specifying
date_format
.)
Times are converted to character strings using the ISO 8601 format “hh:mm:ss.fffff”. “fffff” is fractions of a second and must not containing trailing zeroes (as with any numeric value, see [1], above). The time should be expressed in UTC time with a terminal “Z” character. (Note: Times without accompanying dates are not supported in R, and thus not implemented in the package.)
Datetimes may be expressed as a concatenated date (only in the
form “YYYY-MM-DD”) and time, separated by “T”. As an example, Fri Aug 22
12:51:05 EDT 2014 is encoded as:
"2014-08-22T16:51:05Z"
.
Intervals are represented as two datetimes, concatenated by a “/”. (Note: Intervals are not supported in R, and thus not implemented in the package.)
Note: Given the different implementation of timezones in different programming languages and software applications, UNF signatures calculated for identical datasets in different applications may differ. For example, the UNFv6 specification notes that Stata does not implement time zones, while R always assumes a timezone. The suggested work around is to convert variables to a string representation and handle as in (2), above.
Append all non-missing values with an end-of-line
(\n
) character and a single null byte. Represent all
missing values as a string of three null bytes. (Note: At some point in
time, Dataverse appeared to treat empty character strings
""
as missing values. As of UNFv6, this is explicit that a
missing value NA
is represented by only three null bytes
and an empty character string ""
is represented by an
end-of-line character and a null byte.)
Convert to Unicode bit encoding. For UNF versions < 4.1, use UTF-32BE. For UNF versions >= 4.1, use UTF-8.
Concatenate all values into a single byte sequence. Compute a hash on the resulting byte sequence. For UNF versions > 3, use SHA256. For UNF version 3, use MD5.
Base64 encode the resulting hash. For UNF versions >= 5, truncate the UNF by performing base64 encoding only on the leftmost 128, 192, 196, or 256 bits, where 128 bits (16 bytes) is the default.
To aggregate multiple variables:
Note (a): For one-variable datasets, Dataverse implements the algorithm at the variable-level only, without aggregation. Thus a UNF for a one-variable dataframe is the same as the UNF for that variable alone. The standard is ambiguous in this regard and the package copies the Dataverse implementation.
Note (b): The package treats dataframes and lists identically. Matrices are coerced to dataframes before running the algorithm.
Sort the base64-encoded UNFs in POSIX locale order.
Apply the UNF algorithm to the sorted, base64-encoded UNFs, using a truncation value as large as the original, treating the UNFs as character. For UNF versions >= 5, the algorithm is applied to the truncated UNFs.
To aggregate multiple datasets:
Calculate the UNF for each dataset, as above.
Sort the base64-encoded UNFs in POSIX locale order.
Apply the UNF algorithm to the sorted, base64-encoded UNFs, using a truncation value as large as the original, treating the UNFs as character. For UNF versions >= 5, the algorithm is applied to the truncated UNFs.
Note: Multiple datasets need to be combined based on UNFs calculated with the same version of the algorithm. Thus when calculating a study-level UNF, dataset-level UNFs need to be calculated using the same version of the algorithm. (To achieve this, Dataverse recalculates old UNFs whenever new data is added to a study.)
The UNF is intended to be used as part of a data citation, for example:
James Druckman; Jordan Fein; Thomas Leeper, 2012, “Replication data for: A Source of Bias in Public Opinion Stability”, http://hdl.handle.net/1902.1/17864 UNF:5:esVZKwuUnh5kkpDhxXKLxA==
Here, a citation to the data file includes a persistent handle URI and a UNF signature specifying a specific version of the data file available from that handle. Note the UNF is printed as with a small header indicating the algorithm version, making it easy to match any particular UNF against a data file:
UNF:[UNF version]:[UNF hash]
In UNFv5, the header might also contain details of other parameters for non-default number rounding and character string truncation, respectively:
UNF:[UNF version]:[digits],[characters]:[UNF hash]
In UNFv6, the header can contain a number rounding parameter (N), a string truncation parameter (X), and a variable-level UNF hash truncation parameter (H) (in any order):
UNF:[UNF version]:N[digits],X[characters],H[bits]:[UNF hash]
The package prints each UNF in the appropriate format, including any non-default parameters when appropriate.
Altman, Micah, Jeff Gill and Michael P. McDonald. 2003. Numerical Issues in Statistical Computing for the Social Scientist. John Wiley & Sons. (Describes version 3 of the algorithm)
Altman, Micah, & Gary King. 2007. “A Proposed Standard for the Scholarly Citation of Quantitative Data.” D-Lib 13(3/4). http://dlib.org/dlib/march07/altman/03altman.html.
Altman, Micah 2008. “A Fingerprint Method for Scientific Data Verification.” In T. Sobh, editor, Advances in Computer and Information Sciences and Engineering, chapter 57, pp. 311-316. Springer Netherlands, Netherlands. https://link.springer.com/chapter/10.1007/978-1-4020-8741-7_57. (Describes version 5 of the algorithm)
Data Citation Synthesis Group. 2013. “Declaration of Data Citation Principles.”
Altman, Michah, and Merce Crosas. 2014. “The Evolution of Data Citation: From Principles to Implementation.” IASSIST QUARTERLY.