library(manydata)
The first thing users of the package will want to do is to identify
datasets that might contribute to their research goals. Since some of
these data packages are too big for CRAN, we expect that their
developers will instead choose to make their packages available on
GitHub. To make it easier to identify all packages in the many packages
universe, we have developed the get_packages()
function.
The function lists the many packages available and allow users to
download them.
get_packages()
Packages in the many packages universe have the advantage to facilitate comparison and analysis of multiple datasets in a specific domain of global governance. This is possible with a particular coding system which follows the same principles across the different packages.
In {manystates}
for example, all datasets from the
states database contain variables named Beg
and
End
which represent the beginning and ending date of an
episode of state sovereignty.
In {manyenviron}
, the agreements database also have the
Beg
and End
variables but those are attributed
to treaties (signature and term dates). For the memberships database,
Beg
and End
represent when a relationship
between states and an agreement starts (either signature, ratification
or entry into force) and ends (either withdrawal or term).
This specific variable name allows the comparison across the datasets which have different sources but same informations. It enables to point out the recurrence, difference or absence of observations between the datasets and extract more robust data when researching on a particular governance domain.
Let us say that we wish to download the {manystates}
package, which offers a set of datasets related to state actors in
global governance. We can download and install the latest release
version of the {manystates}
package using the same function
as before, only specifying which package we want to ‘get’,
‘get_packages(“manystates”)’.
For now, let’s work with the Roman Emperors database included in manydata. We can get a quick summary of the datasets included in this package with the following command:
data(package = "manydata")
data(emperors, package = "manydata")
emperors
## $wikipedia
## # A tibble: 68 × 15
## ID Beg End FullName Birth Death CityBirth ProvinceBirth Rise Cause
## <chr> <mda> <mda> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Augustus -002… 0014… IMPERAT… 0062… 0014… Rome Italia Birt… Assa…
## 2 Tiberius 0014… 0037… TIBERIV… 0041… 0037… Rome Italia Birt… Assa…
## 3 Caligula 0037… 0041… GAIVS I… 0012… 0041… Antitum Italia Birt… Assa…
## 4 Claudius 0041… 0054… TIBERIV… 0009… 0054… Lugdunum Gallia Lugdu… Birt… Assa…
## 5 Nero 0054… 0068… NERO CL… 0037… 0068… Antitum Italia Birt… Suic…
## 6 Galba 0068… 0069… SERVIVS… 0002… 0069… Terracina Italia Seiz… Assa…
## 7 Otho 0069… 0069… MARCVS … 0032… 0069… Terentin… Italia Appo… Suic…
## 8 Vitelli… 0069… 0069… AVLVS V… 0015… 0069… Rome Italia Seiz… Assa…
## 9 Vespasi… 0069… 0079… TITVS F… 0009… 0079… Falacrine Italia Seiz… Natu…
## 10 Titus 0079… 0081… TITVS F… 0039… 0081… Rome Italia Birt… Natu…
## # … with 58 more rows, and 5 more variables: Killer <chr>, Dynasty <chr>,
## # Era <chr>, Notes <chr>, Verif <chr>
##
## $UNRV
## # A tibble: 99 × 7
## ID Beg End Birth Death FullName Dynasty
## <chr> <mdate> <mdate> <chr> <chr> <chr> <chr>
## 1 Augustus -0027 -0014 63 BC 14 Gaius Julius Caesar Oct… Julio-…
## 2 Tiberius -0014 0037 42 BC 37 Tiberius Claudius Nero … Julio-…
## 3 Gaius (Caligula) 0037 0041 12 14 Gaius Caesar Germanicus… Julio-…
## 4 Claudius 0041 0054 10 BC 41 Tiberius Claudius Nero … Julio-…
## 5 Nero 0054 0068 37 68 Claudius Nero Caesar (b… Julio-…
## 6 Galba 0068 0069 3 BC 69 Servius Sulpicius Galba… Year o…
## 7 Otho 0069 0069 32 69 Marcus Salvius Otho / I… Year o…
## 8 Vitellius 0069 0069 15 69 Aulus Vitellius / Aulus… Year o…
## 9 Vespasian 0069 0079 9 79 Titus Flavius Vespasian… Year o…
## 10 Titus 0079 0081 39 79 Titus Flavius Vespasian… Flavian
## # … with 89 more rows
##
## $britannica
## # A tibble: 87 × 3
## ID Beg End
## <chr> <mdate> <mdate>
## 1 Augustus -0031 0014
## 2 Tiberius 0014 0037
## 3 Caligula 0037 0041
## 4 Claudius 0041 0054
## 5 Nero 0054 0068
## 6 Galba 0068 0069
## 7 Otho 0069-01 0069-04
## 8 Aulus Vitellius 0069-07 0069-12
## 9 Vespasian 0069 0079
## 10 Titus 0079 0081
## # … with 77 more rows
We can see that there are three named datasets relating to emperors
here: wikipedia
(dataset assembled from Wikipedia pages),
UNVR
(United Nations of Roman Vitrix), and
britannica
(Britannica Encyclopedia List of Roman
Emperors). Each of these datasets has their advantages and so we may
wish to understand their differences, summarise variables across them,
and perhaps also rerun models across them.
To retrieve an individual dataset from this database, we can use the
pluck()
function.
<- pluck(emperors, "wikipedia") wikipedia
However, the real value of the various ‘many packages’ is that multiple datasets relating to the same phenomenon are presented together.
First of all, we want to understand what the differences between the datasets in a database. One important way to understand the relationship between these datasets is to understand what their relative advantages and disadvantages are. For example, one dataset may be long (has many observations) while another is shorter but wider (has more variables). One might include details further back in history while the other is more recent, but include more missing data or less precise data (i.e. coded at a less granular level) than another with a more restrictive. Or one might appear complete yet offer less information on where the original data points were sourced or how certain variables were coded, while another provides an extensive and transparent codebook that facilitates replication.
data_source()
and
data_contrast()
We can bring up the database level documentation using:
?emperors
. This informs users on the datasets present in
the database as well as the variables in the various datasets. Though,
if we want a more detailed summary of the various levels of data and
sources, we can use data_source()
and
data_contrast()
.
The data_source()
function displays bibliographic
references for the datasets within a database.
data_source(pkg = "manydata", database = NULL, dataset = NULL)
## Component 1 :
## Reference
## wikipedia "(????). "List_of_Roman_emperors."<tps://en.wikipedia.org/wiki/List_of_Roman_emperors>. Accessed:202" [truncated]
## UNRV "(????). "Roman Emperor list."<https://www.unrv.com/government/emperor.php>. Accessed: 2021-07-22."
## britannica "(????). "List of Roman emperors."<https://www.britannica.com/topic/list-of-Roman-emperors-2043294>." [truncated]
The data_contrast()
function returns a data frame with
the key metadata of each level of data objects (many package, database,
and dataset). This metadata includes the following elements:
data_contrast(pkg = "manydata", database = NULL, dataset = NULL)
## Please specify 'approx_range' argument if you want approximate dates to also be expanded
## Please specify 'approx_range' argument if you want approximate dates to also be expanded
## Please specify 'approx_range' argument if you want approximate dates to also be expanded
## Please specify 'approx_range' argument if you want approximate dates to also be expanded
## Please specify 'approx_range' argument if you want approximate dates to also be expanded
## Please specify 'approx_range' argument if you want approximate dates to also be expanded
## emperors :
## Unique ID Missing Data Rows Columns Beg End
## wikipedia 68 9.9 % 68 15 -0026-01-16 0014-08-19
## UNRV 98 6.06 % 99 7 -0027-01-01 -0014-12-31
## britannica 87 0 % 87 3 -0031\032 0014\032
## URL
## wikipedia https://github.com/zonination/emperors
## UNRV https://www.unrv.com/government/emperor.php
## britannica https://www.britannica.com/topic/list-of-Roman-emperors-2043294
Next we may be interested in whether any relationships we are interested in or inferences we want to draw are sensitive to which data we use. That is, we are interested in the robustness of any results to different data specifications.
We can start by exploring whether our conclusion about when emperors
began their reign would differ depending on which dataset we use. We can
use the purrr::map()
function used above, but this time
pass it the mean()
function and tell it to operate on just
the “Beg” variable, which represents when emperors began their reign
(removing any NAs). Since manydata datasets are always ordered by “Beg”
(and then “ID”), we can remove any subsequent (duplicated) entries by ID
to concentrate on first appearances.
library(dplyr)
%>%
emperors ::map(function(x){
purrr%>% dplyr::filter(!duplicated(ID)) %>%
x ::summarise(mean(Beg, na.rm = TRUE))
dplyr })
## $wikipedia
## # A tibble: 68 × 1
## `mean(Beg, na.rm = TRUE)`
## <chr>
## 1 -0026-01-16
## 2 0014-09-18
## 3 0037-03-18
## 4 0041-01-25
## 5 0054-10-13
## 6 0068-06-08
## 7 0069-01-15
## 8 0069-04-17
## 9 0069-12-21
## 10 0079-06-24
## # … with 58 more rows
##
## $UNRV
## # A tibble: 98 × 1
## `mean(Beg, na.rm = TRUE)`
## <chr>
## 1 -0027-07-02
## 2 -0014-07-02
## 3 0037-07-02
## 4 0041-07-02
## 5 0054-07-02
## 6 0068-07-01
## 7 0069-07-02
## 8 0069-07-02
## 9 0069-07-02
## 10 0079-07-02
## # … with 88 more rows
##
## $britannica
## # A tibble: 87 × 1
## `mean(Beg, na.rm = TRUE)`
## <chr>
## 1 "-0031\u001a"
## 2 "0014-07-02"
## 3 "0037-07-02"
## 4 "0041-07-02"
## 5 "0054-07-02"
## 6 "0068-07-01"
## 7 "0069-01-16"
## 8 "0069-07-16"
## 9 "0069-07-02"
## 10 "0079-07-02"
## # … with 77 more rows
Now that we have compared the data and looked at some of the different inferences drawn, let us examine how to select and consolidate databases.
The consolidate()
function facilitates consolidating a
set of datasets, or a database, from a ‘many’ package into a single
dataset with some combination of the rows and columns. The function
includes separate arguments for rows and columns, as well as for how to
resolve conflicts in observations across datasets. The key argument
indicates the column to collapse datasets by. This provides users with
considerable flexibility in how they combine data.
For example, users may wish to see units and variables coded in “any” dataset (i.e. units or variables present in at least one of the datasets in the database) or units and variables coded in “every” dataset (i.e. units or variables present in all of the datasets in the database).
consolidate(database = emperors, rows = "any", cols = "any", resolve = "coalesce", key = "ID")
## # A tibble: 137 × 15
## ID CityBirth ProvinceBirth Rise Cause Killer Era Notes Verif Birth
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Augustus Rome Italia Birt… Assa… Wife Prin… birt… Redd… 0062…
## 2 Tiberius Rome Italia Birt… Assa… Other… Prin… birt… Redd… 0041…
## 3 Caligula Antitum Italia Birt… Assa… Senate Prin… assa… Redd… 0012…
## 4 Claudius Lugdunum Gallia Lugdu… Birt… Assa… Wife Prin… birt… Redd… 0009…
## 5 Nero Antitum Italia Birt… Suic… Senate Prin… <NA> Redd… 0037…
## 6 Galba Terracina Italia Seiz… Assa… Other… Prin… birt… Redd… 0002…
## 7 Otho Terentinum Italia Appo… Suic… Other… Prin… <NA> <NA> 0032…
## 8 Vitellius Rome Italia Seiz… Assa… Other… Prin… <NA> <NA> 0015…
## 9 Vespasian Falacrine Italia Seiz… Natu… Disea… Prin… <NA> <NA> 0009…
## 10 Titus Rome Italia Birt… Natu… Disea… Prin… <NA> <NA> 0039…
## # … with 127 more rows, and 5 more variables: Death <chr>, FullName <chr>,
## # Dynasty <chr>, Beg <mdate>, End <mdate>
consolidate(database = emperors, rows = "every", cols = "every", resolve = "coalesce", key = "ID")
## # A tibble: 41 × 3
## ID Beg End
## <chr> <mdate> <mdate>
## 1 Augustus -0026-01-16 0014-08-19
## 2 Tiberius 0014-09-18 0037-03-16
## 3 Claudius 0041-01-25 0054-10-13
## 4 Nero 0054-10-13 0068-06-09
## 5 Galba 0068-06-08 0069-01-15
## 6 Otho 0069-01-15 0069-04-16
## 7 Vespasian 0069-12-21 0079-06-24
## 8 Titus 0079-06-24 0081-09-13
## 9 Domitian 0081-09-14 0096-09-18
## 10 Nerva 0096-09-18 0098-01-27
## # … with 31 more rows
Users can also choose how they want to resolve conflicts between
observations in consolidate()
with several ‘resolve’
methods:
consolidate(database = emperors, rows = "any", cols = "every", resolve = "max", key = "ID")
## # A tibble: 137 × 3
## ID Beg End
## <chr> <chr> <chr>
## 1 Augustus "-0031\u001a" 0014-08-19
## 2 Tiberius "0014-09-18" 0037-03-16
## 3 Caligula "0037-03-18" 0041-01-24
## 4 Claudius "0041-01-25" 0054-10-13
## 5 Nero "0054-10-13" 0068-06-09
## 6 Galba "0068-06-08" 0069-01-15
## 7 Otho "0069-01-15" 0069-04-16
## 8 Vitellius "0069-04-17" 0069-12-20
## 9 Vespasian "0069-12-21" 0079-06-24
## 10 Titus "0079-06-24" 0081-09-13
## # … with 127 more rows
consolidate(database = emperors, rows = "every", cols = "any", resolve = "min", key = "ID")
## # A tibble: 41 × 15
## ID CityBirth ProvinceBirth Rise Cause Killer Era Notes Verif Birth
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Augustus Rome Italia Birt… Assa… Wife Prin… birt… Redd… 0062…
## 2 Tiberius Rome Italia Birt… Assa… Other… Prin… birt… Redd… 0041…
## 3 Claudius Lugdunum Gallia Lugdu… Birt… Assa… Wife Prin… birt… Redd… 0009…
## 4 Nero Antitum Italia Birt… Suic… Senate Prin… <NA> Redd… 0037…
## 5 Galba Terracina Italia Seiz… Assa… Other… Prin… birt… Redd… 0002…
## 6 Otho Terentinum Italia Appo… Suic… Other… Prin… <NA> <NA> 0032…
## 7 Vespasian Falacrine Italia Seiz… Natu… Disea… Prin… <NA> <NA> 0009…
## 8 Titus Rome Italia Birt… Natu… Disea… Prin… <NA> <NA> 0039…
## 9 Domitian Rome Italia Birt… Assa… Court… Prin… <NA> <NA> 0051…
## 10 Nerva Narni Italia Appo… Natu… Disea… Prin… <NA> <NA> 0030…
## # … with 31 more rows, and 5 more variables: Death <chr>, FullName <chr>,
## # Dynasty <chr>, Beg <chr>, End <chr>
consolidate(database = emperors, rows = "every", cols = "every", resolve = "mean", key = "ID")
## # A tibble: 41 × 3
## ID Beg End
## <chr> <chr> <chr>
## 1 Augustus -0026-01-16 0014-08-19
## 2 Tiberius 0014-09-18 0037-03-16
## 3 Claudius 0041-01-25 0054-10-13
## 4 Nero 0054-10-13 0068-06-09
## 5 Galba 0068-06-08 0069-01-15
## 6 Otho 0069-01-15 0069-04-16
## 7 Vespasian 0069-12-21 0079-06-24
## 8 Titus 0079-06-24 0081-09-13
## 9 Domitian 0081-09-14 0096-09-18
## 10 Nerva 0096-09-18 0098-01-27
## # … with 31 more rows
consolidate(database = emperors, rows = "any", cols = "any", resolve = "median", key = "ID")
## # A tibble: 137 × 15
## ID CityBirth ProvinceBirth Rise Cause Killer Era Notes Verif Birth
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Augustus Rome Italia Birt… Assa… Wife Prin… birt… Redd… 0062…
## 2 Tiberius Rome Italia Birt… Assa… Other… Prin… birt… Redd… 0041…
## 3 Caligula Antitum Italia Birt… Assa… Senate Prin… assa… Redd… 0012…
## 4 Claudius Lugdunum Gallia Lugdu… Birt… Assa… Wife Prin… birt… Redd… 0009…
## 5 Nero Antitum Italia Birt… Suic… Senate Prin… <NA> Redd… 0037…
## 6 Galba Terracina Italia Seiz… Assa… Other… Prin… birt… Redd… 0002…
## 7 Otho Terentinum Italia Appo… Suic… Other… Prin… <NA> <NA> 0032…
## 8 Vitellius Rome Italia Seiz… Assa… Other… Prin… <NA> <NA> 0015…
## 9 Vespasian Falacrine Italia Seiz… Natu… Disea… Prin… <NA> <NA> 0009…
## 10 Titus Rome Italia Birt… Natu… Disea… Prin… <NA> <NA> 0039…
## # … with 127 more rows, and 5 more variables: Death <chr>, FullName <chr>,
## # Dynasty <chr>, Beg <chr>, End <chr>
consolidate(database = emperors, rows = "every", cols = "every", resolve = "random", key = "ID")
## # A tibble: 41 × 3
## ID Beg End
## <chr> <chr> <chr>
## 1 Augustus "-0031\u001a" "0014\u001a"
## 2 Tiberius "0014-09-18" "0037"
## 3 Claudius "0041" "0054"
## 4 Nero "0054" "0068"
## 5 Galba "0068" "0069-01-15"
## 6 Otho "0069-01-15" "0069-04"
## 7 Vespasian "0069" "0079-06-24"
## 8 Titus "0079" "0081"
## 9 Domitian "0081-09-14" "0096-09-18"
## 10 Nerva "0096" "0098-01-27"
## # … with 31 more rows
Users can even specify how conflicts for different variables should be ‘resolved’:
consolidate(database = emperors, rows = "any", cols = "every", resolve = c(Beg = "min", End = "max"), key = "ID")
## # A tibble: 137 × 3
## ID Beg End
## <chr> <chr> <chr>
## 1 Augustus -0026-01-16 0014-08-19
## 2 Tiberius -0014 0037-03-16
## 3 Caligula 0037-03-18 0041-01-24
## 4 Claudius 0041 0054-10-13
## 5 Nero 0054 0068-06-09
## 6 Galba 0068 0069-01-15
## 7 Otho 0069 0069-04-16
## 8 Vitellius 0069-04-17 0069-12-20
## 9 Vespasian 0069 0079-06-24
## 10 Titus 0079 0081-09-13
## # … with 127 more rows
Alternatively, users can “favour” a dataset in a database over others:
consolidate(database = favour(emperors, "UNRV"), rows = "every", cols = "any", resolve = "coalesce", key = "ID")
## # A tibble: 41 × 15
## ID FullName Birth Death CityBirth ProvinceBirth Rise Cause Killer Dynasty
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Augu… Gaius J… 63 BC 14 Rome Italia Birt… Assa… Wife Julio-…
## 2 Tibe… Tiberiu… 42 BC 37 Rome Italia Birt… Assa… Other… Julio-…
## 3 Clau… Tiberiu… 10 BC 41 Lugdunum Gallia Lugdu… Birt… Assa… Wife Julio-…
## 4 Nero Claudiu… 37 68 Antitum Italia Birt… Suic… Senate Julio-…
## 5 Galba Servius… 3 BC 69 Terracina Italia Seiz… Assa… Other… Year o…
## 6 Otho Marcus … 32 69 Terentin… Italia Appo… Suic… Other… Year o…
## 7 Vesp… Titus F… 9 79 Falacrine Italia Seiz… Natu… Disea… Year o…
## 8 Titus Titus F… 39 79 Rome Italia Birt… Natu… Disea… Flavian
## 9 Domi… Titus F… 51 96 Rome Italia Birt… Assa… Court… Flavian
## 10 Nerva Marcus … 30 98 Narni Italia Appo… Natu… Disea… Adopti…
## # … with 31 more rows, and 5 more variables: Era <chr>, Notes <chr>,
## # Verif <chr>, Beg <mdate>, End <mdate>
Users can, even, declare multiple key ID columns to consolidate a database or multiple datasets:
consolidate(database = emperors, rows = "any", cols = "any", resolve = c(Death = "max", Cause = "coalesce"),
key = c("ID", "Beg"))
## # A tibble: 201 × 4
## ID Beg Cause Death
## <chr> <mdate> <chr> <chr>
## 1 Augustus -0026-01-16 Assassination 0014-08-19
## 2 Tiberius 0014-09-18 Assassination 0037-03-16
## 3 Caligula 0037-03-18 Assassination 0041-01-24
## 4 Claudius 0041-01-25 Assassination 0054-10-13
## 5 Nero 0054-10-13 Suicide 0068-06-09
## 6 Galba 0068-06-08 Assassination 0069-01-15
## 7 Otho 0069-01-15 Suicide 0069-04-16
## 8 Vitellius 0069-04-17 Assassination 0069-12-20
## 9 Vespasian 0069-12-21 Natural Causes 0079-06-24
## 10 Titus 0079-06-24 Natural Causes 0081-09-13
## # … with 191 more rows