After going through this tutorial, you should be able to create virtual populations from fertility and mortality rates, downloaded from the Human Mortality Database (HMD) and the Human Fertility Database (HFD) for any country and any calendar year included in the database. The population may consist of multiple generations. The multi-generation virtual population is the point of departure of further analysis, using methods that are common in studies of a real population.
The principle underlying the generation of a virtual population is a separation of data generation and data analysis. The principle is inspired by the practice in sample surveys. Stochastic simulation and sample surveys have much in common. Both are data-generating processes. Simulation relies on a probability model to generate data, while a sample survey uses observations. That similarity guided the design of \(VirtualPop\). Simulation is approached as sampling. The output of the simulation is a person file, comparable to the data structure commonly used in sample surveys.
The tutorial has seven sections. The first section describes how to install the package. The second is about downloading data from the HMD and HFD. The mortality and fertility rates are downloaded for a given country and for all years for which data are available. One year (reference year) is selected to produce the virtual population. Section three describes how to generate a virtual population. By default, fertility histories are generated from birth to death. Since censuses and sample surveys record fertility histories for segments of life, censoring schemes should be simulated too in order to ensure that demographic indicators computed for a virtual population are comparable to indicators computed from census or survey data. That is the subject of Section 4. An advantage of the separation of data generation and you do not need R to analyse the data. The export of the virtual population from R to STATA and SPSS is described in Section 5. Section 6 covers the conversion of decimal dates produced by \(VirtualPop\) to calendar dates. Section 7 summarizes the steps by illustrating the generation of a virtual population from mortality and fertility rates of Japan.
To install \(VirtualPop\) and its dependencies from CRAN, use
install.packages ("VirtualPop")
To install the package from GitHub, use
devtools::install_github("willekens/VirtualPop")
with \(devtools\) a package on CRAN. To install \(devtools\) type
install.packages("devtools")
To load and attach the package, use
library (VirtualPop)
Using the HMD and HFD to create a virtual population involves three steps:
Each step is a function of the package. The first, \(Getdata()\), retrieves the data from the HMD and HFD. The second, \(Getrates()\), produces the \(rates\) object. The third, \(GetGenerations()\), generates the virtual population.
The package \(VirtualPop\) has four vignettes:
To see the list of vignettes, type
utils::vignette (package="VirtualPop")
To read a vignette, e.g. \(Tutorial\), type
utils::vignette (topic="Tutorial",package="VirtualPop")
The data in the HMD and HFD are provided free of charge to registered users. At registration, basic contact information is obtained (i.e., name, e-mail address, affiliation, and title). To register, go to https://www.mortality.org and https://www.humanfertility.org . Upon acceptance of the agreement, you receive a password. The user name (e-mail address used at registration) and the password are required to download data. In this tutorial, it is assumed that you use the same e-mail address to register with the HMD and the HFD.
Three data files are downloaded. Two are part of the HMD. The first, \(fltper\_1x1\), contains the period life tables for females for several years (1933 to 2019 in case of the United States). The second, \(mltper\_1x1\), contains the period life tables for males. One file, \(mi.txt\), is retrieved from the HFD. The file contains the conditional fertility rates for several years. The test the validity of the simulation some additional data files are downloaded. They are used in Section 4 of this vignette and in the vignette Validation of the simulation.
A note on fertility rates is in order. Fertility rates by birth order are given different names, which contributes to the confusion. The term occurrence-exposure rate is the established term in survival analysis (ratio of occurrences to exposures during a discrete interval). The rates measure the childbearing intensity among women of a specific age and birth order (e.g., second births to woman of age x are related to women of age x of parity one and not to all women of age x). Occurrence-exposure rates differ from fertility rates by birth order commonly used in demography, which are ratios of number of children of birth order j born to women of a given age divided by all women of that age (in mid-year). In the occurrence-exposure rate, the denominator is the number of women of a given age in mid-year who are at risk of giving birth to a j-th child. These are women with j-1 children ever born. For a discussion of the fertility table from a statistical perspective, see Chiang and Van Den Berg (1982) and Li (2020).
The demographers Preston, Heuveline, and Guillot (2001, 3) and (Bongaarts and Sobotka 2012, 113) refer to occurrence-exposure rates as conditional fertility rates of the first kind. In biostatistics, transition rate is the general term and occurrence-exposure rate is reserved for the ratio of observed number of events (occurrences) and total exposure time (exposure) during a discrete interval (Aalen, Borgan, and Gjessing 2008, 215ff). In biomedical statistics occurrence-exposure rates are also referred to as incidence rates (Andersen et al. 2021, 6). The HFD uses the term conditional age- and order-specific fertility rates, in short conditional fertility rates to denote occurrence-exposure rates (Jasilioniene et al. 2015, 2016).
The conditional fertility rates are extracted from period fertility tables, which in the HFD are data files with the extension “mi”. Period fertility tables are increment-decrement life tables (multistate life tables) which model the process of childbearing in synthetic female cohorts, focusing on the age and parity dimension (Jasilioniene et al. 2016, 3).
The virtual population is generated for a country and calendar year (period data) selected by the user. Countries are denoted by short codes of three letters (except for Great Britain and New Zealand). To get the list of countries and the code, you may consult the website or install and load the package HMDHFDplus:
# install.packages ("HMDHFDplus")
library (HMDHFDplus)
countries <- HMDHFDplus::getHMDcountries()
To download the data, you may consider two options. The first is to use the package HMDHFDplus. The second is to download the data manually. Consider the first option. The HMDHFDplus package was developed by Tim Riffe and colleagues at the Max Planck Institute for Demographic Research in Rostock, Germany. The functions \(readHMDweb()\) and \(readHFDweb()\) of the package HMDHFDplus download the data. The user name and passwords must be provided before running the code in this tutorial. They should be stored in the following objects:
user <- "your email address"
pw_HMD <- "password for HMD"
pw_HFD <- "password for HFD"
The functions \(readHMDweb\), \(readHFDweb\) and \(GetData()\) require the objects as arguments (see further). The code chunks in this section are not executed because a user name and passwords are required.
Once you provided your username and passwords, you are ready to use the function \(GetData\) to download the data of a given country from the HMD and HFD. To download data of the United States, use:
data_raw <- GetData (country="USA",user,pw_HMD,pw_HFD)
\(data\_raw\) is a list object with five components:
The function \(GetData()\) uses \(readHMDweb()\) and \(readHFDweb()\). You may want to use these functions directly without calling \(GetData()\):
country <- "USA"
df <- HMDHFDplus::readHMDweb(CNTRY=country,item="fltper_1x1",username=user,password=pw_HMD,fixup=TRUE)
dm <- HMDHFDplus::readHMDweb(CNTRY=country,item="mltper_1x1",username=user,password=pw_HMD,fixup=TRUE)
fert_rates <- HMDHFDplus::readHFDweb(CNTRY=country,item="mi",username=user,password=pw_HFD,fixup=TRUE)
dpopHMD <- HMDHFDplus::readHMDweb(CNTRY=country,item="Population",username=user,password=pw_HMD,fixup=TRUE)
The functions download the selected data for all years. The number of years for which data are provided differ by country. Let \(country\) denote the country selected, e.g. USA. For the USA,the HMD provides data for 41 countries, while the HFD provides data for 33 countries. To list the number of calendar years covered by the HMD and HFD, use
yearsHMD <- unique(df$Year)
yearsHFD <- unique(fert_rates$Year)
If you are using Rstudio integrated development environment (IDE) and defined a project, the file is saved in the project directory. The project directory points to the folder where the \(.Rproj\) file is saved, which in Rstudio is called root folder¨. It differs from current working directory displayed by the RStudio IDE within the title region of the Console pane. The function \(getwd()\) called from the Rstudio console retrieves the directory listed within the title plane of the Console pane. The reason for the difference is that R defines an absolute file path (set by \(setwd()\)) and Rstudio defines a relative file path. The relative path is not affected by a migration of \(.Rproj\) to another location, but the absolute path is.
The second option is to download \(fltper\_1x1\), \(mltper\_1x1\) and \(Population\) manually from the HMD website and \(mi\) from the HFD website. The URLs of the files for the United State are:
and
For the fertility data, select “Conditional age-specific fertility rates”.
To read the data, remove the first three lines in \(mltper\_1x1.txt\) and \(mltper\_1x1.txt\) and the first two lines in \(USAmi.txt\) and read the data:
dm <- read.table(paste(path,"mltper_1x1.txt",sep=""))
df <- read.table(paste(path,"fltper_1x1.txt",sep=""))
fert_rates <- read.table(paste(path,"USAmi.txt",sep=""))
To save the raw data in a single object, called \(data\_raw\),use:
data_raw <- list (country=country,LTf=df,LTm=dm,fert_rates=fert_rates,dpopHMD=dpopHMD)
attr(data_raw,"country") <- country
An object attribute is added to identify the country.
To save the data as an R data file under a name that refers to the country at a desired location, specify the path to the location (folder):
save (data_raw,file=paste(path,"data_raw",country,".RData",sep=""))
If the desired location is current working directory, .
The USA mortality data are available from 1933 to 2020, the fertility data from 1963 to 2019, and the population data from 1933 to 2021. The data included in \(VirtualPop\) are USA data of 2019.
The function \(Getrates()\) retrieves the fertility and mortality rates from \(data_raw\) for the selected reference year. In this tutorial 2019 data are used.
rates <- VirtualPop::GetRates (data=data_raw,refyear=2019)
The code chunk is not executed because the chunks in the previous section were not executed. Instead, the object \(rates\), which is included ih \(VirtualPop\), is taken from the package: The data object \(rates\), with 2019 rates for the Unites States is included in \(VirtualPop\). To load the rates, use
library (VirtualPop)
rates <- NULL
utils::data(rates)
\(rates\) is a list object with three components:
The third component contains the transition rate matrix by age (matrix \(mu(x)\) ), a format required for multistate modelling (see vignette Simulation of life histories). The list object has two attributes, country and calendar year.
To see the content of \(rates\), type \(str(rates)\).
The function \(GetGenerations()\) generates a virtual population from death rates by age and sex and fertility rates by age and birth order. The user specifies the population size of the initial cohort (\(ncohort\)) and the desired number of generations (\(ngen\)). The simulation of fertility careers accounts for mortality. In the tutorial, the initial cohort is 1,000. A cohort of 10,000 is better to assess the validity of the simulation than a cohort of 1,000, say. In the latter, the effect of random variation is large.
dataLH <- VirtualPop::GetGenerations (rates,ncohort=1000,ngen=4)
The object \(dataLH\) includes data on all individuals in the virtual population, including children, grandchildren and great-grandchildren. For each individual in the virtual population, the following data are included:
Note that the mothers and fathers of individuals in generation 1, which is the initial population, are omitted since they are not part of the virtual population.
The object \(dataLH\) represents the main result of the \(VirtualPop\) package. The package \(VirtualPop\) includes data of the virtual population generated from the mortality and fertility rates of the United States for 2019 and an initial cohort of 10,000. The virtual population consists of 29,954 individuals. To load the data, use
utils::data(dataLH)
It is good practice to save the virtual population as an R data file:
save (dataLH,file=paste(path,"dataLH",country,".RData",sep=""))
The data are saved in the current working directory under the name \(dataLHUSA.RData\). To retrieve the data, use
load(file=paste(path,"dataLH",country,".RData",sep=""))
The name of the retrieved object is \(dataLH\).
To remove the effect of mortality on the fertility career, ages at death are set at ages beyond the reproductive career, e.g. 85. The user may change the value variable \(x\_D\) in \(dataLH\) manually or by specifying the argument \(iages\) in the \(GetGenerations()\) function:
ncohort <- 1000
dLH_nomort <- VirtualPop::GetGenerations (rates,ncohort=ncohort,ngen=4,iage=85)
Death interrupts the fertility career. Other modes of interruption or censoring may be considered. The age at censoring may differ for each individual. For instance, to assess the validity of the simulation and to compare fertility indicators of the virtual population with those of a real population, the simulation window must coincide with the observation window (see vignette Validity of the simulation). Suppose the fertility indicators of a real population are obtained from data collected in a retrospective life-history survey of a cross-section of a population. At survey date (and censoring of the observation), individuals in the sample survey have different ages. To assess the validity of the simulation, the ages at censoring in the sample survey should be imposed onto the virtual population. To impose the age structure of the sample population onto the virtual population, the argument \(age\_end\_perc\) of the \(GetGenerations()\) function is specified. The argument is an age distribution (single years of age, by sex). Is the argument specified, then \(GetGenerations()\) assigns to each individual in the virtual population an age at censoring such that the age distribution at censoring in the virtual population is the same as the age distribution at censoring in the sample survey. Mortality is disregarded because respondents interviewed at survey date are survivors. Consider the age distribution of the female population in the United States in 2019 reported in the HMD (\(Population.txt\) file):
dpopus <- dpopHMD[dpopHMD$Year==2019,c("Female1","Male1")]
dimnames(dpopus) <- list (Age=0:110,Sex=c("Females","Males"))
The code chunk requires dpopHMD, which is downloaded from the HMD. The code is not executed. Instead, the data object \(dpopus\), which is distributed with %VirtualPop$, is used:
utils::data(dpopus)
The age distribution of the population 15+ is
data(dpopus)
z <- dpopus[16:nrow(dpopus),]
age_end_perc <- apply(z,2,function(x) x/sum(x))
To simulate the past fertility careers of a population with the ages structure of the female population (15+) of the United States in 2019, use
xx <- VirtualPop::GetGenerations (rates,ncohort=1000,ngen=1,age_end_perc =age_end_perc)
# Age distribution of the virtual population at censoring time
n <- table (trunc(xx$x_D),xx$sex)
nperc <- apply(n,2,function(x) x/sum(x))
age_end_perc <- NULL
In the function \(Children()\), called by \(GetGenerations()\), an individual’s age at censoring is transmitted to the function \(Sim\_bio()\) as part of the argument \(popsim\) (see the vignette Simulation of life histories).
The virtual population generated by \(GetGenerations()\) is stored in a data frame, which is a rectangular data structure. The data frame may be exported from R to a STATA or SPSS binary data format and saved at a desired location:
library(foreign)
path <- "" # or different path
foreign::write.dta(dataLH, paste (path,"dataLH.dta",sep=""))
with \(path\) the path of the directory in which the STATA file should be located. It the path is omitted, the file is saved in the working directory. The results of the simulation may also be saved as an SPSS file:
foreign::write.foreign(dataLH,
paste (path,"dataLH.txt",sep=""),
paste (path,"dataLH.sps",sep=""), package="SPSS")
The function creates a text file and an SPSS program to read it. The functions \(write.dta\) and \(write.foreign\) are functions of the package \(foreign\). An alternative is to use the \(write\_tda\) and \(write\_sav\) functions of the haven package, which is part of the tidyverse collection of R packages.
Note that decimal dates can be converted quite easily to calendar date format. The Comprehensive R Archive Network (CRAN) has several packages with functions to perform operations on dates. The package lubridate is one of them. It is part of the tidyverse collection of R packages. The \(date\_decimal\) function of the lubridate package converts a decimal date into a calendar date. R’s format function is used to obtain a calendar date in a desired format. The call function is
format(lubridate::date_decimal(2019.409), "%Y-%m-%d")
#> [1] "2019-05-30"
The calendar date is May 30th 2019. For a discussion of dates in demographic research, see Willekens (2013).
To illustrate how easy it is to generate a virtual population for another country or another year, the mortality rates and fertility rates of Japan are used. The mortality rates are available from 1947 to 2020 and the conditional fertility rates from 1998 to 2020. In the application, data for three years are used. In the virtual population, 1,000 individuals experience the mortality and conditional fertility rates of 1998, 1,000 the rates of 2010 and 1,000 the rates of 2020. The code is (requires user name and password):
country <- "JPN"
refyear2 <- c(1998,2010,2020)
ncohort <- 1000
dLH_Japan <- NULL
df <- HMDHFDplus::readHMDweb(CNTRY=country,item="fltper_1x1",username=user,password=pw_HMD,fixup=TRUE)
dm <- HMDHFDplus::readHMDweb(CNTRY=country,item="mltper_1x1",username=user,password=pw_HMD,fixup=TRUE)
fert_rates <- HMDHFDplus::readHFDweb(CNTRY=country,item="mi",username=user,password=pw_HFD,fixup=TRUE)
data_raw <- list (country=country,LTf=df,LTm=dm,fert_rates=fert_rates)
attr(data_raw,"country") <- country
for (iy in 1:3)
{ rates <- VirtualPop::GetRates (data=data_raw,refyear=refyear2[iy])
dJapan <- VirtualPop::GetGenerations (rates,ncohort=ncohort,ngen=4)
if (iy >1) ID1 <- dLH_Japan + 1
dLH_Japan <- rbind (dLH_Japan,dJapan)
}
The initial population (generation 1) consists of 3,000 individuals. The sex composition, by year of birth, is
z <- addmargins (table (dLH_Japan$sex[dLH_Japan$gen==1],trunc(dLH_Japan$bdated)[dLH_Japan$gen==1]))
z
The numbers of children, grandchildren and great-grandchildren, provided the fertility and mortality regimes of Japan in a reference year apply to all virtual individuals born in that reference year and their offspring, are:
# Children
sum(dLH_Japan$nch[dLH_Japan$gen==1 & dLH_Japan$sex=="Female"]
# Grandchildren
sum(dLH_Japan$nch[dLH_Japan$gen==2 & dLH_Japan$sex=="Female"])
# Great-grandchildren
sum(dLH_Japan$nch[dLH_Japan$gen==3 & dLH_Japan$sex=="Female"])
The virtual population facilitates the analysis of grandparenthood, families and kinship networks.