CDISC standards provide a standard for submission of data set metadata through a document known as define.xml. The define provides a great deal of useful information that is both machine readable and can be viewed through your web browser. While many organizations wait to produce a define until the datasets are finalized, it can still be advantageous to be able to read metadata directly from a define. For this purpose, we developed readers that can go directly from a define.xml to a metacore object.
To do this, we’ve built separate reader function for each of the metacore tables. For more information on the structure of the metacore tables, check out the README.
We start by reading the define from disk using the xmlTreeParse()
function from the XML
package.
Next, we use the metacore readers for each of the separate tables necessary for a metacore object.
ds_spec2 <- xml_to_ds_spec(doc)
ds_vars <- xml_to_ds_vars(doc)
var_spec <- xml_to_var_spec(doc)
value_spec <- xml_to_value_spec(doc)
code_list <- xml_to_codelist(doc)
derivations <- xml_to_derivations(doc)
Great! Now we’re ready to create our metacore object.
test <- metacore(ds_spec2, ds_vars, var_spec, value_spec, derivations, code_list)
#> Warning: core from the ds_vars table only contain missing values.
#>
#> supp_flag from the ds_vars table only contain missing values.
#>
#> common from the var_spec table only contain missing values.
#>
#> dataset from the supp table only contain missing values.
#>
#> variable from the supp table only contain missing values.
#>
#> idvar from the supp table only contain missing values.
#>
#> qeval from the supp table only contain missing values.
#> Warning: The following derivations are never used:
#> MT.SUPPAE.QVAL: see value level metadata
#> MT.SUPPDM.QVAL: see value level metadata
#> Warning: The following codelist(s) are never used:
#> DRUG DICTIONARY
#> MEDICAL HISTORY DICTIONARY
#>
#> Metadata successfully imported
Something to note about a metacore object is that it inherently holds all data from your source of metadata, be it your specification, define.xml, database, etc. So that means you have all the metadata. In your program, it’s likely that you’ll just want to keep metadata relevant to the dataset you’re currently programming. We’ve made process easily, with functions that filter metadata down to information only relevant to a specific dataset.
# a metacore object with all your dataframes
subset <- test %>% select_dataset("DM")
subset$ds_spec
#> # A tibble: 1 × 3
#> dataset structure label
#> <chr> <chr> <chr>
#> 1 DM One record per subject Demographics
# a simplified dataframe
subset_t <- test %>% select_dataset("DM", simplify = TRUE)
As can be seen above, the metacore object can be filtered directly, or by using the simplify = TRUE
argument, a simplified data frame can be returned.
subset_t
#> # A tibble: 25 × 22
#> dataset variable key_seq order keep core supp_flag type length label
#> <chr> <chr> <int> <int> <lgl> <chr> <lgl> <chr> <int> <chr>
#> 1 DM STUDYID 1 1 TRUE <NA> NA text 12 Study I…
#> 2 DM DOMAIN NA 2 TRUE <NA> NA text 2 Domain …
#> 3 DM USUBJID 2 3 TRUE <NA> NA text 11 Unique …
#> 4 DM SUBJID NA 4 TRUE <NA> NA text 4 Subject…
#> 5 DM RFSTDTC NA 5 FALSE <NA> NA date 10 Subject…
#> 6 DM RFENDTC NA 6 FALSE <NA> NA date 10 Subject…
#> 7 DM RFXSTDTC NA 7 FALSE <NA> NA datetime 20 Date/Ti…
#> 8 DM RFXENDTC NA 8 FALSE <NA> NA datetime 20 Date/Ti…
#> 9 DM RFICDTC NA 9 FALSE <NA> NA datetime 20 Date/Ti…
#> 10 DM RFPENDTC NA 10 FALSE <NA> NA datetime 20 Date/Ti…
#> # … with 15 more rows, and 12 more variables: format <chr>, common <lgl>,
#> # sig_dig <int>, origin <chr>, code_id <chr>, where <chr>,
#> # derivation_id <chr>, derivation <chr>, name <chr>, codes <list>,
#> # idvar <chr>, qeval <chr>