Tutorial: create_analytic

Introduction

The create_analytic_cohort() function allows the user to create a cohort from the GENIE BPC data based on cancer diagnosis information such as cancer cohort, treating institution, histology, and stage at diagnosis, as well as cancer-directed regimen information including regimen name and regimen order. This function returns a list of data frames containing each of the clinical and genomic data files for patients that met criteria for the analytic cohort.

This vignette will walk a user through the create_analytic_cohort() function.

Setup

Before going through the tutorial, install and load the {genieBPC} library and log into Synapse using the set_synapse_credentials() function. For more information on set_synapse_credentials(), refer to the Tutorial: pull_data_synapse() vignette.

library(genieBPC)

set_synapse_credentials()

Modifying Function Arguments

The create_analytic_cohort() function includes several input parameters:

Argument	Description
`data_synapse`	The item from the nested list returned from pull_data_synapse() corresponding to the cancer cohort of interest.
`index_ca_seq`	Index cancer sequence. Default is 1, indicating the patient's first index cancer. The index cancer is also referred to as the BPC Project cancer in the GENIE BPC Analytic Data Guide; this is the cancer that met the eligibility criteria for the project and was selected at random for PRISSMM phenomic data curation. Specifying multiple index cancer sequences, e.g. index_ca_seq = c(1, 2), will return index cancers to patients with 1 index cancer and will return the first AND second index cancers to patients with multiple.
`institution`	GENIE BPC participating institution. Must be one of 'DFCI', 'MSK', 'UHN', or 'VICC' for NSCLC cohorts; must be one of 'DFCI', 'MSK', 'VICC' for CRC and BrCa. Default selection is all institutions. This parameter corresponds to the variable `institution` in the Analytic Data Guide.
`stage_dx`	Stage at diagnosis. Must be one of 'Stage I', 'Stage II', 'Stage III', 'Stage I-III NOS', 'Stage IV'. Default selection is all stages. This parameter corresponds to the variable `stage_dx` in the Analytic Data Guide.
`histology`	Cancer histology. For all cancer cohorts except for BrCa (breast cancer), this parameter corresponds to the variable `ca_hist_adeno_squamous` and must be one of 'Adenocarcinoma', 'Squamous cell', 'Sarcoma', 'Small cell carcinoma', 'Carcinoma', 'Other histologies/mixed tumor'. For BrCa, this parameter corresponds to the variable `ca_hist_brca` and must be one of 'Invasive lobular carcinoma', 'Invasive ductal carcinoma', 'Other histology'. Default selection is all histologies.
`regimen_drugs`	Vector with names of drugs in cancer-directed regimen, separated by a comma. For example, to specify a regimen consisting of Carboplatin and Pemetrexed, specify regimen_drugs = 'Carboplatin, Pemetrexed'. Acceptable values are found in the `drug_regimen_list` dataset provided with this package. This parameter corresponds to the variable `regimen_drugs` in the Analytic Data Guide.
`regimen_type`	Indicates whether the regimen(s) specified in `regimen_drugs` indicates the exact regimen to return, or if regimens containing the drugs listed in `regimen_drugs` should be returned. Must be one of 'Exact' or 'Containing'. The default is 'Exact'.
`regimen_order`	Order of cancer-directed regimen. If multiple drugs are specified, `regimen_order` indicates the regimen order for all drugs; different values of `regimen_order` cannot be specified for different drug regimen. If multiple values are specified, e.g. c(1, 2), then drug regimens that met either order criteria are returned.
`regimen_order_type`	Specifies whether the `regimen_order` parameter refers to the order of receipt of the drug regimen within the cancer diagnosis (across all other drug regimens; 'within cancer') or the order of receipt of the drug regimen within the times that that drug regimen was administered (e.g. the first time carboplatin pemetrexed was received, out of all times that the patient received carboplatin pemetrexed; 'within regimen'). Acceptable values are 'within cancer' and 'within regimen'.
`return_summary`	Specifies whether a summary table for the cohort is returned. Default is FALSE. The `gtsummary` package is required to return a summary table.

This function returns a list of data frames containing clinical and next generation sequencing information for patients that met the specified criteria. Optionally, if return_summary = TRUE, the list also includes summary tables for the number of records per dataset (‘tbl_overall_summary’) as well as tables of key cancer diagnosis (‘tbl_cohort’), cancer-directed regimen (‘tbl_drugs’) and next generation sequencing (‘tbl_ngs’) variables.

Example using package test data

Create analytic cohort using package test data

create_analytic_cohort(data_synapse = genieBPC::nsclc_test_data)

Examples using data from `pull_data_synapse()`

Examples 2-4 create analytic datasets based on the GENIE BPC NSCLC cohort. To begin, the user must first pull the NSCLC data from Synapse and save it to the local environment, as shown below using the pull_data_synapse function:

nsclc_2_0 <- pull_data_synapse("NSCLC", version = "v2.0-public")

Example 2

Create a cohort of all patients with stage IV NSCLC adenocarcinoma and also return all of their corresponding cancer-directed drugs

nsclc_stg_iv_adeno <- create_analytic_cohort(data_synapse = nsclc_2_0$NSCLC_v2.0, 
                                             stage_dx = "Stage IV", 
                                             histology = "Adenocarcinoma")

ls(nsclc_stg_iv_adeno)
#>  [1] "cohort_ca_drugs"           "cohort_ca_dx"             
#>  [3] "cohort_ca_dx_non_index"    "cohort_cna"               
#>  [5] "cohort_fusions"            "cohort_mutations_extended"
#>  [7] "cohort_ngs"                "cohort_prissmm_imaging"   
#>  [9] "cohort_prissmm_md"         "cohort_prissmm_pathology" 
#> [11] "cohort_pt_char"

Example 3

Create a cohort of all NSCLC patients who received Cisplatin, Pemetrexed Disodium or Cisplatin, Etoposide as their first drug regimen for their first index NSCLC

create_analytic_cohort(data_synapse = nsclc_2_0$NSCLC_v2.0,
                       regimen_drugs = c("Cisplatin, Pemetrexed Disodium","Cisplatin, Etoposide"),
                       regimen_order = 1,
                       regimen_order_type = "within cancer")

Example 4

Create a cohort of all NSCLC patients who received Cisplatin, Pemetrexed Disodium at any time throughout the course of treatment for their cancer diagnosis, but in the event that the patient received the drug multiple times, only select the first time.

nsclc_cisplatin_pem_any = create_analytic_cohort(data_synapse = nsclc_2_0$NSCLC_v2.0,
                                                 regimen_drugs = c("Cisplatin, Pemetrexed Disodium"),
                                                 regimen_order = 1,
                                                 regimen_order_type = "within regimen",
                                                 return_summary = TRUE)

nsclc_cisplatin_pem_any$tbl_overall_summary

Characteristic	N = 182 patients¹
Number of diagnoses per patient in cohort_ca_dx data frame
1	182 (100%)
Number of regimens per patient in cohort_ca_drugs data frame
1	182 (100%)
Number of CPTs per patient in cohort_ngs data frame
1	171 (94%)
2	10 (5.5%)
4	1 (0.5%)
¹ n (%)

nsclc_cisplatin_pem_any$tbl_cohort

Characteristic	N = 182 Diagnoses¹
Cohort (cohort)
NSCLC	182 (100%)
Institution (institution)
DFCI	71 (39%)
MSK	92 (51%)
UHN	7 (3.8%)
VICC	12 (6.6%)
Stage at diagnosis (stage_dx)
Stage I	19 (10%)
Stage I-III NOS	1 (0.5%)
Stage II	42 (23%)
Stage III	71 (39%)
Stage IV	49 (27%)
Histology (ca_hist_adeno_squamous)
	23 (13%)
Adenocarcinoma	142 (78%)
Carcinoma	14 (7.7%)
Other histologies/mixed tumor	2 (1.1%)
Squamous cell	1 (0.5%)
¹ n (%)

nsclc_cisplatin_pem_any$tbl_drugs

Characteristic	N = 182 Regimens¹
Cohort (cohort)
NSCLC	182 (100%)
Institution (institution)
DFCI	71 (39%)
MSK	92 (51%)
UHN	7 (3.8%)
VICC	12 (6.6%)
Drugs in regimen (regimen_drugs)
Cisplatin, Pemetrexed Disodium	182 (100%)
¹ n (%)

nsclc_cisplatin_pem_any$tbl_ngs

Characteristic	N = 195 Cancer Panel Tests¹
Cohort (cohort)
NSCLC	195 (100%)
Institution (institution)
DFCI	73 (37%)
MSK	103 (53%)
UHN	7 (3.6%)
VICC	12 (6.2%)
OncoTree code (cpt_oncotree_code)
LCLC	1 (0.5%)
LUAD	187 (96%)
LUPC	1 (0.5%)
LUSC	3 (1.5%)
NSCLCPD	3 (1.5%)
Sequence assay ID (cpt_seq_assay_id)
DFCI-ONCOPANEL-1	1 (0.5%)
DFCI-ONCOPANEL-2	40 (21%)
DFCI-ONCOPANEL-3	32 (16%)
MSK-IMPACT341	1 (0.5%)
MSK-IMPACT410	51 (26%)
MSK-IMPACT468	51 (26%)
UHN-48-V1	5 (2.6%)
UHN-50-V2	2 (1.0%)
VICC-01-SOLIDTUMOR	4 (2.1%)
VICC-01-T7	8 (4.1%)
¹ n (%)

Tutorial: create_analytic_cohort