ccoptimalmatch
packageR software will be used throughout this vignette. The R statistical software is freely available and you may download and install it for Windows, Mac, and Linux systems from: https://www.r-project.org.
R is a user-friendly platform. For example, you can type in the console:
ccoptimalmatch
packageYou can install the optccmatch package directly from R:
ccoptimalmatch
packageAfter you install the ccoptimalmatch
package, you can load it to use its functions using the library function: Note that you have to load the R package that you need to use each time you start a new R session.
We have two data-sets in this package, namely the “not_processed” and the “being_processed” data-set. The “not_processed” data-set, as the name reveals, is a raw data-set containing the cases, controls, patient_ID and other relevant variables. After different pre-processing steps, we end up to the “being_processed” data, which is the one to use in the algorithm. To see the first 6 rows of the “being_processed” data, enter:
head(being_processed)
#> cluster_case Patient_Id case_control case_ind JCG entry_year CI age_diff
#> 1 case_1 Patient_608 case 1 2011 2009 1 0
#> 2 case_1 Patient_4047 control 0 2011 2009 2 1
#> 3 case_10 Patient_6326 case 1 2013 2010 1 0
#> 4 case_10 Patient_86161 control 0 2013 2010 0 0
#> 5 case_10 Patient_163561 control 0 2013 2010 0 0
#> 6 case_10 Patient_19087 control 0 2013 2010 0 1
#> fup_diff total_control_per_case freq_of_controls
#> 1 0 1 1
#> 2 0 1 1
#> 3 0 4 1
#> 4 0 4 1
#> 5 0 4 1
#> 6 0 4 1
If you wish to investigate the data-set and its attributes, then use the help function:
You can directly access a variable within this data frame as follows:
For example, let us tabulate this variable and investigate the number of cases and controls:
“case_control” is a dummy variable indicating whether the patient is a case or a control. There are 202 cases and 10,462 available controls to pool.
To see the first 6 rows of the “not_processed” data:
head(not_processed)
#> Patient_Id JCG Birth_Year Practice_Id Gender case_control entry_year foll_up
#> 1 Patient_16 2014 1977 Practice_1 F control 2014 0
#> 2 Patient_16 2015 1977 Practice_1 F control 2014 1
#> 3 Patient_18 2014 1996 Practice_2 M control 2014 0
#> 4 Patient_19 2010 1963 Practice_3 F control 2000 10
#> 5 Patient_19 2011 1963 Practice_3 F control 2000 11
#> 6 Patient_19 2012 1963 Practice_3 F control 2000 12
#> CI
#> 1 0
#> 2 0
#> 3 0
#> 4 0
#> 5 1
#> 6 1
Let us tabulate the “case_control” variable and investigate the number of cases and controls in the “not_processed” data-set this time:
You can observe that we have 202 cases and 45,817 available controls to pool. Controls might be duplicated since a control could appear up to 6 times, from 2010 until 2015. The unique controls to pool are 13,491.
The following steps are necessary to pre-process the “not_processed” data-set in a format that can be used by our algorithm:
We start by defining the subsets. In order to define the subsets, we filter by the “cases”, take the distinct combination of exact variables (Gender, JCG and Practice_Id), and create the new variable “subset”. Finally, we select only 4 relevant variables (Gender, JCG, Practice_Id, subset):
create_subset <- not_processed %>%
filter(case_control =="case") %>%
arrange(Practice_Id, Gender, JCG) %>%
distinct(Gender, JCG, Practice_Id, .keep_all = TRUE) %>%
mutate(subset = 1:n()) %>%
select(Gender, JCG, Practice_Id, subset)
There were created n (=26) subsets, where a subset is defined as a factorial combination of the exact variables. For example, subset 1 contains females that visited practice 1 in year 2010, subset 2 contains females that visited practice 1 in year 2011, subset 3 contains females that visited practice 1 in year 2012 up to subset n, which is the last factorial combination of the exact variables:
head(create_subset)
#> Gender JCG Practice_Id subset
#> 1 F 2010 Practice_1 1
#> 2 F 2011 Practice_1 2
#> 3 F 2012 Practice_1 3
#> 4 F 2013 Practice_1 4
#> 5 F 2014 Practice_1 5
#> 6 F 2015 Practice_1 6
We merge the data that contains the “subset” variable with the data that contains the cases only:
case_with_subset <- not_processed %>%
filter(case_control =="case") %>%
full_join(create_subset, by = c("Gender", "JCG", "Practice_Id"))
We merge the data that contains the “subset” variable with the data that contains the controls only:
control_with_subset <- not_processed %>%
filter(case_control =="control") %>%
right_join(create_subset, by = c("Gender", "JCG", "Practice_Id"))
Finally we bind the cases and the controls, which will have now the new variable “subset”:
Let us tabulate the “case_control” variable again:
As we observe, the number of controls have decreased to 36,518 and the unique controls to 12,643. The gain from exact matching is that by shifting the analysis from one big data-set to several small sub-sets, the computational burden decreases substantially. There were 13,491-12,643 = 848 controls that couldn’t be matched to any of the cases, thus are excluded.
Firstly, we split the data-set in cases and controls and create a variable “cluster_case” to depict the cases separately. The “cluster_case” variable will have as many levels as the total number of cases, i.e. 202 in our example. For that purpose, the “cluster_case” will be empty in the controls data-set but have the names of the cases in the cases data-set:
bdd_controls <- not_processed[not_processed$case_control=="control",]
bdd_controls$cluster_case <- 0
bdd_cases <- not_processed[not_processed$case_control=="case",]
bdd_cases$cluster_case <- paste("case",1:nrow(bdd_cases),sep = "_")
Next, we bind the cases and the controls, which will have now the new variable “cluster_case” and create the variable age:
not_processed <- rbind(bdd_cases,bdd_controls)
not_processed$age <- not_processed$JCG-not_processed$Birth_Year
After creating the variable “cluster_case”, we split again the cases and controls into two different data-sets:
bdd_cases <- not_processed[not_processed$case_control=="case",]
bdd_control <- not_processed[not_processed$case_control=="control",]
Next, we create an empty data-frame and a unique list of the variable “cluster_case”:
Below it is the loop to generate the pseudo-observations for controls, which will be explained in details. We start by identifying in which subset each case belongs. Next, we check which controls are in the same subset and bind those controls to the case. For example, subset 1 has 2 cases and 1,523 controls. By creating pseudo-observations for controls, subset 1 will have 2 cases and 3,046 controls. Finally, we select the range for the age and follow-up. For demonstration purposes, we decided that an absolute difference of age smaller than 2 is acceptable and that the follow-up time between cases and controls is exact. Since the 2 cases are different in subset 1 in terms of age and follow-up, each case will end up with a different number of controls available to pool:
for(i in 1:length(list_p)){
temp <- bdd_cases[bdd_cases$cluster_case==list_p[i],]
subset_identified <- temp$subset
temp0 <- bdd_control[bdd_control$subset==temp$subset,]
temp_final <- rbind(temp,temp0)
temp_final$cluster_case <- list_p[i]
temp_final=temp_final %>%
group_by(cluster_case) %>%
mutate(age_diff = abs(age - age[case_control=="case"]),
fup_diff = foll_up - foll_up[case_control=="case"])
temp_final$age_fup <- ifelse(temp_final$age_diff<=2&temp_final$fup_diff==0,"accept","delete")
temp_final <- temp_final[temp_final$age_fup=="accept",]
temp_final$age_fup <- NULL
bdd_temp <- rbind(bdd_temp,temp_final)
}
Let us tabulate the “case_control” variable again:
The number of duplicated controls have decreased to 202 and the unique controls to identify are 10,462. Now, all the remaining controls are those that have at most 2 years difference from the case in the same subset, and also have the exact follow up.
We create the variable “total controls per case”, which depicts the total pool of controls available for each case. We also create the variable “case_ind” which takes the value 1 if the patient is a case and 0 if the patient is a control. Lastly, we select only relevant variables:
bdd_temp = bdd_temp %>% group_by(cluster_case) %>% mutate(total_control_per_case = n()-1)
bdd_temp$case_ind <- ifelse(bdd_temp$case_control=="case",1,0)
bdd_temp <- subset(bdd_temp, select=c(cluster_case, Patient_Id, case_control, case_ind,
JCG, entry_year, CI, age_diff, fup_diff, total_control_per_case))
The variable “frequency of controls” depicts how many times a control is assigned to a case:
Let us have a glimpse of the data by looking at the 10 first rows:
head(bdd_temp, 10)
#> cluster_case Patient_Id case_control case_ind JCG entry_year CI age_diff
#> 1 case_1 Patient_608 case 1 2011 2009 1 0
#> 2 case_1 Patient_4047 control 0 2011 2009 2 1
#> 3 case_2 Patient_681 case 1 2011 2000 2 0
#> 4 case_2 Patient_3833 control 0 2011 2000 4 2
#> 5 case_2 Patient_8934 control 0 2011 2000 1 1
#> 6 case_2 Patient_10023 control 0 2011 2000 2 0
#> 7 case_2 Patient_11680 control 0 2011 2000 2 1
#> 8 case_2 Patient_13330 control 0 2011 2000 0 1
#> 9 case_2 Patient_14968 control 0 2011 2000 2 2
#> 10 case_2 Patient_15592 control 0 2011 2000 0 2
#> fup_diff total_control_per_case freq_of_controls
#> 1 0 1 1
#> 2 0 1 1
#> 3 0 73 1
#> 4 0 73 8
#> 5 0 73 6
#> 6 0 73 5
#> 7 0 73 7
#> 8 0 73 3
#> 9 0 73 8
#> 10 0 73 8
Some first conclusions can be drawn:
Ordering the variables in a correct order is of utter importance. For simplicity, assuming that there are three variables, namely ”age-difference”, ”follow-up difference” and ”frequency of controls”. The data-set should be ordered by the variables ”case”, ”control”, ”follow-up difference”, ”age-difference” and lastly by ”frequency of controls”. The variable ”follow-up difference” appears before ”age-difference” since the ”follow-up difference” has more weight (importance) than the ”age-difference”.
bdd_temp<-bdd_temp[order(bdd_temp$cluster_case,bdd_temp$case_control,bdd_temp$fup_diff,
bdd_temp$age_diff,bdd_temp$freq_of_controls),]
By checking the 10 first rows, we can see that the closest controls are ordered after the case, indicating that they are optimal (have the same age-difference). Also, we observe that the “frequency of controls” is ordered which allows the control with the lowest frequency to be matched first:
head(bdd_temp, 10)
#> cluster_case Patient_Id case_control case_ind JCG entry_year CI
#> 1 case_1 Patient_608 case 1 2011 2009 1
#> 2 case_1 Patient_4047 control 0 2011 2009 2
#> 527 case_10 Patient_6326 case 1 2013 2010 1
#> 529 case_10 Patient_86161 control 0 2013 2010 0
#> 530 case_10 Patient_163561 control 0 2013 2010 0
#> 528 case_10 Patient_19087 control 0 2013 2010 0
#> 531 case_10 Patient_208844 control 0 2013 2010 0
#> 5190 case_100 Patient_99675 case 1 2014 2000 5
#> 5226 case_100 Patient_99368 control 0 2014 2000 0
#> 5241 case_100 Patient_123241 control 0 2014 2000 0
#> age_diff fup_diff total_control_per_case freq_of_controls
#> 1 0 0 1 1
#> 2 1 0 1 1
#> 527 0 0 4 1
#> 529 0 0 4 1
#> 530 0 0 4 1
#> 528 1 0 4 1
#> 531 1 0 4 1
#> 5190 0 0 89 1
#> 5226 0 0 89 1
#> 5241 0 0 89 2
We have the data ready to be used for the algorithm. The “optimal_matching” function generates an optimal match between cases and controls in an iterative and computational efficient way. For demonstration purposes, we select 4 controls to match, and we perform the analysis without replacement:
final_data <- optimal_matching(bdd_temp, n_con=4, cluster_case, Patient_Id,
total_control_per_case, case_control, with_replacement = FALSE)
Below we summarise the steps that explain how the algorithm works.
We can see the first 20 rows:
final_data <- final_data %>% arrange(cluster_case)
head(final_data,20)
#> cluster_case Patient_Id case_control case_ind JCG entry_year CI
#> 1 case_1 Patient_608 case 1 2011 2009 1
#> 2 case_1 Patient_4047 control 0 2011 2009 2
#> 3 case_10 Patient_6326 case 1 2013 2010 1
#> 4 case_10 Patient_86161 control 0 2013 2010 0
#> 5 case_10 Patient_163561 control 0 2013 2010 0
#> 6 case_10 Patient_19087 control 0 2013 2010 0
#> 7 case_10 Patient_208844 control 0 2013 2010 0
#> 8 case_100 Patient_99675 case 1 2014 2000 5
#> 9 case_100 Patient_99368 control 0 2014 2000 0
#> 10 case_100 Patient_102769 control 0 2014 2000 2
#> 11 case_100 Patient_213631 control 0 2014 2000 2
#> 12 case_100 Patient_126384 control 0 2014 2000 0
#> 13 case_101 Patient_102927 case 1 2015 2000 4
#> 14 case_101 Patient_38270 control 0 2015 2000 1
#> 15 case_101 Patient_96569 control 0 2015 2000 2
#> 16 case_101 Patient_218022 control 0 2015 2000 2
#> 17 case_101 Patient_215336 control 0 2015 2000 3
#> 18 case_102 Patient_104299 case 1 2013 2000 0
#> 19 case_102 Patient_154077 control 0 2013 2000 1
#> 20 case_102 Patient_198652 control 0 2013 2000 0
#> age_diff fup_diff total_control_per_case freq_of_controls
#> 1 0 0 1 1
#> 2 1 0 1 1
#> 3 0 0 4 1
#> 4 0 0 4 1
#> 5 0 0 3 1
#> 6 1 0 2 1
#> 7 1 0 1 1
#> 8 0 0 89 1
#> 9 0 0 89 1
#> 10 0 0 80 3
#> 11 0 0 68 12
#> 12 1 0 57 2
#> 13 0 0 62 1
#> 14 0 0 62 1
#> 15 0 0 58 1
#> 16 0 0 52 1
#> 17 0 0 44 3
#> 18 0 0 99 1
#> 19 0 0 99 1
#> 20 0 0 97 1
If we want to see how many controls are matched for each case, then:
final_data = final_data %>% group_by(cluster_case) %>% mutate(total_control_matched = n()-1)
table(final_data$case_control,final_data$total_control_matched)
#>
#> 1 2 3 4
#> case 16 9 6 161
#> control 16 18 18 644
16 cases have only 1 control, 9 cases have 2 controls, 6 cases have 3 controls and finally 161 cases have 4 controls, using the criteria above.
In the clinical case described in the previous sections, we used 3 exact variables (Gender, JCG, Practice_Id), age and follow-up. This algorithm is very flexible in accommodating even more continuous and exact variables, and as a matter of fact, the more criteria are added, the less the computational burden is. It is very useful to operate different scenarios of matching by adjusting the age range, the follow-up time and the Comorbidity Index.
The user (epidemiologist, researcher) has available a 1:n case-control matching algorithm in an optimal, efficient, and fast way using a multi-step (iterative) procedure for traditional and nested case-control studies.