This vignette covers changes between versions 2.0.0 and 2.0.1.
This document explains how to use clustcurv R package for clustering multiple nonparametric curves, under the survival and regression framework. To this end, we illustrate the use of the package using some real data sets. In the case of the survival context, the algorithm to determine groups automatically is applied to the German breast cancer data included in the condSURV package. For the regression analysis, the clustcurv R package includes a data set called data(barnacle5)
with measurements of rostro-carinal length and dry weight of barnacles collected from five sites of Galicia (northwest of Spain).
We will use German breast cancer data data(gbcsCS)
to illustrate the package capabilities to build clusters of survival curves based on a covariate. This data set is available in condSURV package. A total of 686 patients with primary node positive breast cancer were recruited between July 1984 and December 1989 and 16 variables were measured such as age of the patient (age
), menopausal status (menopause
), hormonal therapy (hormone
), tumour size (in mm,size
), tumor grade (grade
) and number of positive nodes (nodes
). In addition to these and other variables, the recurrence free survival time (in days,rectime
) and the corresponding censoring indicator (0 - censored, 1 - event) were also recorded.
After regular installation with install.packages()
, then load the packages and the data set with
library(clustcurv)
library(condSURV)
data(gbcsCS)
head(gbcsCS[, c(5:10, 13, 14)])
#> age menopause hormone size grade nodes rectime censrec
#> 1 38 1 1 18 3 5 1337 1
#> 2 52 1 1 20 1 1 1420 1
#> 3 47 1 1 30 2 1 1279 1
#> 4 40 1 1 24 1 3 148 0
#> 5 64 2 2 19 2 1 1863 0
#> 6 49 2 2 56 1 3 1933 0
The first three patients have developed a recurrence shown by censrec
variable equals to 1, unlike the following three which take the value of 0. This variable along with other two, rectime
and nodes
, will be taken into account for applying the algorithm for clustering survival curves. The number of positive nodes have been grouped from 1 to > 13 because of its low numbers onwards. Below, the steps for this preprocessed are shown
table(gbcsCS$nodes)
#>
#> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 23 24 26 30 33 35 36 38 51
#> 187 110 79 57 41 33 36 20 20 19 15 13 11 3 5 8 5 5 5 3 1 1 2 1 1 1 1 1 1 1
gbcsCS[gbcsCS$nodes > 13,'nodes'] <- 14
gbcsCS$nodes <- factor(gbcsCS$nodes)
levels(gbcsCS$nodes)[14]<- '>13'
table(gbcsCS$nodes)
#>
#> 1 2 3 4 5 6 7 8 9 10 11 12 13 >13
#> 187 110 79 57 41 33 36 20 20 19 15 13 11 45
Clusters and estimates of the survival curves are obtained using the ksurvcurves()
function or survclustcurves()
function. The main difference between them is that ksurvcurves()
, given a fixed value of \(K\), allows determing the group for which each survival function belongs. In addition, survclustcurves()
is able to determine automatically the number of groups. The functions will verify if data has been introduced correctly and will create kcurves
and clustcurves
objects, respectively. Both functions allow determining groups using the optimization algorithm \(K\)-means or \(K\)-medians (e.g. algorithm = 'kmeans'
, or algorithm = 'kmedians'
). The first three arguments must be introduced, where time
is a vector with event-times, status
for their corresponding indicator statuses, and x
is the categorical covariate.
By means of the ksurvcurves()
function and filling, for example, the arguments k = 3
and algorithm = 'kmedians'
, the estimates and the group for which each survival function belongs can be obtained as follows,
fit.kgbcs<- ksurvcurves(time = gbcsCS$rectime, status = gbcsCS$censrec, x = gbcsCS$nodes,
algorithm = 'kmedians', k = 4, seed = 300716)
Additionally, one can be interesting to know, not only, the assignment of the survival curves to the group which they belong but also, the automatic selection of the number of groups. As we mentioned, it is possible by means of the survclustcurves()
function. The following input command provides an example of the output using, as well, the \(K\)-medians algorithm (i.e. algorithm = 'kmedians'
)
fit.gbcs <- survclustcurves(time = gbcsCS$rectime, status = gbcsCS$censrec, x = gbcsCS$nodes,
nboot = 100, seed = 300716, algorithm = 'kmedians')
#> Checking 1 cluster...
#> Checking 2 clusters...
#> Checking 3 clusters...
#>
#> Finally, there are 3 clusters.
In the above function it is also included an argument for reducing executing time by means of parallelizing the testing procedure. This is cluster = TRUE
. Related to this argument, the number of cores to be used in the parallelized procedure can be specified with the argument ncores
. By default, ncores = NULL
, so that the number is equal to the number of cores of the machine - 1.
The following piece of code can be executed for obtaining a small summary of the fit
summary(fit.kgbcs)
#>
#> Call:
#> ksurvcurves(time = gbcsCS$rectime, status = gbcsCS$censrec, x = gbcsCS$nodes,
#> k = 4, algorithm = "kmedians", seed = 300716)
#>
#> Clustering curves in 4 groups
#>
#> Number of observations: 642
#> Cluster method: kmedians
#>
#> Factor's levels:
#> [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" ">13"
#>
#> Clustering factor's levels:
#> [1] 4 4 4 1 1 1 1 3 1 3 3 3 3 2
#>
#> Available components:
#> [1] "measure" "levels" "cluster" "centers" "curves" "method" "data" "algorithm" "call"
summary(fit.gbcs)
#>
#> Call:
#> survclustcurves(time = gbcsCS$rectime, status = gbcsCS$censrec,
#> x = gbcsCS$nodes, nboot = 100, algorithm = "kmedians", seed = 300716)
#>
#> Clustering curves in 3 groups
#>
#> Number of observations: 640
#> Cluster method: kmedians
#>
#> Factor's levels:
#> [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" ">13"
#>
#> Clustering factor's levels:
#> [1] 3 3 3 1 1 1 1 2 1 2 2 2 2 2
#>
#> Testing procedure:
#> H0 Tvalue pvalue
#> 1 1 95.68626 0.00
#> 2 2 56.03966 0.01
#> 3 3 33.63386 0.94
#>
#> Available components:
#> [1] "num_groups" "table" "levels" "cluster" "centers" "curves" "method" "data" "algorithm" "call"
As can be seen, the summary()
function, as well as the print()
function, can be used to obtained some brief information about the output from ksurvcurves()
and survclustcurves()
.
The graphical representation of the fitted model can be easily obtained using the function autoplot()
. The plot obtained, specifying the arguments groups_by_color = FALSE
and interactive = TRUE
, represents the estimated survival curves for each level of the factor nodes by means of the Kaplan-Meier estimator. As expected, the survival of patients can be influenced by the number of lymph nodes, patient’s recurrence time rises with the decrease of lymph nodes
The assignment of the curves to the three groups can be observed in the following plot simply typing groups_by_color = TRUE
Equivalently, the following piece of code shows the input commands and the results obtained with the algorithm = 'kmeans'
. The number of groups and the assignments are different as those ones obtained with the algorithm = 'kmedians'
. Although this situation is not so common, in some real applications it can happen.
We will use barnacle’s growth data data(barnacle5)
to illustrate the package capabilities to build clusters of regression curves based on a covariate. This data set (barnacle5
) is available in the clustcurv package. A total of 5000 specimens were collected from five sites of the region’s Atlantic coastline and corresponds to the stretches of coast where this species is harvested: Punta do Mouro, Punta Lens, Punta de la Barca, Punta del Boy and Punta del Alba. Two biometric variables of each specimen were measured: RC
(Rostro-carinal length, maximum distance across the capitulum between the ends of the rostral and carinal plates) and DW
(Dry Weight).
data("barnacle5")
head(barnacle5)
#> DW RC F
#> 1 0.52 12.0 laxe
#> 2 1.46 18.9 laxe
#> 3 0.05 6.4 laxe
#> 4 0.17 9.4 laxe
#> 5 0.05 6.2 laxe
#> 6 0.41 12.2 laxe
Here, the idea is to know the relation between RC
and DW
variables along the coast, i.e., to analyze if the barnacle’s growth is similar in all locations F
or by contrast, if it is possible to detect geographical differentiation in growth. To do this, the regclustcurves()
function will be used with the input variables y
, x
, z
, by means of executing the following piece of code
fit.bar <- regclustcurves(y = barnacle5$DW, x = barnacle5$RC, z = barnacle5$F,
nboot = 100, seed = 300716, algorithm = 'kmeans')
#> Checking 1 cluster...
#> Checking 2 clusters...
#> Checking 3 clusters...
#>
#> Finally, there are 3 clusters.
The output of this function can be observed with print()
or summary()
functions. Below, there is an example of this
print(fit.bar)
#>
#> Call:
#> regclustcurves(y = barnacle5$DW, x = barnacle5$RC, z = barnacle5$F,
#> nboot = 100, algorithm = "kmeans", seed = 300716)
#>
#> Clustering curves in 3 groups
#>
#> Number of observations: 5000
#> Cluster method: kmeans
summary(fit.bar)
#>
#> Call:
#> regclustcurves(y = barnacle5$DW, x = barnacle5$RC, z = barnacle5$F,
#> nboot = 100, algorithm = "kmeans", seed = 300716)
#>
#> Clustering curves in 3 groups
#>
#> Number of observations: 5000
#> Cluster method: kmeans
#>
#> Factor's levels:
#> [1] "laxe" "lens" "barca" "boy" "alba"
#>
#> Clustering factor's levels:
#> [1] 2 3 1 1 2
#>
#> Testing procedure:
#> H0 Tvalue pvalue
#> 1 1 0.94353014 0.00
#> 2 2 0.15463483 0.02
#> 3 3 0.02348982 0.46
#>
#> Available components:
#> [1] "num_groups" "table" "levels" "cluster" "centers" "curves" "method" "data" "algorithm" "call"
Equivalent to the example with survival curves shown before, the results obtained above can be plotted using the autoplot()
Never mind ->
install.packages('clustcurv')