Sparse weighted k-means for mixed data

Marie Chavent and Alex Mourer and Madalina Olteanu

2020-12-18

\(\DeclareMathOperator{\R}{\mathbb{R}}\)

Basic function description

sparsewkm is designed for performing sparse clustering of a dataset described by numerical, categorical, or mixed features. In this respect, it generalizes the sparse \(k\)-means algorithm introduced in Witten and Tibshirani (2010) for numerical features only.

The implementation is based on the optimization of a penalized between-class variance criterion. If the features used for clustering are numerical only, the algorithm reduces to the one in Witten and Tibshirani (2010). If some or all the features are categorical, sparsewkm transforms the data using a factor analysis step (see M. Chavent and Saracco (2014) for the preprocessing), and then trains a group-sparse \(k\)-means algorithm, each group being defined by the levels of one specific feature. For more technical details on the cost function and the optimization procedure, one may refer to M. Chavent and Olteanu (2020).

Arguments

Several arguments may be passed to sparsewkm, but only the first two are required:

The rest of the arguments are related to the choices of the regularization parameter, or the number of iterations and random starts in the algoritm. Default values are fixed for these parameters, one may see help(sparsewkm) for further details.

Output

The sparsewkm function returns an object of class spwkm (see help(sparsewkm) for further details on this class).

A case study: HDdata dataset

The HDdata consists of 270 patients described by six numerical features and eight categorical ones. It was sampled from the Cleveland Heart Disease Data found in the UCI machine learning repository.

Further details about this dataset may be found with help(HDdata).

library(vimpclust)
head(HDdata)
##   age sex cp trestbps chol fbs restecg maxhr exang oldpeak slope numv thal
## 1  70   1  4      130  322   0       2   109     0     2.4     2    3    3
## 2  67   0  3      115  564   0       2   160     0     1.6     2    0    7
## 3  57   1  2      124  261   0       0   141     0     0.3     1    0    7
## 4  64   1  4      128  263   0       0   105     1     0.2     2    1    7
## 5  74   0  2      120  269   0       2   121     1     0.2     1    1    3
## 6  65   1  4      120  177   0       0   140     0     0.4     1    0    7
##         HD
## 1 presence
## 2  absence
## 3 presence
## 4  absence
## 5  absence
## 6  absence

Training the sparsewkm function

The sparsewkm function is applied to HDdata on all features except the last one, HD, which codes for the presence or the absence of a heart disease. HD was removed from the clustering, and will be used later as a control variable. Since the control variable has only two classes, the number of clusters is set to 2 with the argument centers. We shall check after the clustering procedure whether the algorithm retrieves these two classes.

res <- sparsewkm(X = HDdata[,-14], centers = 2)

According to the above, the algorithm converged for all values of lambda. In some cases, the stopping criterion may not be satisfied and the convergence over the weights w might not be achieved. If this were the case, one should increase the number of iterations itermaxw, which by default is set to 20. Note however that setting a larger value for this parameter would increase the computational time, since a full \(k\)-means algorithm is trained at each iteration.

Results

The weights associated to each feature may be found in the matrix W. These weights illustrate the contribution of each feature, numerical or categorical, to the clustering, and may be seen as a measure of the relative importance of each feature in the clustering.

Each column of W contains the weights computed for a given value of the regularization parameter lambda. The default setting in the sparsewkm function selects 20 values for lambda, chosen uniformly between 0 (no regularization) and a maximum value automatically tuned.

In the following, only the weights associated to the first 5 values of lambda are displayed.

res$W[,1:5]
##          Lambda 1 Lambda 2 Lambda 3 Lambda 4 Lambda 5
## age       1.3e-01    0.138    0.112    0.091    0.046
## trestbps  2.2e-02    0.000    0.000    0.000    0.000
## chol      1.7e-05    0.000    0.000    0.000    0.000
## maxhr     7.0e-01    0.816    0.860    0.879    0.909
## oldpeak   5.4e-01    0.449    0.428    0.424    0.402
## numv      1.4e-01    0.122    0.094    0.071    0.046
## sex       2.6e-02    0.000    0.000    0.000    0.000
## cp        1.3e-01    0.086    0.023    0.000    0.000
## fbs       6.5e-06    0.000    0.000    0.000    0.000
## restecg   3.1e-02    0.000    0.000    0.000    0.000
## exang     2.3e-01    0.182    0.138    0.106    0.046
## slope     3.2e-01    0.231    0.188    0.151    0.077
## thal      1.2e-01    0.058    0.023    0.000    0.000

One may see that, as lambda increases, the weights of some features are progressively set to 0. The evolution of the regularization paths for the features used in the clustering may be illustrated using the plot function for the spwkm class. With the default settings of this implementation, the paths associated to numerical features are drawn with continuous lines, while those associated to categorical features are drawn with dotted lines.

According to the results, the numerical features maxhr and oldpeak, and the categorical features slope and exang appear as the most discriminant for small values of lambda. As lambda increases, maxhr only is selected by the algorithm.

plot(res, what="weights.features")

Furthermore, the weights associated to each level of the categorical features may be also displayed. These weights are stored in the matrix Wm. Similarly to W, each column of this matrix contains the weights – associated either to the numerical features or to the levels of the categorical features – for a given value of lambda.

We display here these weights asssociated to the first 5 values of lambda.

res$Wm[,1:5]
##           Lambda 1 Lambda 2 Lambda 3 Lambda 4 Lambda 5
## age        1.3e-01  1.4e-01  1.1e-01    0.091    0.046
## trestbps   2.2e-02  0.0e+00  0.0e+00    0.000    0.000
## chol       1.7e-05  0.0e+00  0.0e+00    0.000    0.000
## maxhr      7.0e-01  8.2e-01  8.6e-01    0.879    0.909
## oldpeak    5.4e-01  4.5e-01  4.3e-01    0.424    0.402
## numv       1.4e-01  1.2e-01  9.4e-02    0.071    0.046
## sex=0      2.3e-02  0.0e+00  0.0e+00    0.000    0.000
## sex=1      1.1e-02  0.0e+00  0.0e+00    0.000    0.000
## cp=1       2.1e-04  9.6e-05  4.5e-06    0.000    0.000
## cp=2       8.2e-02  5.7e-02  1.5e-02    0.000    0.000
## cp=3       3.7e-02  2.2e-02  5.3e-03    0.000    0.000
## cp=4       9.5e-02  6.2e-02  1.6e-02    0.000    0.000
## fbs=0      1.1e-06  0.0e+00  0.0e+00    0.000    0.000
## fbs=1      6.4e-06  0.0e+00  0.0e+00    0.000    0.000
## restecg=0  2.1e-02  0.0e+00  0.0e+00    0.000    0.000
## restecg=1  1.6e-02  0.0e+00  0.0e+00    0.000    0.000
## restecg=2  1.6e-02  0.0e+00  0.0e+00    0.000    0.000
## exang=0    9.9e-02  8.0e-02  6.1e-02    0.047    0.020
## exang=1    2.0e-01  1.6e-01  1.2e-01    0.095    0.041
## slope=1    2.5e-01  1.8e-01  1.5e-01    0.118    0.061
## slope=2    2.0e-01  1.4e-01  1.2e-01    0.093    0.046
## slope=3    3.1e-02  2.7e-02  2.1e-02    0.017    0.011
## thal=3     8.2e-02  4.0e-02  1.6e-02    0.000    0.000
## thal=6     3.3e-02  2.0e-02  1.0e-02    0.000    0.000
## thal=7     7.8e-02  3.7e-02  1.3e-02    0.000    0.000

The regularization paths for the levels of the categorical features may be plotted using argument what=weights.levels in the plot function. This option provides a more detailed image of how each level in a categorical feature contributes to the clustering. One may see here, for instance, that only the first two levels of slope and one level of exang have particularly significant contributions.

plot(res, what="weights.levels")

Depending on the number of levels in the categorical features, Wm may potentially be quite a big matrix. One may chose to plot the regularization paths for some features only, whether numerical, or categorical. For doing this, it is enough to add the argument Which in the plot function and list the features (one should note here that after training the sparsewkm function, the features are ordered differently than in the input data, with the numerical features listed first; the argument Which takes into account the order in the W output matrix).

plot(res, what="weights.features", Which=c(4,5,11,12))

For the categorical features, one may equally plot the regularization paths of the associated levels, as we do here for slope and exang.

plot(res, what="weights.levels", Which=c(11,12))

Additional plots

The number of selected features and the number of selected levels for a given value of the regularization parameter lambda provide valuable information. These two criteria may be illustrated using options what=sel.features and what=sel.levels in the plot function.

plot(res, what="sel.features")

plot(res, what="sel.levels")

The two curves are very similar and show how the number of selected features relevant for the clustering rapidly decreases with lambda.

Clustering may also be assessed using criteria based on the evolution of the explained variance. The latter is computed as the ratio between the between-class variance and the global variance in the data, and represents the ratio of information explained by the clustering. The explained variance may be computed on all features used for clustering, without taking the weights into account, or in a weighted fashion, by taking the computed weights into account and thus suppressing all features discarded by the algorithm. In the latter case and for large lambda, the explained weighted variance results in being computed using one feature only, maxhr.

plot(res, what="expl.var")

plot(res, what="w.expl.var")

These two criteria, combined with the number of selected features, may be used to select the appropriate regularization parameter lambda. A good choice for lambda should preserve a high percentage of variance explained by the clustering, while discarding a large number of features. This amounts to a trade-off between the quality of the model and its parcimony.

One may remark that with the fifth value of the regularization parameter, lambda=0.07, the number of features is reduced by half, while the loss in the explained variance is close to 1%. One may notice also that the percentage of explained variance is very low for all clusterings, while the percetange of explained weighthed variance is significantly larger and increasing with the increasing penalty. This amounts to saying that most of the features in the dataset are not discriminant, and the global quality of the clustering, computed on all features, is rather poor. If only the most discriminant features are selected, the explained weighted variance largely increses, whereas the loss in the unweighted explained variance is minor.

Furthermore, if one selects the sixth value of the regularization parameter, lambda=0.09, only two features are kept for clustering. The loss in the unweighted explained variance is close to 2%, but the weighted explained variance gains more than 30%.

It appears, according to these remarks, that only very few features are actually playing a significant part in the clustering procedure. If one prefers to discard most of the features, she may keep maxhr and oldpeak, with an explained variance of roughly 6.2% and an explained weighted variance of roughly 60%. If one wants a larger ratio of explained variance, she would also keep age, numv, exang and slope besides maxhr and oldpeak. In this case, the ratio of explained variance is about 7.5%, while the ratio of explained weighted variance drops to approximately 45%. Furthermore, according to the regularization paths, it appears that these six features are the most discriminant and the most important for the clustering.

Comparing the clustering with the “ground truth”

Since we have a control variable for the HDdata, we shall use it to compare three clustering produced by the algorithm, for three different values of lambda. The comparison criterion chosen here is the Adjusted Rand Index (ARI). We shall also display the confusion matrices for the first, fifth and sixth values of lambda.

library(mclust)
sapply(c(1,5,6), function(x) {adjustedRandIndex(res$cluster[,x],HDdata$HD)})
## [1] 0.29 0.21 0.16
table(HDdata$HD, res$cluster[,1])
##           
##              1   2
##   absence  128  22
##   presence  40  80
table(HDdata$HD, res$cluster[,5])
##           
##              1   2
##   absence  123  27
##   presence  45  75
table(HDdata$HD, res$cluster[,6])
##           
##              1   2
##   absence   34 116
##   presence  73  47

According to these results, the quality of the agreement between the clustering and the control variable is decreasing with larger penalties and fewer features kept for the clustering. Nevertheless, although the accuracy is deteriorated, reducing the number of features by half leads to a loss of 3.7% only in terms of accuracy, while reducing the number of features from 13 to 2 leads to a loss of 7%. It appears here that sparse clustering may be particularly useful if one wishes to find a trade-off between clustering quality and dimensionality, while putting forward the features which mostly contribute to the clustering.

We shall also mention here that the comparison with the control variable is done for illustration purposes only. Since sparse weighted \(k\)-means is an unsupervised method, there is no reason the clustering it builds correponds to some prior partitioning defined by some control variable.

Cluster compostion

Eventually, once one selects an appropriate regularization parameter lambda and the associated clustering, she may consider cluster composition. Using the function info_clust in the package, she may display features distributions within the clusters (averages for numerical features, frequencies for categorical ones). This first insight may further be completed with a more thorough analysis, using some analysis of variance etc.

info_clust(res, 5, X = HDdata)
## $mean.by.clust
## $mean.by.clust$`lambda = 0.0702617509207402`
##          cluster=1 cluster=2
## age          52.21      58.1
## trestbps    130.21     133.2
## chol        250.30     248.6
## maxhr       163.90     126.3
## oldpeak       0.55       1.9
## numv          0.44       1.0
## 
## 
## $freq.by.clust
## $freq.by.clust$`lambda = 0.0702617509207402`
##             cluster=1 cluster=2
## sex=0           0.381     0.225
## sex=1           0.619     0.775
## cp=1            0.071     0.078
## cp=2            0.226     0.039
## cp=3            0.351     0.196
## cp=4            0.351     0.686
## fbs=0           0.851     0.853
## fbs=1           0.149     0.147
## restecg=0       0.548     0.382
## restecg=1       0.000     0.020
## restecg=2       0.452     0.598
## exang=0         0.821     0.422
## exang=1         0.179     0.578
## slope=1         0.679     0.157
## slope=2         0.286     0.725
## slope=3         0.036     0.118
## thal=3          0.685     0.363
## thal=6          0.018     0.108
## thal=7          0.298     0.529
## HD=absence      0.732     0.265
## HD=presence     0.268     0.735
## 
## 
## $lambda
## [1] 0.07

Bibliography

M. Chavent, A. Labenne, V. Kuentz-Simonet, and J. Saracco. 2014. “Multivariate Analysis of Mixed Data: The Pcamixdata R Package.” arXiv Preprint arXiv:1411.4911.

M. Chavent, A. Mourer, J. Lacaille, and M. Olteanu. 2020. “Sparse K-Means for Mixed Data via Group-Sparse Clustering.” In Proceedings of Esann.

Witten, D. M., and R. Tibshirani. 2010. “A Framework for Feature Selection in Clustering.” Journal of the American Statistical Association 105 (490): 713–26.