To limit this documentation size, most figures are not displayed in the version
of the vignette included in the package. To see all figures, you can rerun
the vignettes, changing the following option to TRUE
:
knitr::opts_chunk$set(include = FALSE)
Alternatively, this compilation is also available at: http://sombrero.nathalievialaneix.eu/articles/d-doc-korrespSOM.html
SOMbrero
implements different variants of the Self-Organizing Map algorithm
(also called Kohonen's algorithm). To process a given dataset with the SOM
algorithm, you can use the function trainSOM()
.
This documentation only considers the case of contingency tables.
The trainSOM
function has several arguments, but only the first one is
required. This argument is x.data
which is the dataset used to train the
SOM. In this documentation, it is passed to the function as a matrix or a data
frame and encodes a contingency tables (the entries are the frequencies of joint
observations for two factors). Column and row names must be supplied to ease the
interpretation.
The function handles over options, which are the same as the ones passed to
initSOM
(they are parameters defining the algorithm, see
help(initSOM)
for further details).
The trainSOM
function returns an object of class somRes
(see
help(trainSOM)
for further details on this class).
presidentielles2002
data setThe presidentielles2002
data set provides the number of votes for the
first round of the 2002 French presidential election for each of the 16
candidates in all of the 106 French administrative districts called
“départements”. Further details about this data set and the 2002 French
presidential election are given with help(presidentielles2002)
.
data(presidentielles2002)
apply(presidentielles2002, 2, sum)
## MEGRET LEPAGE GLUCKSTEIN BAYROU CHIRAC LE_PEN
## 667043 535875 132696 1949219 5666021 4804772
## TAUBIRA SAINT_JOSSE MAMERE JOSPIN BOUTIN HUE
## 660515 1204801 1495774 4610267 339157 960548
## CHEVENEMENT MADELIN LAGUILLER BESANCENOT
## 1518568 1113551 1630118 1210562
(the two candidates that ran the second round of the election were Jacques Chirac and the far-right candidate Jean-Marie Le Pen)
set.seed(01091407)
korresp.som <- trainSOM(x.data = presidentielles2002, dimension = c(8,8),
type = "korresp", scaling = "chi2", nb.save = 10,
topo = "hexagonal", maxit = 500)
korresp.som
## Self-Organizing Map object...
## online learning, type: korresp
## 8 x 8 grid with hexagonal topology
## neighbourhood type: gaussian
## distance type: euclidean
As the energy is registered during the intermediate backups, we can take a look at its evolution
plot(korresp.som, what = "energy")
which has approximately stabilized at iteration 500.
The clustering component contains the final classification of the dataset. As both row and column variables are classified, the length of the resulting vector is equal to the sum of the number of rows and the number of columns.
NB: The clustering component shows first the column variables (here, the candidates) and then the row variables (here, the départements).
korresp.som$clustering
## MEGRET LEPAGE GLUCKSTEIN
## 8 8 8
## BAYROU CHIRAC LE_PEN
## 40 33 61
## TAUBIRA SAINT_JOSSE MAMERE
## 8 4 32
## JOSPIN BOUTIN HUE
## 25 8 6
## CHEVENEMENT MADELIN LAGUILLER
## 32 24 4
## BESANCENOT ain aisne
## 5 61 61
## allier alpes_de_haute_provence hautes_alpes
## 59 57 57
## alpes_maritimes ardeche ardennes
## 64 59 57
## ariege aube aude
## 57 57 59
## aveyron bouches_du_rhone calvados
## 57 4 53
## cantal charente charente_maritime
## 57 57 43
## cher correze corse_sud
## 57 57 57
## haute_corse cote_d'or cotes_d'armor
## 57 61 25
## creuse dordogne doubs
## 57 59 61
## drome eure eure_et_loir
## 60 61 59
## finistere gard haute_garonne
## 17 62 9
## gers gironde herault
## 57 1 55
## ille_et_vilaine indre indre_et_loire_
## 17 57 61
## isere jura landes
## 56 57 57
## loir_et_cher loire haute_loire
## 59 63 57
## loire_atlantique loiret lot
## 10 61 57
## lot_et_garonne_ lozere maine_et_loire_
## 59 57 45
## manche marne haute_marne
## 60 61 57
## mayenne meurthe_et_moselle meuse
## 57 62 57
## morbihan moselle nievre
## 26 64 57
## nord oise orne
## 6 63 57
## pas_de_calais puy_de_dome pyrenees_atlantiques
## 2 53 25
## hautes_pyrenees pyrenees_orientales bas_rhin
## 57 59 56
## haut_rhin rhone haute_saone
## 63 40 57
## saone_et_loire_ sarthe savoie
## 61 52 59
## haute_savoie paris seine_maritime_
## 63 24 1
## seine_et_marne_ yvelines deux_sevres
## 56 48 49
## somme tarn tarn_et_garonne
## 61 59 57
## var vaucluse vendee
## 64 61 25
## vienne haute_vienne vosges
## 58 49 59
## yonne territoire_de_belfort essonne
## 59 57 46
## hauts_de_seine_ seine_saint-denis val_de_marne
## 40 37 46
## val_d'oise guadeloupe martinique
## 55 57 57
## guyane la_reunion mayotte
## 57 33 57
## nouvelle_caledonie polynesie_francaise saint_pierre_et_miquelon
## 57 57 57
## wallis_et_futuna francais_de_l'etranger
## 57 49
The following table indicates which graphics are available for a korresp SOM.
What SOM or SC Type |
SOM Energy |
Obs |
Prototypes |
Add |
SuperCluster (no what) |
Obs |
Prototypes |
Add |
---|---|---|---|---|---|---|---|---|
(no type) | x | |||||||
hitmap | x | x | ||||||
color | x | x | ||||||
lines | x | x | ||||||
barplot | x | x | ||||||
3d | x | |||||||
poly.dist | x | x | ||||||
umatrix | x | |||||||
smooth.dist | x | |||||||
mds | x | x | ||||||
grid.dist | x | |||||||
names | x | |||||||
grid | x | |||||||
dendrogram | x | |||||||
dendro3d | x |
The resulting distribution of the clustering on the map can also be visualized by a hitmap:
plot(korresp.som, what = "obs", type = "hitmap", show.names = FALSE)
For a more precise view, "names"
plot is implemented: the names of the
values assigned to every neuron is displayed in the corresponding cluster. In
korresp
SOM, both row and column names are displayed.
plot(korresp.som, what="obs", type="names")
The map is divided into two main parts: minor candidates are classified at its top left hand side whereas the first main candidates CHIRAC, LE PEN and JOSPIN are classified at the bottom right hand side of the map, in three different parts of this corner. Some striking facts are:
most of rural départements (Corrèze, Creuse, Jura, Cantal, Ariège, …) are classified in the bottom right corner, in between CHIRAC and LE PEN, who have high number of votes (compared to the other candidates) in these département;
CHIRAC is characterized by higher votes for La Réunion (oversee département) whereas LE PEN has higher votes for Indre Et Loire, Aisne, Loiret, Côte d'Or;
some well known associations, like HUE (communist party) in the Nord, are also visible on the map.
Some graphics from the numeric SOM algorithm are still available in the korresp
case. They are detailed below. As the resulting clustering provides the
classification for both rows and columns, a new argument view
is used to
specify which one should be considered. Its possible values are either
"r"
for row variables (the default value) or "c"
for column
variables.
Three representations are available:
view
argument
is used)The département profiles are much flatter (and with low values) in the top left corner of the map than in the bottom right corner which shows more differences between département and globally higher number of votes.
On the contrary, the candidate profiles are flatter, with globally lower values in the bottom right corner of the map.
A more precise individual view are given with the graphics “color” and “3d”, here drawn, as an example for the candidate “LE PEN” and for the département “La Réunion”.
variable
) is represented on the map;"color"
.plot(korresp.som, what = "prototypes", type = "color", variable = "LE_PEN")
The first graphic shows that LE_PEN obtained more votes in the departements located at the top left corner of the map. The second graphic shows that the candidates that obtained the highest scores in La Réunion are located at the bottom of the map (like Chirac).
The graphics can also be drawn by giving the variable number and its type, either “r” or “c” (here, as an example, CHIRAC who is the 5th candidate, 5th column):
Hence CHIRAC obtained more votes in departement located at the left hand side of the map.
These graphics are exactly the same as in the numerical case and provide various way to display the distance between prototypes on the grid.
plot(korresp.som, what = "prototypes", type = "poly.dist", show.names = FALSE)
All these graphics show a clear separation between the top left corner of the map and the bottom right corner of the map.
The quality of the projection is provided by the function quality
that
outputs the same quality criteria than in the numeric case.
quality(korresp.som)
## $topographic
## [1] 0.1603774
##
## $quantization
## [1] 60033.83
In the SOM algorithm, the number of clusters is necessarily close to the number of neurons on the grid (not necessarily equal as some neurons may have no observations assigned to them). This - quite large - number may not suit the original data for a clustering purpose.
A usual way to address clustering with SOM is to perform a hierarchical
clustering on the prototypes. This clustering is directly available in the
package SOMbrero
using the function superClass
. To do so, you can
first have a quick overview to decide on the number of super clusters which
suits your data.
plot(superClass(korresp.som))
## Warning in plot.somSC(superClass(korresp.som)): Impossible to plot the rectangles: no super clusters.
By default, the function plots both a dendrogram and the evolution of the
percentage of explained variance. Here, 3 super clusters seem to be a good
choice. The output of superClass
is a somSC
class object.
Basic functions have been defined for this class:
my.sc <- superClass(korresp.som, k = 3)
summary(my.sc)
##
## SOM Super Classes
## Initial number of clusters : 64
## Number of super clusters : 3
##
##
## Frequency table
## 1 2 3
## 18 14 32
##
## Clustering
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
## 1 1 1 2 2 2 2 2 1 1 1 2 2 2 2 2 3 1 1 1 1 2 2 2 3 3
## 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52
## 3 1 1 1 1 2 3 3 3 3 3 1 1 1 3 3 3 3 3 3 3 1 3 3 3 3
## 53 54 55 56 57 58 59 60 61 62 63 64
## 3 3 3 3 3 3 3 3 3 3 3 3
plot(my.sc, plot.var = FALSE)
Like plot.somRes
, the function plot.somSC
has an argument
'type'
which offers many different plots and can thus be combined with
most of the graphics produced by plot.somSC
:
"grid"
fills the grid with colors according to the super clustering
(and can provide a legend)."dendro3d"
plots a 3d dendrogram.plot(my.sc, type = "grid")
The three super-clusters correspond to most voted candidates (blue), less voted candidates (green) and, in between, départments with intermediate votes in which BAYROU (from one of the center party) are classified.
A couple of plots from plot.somRes
are also available for the super
clustering. Some identify the super clusters with colors:
And some others identify the super clusters with titles:
This vignette has been computed with the following environment:
sessionInfo()
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=fr_FR.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=fr_FR.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_3.3.5 SOMbrero_1.4-1 markdown_1.1 igraph_1.2.10
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.1.1 xfun_0.24 purrr_0.3.4
## [4] lattice_0.20-45 colorspace_2.0-2 vctrs_0.3.8
## [7] generics_0.1.0 utf8_1.2.2 rlang_0.4.11
## [10] isoband_0.2.5 hexbin_1.28.2 pillar_1.6.2
## [13] glue_1.4.2 withr_2.4.2 DBI_1.1.1
## [16] RColorBrewer_1.1-2 lifecycle_1.0.0 plyr_1.8.6
## [19] stringr_1.4.0 metR_0.11.0 munsell_0.5.0
## [22] gtable_0.3.0 codetools_0.2-18 evaluate_0.14
## [25] labeling_0.4.2 ggwordcloud_0.5.0 knitr_1.33
## [28] fansi_0.5.0 highr_0.9 Rcpp_1.0.7
## [31] scales_1.1.1 backports_1.2.1 checkmate_2.0.0
## [34] scatterplot3d_0.3-41 mime_0.10 deldir_1.0-6
## [37] farver_2.1.0 interp_1.0-33 png_0.1-7
## [40] digest_0.6.27 stringi_1.6.2 dplyr_1.0.7
## [43] grid_4.1.2 tools_4.1.2 magrittr_2.0.1
## [46] tibble_3.1.3 crayon_1.4.1 pkgconfig_2.0.3
## [49] ellipsis_0.3.2 data.table_1.14.0 lubridate_1.7.10
## [52] assertthat_0.2.1 R6_2.5.1 compiler_4.1.2