This short vignette illustrates basic usage of the OutlierTree library for outlier detection, using the hypothyroid dataset which is bundled with it.
This is a library which flags suspicious values within an observation, contrasting them against the normal values in a human-readable format and potentially adding conditions within the data that make the observation more suspicious; and does so in a similar way as one would do it manually, by checking extreme values in sorted order and filtering observations according to the values of other variables (e.g. if some other variable is TRUE
or FALSE
).
For a full description of the procedure see Explainable outlier detection through decision tree conditioning.
This is a dataset about hospital patients who might potentially have hypo- or hyperthyroidism problems. The observations are about anonymous people whose demographic characteristics, drug intake, and hormone indicators were recorded, along with the judgement about their condition.
It contains many interesting outliers which have something obviously wrong when examined visually, but which would nevertheless be missed by other outlier detection methods.
library(outliertree)
data(hypothyroid)
summary(hypothyroid)
age sex on.thyroxine query.on.thyroxine: 1.00 F :1817 Mode :logical Mode :logical
Min. : 36.00 M : 849 FALSE:2442 FALSE:2733
1st Qu.: 54.00 NA's: 106 TRUE :330 TRUE :39
Median Mean : 51.75
3rd Qu.: 67.00
Max. :455.00
NA's :1
on.antithyroid.medication sick pregnant thyroid.surgery:logical Mode :logical Mode :logical Mode :logical
Mode FALSE:2738 FALSE:2663 FALSE:1882 FALSE:2734
TRUE :34 TRUE :109 TRUE :41 TRUE :38
's :849
NA
I131.treatment query.hypothyroid query.hyperthyroid lithium
Mode :logical Mode :logical Mode :logical Mode :logical
FALSE:2724 FALSE:2611 FALSE:2600 FALSE:2758
TRUE :48 TRUE :161 TRUE :172 TRUE :14
tumor goitre hypopituitary psych
Mode :logical Mode :logical Mode :logical Mode :logical
FALSE:2747 FALSE:2701 FALSE:2771 FALSE:2638
TRUE :25 TRUE :71 TRUE :1 TRUE :134
TSH T3 TT4 T4U
Min. : 0.005 Min. : 0.05 Min. : 2.0 Min. :0.3100
1st Qu.: 0.430 1st Qu.: 1.60 1st Qu.: 88.0 1st Qu.:0.8800
Median : 1.400 Median : 2.00 Median :104.0 Median :0.9800
Mean : 4.509 Mean : 2.03 Mean :109.1 Mean :0.9986
3rd Qu.: 2.600 3rd Qu.: 2.40 3rd Qu.:125.0 3rd Qu.:1.0800
Max. :478.000 Max. :10.60 Max. :430.0 Max. :2.1200
NA's :277 NA's :576 NA's :179 NA's :291
referral.source diagnosis FTI
other:1615 compensated: 154 Min. : 2.381
STMW : 91 negative :2553 1st Qu.: 92.857
SVHC : 275 primary : 63 Median :107.175
SVHD : 31 secondary : 2 Mean :110.791
SVI : 760 3rd Qu.:124.272
Max. :394.495
NA's :292
outlier.tree(hypothyroid, nthreads=1)
otree <-8 outliers [out of 8 found]
Reporting top
531] - suspicious column: [hypopituitary] - suspicious value: [TRUE]
row [: 99.964% different [norm. obs: 2772]
distribution
623] - suspicious column: [age] - suspicious value: [455.00]
row [: 99.964% <= 94.00 - [mean: 51.60] - [sd: 18.98] - [norm. obs: 2770]
distribution
2230] - suspicious column: [T3] - suspicious value: [10.60]
row [: 99.951% <= 7.10 - [mean: 1.98] - [sd: 0.75] - [norm. obs: 2050]
distribution:
given [query.hyperthyroid] = [FALSE]
1138] - suspicious column: [age] - suspicious value: [75.00]
row [: 95.122% <= 42.00 - [mean: 31.46] - [sd: 5.28] - [norm. obs: 39]
distribution:
given [pregnant] = [TRUE]
2211] - suspicious column: [age] - suspicious value: [73.00]
row [: 95.122% <= 42.00 - [mean: 31.46] - [sd: 5.28] - [norm. obs: 39]
distribution:
given [pregnant] = [TRUE]
1438] - suspicious column: [FTI] - suspicious value: [394.50]
row [: 99.618% <= 232.08 - [mean: 132.68] - [sd: 28.23] - [norm. obs: 261]
distribution:
given [TT4] > [123.00] (value: 430.00)
!= [other] (value: STMW)
[referral.source]
745] - suspicious column: [TT4] - suspicious value: [239.00]
row [: 98.571% <= 177.00 - [mean: 135.23] - [sd: 12.57] - [norm. obs: 69]
distribution:
given [FTI] between (97.96, 128.12] (value: 112.74)
> [1.12] (value: 2.12)
[T4U] > [55.00] (value: 87.00)
[age]
1412] - suspicious column: [TT4] - suspicious value: [430.00]
row [: 99.762% <= 230.00 - [mean: 111.88] - [sd: 31.88] - [norm. obs: 420]
distribution:
given [T3] is NA
(i.e. it’s saying that it’s abnormal to be pregnant at the age of 75, or to not be classified as hyperthyroidal when having very high thyroid hormone levels)
A look at the distributions within the clusters in which some outliers were flagged:
hypothyroid[hypothyroid$pregnant,]
pregnant <-hist(pregnant$age, breaks=50, col="navy",
main="Age distribution among pregnant patients",
xlab="Age")
hypothyroid[!hypothyroid$query.hyperthyroid,]
non.hyperthyr <-hist(non.hyperthyr$T3, breaks=50, col="darkred",
main="T3 hormone levels\n(Non-hyperthyroidal patients)",
xlab="T3 blood concentration")
The identified outliers, along with all the relevant information, are returned as a list of lists, which can be inspected manually and the exact conditions extracted from them (see documentation for more details).
They are nevertheless returned as a class of its own in order to provide pretty-printing and slicing:
predict(otree, hypothyroid, outliers_print=FALSE)
outliers <-1:700]
outliers[2 outliers [out of 2 found]
Reporting top
531] - suspicious column: [hypopituitary] - suspicious value: [TRUE]
row [: 99.964% different [norm. obs: 2772]
distribution
623] - suspicious column: [age] - suspicious value: [455.00]
row [: 99.964% <= 94.00 - [mean: 51.60] - [sd: 18.98] - [norm. obs: 2770] distribution
1138]
outliers[1 outliers [out of 1 found]
Reporting top
1138] - suspicious column: [age] - suspicious value: [75.00]
row [: 95.122% <= 42.00 - [mean: 31.46] - [sd: 5.28] - [norm. obs: 39]
distribution:
given [pregnant] = [TRUE]
1138]]
outliers[[$suspicous_value
$suspicous_value$column
1] "age"
[
$suspicous_value$value
1] 75
[
$suspicous_value$decimals
1] 0
[
$group_statistics
$group_statistics$upper_thr
1] 42
[
$group_statistics$pct_below
1] 0.9512195
[
$group_statistics$mean
1] 31.46154
[
$group_statistics$sd
1] 5.28078
[
$group_statistics$n_obs
1] 39
[
$conditions
$conditions[[1]]
$conditions[[1]]$column
1] "pregnant"
[
$conditions[[1]]$value_this
1] TRUE
[
$conditions[[1]]$comparison
1] "="
[
$conditions[[1]]$value_comp
1] TRUE
[
$tree_depth
1] 1
[
$uses_NA_branch
1] FALSE
[
$outlier_score
1] 0.01297346 [