Oscar Kjell
The text-package uses natural language processing and machine learning methods to examine text and numerical variables.
Central text functions are described below. The data and methods come from the Kjell et al., 2019 (pre-print), which show how individuals’ open-ended text answers can be used to measure, describe and differentiate psychological constructs.
In short the workflow includes to first transform text variables into word embeddings. These word embeddings are then used to, for example, predict numerical variables, compute semantic similarity scores, statistically test difference in meaning between two sets of texts and plot words in the word embedding space.
The textEmbed()
function automatically transforms
character variables in a given tibble to word embeddings. The example
data that will be used in this tutorial comes from participants that
have described their harmony in life and satisfaction with life with a
text response, 10 descriptive words or rating scales. For a more
detailed description please see the word
embedding tutorial
library(text)
# Get example data including both text and numerical variables
<- Language_based_assessment_data_8
sq_data
# Transform the text data to BERT word embeddings
<- textEmbed(sq_data)
wordembeddings
# See how word embeddings are structured
wordembeddings
# Save the word embeddings to avoid having to import the text every time
saveRDS(wordembeddings, "wordembeddings.rds")
# Get the word embeddings again
<- readRDS("_YOURPATH_/wordembeddings.rds") wordembeddings
The textTrain()
is used to examine how well the word
embeddings from a text can predict a numeric variable. This is done by
training the word embeddings using ridge regression and 10-fold
cross-validation (where the word embeddings are pre-processed using
pca). In the example below we examine how well the harmony text
responses can predict the rating scale scores from the Harmony in life
scale.
library(text)
library(rio)
# Load data that has already gone through textEmbed
# The previous example only imported 10 participants;
# whereas below we load data from 100 participants
<- rio::import("https://r-text.org/text_data_examples/wordembeddings4_100.rda")
wordembeddings # Load corresponding numeric variables
<- rio::import("https://r-text.org/text_data_examples/Language_based_assessment_data_8_100.rda")
numeric_data
# Examine the relationship between harmonytext and the corresponding rating scale
<- textTrain(wordembeddings$harmonytexts,
model_htext_hils $hilstotal,
numeric_datapenalty = 1)
# Examine the correlation between predicted and observed Harmony in life scale scores
$correlation model_htext_hils
The textSimilarityTest()
function provides a permutation
based test to examine whether two sets of texts significantly differ in
meaning. It produces a p-value and estimate as an effect size. Below we
examine whether the harmony text and satisfaction text responses differ
in meaning.
library(text)
# Compare the meaning between individuals' harmony in life and satisfaction with life answers
textSimilarityTest(word_embeddings_4$harmonytexts,
$satisfactiontexts,
word_embeddings_4Npermutations = 100,
output.permutations = FALSE)
The plotting is made in two steps: First the
textProjection
function is pre-processing the data,
including computing statistics for each word to be plotted. Second,
textProjectionPlot()
is visualizing the words, including
many options to set color, font etc for the figure. Dividing this
procedure into two steps makes the process more transparent (since the
user naturally get to see the output that the words are plotted
according to) and quicker since the more heavy computations are made in
the first step, the last step goes quicker so that one can try different
design settings.
library(text)
# Pre-process word data to be plotted with textPlotViz-function
# word_embeddings_4 and Language_based_assessment_data_8 contain example data provided with the package.
# Pre-process data
<- textProjection(Language_based_assessment_data_8$harmonywords,
df_for_plotting $harmonywords,
word_embeddings_4$singlewords_we,
word_embeddings_4$hilstotal, Language_based_assessment_data_8$swlstotal
Language_based_assessment_data_8
) df_for_plotting
library(text)
# Used data (DP_projections_HILS_SWLS_100) has
# been pre-processed with the textProjection function
<- textProjectionPlot(
plot_projection word_data = DP_projections_HILS_SWLS_100,
k_n_words_to_test = FALSE,
plot_n_words_square = 5,
plot_n_words_p = 5,
plot_n_word_extreme = 1,
plot_n_word_frequency = 1,
plot_n_words_middle = 1,
y_axes = TRUE,
p_alpha = 0.05,
title_top = " Supervised Bicentroid Projection of Harmony in life words",
x_axes_label = "Low vs. High HILS score",
y_axes_label = "Low vs. High SWLS score",
p_adjust_method = "bonferroni",
points_without_words_size = 0.4,
points_without_words_alpha = 0.4
)
plot_projection#> $final_plot
#>
#> $description
#> [1] "INFORMATION ABOUT THE PROJECTION type = textProjection words = $ wordembeddings = Information about the embeddings. textEmbedLayersOutput: model: bert-base-uncased ; layers: 11 12 . Warnings from python: Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight']\n- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n\n textEmbedLayerAggregation: layers = 11 12 aggregate_layers = concatenate aggregate_tokens = mean tokens_select = tokens_deselect = single_wordembeddings = Information about the embeddings. textEmbedLayersOutput: model: bert-base-uncased layers: 11 12 . textEmbedLayerAggregation: layers = 11 12 aggregate_layers = concatenate aggregate_tokens = mean tokens_select = tokens_deselect = x = $ y = $ pca = aggregation = mean split = quartile word_weight_power = 1 min_freq_words_test = 0 Npermutations = 1e+06 n_per_split = 1e+05 type = textProjection words = Language_based_assessment_data_3_100 wordembeddings = Information about the embeddings. textEmbedLayersOutput: model: bert-base-uncased ; layers: 11 12 . Warnings from python: Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight']\n- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n\n textEmbedLayerAggregation: layers = 11 12 aggregate_layers = concatenate aggregate_tokens = mean tokens_select = tokens_deselect = single_wordembeddings = Information about the embeddings. textEmbedLayersOutput: model: bert-base-uncased layers: 11 12 . textEmbedLayerAggregation: layers = 11 12 aggregate_layers = concatenate aggregate_tokens = mean tokens_select = tokens_deselect = x = Language_based_assessment_data_3_100 y = Language_based_assessment_data_3_100 pca = aggregation = mean split = quartile word_weight_power = 1 min_freq_words_test = 0 Npermutations = 1e+06 n_per_split = 1e+05 type = textProjection words = harmonywords wordembeddings = Information about the embeddings. textEmbedLayersOutput: model: bert-base-uncased ; layers: 11 12 . Warnings from python: Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight']\n- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n\n textEmbedLayerAggregation: layers = 11 12 aggregate_layers = concatenate aggregate_tokens = mean tokens_select = tokens_deselect = single_wordembeddings = Information about the embeddings. textEmbedLayersOutput: model: bert-base-uncased layers: 11 12 . textEmbedLayerAggregation: layers = 11 12 aggregate_layers = concatenate aggregate_tokens = mean tokens_select = tokens_deselect = x = hilstotal y = swlstotal pca = aggregation = mean split = quartile word_weight_power = 1 min_freq_words_test = 0 Npermutations = 1e+06 n_per_split = 1e+05 INFORMATION ABOUT THE PLOT word_data = word_data k_n_words_to_test = FALSE min_freq_words_test = 1 min_freq_words_plot = 1 plot_n_words_square = 5 plot_n_words_p = 5 plot_n_word_extreme = 1 plot_n_word_frequency = 1 plot_n_words_middle = 1 y_axes = TRUE p_alpha = 0.05 p_adjust_method = bonferroni bivariate_color_codes = #398CF9 #60A1F7 #5dc688 #e07f6a #EAEAEA #40DD52 #FF0000 #EA7467 #85DB8E word_size_range = 3 - 8 position_jitter_hight = 0 position_jitter_width = 0.03 point_size = 0.5 arrow_transparency = 0.5 points_without_words_size = 0.4 points_without_words_alpha = 0.4 legend_x_position = 0.02 legend_y_position = 0.02 legend_h_size = 0.2 legend_w_size = 0.2 legend_title_size = 7 legend_number_size = 2"
#>
#> $processed_word_data
#> # A tibble: 583 × 32
#> words x_plotted p_values_x n_g1.x n_g2.x y_plotted p_values_y n_g1.y n_g2.y
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 able 1.42 0.194 0 1 2.99 0.0000181 0 0
#> 2 accept… 0.732 0.451 -1 1 1.40 0.0396 -1 1
#> 3 accord 2.04 0.0651 0 1 3.45 0.00000401 0 1
#> 4 active 1.46 0.180 0 1 1.92 0.00895 0 1
#> 5 adapta… 2.40 0.0311 0 0 0.960 0.113 0 0
#> 6 admiri… 0.161 0.839 0 0 1.58 0.0255 0 0
#> 7 adrift -2.64 0.0245 -1 0 -3.17 0.0000422 -1 0
#> 8 affini… 1.03 0.320 0 1 2.24 0.00324 0 1
#> 9 agreei… 1.62 0.140 0 1 2.12 0.00500 0 0
#> 10 alcohol -2.15 0.0822 -1 0 -1.78 0.0212 0 0
#> # … with 573 more rows, and 23 more variables: n <dbl>, n.percent <dbl>,
#> # N_participant_responses <int>, adjusted_p_values.x <dbl>,
#> # adjusted_p_values.y <dbl>, square_categories <dbl>, check_p_square <dbl>,
#> # check_p_x_neg <dbl>, check_p_x_pos <dbl>, check_extreme_max_x <dbl>,
#> # check_extreme_min_x <dbl>, check_extreme_frequency_x <dbl>,
#> # check_middle_x <dbl>, extremes_all_x <dbl>, check_p_y_pos <dbl>,
#> # check_p_y_neg <dbl>, check_extreme_max_y <dbl>, …
Text is new and has not been used in a publication yet. therefore, the below list consists of papers analyzing human language in a similar fashion that is possible text.
Methods Articles
Gaining
insights from social media language: Methodologies and
challenges.
Kern et al., (2016). Psychological Methods.
Semantic
measures: Using natural language processing to measure, differentiate,
and describe psychological constructs. Pre-print
Kjell et al., (2019). Psychological Methods.
Clinical Psychology
Facebook
language predicts depression in medical records
Eichstaedt, J. C., … & Schwartz, H. A. (2018). PNAS.
Social and Personality Psychology
Personality,
gender, and age in the language of social media: The open-vocabulary
approach
Schwartz, H. A., … & Seligman, M. E. (2013). PloSOne.
Automatic
Personality Assessment Through Social Media Language
Park, G., Schwartz, H. A., … & Seligman, M. E. P. (2014).
Journal of Personality and Social Psychology.
Health Psychology
Psychological
language on Twitter predicts county-level heart disease
mortality
Eichstaedt, J. C., Schwartz, et al. (2015). Psychological
Science.
Positive Psychology
The
Harmony in Life Scale Complements the Satisfaction with Life Scale:
Expanding the Conceptualization of the Cognitive Component of Subjective
Well-Being
Kjell, et al., (2016). Social Indicators Research
Computer Science: Python Software
DLATK: Differential
language analysis toolkit Schwartz, H. A., Giorgi, et al.,
(2017). In Proceedings of the 2017 Conference on Empirical Methods in
Natural Language Processing: System Demonstrations