Getting started

Oscar Kjell

The text-package uses natural language processing and machine learning methods to examine text and numerical variables.

Central text functions are described below. The data and methods come from the Kjell et al., 2019 (pre-print), which show how individuals’ open-ended text answers can be used to measure, describe and differentiate psychological constructs.

In short the workflow includes to first transform text variables into word embeddings. These word embeddings are then used to, for example, predict numerical variables, compute semantic similarity scores, statistically test difference in meaning between two sets of texts and plot words in the word embedding space.

textEmbed(): mapping text to numbers

The textEmbed() function automatically transforms character variables in a given tibble to word embeddings. The example data that will be used in this tutorial comes from participants that have described their harmony in life and satisfaction with life with a text response, 10 descriptive words or rating scales. For a more detailed description please see the word embedding tutorial

library(text)

# Get example data including both text and numerical variables
sq_data <- Language_based_assessment_data_8

# Transform the text data to BERT word embeddings
wordembeddings <- textEmbed(sq_data)

# See how word embeddings are structured
wordembeddings

# Save the word embeddings to avoid having to import the text every time
saveRDS(wordembeddings, "wordembeddings.rds")

# Get the word embeddings again
wordembeddings <- readRDS("_YOURPATH_/wordembeddings.rds")

textTrain(): Examine the relationship between text and numeric variables

The textTrain() is used to examine how well the word embeddings from a text can predict a numeric variable. This is done by training the word embeddings using ridge regression and 10-fold cross-validation (where the word embeddings are pre-processed using pca). In the example below we examine how well the harmony text responses can predict the rating scale scores from the Harmony in life scale.

library(text)
library(rio)
# Load data that has already gone through textEmbed
# The previous example only imported 10 participants; 
# whereas below we load data from 100 participants
wordembeddings <- rio::import("https://r-text.org/text_data_examples/wordembeddings4_100.rda")
# Load corresponding numeric variables
numeric_data <-   rio::import("https://r-text.org/text_data_examples/Language_based_assessment_data_8_100.rda")

# Examine the relationship between harmonytext and the corresponding rating scale
model_htext_hils <- textTrain(wordembeddings$harmonytexts, 
                              numeric_data$hilstotal, 
                              penalty = 1)

# Examine the correlation between predicted and observed Harmony in life scale scores
model_htext_hils$correlation

textSimilarityTest(): Test the difference in meaning between to sets of texts

The textSimilarityTest() function provides a permutation based test to examine whether two sets of texts significantly differ in meaning. It produces a p-value and estimate as an effect size. Below we examine whether the harmony text and satisfaction text responses differ in meaning.

library(text)

# Compare the meaning between individuals' harmony in life and satisfaction with life answers
textSimilarityTest(word_embeddings_4$harmonytexts, 
         word_embeddings_4$satisfactiontexts, 
         Npermutations = 100, 
         output.permutations = FALSE)

Plot statistically significant words

The plotting is made in two steps: First the textProjection function is pre-processing the data, including computing statistics for each word to be plotted. Second, textProjectionPlot() is visualizing the words, including many options to set color, font etc for the figure. Dividing this procedure into two steps makes the process more transparent (since the user naturally get to see the output that the words are plotted according to) and quicker since the more heavy computations are made in the first step, the last step goes quicker so that one can try different design settings.

textProjection(): Pre-process data for plotting

library(text)

# Pre-process word data to be plotted with textPlotViz-function
# word_embeddings_4 and Language_based_assessment_data_8  contain example data provided with the package.

# Pre-process data
df_for_plotting <- textProjection(Language_based_assessment_data_8$harmonywords, 
                                word_embeddings_4$harmonywords,
  word_embeddings_4$singlewords_we,
  Language_based_assessment_data_8$hilstotal, Language_based_assessment_data_8$swlstotal
)
df_for_plotting

textProjectionPlot(): A two-dimensional word plot

library(text)
# Used data (DP_projections_HILS_SWLS_100) has
# been pre-processed with the textProjection function
plot_projection <- textProjectionPlot(
  word_data = DP_projections_HILS_SWLS_100,
  k_n_words_to_test = FALSE,
  plot_n_words_square = 5,
  plot_n_words_p = 5,
  plot_n_word_extreme = 1,
  plot_n_word_frequency = 1,
  plot_n_words_middle = 1,
  y_axes = TRUE,
  p_alpha = 0.05,
  title_top = " Supervised Bicentroid Projection of Harmony in life words",
  x_axes_label = "Low vs. High HILS score",
  y_axes_label = "Low vs. High SWLS score",
  p_adjust_method = "bonferroni",
  points_without_words_size = 0.4,
  points_without_words_alpha = 0.4
)
plot_projection
#> $final_plot

#> 
#> $description
#> [1] "INFORMATION ABOUT THE PROJECTION type = textProjection words = $ wordembeddings = Information about the embeddings. textEmbedLayersOutput:  model: bert-base-uncased ;  layers: 11 12 . Warnings from python:  Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight']\n- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n\n textEmbedLayerAggregation: layers =  11 12 aggregate_layers =  concatenate aggregate_tokens =  mean tokens_select =   tokens_deselect =   single_wordembeddings = Information about the embeddings. textEmbedLayersOutput:  model: bert-base-uncased layers: 11 12 . textEmbedLayerAggregation: layers =  11 12 aggregate_layers =  concatenate aggregate_tokens =  mean tokens_select =   tokens_deselect =   x = $ y = $ pca =  aggregation =  mean split =  quartile word_weight_power = 1 min_freq_words_test = 0 Npermutations = 1e+06 n_per_split = 1e+05 type = textProjection words = Language_based_assessment_data_3_100 wordembeddings = Information about the embeddings. textEmbedLayersOutput:  model: bert-base-uncased ;  layers: 11 12 . Warnings from python:  Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight']\n- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n\n textEmbedLayerAggregation: layers =  11 12 aggregate_layers =  concatenate aggregate_tokens =  mean tokens_select =   tokens_deselect =   single_wordembeddings = Information about the embeddings. textEmbedLayersOutput:  model: bert-base-uncased layers: 11 12 . textEmbedLayerAggregation: layers =  11 12 aggregate_layers =  concatenate aggregate_tokens =  mean tokens_select =   tokens_deselect =   x = Language_based_assessment_data_3_100 y = Language_based_assessment_data_3_100 pca =  aggregation =  mean split =  quartile word_weight_power = 1 min_freq_words_test = 0 Npermutations = 1e+06 n_per_split = 1e+05 type = textProjection words = harmonywords wordembeddings = Information about the embeddings. textEmbedLayersOutput:  model: bert-base-uncased ;  layers: 11 12 . Warnings from python:  Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight']\n- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n\n textEmbedLayerAggregation: layers =  11 12 aggregate_layers =  concatenate aggregate_tokens =  mean tokens_select =   tokens_deselect =   single_wordembeddings = Information about the embeddings. textEmbedLayersOutput:  model: bert-base-uncased layers: 11 12 . textEmbedLayerAggregation: layers =  11 12 aggregate_layers =  concatenate aggregate_tokens =  mean tokens_select =   tokens_deselect =   x = hilstotal y = swlstotal pca =  aggregation =  mean split =  quartile word_weight_power = 1 min_freq_words_test = 0 Npermutations = 1e+06 n_per_split = 1e+05 INFORMATION ABOUT THE PLOT word_data = word_data k_n_words_to_test = FALSE min_freq_words_test = 1 min_freq_words_plot = 1 plot_n_words_square = 5 plot_n_words_p = 5 plot_n_word_extreme = 1 plot_n_word_frequency = 1 plot_n_words_middle = 1 y_axes = TRUE p_alpha = 0.05 p_adjust_method = bonferroni bivariate_color_codes = #398CF9 #60A1F7 #5dc688 #e07f6a #EAEAEA #40DD52 #FF0000 #EA7467 #85DB8E word_size_range = 3 - 8 position_jitter_hight = 0 position_jitter_width = 0.03 point_size = 0.5 arrow_transparency = 0.5 points_without_words_size = 0.4 points_without_words_alpha = 0.4 legend_x_position = 0.02 legend_y_position = 0.02 legend_h_size = 0.2 legend_w_size = 0.2 legend_title_size = 7 legend_number_size = 2"
#> 
#> $processed_word_data
#> # A tibble: 583 × 32
#>    words   x_plotted p_values_x n_g1.x n_g2.x y_plotted p_values_y n_g1.y n_g2.y
#>    <chr>       <dbl>      <dbl>  <dbl>  <dbl>     <dbl>      <dbl>  <dbl>  <dbl>
#>  1 able        1.42      0.194       0      1     2.99  0.0000181       0      0
#>  2 accept…     0.732     0.451      -1      1     1.40  0.0396         -1      1
#>  3 accord      2.04      0.0651      0      1     3.45  0.00000401      0      1
#>  4 active      1.46      0.180       0      1     1.92  0.00895         0      1
#>  5 adapta…     2.40      0.0311      0      0     0.960 0.113           0      0
#>  6 admiri…     0.161     0.839       0      0     1.58  0.0255          0      0
#>  7 adrift     -2.64      0.0245     -1      0    -3.17  0.0000422      -1      0
#>  8 affini…     1.03      0.320       0      1     2.24  0.00324         0      1
#>  9 agreei…     1.62      0.140       0      1     2.12  0.00500         0      0
#> 10 alcohol    -2.15      0.0822     -1      0    -1.78  0.0212          0      0
#> # … with 573 more rows, and 23 more variables: n <dbl>, n.percent <dbl>,
#> #   N_participant_responses <int>, adjusted_p_values.x <dbl>,
#> #   adjusted_p_values.y <dbl>, square_categories <dbl>, check_p_square <dbl>,
#> #   check_p_x_neg <dbl>, check_p_x_pos <dbl>, check_extreme_max_x <dbl>,
#> #   check_extreme_min_x <dbl>, check_extreme_frequency_x <dbl>,
#> #   check_middle_x <dbl>, extremes_all_x <dbl>, check_p_y_pos <dbl>,
#> #   check_p_y_neg <dbl>, check_extreme_max_y <dbl>, …

Relevant References

Text is new and has not been used in a publication yet. therefore, the below list consists of papers analyzing human language in a similar fashion that is possible text.

Methods Articles
Gaining insights from social media language: Methodologies and challenges.
Kern et al., (2016). Psychological Methods.

Semantic measures: Using natural language processing to measure, differentiate, and describe psychological constructs. Pre-print
Kjell et al., (2019). Psychological Methods.

Clinical Psychology
Facebook language predicts depression in medical records
Eichstaedt, J. C., … & Schwartz, H. A. (2018). PNAS.

Social and Personality Psychology
Personality, gender, and age in the language of social media: The open-vocabulary approach
Schwartz, H. A., … & Seligman, M. E. (2013). PloSOne.

Automatic Personality Assessment Through Social Media Language
Park, G., Schwartz, H. A., … & Seligman, M. E. P. (2014). Journal of Personality and Social Psychology.

Health Psychology
Psychological language on Twitter predicts county-level heart disease mortality
Eichstaedt, J. C., Schwartz, et al. (2015). Psychological Science.

Positive Psychology
The Harmony in Life Scale Complements the Satisfaction with Life Scale: Expanding the Conceptualization of the Cognitive Component of Subjective Well-Being
Kjell, et al., (2016). Social Indicators Research

Computer Science: Python Software
DLATK: Differential language analysis toolkit Schwartz, H. A., Giorgi, et al., (2017). In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

DLATK