Please see the tidycode website for full documentation:
The tidycode package is an attempt to make analyzing R code tidy. It is modeled after the tidytext package.
One way to analyze code is to read in existing R files. The read_rfiles()
function will allow parse your R files into individual R calls, indicating the original file path along with the line number for each call. The tidycode package includes some example files with the paths accessible via the tidycode_example()
function. Let’s examine two, the example_plots.R
file and the example_analysis.R
file.
cat(readLines(tidycode_example("example_plot.R")), sep = '\n')
#> library(tidyverse)
#>
#> starwars %>%
#> select(height, mass) %>%
#> filter(!is.na(mass), !is.na(height)) %>%
#> ggplot(aes(height, mass)) +
#> geom_point()
cat(readLines(tidycode_example("example_analysis.R")), sep = '\n')
#> library(tidyverse)
#> library(rms)
#>
#> starwars %>%
#> mutate(bmi = mass / ((height / 100) ^ 2)) %>%
#> select(bmi, gender) -> starwars
#>
#> dd <- datadist(starwars)
#> options(datadist = "dd")
#>
#> mod <- ols(bmi ~ gender, data = starwars) %>%
#> summary()
#>
#> plot(mod)
Using the read_rfiles()
function, we can read them in as a tidy data frame.
(d <- read_rfiles(
tidycode_example("example_plot.R"),
tidycode_example("example_analysis.R")
))
#> # A tibble: 9 x 3
#> file expr line
#> <chr> <list> <int>
#> 1 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn/T/Rtmpn… <langua… 1
#> 2 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn/T/Rtmpn… <langua… 2
#> 3 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn/T/Rtmpn… <langua… 1
#> 4 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn/T/Rtmpn… <langua… 2
#> 5 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn/T/Rtmpn… <langua… 3
#> 6 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn/T/Rtmpn… <langua… 4
#> 7 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn/T/Rtmpn… <langua… 5
#> 8 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn/T/Rtmpn… <langua… 6
#> 9 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn/T/Rtmpn… <langua… 7
This tidy data frame has one row per R call in the original file. It places the file path in the file
column, the R call in the expr
column, and the line number in the line
column. Since this is in a tidy format, we can manipulate it using common data manipulation functions.
Let’s examine the first row.
d[1, ]
#> # A tibble: 1 x 3
#> file expr line
#> <chr> <list> <int>
#> 1 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn/T/Rtmpn… <langua… 1
This is the first line of the example_plot.R
file. We can dig into the expr
list column to see what R call was made on this first line.
The call is library(tidyverse)
.
Similar to the tidytext package that will unnest groups of words by token using the unnest_tokens()
function, such as by word or sentence, we can unnest these calls into individual functions using the unnest_calls()
function. To do this, we can pipe the data frame we just created, d
into the unnest_calls()
function and specify the column that contains the R calls, in this case expr
.
library(dplyr)
d_funcs <- d %>%
unnest_calls(expr)
d_funcs
#> # A tibble: 35 x 4
#> file line func args
#> <chr> <int> <chr> <list>
#> 1 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn… 1 libra… <list […
#> 2 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn… 2 + <list […
#> 3 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn… 2 %>% <list […
#> 4 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn… 2 %>% <list […
#> 5 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn… 2 %>% <list […
#> 6 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn… 2 select <list […
#> 7 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn… 2 filter <list […
#> 8 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn… 2 ! <list […
#> 9 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn… 2 is.na <list […
#> 10 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn… 2 ! <list […
#> # … with 25 more rows
This added two columns to our data frame, func
a column of type character
indicating each function called and args
a list column containing the arguments for each function. Let’s examine that first row again.
d_funcs[1, ]
#> # A tibble: 1 x 4
#> file line func args
#> <chr> <int> <chr> <list>
#> 1 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn/… 1 libra… <list […
Here the function is library
, which tracks with what we have previously observed. Examining the args
list column, we see the following.
The argument for the library
function on this first line is tidyverse
. This aligns with what we observed, the first R call is library(tidyverse)
.
In text analysis, there is the concept of “stopwords”. These are often small common filler words you want to remove before completing an analysis, such as “a” or “the”. In a tidy code analysis, we can use a similar concept to remove some functions. For example we may want to remove the assignment operator, <-
, before completing an analysis. We have compiled a list of common stop functions in the get_stopfuncs()
function to antijoin from the data frame.
d_funcs %>%
anti_join(get_stopfuncs())
#> Joining, by = "func"
#> # A tibble: 17 x 4
#> file line func args
#> <chr> <int> <chr> <list>
#> 1 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0… 1 library <list [1]>
#> 2 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0… 2 select <list [2]>
#> 3 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0… 2 filter <list [2]>
#> 4 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0… 2 is.na <list [1]>
#> 5 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0… 2 is.na <list [1]>
#> 6 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0… 2 ggplot <list [1]>
#> 7 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0… 2 aes <list [2]>
#> 8 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0… 2 geom_po… <list [0]>
#> 9 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0… 3 library <list [1]>
#> 10 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0… 4 library <list [1]>
#> 11 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0… 5 mutate <named lis…
#> 12 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0… 5 select <list [2]>
#> 13 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0… 6 datadist <list [1]>
#> 14 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0… 7 options <named lis…
#> 15 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0… 8 ols <named lis…
#> 16 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0… 8 summary <list [0]>
#> 17 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0… 9 plot <list [1]>
Akin to the tidytext get_sentiments()
function for sentiment analysis, the tidycode package has a get_classifications()
function that will output a classification data frame. By default, this outputs a data frame with two classification lexicons, crowdsource
and leeklab
. The crowdsource
lexicon was developed by twitter users who tried out the classify shiny application. The leeklab
lexicon was curated by members of Jeff Leek’s Lab. Both lexicons involve the same functions classified multiple times by different users. The score
column indicates the percentage of functions that were classified as a given class. To just use the most prevalent classification, you can set the incude_duplicates
parameter to FALSE
in the get_classifications()
function. By default both the crowdsource
and leeklab
lexicons will be output. To get just one, specify the lexicon
parameter. Here we will merge in the crowdsource
lexicon, picking the most prevalent classification by setting the incude_duplicates
parameter to FALSE
.
d_funcs %>%
anti_join(get_stopfuncs()) %>%
inner_join(get_classifications("crowdsource", include_duplicates = FALSE)) %>%
select(func, classification)
#> Joining, by = "func"Joining, by = "func"
#> # A tibble: 15 x 2
#> func classification
#> <chr> <chr>
#> 1 library setup
#> 2 select data cleaning
#> 3 filter data cleaning
#> 4 is.na data cleaning
#> 5 is.na data cleaning
#> 6 ggplot visualization
#> 7 aes visualization
#> 8 geom_point visualization
#> 9 library setup
#> 10 library setup
#> 11 mutate data cleaning
#> 12 select data cleaning
#> 13 options setup
#> 14 summary exploratory
#> 15 plot visualization
Notice we know have one classification per function. If we left the incude_duplicates
parameter to its default, TRUE
, we would end up with more than one classification per function along with a score
column.
d_funcs %>%
anti_join(get_stopfuncs()) %>%
inner_join(get_classifications("crowdsource")) %>%
select(func, classification, score)
#> Joining, by = "func"Joining, by = "func"
#> # A tibble: 115 x 3
#> func classification score
#> <chr> <chr> <dbl>
#> 1 library setup 0.687
#> 2 library import 0.213
#> 3 library visualization 0.0339
#> 4 library data cleaning 0.0278
#> 5 library modeling 0.0134
#> 6 library exploratory 0.0128
#> 7 library communication 0.00835
#> 8 library evaluation 0.00278
#> 9 library export 0.00111
#> 10 select data cleaning 0.636
#> # … with 105 more rows