tidycode

Read R files in as a tidy data frame

One way to analyze code is to read in existing R files. The read_rfiles() function will allow parse your R files into individual R calls, indicating the original file path along with the line number for each call. The tidycode package includes some example files with the paths accessible via the tidycode_example() function. Let’s examine two, the example_plots.R file and the example_analysis.R file.

cat(readLines(tidycode_example("example_plot.R")), sep = '\n')
#> library(tidyverse)
#> 
#> starwars %>%
#>   select(height, mass) %>%
#>   filter(!is.na(mass), !is.na(height)) %>%
#>   ggplot(aes(height, mass)) +
#>   geom_point()

cat(readLines(tidycode_example("example_analysis.R")), sep = '\n')
#> library(tidyverse)
#> library(rms)
#> 
#> starwars %>%
#>   mutate(bmi = mass / ((height / 100) ^ 2)) %>%
#>   select(bmi, gender) -> starwars
#> 
#> dd <- datadist(starwars)
#> options(datadist = "dd")
#> 
#> mod <- ols(bmi ~ gender, data = starwars) %>%
#>   summary()
#> 
#> plot(mod)

Using the read_rfiles() function, we can read them in as a tidy data frame.

(d <- read_rfiles(
  tidycode_example("example_plot.R"),
  tidycode_example("example_analysis.R")
  ))
#> # A tibble: 9 x 3
#>   file                                                            expr      line
#>   <chr>                                                           <list>   <int>
#> 1 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn/T/Rtmpn… <langua…     1
#> 2 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn/T/Rtmpn… <langua…     2
#> 3 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn/T/Rtmpn… <langua…     1
#> 4 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn/T/Rtmpn… <langua…     2
#> 5 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn/T/Rtmpn… <langua…     3
#> 6 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn/T/Rtmpn… <langua…     4
#> 7 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn/T/Rtmpn… <langua…     5
#> 8 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn/T/Rtmpn… <langua…     6
#> 9 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn/T/Rtmpn… <langua…     7

This tidy data frame has one row per R call in the original file. It places the file path in the file column, the R call in the expr column, and the line number in the line column. Since this is in a tidy format, we can manipulate it using common data manipulation functions.

Let’s examine the first row.

d[1, ]
#> # A tibble: 1 x 3
#>   file                                                            expr      line
#>   <chr>                                                           <list>   <int>
#> 1 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn/T/Rtmpn… <langua…     1

This is the first line of the example_plot.R file. We can dig into the expr list column to see what R call was made on this first line.

d[1, "expr"][[1]]
#> [[1]]
#> library(tidyverse)

The call is library(tidyverse).

Unnest calls into individual functions

Similar to the tidytext package that will unnest groups of words by token using the unnest_tokens() function, such as by word or sentence, we can unnest these calls into individual functions using the unnest_calls() function. To do this, we can pipe the data frame we just created, d into the unnest_calls() function and specify the column that contains the R calls, in this case expr.

library(dplyr)

d_funcs <- d %>%
  unnest_calls(expr)

d_funcs
#> # A tibble: 35 x 4
#>    file                                                     line func   args    
#>    <chr>                                                   <int> <chr>  <list>  
#>  1 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn…     1 libra… <list […
#>  2 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn…     2 +      <list […
#>  3 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn…     2 %>%    <list […
#>  4 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn…     2 %>%    <list […
#>  5 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn…     2 %>%    <list […
#>  6 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn…     2 select <list […
#>  7 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn…     2 filter <list […
#>  8 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn…     2 !      <list […
#>  9 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn…     2 is.na  <list […
#> 10 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn…     2 !      <list […
#> # … with 25 more rows

This added two columns to our data frame, func a column of type character indicating each function called and args a list column containing the arguments for each function. Let’s examine that first row again.

d_funcs[1, ]
#> # A tibble: 1 x 4
#>   file                                                      line func   args    
#>   <chr>                                                    <int> <chr>  <list>  
#> 1 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn/…     1 libra… <list […

Here the function is library, which tracks with what we have previously observed. Examining the args list column, we see the following.

d_funcs[1, "args"][[1]]
#> [[1]]
#> [[1]][[1]]
#> tidyverse

The argument for the library function on this first line is tidyverse. This aligns with what we observed, the first R call is library(tidyverse).

Remove “stopwords”

In text analysis, there is the concept of “stopwords”. These are often small common filler words you want to remove before completing an analysis, such as “a” or “the”. In a tidy code analysis, we can use a similar concept to remove some functions. For example we may want to remove the assignment operator, <-, before completing an analysis. We have compiled a list of common stop functions in the get_stopfuncs() function to antijoin from the data frame.

d_funcs %>%
  anti_join(get_stopfuncs())
#> Joining, by = "func"
#> # A tibble: 17 x 4
#>    file                                                line func     args       
#>    <chr>                                              <int> <chr>    <list>     
#>  1 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0…     1 library  <list [1]> 
#>  2 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0…     2 select   <list [2]> 
#>  3 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0…     2 filter   <list [2]> 
#>  4 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0…     2 is.na    <list [1]> 
#>  5 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0…     2 is.na    <list [1]> 
#>  6 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0…     2 ggplot   <list [1]> 
#>  7 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0…     2 aes      <list [2]> 
#>  8 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0…     2 geom_po… <list [0]> 
#>  9 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0…     3 library  <list [1]> 
#> 10 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0…     4 library  <list [1]> 
#> 11 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0…     5 mutate   <named lis…
#> 12 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0…     5 select   <list [2]> 
#> 13 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0…     6 datadist <list [1]> 
#> 14 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0…     7 options  <named lis…
#> 15 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0…     8 ols      <named lis…
#> 16 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0…     8 summary  <list [0]> 
#> 17 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0…     9 plot     <list [1]>

Classify code

Akin to the tidytext get_sentiments() function for sentiment analysis, the tidycode package has a get_classifications() function that will output a classification data frame. By default, this outputs a data frame with two classification lexicons, crowdsource and leeklab. The crowdsource lexicon was developed by twitter users who tried out the classify shiny application. The leeklab lexicon was curated by members of Jeff Leek’s Lab. Both lexicons involve the same functions classified multiple times by different users. The score column indicates the percentage of functions that were classified as a given class. To just use the most prevalent classification, you can set the incude_duplicates parameter to FALSE in the get_classifications() function. By default both the crowdsource and leeklab lexicons will be output. To get just one, specify the lexicon parameter. Here we will merge in the crowdsource lexicon, picking the most prevalent classification by setting the incude_duplicates parameter to FALSE.

d_funcs %>%
  anti_join(get_stopfuncs()) %>%
  inner_join(get_classifications("crowdsource", include_duplicates = FALSE)) %>%
  select(func, classification)
#> Joining, by = "func"Joining, by = "func"
#> # A tibble: 15 x 2
#>    func       classification
#>    <chr>      <chr>         
#>  1 library    setup         
#>  2 select     data cleaning 
#>  3 filter     data cleaning 
#>  4 is.na      data cleaning 
#>  5 is.na      data cleaning 
#>  6 ggplot     visualization 
#>  7 aes        visualization 
#>  8 geom_point visualization 
#>  9 library    setup         
#> 10 library    setup         
#> 11 mutate     data cleaning 
#> 12 select     data cleaning 
#> 13 options    setup         
#> 14 summary    exploratory   
#> 15 plot       visualization

Notice we know have one classification per function. If we left the incude_duplicates parameter to its default, TRUE, we would end up with more than one classification per function along with a score column.

d_funcs %>%
  anti_join(get_stopfuncs()) %>%
  inner_join(get_classifications("crowdsource")) %>%
  select(func, classification, score)
#> Joining, by = "func"Joining, by = "func"
#> # A tibble: 115 x 3
#>    func    classification   score
#>    <chr>   <chr>            <dbl>
#>  1 library setup          0.687  
#>  2 library import         0.213  
#>  3 library visualization  0.0339 
#>  4 library data cleaning  0.0278 
#>  5 library modeling       0.0134 
#>  6 library exploratory    0.0128 
#>  7 library communication  0.00835
#>  8 library evaluation     0.00278
#>  9 library export         0.00111
#> 10 select  data cleaning  0.636  
#> # … with 105 more rows