The CodeDepends package provides a flexible framework for statically analyzing R code (i.e., without evaluating it). It also contains higher-level functionality for: detecting dependencies between R code blocks or expressions, “tree-shaking” (pruning a script down to only the expressions necessary to evaluate a given expression), plotting variable usage timelines, and more.
The primary functions to perform basic code analysis are readScript
which reads in R scripts of various forms (including .R and .Rmd files), and getInputs
which performs the low-level code-analysis.
The readScript
function returns a Script
object (essentially a list of ScriptNodes
representing the top-level expressions in the script). This can then be passed to the getInputs
which, in that case, returns a ScriptInfo
object, which can be thought of as a list of ScriptNodeInfo
objects representing information about those top-level expressions.
R expressions can also be passed directly to getInputs
, which returns a single ScriptNodeInfo
object in that case. While in practice users will generally call getInputs
on entire scripts, passing expressions directly is useful for testing and illustration.
As stated above, ScriptNodeInfo
objects are the units of information about single expressions being analyzed, and collect various information extracted from examining the expression itself:
library(CodeDepends)
getInputs(quote(x <- y + rnorm(10, sd = z)))
## An object of class "ScriptNodeInfo"
## Slot "files":
## character(0)
##
## Slot "strings":
## character(0)
##
## Slot "libraries":
## character(0)
##
## Slot "inputs":
## [1] "y" "z"
##
## Slot "outputs":
## [1] "x"
##
## Slot "updates":
## character(0)
##
## Slot "functions":
## + rnorm
## NA NA
##
## Slot "removes":
## character(0)
##
## Slot "nsevalVars":
## character(0)
##
## Slot "sideEffects":
## character(0)
##
## Slot "code":
## x <- y + rnorm(10, sd = z)
As we can see, the information includes the any string literals used in the expression, split into file and non-file strings based on whether the string appears to point to an existing path at analysis time with respect to the basedir
argument (which defaults to the current directory). It also contains any libraries loaded by the code (via library
, require
, or requireNamespace
calls).
Next is are the inputs and outputs of the expression, which are the variables used by the expression and created by the expression (via assignment), respectively. By default, these lists will not include symbols used in ways that mean they are non-standardly evaluated (e.g., within the construction of a ggplot2
plot object). These non-standard evaluation variables are collected separately (as nsevalVars).
Variables whose values are updated (ie ones who are assigned new values which depend on their existing value) are collected separately. These updates can take a large number of forms, including:
x = x + 5
rownames(x) = 5
x[1:3] = 5
x = lapply(1:5, function(i) x[i]^2)
x$y = 5
In all of the above cases, the variable x
will be listed in both the updated
and inputs
categories, but NOT in the outputs
category.
Next are the functions which were called by the expression. These include those invoked as funtionals, e.g. via the apply
family or mutate_*
and summarize_*
families. We note here that the functions list is actually a logical
vector, indicating whether the function was locally defined within the script (TRUE
), defined within a package (FALSE
), or unkown (NA
). The names of the vector indicate the names of the functions. Currently, functions will always be unknown if a single expression is analyzed directly. Function provenance detection is only applied to full scripts.
Finally, the list of removed variables, side-effects CodeDepends
is able to detect, and a copy of the code complete the list of information extracted.
Symbols within formulas are treated specially when analyzing code, based on the formulaInputs
argument to getInputs
. If FALSE
(the default), they are assumed to evaluated nonstandardly (e.g., in the context of a data.frame
), if TRUE
, they are counted as standard inputs. Currently there is no capacity for mixing these behaviors within a single call to getInputs
.
The getInputs
function accepts a collector
argument, which essentially specifies a state tracker to be used when walking the code to collect inputs, functions called, etc.
For largely historical reasons, input collectors are roughly defined as the output from the inputCollector
constructor, rather than as a more formal class.
When creating an input collector, various behavior can be customized, primarily in the form of handlers which specify behavior when analyzing calls to specific functions. This is, for example, how CodeDepends
knows that some arguments within certain functions are non-standardly evaluated. CodeDepends ships with a robust set of default handlers, but these can be overridden or supplemented with custom handlers by specifying them when constructing the collector, either via the ...
arguments or as list. In both cases, the names are the names of the function the handler should be used on.
col = inputCollector(library = function(e, collector, ...) {
print(paste("Hello", asVarName(e)))
defaultFuncHandlers$library(e, collector, ...)
})
getInputs(quote(library(CodeDepends)), collector = col)
## [1] "Hello CodeDepends"
## An object of class "ScriptNodeInfo"
## Slot "files":
## character(0)
##
## Slot "strings":
## character(0)
##
## Slot "libraries":
## [1] "CodeDepends"
##
## Slot "inputs":
## character(0)
##
## Slot "outputs":
## character(0)
##
## Slot "updates":
## character(0)
##
## Slot "functions":
## named logical(0)
##
## Slot "removes":
## character(0)
##
## Slot "nsevalVars":
## character(0)
##
## Slot "sideEffects":
## character(0)
##
## Slot "code":
## library(CodeDepends)
inputCollector
also accepts arguments which control what is counted as an input when processing expressions. The inclPrevOutput
argument specifies whether output variables should be included as inputs to subsequent expressions when processing multiple expressions as an single block (e.g., when they are wrapped in {}
). The checkLibrarySymbols
and funcsAsInputs
arguments control how symbols which appear to be resolved within libraries, and functions which are called are handled, respectively. The default behavior is for all of these to be FALSE
.
CodeDepends
can visualize code in various ways.
We can create the variable graph of dependnecies between variables, via the makeVariableGraph
function:
f = system.file("samples", "results-multi.R", package = "CodeDepends")
sc = readScript(f)
g = makeVariableGraph( info = getInputs(sc))
if(require(Rgraphviz))
plot(g)
## Loading required package: Rgraphviz
## Loading required package: graph
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## Filter, Find, Map, Position, Reduce, anyDuplicated, append,
## as.data.frame, cbind, colMeans, colSums, colnames, do.call,
## duplicated, eval, evalq, get, grep, grepl, intersect,
## is.unsorted, lapply, lengths, mapply, match, mget, order,
## paste, pmax, pmax.int, pmin, pmin.int, rank, rbind, rowMeans,
## rowSums, rownames, sapply, setdiff, sort, table, tapply,
## union, unique, unsplit, which, which.max, which.min
## Loading required package: grid
We can also create call graphs for functions or entire packages:
gg = makeCallGraph("package:CodeDepends")
if(require(Rgraphviz)) {
gg = layoutGraph(gg, layoutType = "circo")
graph.par(list(nodes = list(fontsize=55)))
renderGraph(gg) ## could also call plot directly
}
Finally we can display timelines for when variables are defined, redefined, and used:
f = system.file("samples", "results-multi.R", package = "CodeDepends")
sc = readScript(f)
dtm = getDetailedTimelines(sc, getInputs(sc))
plot(dtm)
## [1] TRUE
# A big/long function
info = getInputs(arima0)
dtm = getDetailedTimelines(info = info)
plot(dtm, var.cex = .7, mar = 4, srt = 30)
## [1] TRUE