Crunch is designed to facilitate collaboration on a common dataset, a
single source of truth in the cloud. As the previous vignettes have
shown, you can get a lot of work done in R without pulling the data
itself off of the server. Indeed, whenever possible, you should strive
to get your work done without pulling data across the network: shipping
data across the wire can be slow and inefficient. However, in some
cases, you may need to extract a subset of a dataset to do more
extensive calculations or manipulations locally. This vignette shows you
how to get a local data.frame
from your Crunch dataset, as
well as how to export a CSV or SPSS file of the dataset or subset of
dataset.
To get the local R representation of a Crunch variable, use
as.vector
:
<- as.vector(ds$pid3) party_id
as.vector
translates Crunch to R types in reverse of how
they are mapped in translation from R to Crunch in
newDataset
: categoricals become factors, numerics are
numeric, and so on. Array variables (categorical array and multiple
response) return a data.frame
of categoricals, despite the
name of as.vector
, because doing so allows natural indexing
into the subvariables (like ds$var$subvar
).
While categorical variables by default are translated as factors, you
can use the “mode” argument to as.vector
to request either
the category “id” or the “numeric” values of the categories.
<- as.vector(ds$pid3, mode="id") party_id
Requesting mode="id"
may be particularly useful when you
want to work with data locally that most closely matches the
representation of the data on the server; however, the category names
are disconnected from the data, so proceed with caution.
Similarly, as.data.frame
on a CrunchDataset
gives you access to the values in the dataset, yet there is an important
distinction: as.data.frame
doesn’t itself pull data off the
server. Rather, as.data.frame
returns a
data.frame
-like object that lazily fetches columns only
when requested.
<- as.vector(ds$var)
v1
<- as.data.frame(ds)
df identical(v1, df$var)
## TRUE
That way, you can call as.data.frame
and get convenient
access to the columns of data without having to download things you
don’t need up front.
Of course, you can download all of the data at once if you want–even
though it’s discouraged!–by either calling as.data.frame
a
second time
<- as.data.frame(ds)
df is.data.frame(df)
## FALSE
<- as.data.frame(df)
df is.data.frame(df)
## TRUE
or by calling as.data.frame
the first time with
force=TRUE
:
<- as.data.frame(ds, force=TRUE)
df is.data.frame(df)
## TRUE
Given the cost in network traffic to shipping data from the servers
to your local computer, you should be mindful of what you extract. One
way you can do this is by taking advantage of the lazy evaluation of the
as.data.frame
method, which only pulls variables you
explicitly reference in your subsequent code. Another way is to filter
the rows and columns of your data of interest.
Suppose I wanted to look at the specific values on a couple of
demographic variables just for self-identified Democrats. I can filter
the rows and columns of my dataset just as if I was working with a
data.frame
, and only pull that subset to my computer.
<- as.data.frame(ds[ds$pid3 == "Democrat", c("age", "educ", "gender")], force=TRUE) df
This dataset filtering is much more efficient (and thus faster) than
attempting to download the entire dataset and then subsetting the
resulting data.frame
.
You can also use this subsetting for convenience when lazily accessing the data.
<- ds[ds$pid3 == "Democrat", ] dems
gives a view of the dataset that is filtered by party identification.
<- dems$age dem_age
thus gives you just the values of “age” for those rows where “pid3”
is equal to “Democrat”. This is equivalent to calling
as.vector
directly on a subsetted variable:
identical(as.vector(ds$age[ds$pid3 == "Democrat"]), dem_age)
## TRUE
You can also get values for an on-the-fly derivation with
as.vector
:
<- as.vector(100 - ds$perc_skipped) perc_completed
Even though perc_completed
doesn’t exist in the dataset,
we can get values for it by expression.
If you want to download a file of the dataset or of a subset, use
exportDataset
:
exportDataset(ds, file="econ.sav", format="spss")
exportDataset
writes to SPSS (.sav) and CSV formats.
Alternatively, to get a CSV, write.csv
is short for
exportDataset(..., format="csv")
and works similar to how
it does for regular data.frame
s.
write.csv(ds, file="econ.csv")
CSV export does have a “categories” option that governs whether categorical variables are exported as category names or ids. The latter is more concise and pairs well with the Crunch metadata export, but category names can be useful when taking the file without additional metadata.
As with the as.data.frame
methods, you can subset what
you export by indexing the rows and columns. Following the previous
example, we can get a CSV of that demographic subset by:
write.csv(ds[ds$pid3 == "Democrat", c("age", "educ", "gender")], file="demo-demos.csv")