Import, Export, and Convert Data Files

The idea behind rio is to simplify the process of importing data into R and exporting data from R. This process is, probably unnecessarily, extremely complex for beginning R users. Indeed, R supplies an entire manual describing the process of data import/export. And, despite all of that text, most of the packages described are (to varying degrees) out-of-date. Faster, simpler, packages with fewer dependencies have been created for many of the file types described in that document. rio aims to unify data I/O (importing and exporting) into two simple functions: import() and export() so that beginners (and experienced R users) never have to think twice (or even once) about the best way to read and write R data.

The core advantage of rio is that it makes assumptions that the user is probably willing to make. Specifically, rio uses the file extension of a file name to determine what kind of file it is. This is the same logic used by Windows OS, for example, in determining what application is associated with a given file type. By taking away the need to manually match a file type (which a beginner may not recognize) to a particular import or export function, rio allows almost all common data formats to be read with the same function.

By making import and export easy, it’s an obvious next step to also use R as a simple data conversion utility. Transferring data files between various proprietary formats is always a pain and often expensive. The convert function therefore combines import and export to easily convert between file formats (thus providing a FOSS replacement for programs like Stat/Transfer or Sledgehammer).

Supported file formats

rio supports a variety of different file formats for import and export. To keep the package slim, all non-essential formats are supported via “Suggests” packages, which are not installed (or loaded) by default. To ensure rio is fully functional, install these packages the first time you use rio via:

install_formats()

The full list of supported formats is below:

Format	Typical Extension	Import Package	Export Package	Installed by Default
Comma-separated data	.csv	data.table	data.table	Yes
Pipe-separated data	.psv	data.table	data.table	Yes
Tab-separated data	.tsv	data.table	data.table	Yes
CSVY (CSV + YAML metadata header)	.csvy	data.table	data.table	Yes
SAS	.sas7bdat	haven	haven	Yes
SPSS	.sav	haven	haven	Yes
SPSS (compressed)	.zsav	haven	haven	Yes
Stata	.dta	haven	haven	Yes
SAS XPORT	.xpt	haven	haven	Yes
SPSS Portable	.por	haven		Yes
Excel	.xls	readxl		Yes
Excel	.xlsx	readxl	openxlsx	Yes
R syntax	.R	base	base	Yes
Saved R objects	.RData, .rda	base	base	Yes
Serialized R objects	.rds	base	base	Yes
Epiinfo	.rec	foreign		Yes
Minitab	.mtp	foreign		Yes
Systat	.syd	foreign		Yes
“XBASE” database files	.dbf	foreign	foreign	Yes
Weka Attribute-Relation File Format	.arff	foreign	foreign	Yes
Data Interchange Format	.dif	utils		Yes
Fortran data	no recognized extension	utils		Yes
Fixed-width format data	.fwf	utils	utils	Yes
gzip comma-separated data	.csv.gz	utils	utils	Yes
Apache Arrow (Parquet)	.parquet	arrow	arrow	No
EViews	.wf1	hexView		No
Feather R/Python interchange format	.feather	feather	feather	No
Fast Storage	.fst	fst	fst	No
JSON	.json	jsonlite	jsonlite	No
Matlab	.mat	rmatio	rmatio	No
OpenDocument Spreadsheet	.ods	readODS	readODS	No
HTML Tables	.html	xml2	xml2	No
Shallow XML documents	.xml	xml2	xml2	No
YAML	.yml	yaml	yaml	No
Clipboard	default is tsv	clipr	clipr	No
Google Sheets	as Comma-separated data

Additionally, any format that is not supported by rio but that has a known R implementation will produce an informative error message pointing to a package and import or export function. Unrecognized formats will yield a simple “Unrecognized file format” error.

Data Import

rio allows you to import files in almost any format using one, typically single-argument, function. import() infers the file format from the file’s extension and calls the appropriate data import function for you, returning a simple data.frame. This works for any for the formats listed above.

library("rio")

x <- import("mtcars.csv")
y <- import("mtcars.rds")
z <- import("mtcars.dta")

# confirm identical
all.equal(x, y, check.attributes = FALSE)

## [1] TRUE

all.equal(x, z, check.attributes = FALSE)

## [1] TRUE

If for some reason a file does not have an extension, or has a file extension that does not match its actual type, you can manually specify a file format to override the format inference step. For example, we can read in a CSV file that does not have a file extension by specifying csv:

head(import("mtcars_noext", format = "csv"))

##    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## 1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## 2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## 3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## 4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## 5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## 6 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Importing Data Lists

Sometimes you may have multiple data files that you want to import. import() only ever returns a single data frame, but import_list() can be used to import a vector of file names into R. This works even if the files are different formats:

str(import_list(dir()), 1)

Similarly, some single-file formats (e.g. Excel Workbooks, Zip directories, HTML files, etc.) can contain multiple data sets. Because import() is type safe, always returning a data frame, importing from these formats requires specifying a which argument to import() to dictate which data set (worksheet, file, table, etc.) to import (the default being which = 1). But import_list() can be used to import all (or only a specified subset, again via which) of data objects from these types of files.

Data Export

The export capabilities of rio are somewhat more limited than the import capabilities, given the availability of different functions in various R packages and because import functions are often written to make use of data from other applications and it never seems to be a development priority to have functions to export to the formats used by other applications. That said, rio currently supports the following formats:

library("rio")

export(mtcars, "mtcars.csv")
export(mtcars, "mtcars.rds")
export(mtcars, "mtcars.dta")

It is also easy to use export() as part of an R pipeline (from magrittr or dplyr). For example, the following code uses export() to save the results of a simple data transformation:

library("magrittr")
mtcars %>% subset(hp > 100) %>%  aggregate(. ~ cyl + am, data = ., FUN = mean) %>% export(file = "mtcars2.dta")

Some file formats (e.g., Excel workbooks, Rdata files) can support multiple data objects in a single file. export() natively supports output of multiple objects to these types of files:

# export to sheets of an Excel workbook
export(list(mtcars = mtcars, iris = iris), "multi.xlsx")

# export to an .Rdata file
## as a named list
export(list(mtcars = mtcars, iris = iris), "multi.rdata")

## as a character vector
export(c("mtcars", "iris"), "multi.rdata")

It is also possible to use the new (as of v0.6.0) function export_list() to write a list of data frames to multiple files using either a vector of file names or a file pattern:

export_list(list(mtcars = mtcars, iris = iris), "%s.tsv")

File Conversion

The convert() function links import() and export() by constructing a dataframe from the imported file and immediately writing it back to disk. convert() invisibly returns the file name of the exported file, so that it can be used to programmatically access the new file.

Because convert() is just a thin wrapper for import() and export(), it is very easy to use. For example, we can convert

# create file to convert
export(mtcars, "mtcars.dta")

# convert Stata to SPSS
convert("mtcars.dta", "mtcars.sav")

convert() also accepts lists of arguments for controlling import (in_opts) and export (out_opts). This can be useful for passing additional arguments to import or export methods. This could be useful, for example, for reading in a fixed-width format file and converting it to a comma-separated values file:

# create an ambiguous file
fwf <- tempfile(fileext = ".fwf")
cat(file = fwf, "123456", "987654", sep = "\n")

# see two ways to read in the file
identical(import(fwf, widths = c(1,2,3)), import(fwf, widths = c(1,-2,3)))

## [1] FALSE

# convert to CSV
convert(fwf, "fwf.csv", in_opts = list(widths = c(1,2,3)))
import("fwf.csv") # check conversion

##   V1 V2  V3
## 1  1 23 456
## 2  9 87 654

With metadata-rich file formats (e.g., Stata, SPSS, SAS), it can also be useful to pass imported data through characterize() or factorize() when converting to an open, text-delimited format: characterize() converts a single variable or all variables in a data frame that have “labels” attributes into character vectors based on the mapping of values to value labels (e.g., export(characterize(import("file.dta")), "file.csv")). An alternative approach is exporting to CSVY format, which records metadata in a YAML-formatted header at the beginning of a CSV file.

It is also possible to use rio on the command-line by calling Rscript with the -e (expression) argument. For example, to convert a file from Stata (.dta) to comma-separated values (.csv), simply do the following:

Rscript -e "rio::convert('mtcars.dta', 'mtcars.csv')"

Package Philosophy

The core advantage of rio is that it makes assumptions that the user is probably willing to make. Eight of these are important:

rio uses the file extension of a file name to determine what kind of file it is. This is the same logic used by Windows OS, for example, in determining what application is associated with a given file type. By removing the need to manually match a file type (which a beginner may not recognize) to a particular import or export function, rio allows almost all common data formats to be read with the same function. And if a file extension is incorrect, users can force a particular import method by specifying the format argument. Other packages do this as well, but rio aims to be more complete and more consistent than each:

reader handles certain text formats and R binary files
io offers a set of custom formats
ImportExport focuses on select binary formats (Excel, SPSS, and Access files) and provides a Shiny interface.
SchemaOnRead iterates through a large number of possible import methods until one works successfully

rio uses data.table::fread() for text-delimited files to automatically determine the file format regardless of the extension. So, a CSV that is actually tab-separated will still be correctly imported. It’s also crazy fast.
rio, wherever possible, does not import character strings as factors.
rio supports web-based imports natively, including from SSL (HTTPS) URLs, from shortened URLs, from URLs that lack proper extensions, and from (public) Google Documents Spreadsheets.
rio imports from from single-file .zip and .tar archives automatically, without the need to explicitly decompress them. Export to compressed directories is also supported.
rio wraps a variety of faster, more stream-lined I/O packages than those provided by base R or the foreign package. It uses data.table for delimited formats, haven for SAS, Stata, and SPSS files, smarter and faster fixed-width file import and export routines, and readxl and openxlsx for reading and writing Excel workbooks.
rio stores metadata from rich file formats (SPSS, Stata, etc.) in variable-level attributes in a consistent form regardless of file type or underlying import function. These attributes are identified as:
- label: a description of variable
- labels: a vector mapping numeric values to character strings those values represent
- format: a character string describing the variable storage type in the original file
The gather_attrs() function makes it easy to move variable-level attributes to the data frame level (and spread_attrs() reverses that gathering process). These can be useful, especially, during file conversion to more easily modify attributes that are handled differently across file formats. As an example, the following idiom can be used to trim SPSS value labels to the 32-character maximum allowed by Stata:
```
dat <- gather_attrs(rio::import("data.sav"))
attr(dat, "labels") <- lapply(attributes(dat)$labels, function(x) {
    if (!is.null(x)) {
        names(x) <- substring(names(x), 1, 32)
    }
    x
})
export(spread_attrs(dat), "data.dta")
```
In addition, two functions (added in v0.5.5) provide easy ways to create character and factor variables from these “labels” attributes. characterize() converts a single variable or all variables in a data frame that have “labels” attributes into character vectors based on the mapping of values to value labels. factorize() does the same but returns factor variables. This can be especially helpful for converting these rich file formats into open formats (e.g., export(characterize(import("file.dta")), "file.csv").
rio imports and exports files based on an internal S3 class infrastructure. This means that other packages can contain extensions to rio by registering S3 methods. These methods should take the form .import.rio_X() and .export.rio_X(), where X is the file extension of a file type. An example is provided in the rio.db package.

Import, Export, and Convert Data Files

2021-11-08

Import, Export, and Convert Data Files

Supported file formats

Data Import

Importing Data Lists

Data Export

File Conversion

Package Philosophy