Analyze the data values for at least one variable, such as the annual salaries of employees at a company. Typically have available the data for multiple variables. Organize the data values into a specific kind of structure from which analysis proceeds.
Data Table: Organize data values into a rectangular data table with the data values for each variable in a column, and the name of the variable at the top of the column.
Store the structured data values within a computer file, such as an Excel or OpenDocument Spreadsheet (ODS) formatted file or text file. This file can be stored on your computer, an accessible local network, or the world wide web. The data table in the following figure, formatted as an Excel file, contains four variables: Years, Gender, Dept, and Salary plus an ID field called Name for a total of five columns.
Describe the data table by its columns, rows, and cell entries.
Data value: The contents of a single cell of a data table, a specific measurement.
For example, according to the data values for employee Darnell Ritchie, he has worked at the company for 7 years, identifies as a male, and works in administration with an annual salary of $43,788.26.
Missing data: A cell for which there is no recorded data value.
Two data values in this section of the data table are missing. The number of years James Wu has worked at the company is not recorded, nor is the department in which Alissa Jones works.
Variable name: A short, concise word or abbreviation that identifies a column of data values in a data table.
Instance or Example or Sample or Case: Each row of the data table, the data for a specific instance of a single person, organization, place, event, or whatever is being studied.
Encode the data table in one of a variety of computer file formats.
Common formats include Excel files (.xlsx
), OpenDocument
Spreadsheet files (.ods
), and text files the form of
comma-separated value files (.csv
) or tab-delimited text
files (.txt
).
Analysis of data can only proceed with the data table identified and the relevant variables identified by their name.
R data analysis functions analyze the data values for one or more specified variables, identified by their names, such as Salary.
Analysis requires the correct spelling of each variable name, including the same pattern of capitalization.
Your data organized as a data table exists somewhere as a data file stored on a computer system. To analyze data in a data table stored in a computer file, first read the data table from the computer file into a corresponding data table within a running R session.
Data frame: A data table stored within an R session, referenced by its name.
Each variable in a data table has a name, and so does the data table itself. Initially reference the data table stored on your computer system by its file name and location. When read into R, name the data table, the R data frame, with a name of your choice. Regardless of the file name on your computer system, typically name the data table within the active R session, the data frame, as d for data. Not only is d easy to type, but it is also the lessR default data frame name for the data processed by its various analysis functions.
When analyzing data read into R, the same data exists in two different locations: a computer file on your computer system, and an R data frame within a running R app. Different locations, different names: same data. On your computer system, identify the data table by its file name and location. Within a running R app, identify the same data by the name of the data frame, such as d, within which the data from the computer file was read.
To read the data from a file into a data frame of a running R
application, as with every other task in R, accomplish the task with a
function. The R ecosystem, base R and its many packages, presents many
such functions. We use the lessR function Read()
for its simplicity and for its useful output that helps understanding
the data that was read.
The lessR function Read()
can read data files in many
file formats, including MS Excel. The most generic format is the csv
format, for comma separated values. Read() also reads SPSS and
SAS data files, as well as data files in R’s own native format, of type
.rda
.
To read the data, direct R to the location of the data file.”)` R cannot read the data file until it knows where the data is stored. One option has you locate the data file on your computer system by browsing for it, navigating your file system until you locate the file.
To locate your data file by browsing through your file system, call the
Read()
function with an empty file reference("")
, literally nothing between the quotes.
The following Read()
statement reads the data stored as
a rectangular data table from an external file stored on your computer
system into an R data frame called d. The empty quotes indicate
to R to open your file browser for you to locate the data file that
already exists somewhere on your computer system).
d <- Read("")
As with all R (and Excel) functions, the call to invoke the function includes a matching set of parentheses. Any information within the parentheses specifies the information provided to the function for analysis.
The <-
indicates to assign what is on the right of
the expression, here the data read from an external file, to the object
on the left, here the R data frame stored within R, named d in
this example. You can also use an ordinary equals sign, =
,
to indicate the assignment, but the <-
is more
descriptive, and more widely used by R practitioners.
Specify either the full path name of a file on your computer system,
or specify a web address that locates the data table on the web. Also
can explicitly select the location of the data file to be read within
the quotes and parentheses. Specify either the full path name of a file
on your computer system, or specify a web address that locates the data
table on the web. Again, read the data into the d data
frame.
d <- Read("path name" or "web address")
With Excel, R, or any other computer apps that processes data,
enclose values that are character strings, such as a file name, in
quotes. For example, to read the data stored on the web in the data file
called employee.xlsx into the data frame d, invoke the
following Read()
function call.
d <- Read("http://web.pdx.edu/~gerbing/data/employee.xlsx")
This example reads a data file on the web. To specify a location on
your computer, provide the full path name of your data file, its name
and location. To obtain this path name, first browse for the file with
Read("")
. The resulting output displays the path name of
the identified file. Copy this path name and insert between the quotes
of Read("")
, save this and other R function calls in a text
file, and then run the code in the future to directly read the data file
for future analyses without needing to browse for its location.
With the Read()
function, put nothing between the quotes
to browse for a data file, or specify the location of the data file on
your computer system or the web. Direct the data read from a file into
an R data frame, usually named d, but can choose any valid
name.
Read()
The Read()
function displays useful output. Because R
organizes analyses by variable name, it is crucial to know the exact
variable names, including the pattern of capitalization.
In addition to the variable names, Read()
displays each
variable’s type as stored in the computer, as numbers with or without
decimal digits, or as character strings. Also listed are the number of
complete and missing values for each variable, the number of unique
values for each variable, and sample data values.
The following lists the output from reading from a data file downloaded with lessR. All that is needed to read these data files is the name. A file on the web cannot be specified here because you may not have web access when this file is generated.
<- Read("Employee") d
##
## >>> Suggestions
## Details about your data, Enter: details() for d, or details(name)
##
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## integer: Numeric data values, integers only
## double: Numeric data values with decimal digits
## ------------------------------------------------------------
##
## Variable Missing Unique
## Name Type Values Values Values First and last values
## ------------------------------------------------------------------------------------------
## 1 Years integer 36 1 16 7 NA 7 ... 1 2 10
## 2 Gender character 37 0 2 M M W ... W W M
## 3 Dept character 36 1 5 ADMN SALE FINC ... MKTG SALE FINC
## 4 Salary double 37 0 37 53788.26 94494.58 ... 56508.32 57562.36
## 5 JobSat character 35 2 3 med low high ... high low high
## 6 Plan integer 37 0 3 1 1 2 ... 2 2 1
## 7 Pre integer 37 0 27 82 62 90 ... 83 59 80
## 8 Post integer 37 0 22 92 74 86 ... 90 71 87
## ------------------------------------------------------------------------------------------
To allow for many variables, Read()
lists the
information for each variable in a row. Note that the data file
organizes the variables by column.
Always distinguish continuous variables from categorical variables. This distinction between these two types of variables is fundamental in data analysis.
Continuous (quantitative) variable: A numerical variable with many possible values.
Categorical (qualitative) variable: A variable with relatively few unique labels as data values.
Examples of continuous variables are Salary or Time. Examples of categorical variables are Gender or State of Residence, each with just a relatively few number of possible values compared to numerical values. This distinction of continuous and categorical variables is common to all data analytics.
Sometimes that distinction gets a little confusing because variables with integer values, which are numeric, could be either quantitative or qualitative. For example, sometimes Male, Female, and Other are encoded as 0, 1, and 2, respectively, for three levels of the categorical variable Gender. However, these integer values are just labels for different non-numeric categories. Best to avoid this confusion. Instead, encode categorical variables with non-numeric values, such as Gender, for example, with M F, and O for Other.
A variable label is a longer description of the corresponding variable than that of the variable name. The variable label displays in conjunction with the variable name on the text and visualization output to further clarify the interpretability of the output.
The variable label file has two columns. The the first column lists
the variable names and the second column the corresponding labels. The
file can be of type .csv
, or .xlsx
, or
contained within lessR, as with the following
example.
The variable labels must be read into the data frame l.
Specify the var_labels
parameter as TRUE
to
instruct the Read()
function to read variable labels
instead of data.
<- Read("Employee_lbl", var_labels=TRUE) l
##
## >>> Suggestions
## Details about your data, Enter: details() for d, or details(name)
##
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## ------------------------------------------------------------
##
## Variable Missing Unique
## Name Type Values Values Values First and last values
## ------------------------------------------------------------------------------------------
## 1 label character 8 0 8 Time of Company Employment ... Test score on legal issues after instruction
## ------------------------------------------------------------------------------------------
Relying upon the have package,
lessR can write data with function Write()
in formats: csv, Excel, ODS, and native R and native SPSS formats. The R
format preserves a binary copy of the data frame as it is stored within
R, as does the SPSS format for native SPSS files. The Excel and ODS
formats yield worksheets. The csv format yields a comma separated value
text file.
A recommended procedure begins the analysis with data in
.csv
, .xlsx
, or .ods
format. Then
proceed with data cleaning and preparation, including needed
transformations and re-coding. When the data is ready for analysis, save
the cleaned, prepared data as a native R file of format
.rda
. This format is the most efficient for size and for
speed of reading back into R, with all data preparations already
completed. Can also write the data in a format such as Excel so that
those on the team not using R can also access.
With the Write()
function, specify the name of the file
for the output, as well as the type of file with the format
parameter, with the default of .csv
. The following R
statements that call Write()
are not run here as the intent
is not to create additional files.
Write the current contents of default data frame d to GoodData.csv.
Write(d, "GoodData")
Write the data as a Excel data table in an Excel file.
Write(d, "GoodData", format="Excel")
Can also use the abbreviation for an Excel file,
wrt_x()
.
wrt_x(d, "GoodData")
Write the data as a R data table.
Write(d, "GoodData", format="R")
Can also use the abbreviation for an R file,
wrt_r()
.
wrt_r(d, "GoodData")
Use format="ODS"
to specify to write to the OpenDocument
Spreadsheet format with file type .ods
. Use
format="SPSS"
to specify to write to the SPSS format with
file type .sav
.
The output of Write()
indicates the full path name of
the written file.
The haven package has an excellent function for
reading SPSS files, read_spss()
. Read()
adds
to the given functionality by preserving the SPSS variable labels and
value labels. Read()
invokes this function to read SPSS
files, with the default file types of .sav
and
zsav
. The functionality of haven also
allows for reading SAS and Stata data files.
Within SPSS, an integer scored variable can have value labels. An
example is Likert scaled data with 1 representing a Strongly Disagree to
5 a Strongly Agree. The corresponding SPSS variable has integer values
but displays the more informative value labels such as in bar charts.
This type of categorical variable corresponds to a factor
in the R system.
The read_spss()
function preserves the value labels with
a special variable type called haven_labels
. These
variables can be converted to an R factor for processing in the R system
with the haven function as_factor()
.
Read()
performs this conversion automatically for each
relevant variable.
One problem is that the factor conversion preserves the value labels
listed in the correct order, but looses the original integer scoring
information. To preserve both the labels as a standard R factor, and to
preserve the original scoring, Read()
converts the original
read variable with the labels (type haven_labels
) into two
variables. The first variable, an integer variable with the original
integer scoring, has the name of the read variable. The corresponding
factor variable has the same name as the read variable with the suffix
\_f
.
For example, the use of read_spss()
results in the
following, here showing just the first four lines of data. The variable
region contains both the integer scoring and the value label
that are part of the SPSS data file that was read.
# A tibble: 4 x 4
city region growth income
<chr> <dbl+lbl> <dbl> <dbl>
1 ALBANY-SCHNTADY-TROY,N.Y. 1 [NE] -71 3313
2 ATLANTA,GA. 2 [SE] 264 3153
3 BALTIMORE,MD. 1 [NE] 38 3540
4 BIRMINGHAM,ALA. 2 [SE] -178 2528
With Read()
, obtain the following standard R data frame.
The variable region is now a standard R integer variable, and
region_f is the corrsponding factor.
city region region_f growth income
1 ALBANY-SCHNTADY-TROY,N.Y. 1 NE -71 3313
2 ATLANTA,GA. 2 SE 264 3153
3 BALTIMORE,MD. 1 NE 38 3540
4 BIRMINGHAM,ALA. 2 SE -178 2528
With these data, the analyst may accomplish a numerical analysis with the integer variable, and for analyses such as a bar chart, instead, display the corresponding value labels.
lessR automatically accesses variable labels stored
in a data frame named l, and then displays with the variable
name in text and visualization output. Usually, the variable labels are
stored in an Excel or .csv
file with two columns, the
variable name, and the variable label. There are no column titles, just
the names and labels. Then read into the l data frame with the
var_labels
parameter set to TRUE
.
l <- Read(file_reference, var_labels=TRUE)
Read()
also processes the variable labels of these
(usually) integer-scored variables with value labels in the SPSS data
file. The overseers of the R system do not permit package authors to
create stored data structures from internal R code. Only the user can
create these structures. As such, Read()
lists each
variable name, a comma, and then the corresponding label. This example
only has one such relevant variable, region and its factor
equivalent.
Variable and Variable Label --> See vignette("Read"), SPSS section
---------------------------
region, region of US
region_f, region of US
To access these labels, copy the names and labels, paste into a text
file, and then save as a file. Then read the file of names/labels into R
with the preceding Read()
statement.
Use the base R help()
function to view the full manual
for Read()
or Write()
. Simply enter a question
mark followed by the name of the function.
?Read
?Write