Extract[]
The most general way to subset a data frame by rows and/or columns is
the base R Extract[]
function, indicated by matched square
brackets instead of the usual matched parentheses. For a data frame
named d the general format is
d[rows, columms]
.
For the rows
parameter, pass either
To specify a logical expression for the rows
parameter,
use the standard R operators.
operator | meaning |
---|---|
& | and |
| | or |
! | not |
== | is equal to |
!= | is not equal to |
%in% | is in a vector |
For example, to obtain a subset of the data frame that consists of only those who report a value of the Gender variable as Female, specify a logical condition such as:
Gender=="Female"
For the cols
parameter, pass either
If subsetting is done by only rows or only columns, then leave the
other value blank. For example, to subset the d data frame only
by rows, the general form reduces to d[rows,]
. Similarly,
to subset only by columns, d[,cols]
.
Extract
When Extract[]
evaluates the row or column
specifications to obtain the indices, there are several annoying
properties.
rows
: Any reference to the variables in the data frame
for this specification must contain the name of the data frame followed
by a $. But this name has already been specified in the call to
Extract[]
by listing the data frame name in front of the
square brackets, so now is redundant, repeated for every variable
reference.rows
: When specifying a value of a variable for a row
selection, any missing values for the variable are also provided even
though the missing data values are not the requested value of the
variable.cols
: Usually specified with a vector of variable names
but all variable names in in the provided vector of names must be
quoted.cols
: No variable ranges specified with a colon
:
such as m01:m10 to specify 10 variables: m01, m02, …, up
to m10.rows
and cols
. Instead directly enter the
conditions for both rows and columns, which can make the entire
expression quite large.Extract[]
To address the first two deficiencies, one possibility is the base R
subset()
function. To address these deficiencies and still
use Extract[]
directly, lessR provides the
function .()
for obtaining the indices of selected rows and
of selected columns. This function is only callable within the base R
Extract[]
function, with what R refers to as non-standard
evaluation. That basically means that the annoying restrictions are
removed, though in some advanced programming uses the .()
may not apply.
The general form of the subsetting with the .()
function
follows.
d[.(rows), .(columns)]
That is, call the same Extract[]
function with reference
to rows
and cols
, but now wrap the row and
column expressions with the lessR function call
.()
.
To illustrate, use the Employee data set contained in lessR, here read into the d data frame.
<- Read("Employee") d
##
## >>> Suggestions
## Details about your data, Enter: details() for d, or details(name)
##
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## integer: Numeric data values, integers only
## double: Numeric data values with decimal digits
## ------------------------------------------------------------
##
## Variable Missing Unique
## Name Type Values Values Values First and last values
## ------------------------------------------------------------------------------------------
## 1 Years integer 36 1 16 7 NA 7 ... 1 2 10
## 2 Gender character 37 0 2 M M W ... W W M
## 3 Dept character 36 1 5 ADMN SALE FINC ... MKTG SALE FINC
## 4 Salary double 37 0 37 53788.26 94494.58 ... 56508.32 57562.36
## 5 JobSat character 35 2 3 med low high ... high low high
## 6 Plan integer 37 0 3 1 1 2 ... 2 2 1
## 7 Pre integer 37 0 27 82 62 90 ... 83 59 80
## 8 Post integer 37 0 22 92 74 86 ... 90 71 87
## ------------------------------------------------------------------------------------------
Subset the data frame by only listing observations with a
Gender of “M” with scores on Post larger than 90. Only
list columns for the variables in the range from Years to
Salary, and Post. Referring back to the output of
Read()
, the variable range includes Years,
Gender, Dept, and Salary.
=="M" & Post>90), .(Years:Salary, Post)] d[.(Gender
## Years Gender Dept Salary Post
## Ritchie, Darnell 7 M ADMN 53788.26 92
## Hoang, Binh 15 M SALE 111074.86 97
## Pham, Scott 13 M SALE 81871.05 94
## Correll, Trevon 21 M SALE 134419.23 94
## Langston, Matthew 5 M SALE 49188.96 93
## Anderson, David 9 M ACCT 69547.60 91
Following is the traditional R call to Extract[]
to
obtain the same subsetting.
$Gender=="M" & d$Post>90, c("Years", "Gender", "Dept", "Salary", "Post")] d[d
## Years Gender Dept Salary Post
## Ritchie, Darnell 7 M ADMN 53788.26 92
## Hoang, Binh 15 M SALE 111074.86 97
## Pham, Scott 13 M SALE 81871.05 94
## Correll, Trevon 21 M SALE 134419.23 94
## Langston, Matthew 5 M SALE 49188.96 93
## Anderson, David 9 M ACCT 69547.60 91
A row selection is a logical condition. To negate a row selection,
add a !
to the beginning of the condition passed to
.()
, within the call to .()
. To exclude the
specified variables, place a -
, in front of the call to
.()
.
!(Gender=="M" & Post>90)), -.(Dept:Plan, Pre)] d[.(
## Years Gender Post
## Wu, James NA M 74
## Downs, Deborah 7 W 86
## Jones, Alissa 5 W 62
## Afshari, Anbar 6 W 100
## Knox, Michael 18 M 84
## Campagna, Justin 8 M 84
## Kimball, Claire 8 W 92
## Cooper, Lindsay 4 W 91
## Saechao, Suzanne 8 W 100
## Tian, Fang 9 W 61
## Bellingar, Samantha 10 W 72
## Sheppard, Cory 14 M 73
## Kralik, Laura 10 W 71
## Skrotzki, Sara 18 W 61
## James, Leslie 18 W 70
## Osterman, Pascal 5 M 70
## Adib, Hassan 14 M 69
## Gvakharia, Kimberly 3 W 79
## Stanley, Grayson 9 M 73
## Link, Thomas 10 M 83
## Portlock, Ryan 13 M 73
## Stanley, Emma 3 W 84
## Singh, Niral 2 W 59
## Fulton, Scott 13 M 73
## Korhalkar, Jessica 2 W 87
## LaRoe, Maria 10 W 86
## Billing, Susan 4 W 90
## Capelle, Adam 24 M 81
## Hamide, Bita 1 W 90
## Anastasiou, Crystal 2 W 71
## Cassinelli, Anastis 10 M 87
Can still provide the indices directly for one or both of the
expressions as the base R Extract[]
function is unmodified
with the use of .()
. The purpose of .()
is
simply to return the row or column row indices to identify specific rows
or columns of the specified data frame. You can either specify the
indices directory for the rows or columns, or let .()
identify them for you.
1:3, .(Years:Salary, Post)] d[
## Years Gender Dept Salary Post
## Ritchie, Darnell 7 M ADMN 53788.26 92
## Wu, James NA M SALE 94494.58 74
## Downs, Deborah 7 W FINC 57139.90 86
=="M" & Post>90), 1:3] d[.(Gender
## Years Gender Dept
## Ritchie, Darnell 7 M ADMN
## Hoang, Binh 15 M SALE
## Pham, Scott 13 M SALE
## Correll, Trevon 21 M SALE
## Langston, Matthew 5 M SALE
## Anderson, David 9 M ACCT
To enhance readability, store the specified row or column conditions
as character strings. Each string must be named either rows
or cols
. Because the entire expression for
rows
or cols
is a character string,
differentiate between single and double quotes as needed. For example,
use single quotes within the string and double quotes to define the
entire string, illustrated here.
<- "Gender=='M' & Post>93"
rows <- "Gender:Salary, Post" cols
To subset, pass the respective character strings, rows
and cols
, to .()
, respectively.
d[.(rows), .(cols)]
## Gender Dept Salary Post
## Hoang, Binh M SALE 111074.86 97
## Pham, Scott M SALE 81871.05 94
## Correll, Trevon M SALE 134419.23 94
To negate, as with the literal expressions, use !
for
the logical expression that defines the rows and -
for the
columns. Notice their placement, where the !
is inside the
call to .()
, and the -
is outside the
call.
!rows), -.(cols)] d[.(
## Years JobSat Plan Pre
## Ritchie, Darnell 7 med 1 82
## Wu, James NA low 1 62
## Downs, Deborah 7 high 2 90
## Jones, Alissa 5 <NA> 1 65
## Afshari, Anbar 6 high 2 100
## Knox, Michael 18 med 3 81
## Campagna, Justin 8 low 1 76
## Kimball, Claire 8 high 2 93
## Cooper, Lindsay 4 high 1 78
## Saechao, Suzanne 8 med 1 98
## Tian, Fang 9 med 2 60
## Bellingar, Samantha 10 med 1 67
## Sheppard, Cory 14 low 3 66
## Kralik, Laura 10 med 2 74
## Skrotzki, Sara 18 med 2 63
## James, Leslie 18 low 3 70
## Osterman, Pascal 5 high 3 69
## Adib, Hassan 14 med 2 71
## Gvakharia, Kimberly 3 med 2 83
## Stanley, Grayson 9 low 1 74
## Link, Thomas 10 low 1 83
## Portlock, Ryan 13 low 1 72
## Langston, Matthew 5 low 3 94
## Stanley, Emma 3 high 2 86
## Singh, Niral 2 high 2 59
## Anderson, David 9 low 1 94
## Fulton, Scott 13 low 1 72
## Korhalkar, Jessica 2 <NA> 2 74
## LaRoe, Maria 10 high 2 80
## Billing, Susan 4 med 2 91
## Capelle, Adam 24 med 2 83
## Hamide, Bita 1 high 2 83
## Anastasiou, Crystal 2 low 2 59
## Cassinelli, Anastis 10 high 1 80
The variable Dept is missing for the fourth row of data.
1:5,] d[
## Years Gender Dept Salary JobSat Plan Pre Post
## Ritchie, Darnell 7 M ADMN 53788.26 med 1 82 92
## Wu, James NA M SALE 94494.58 low 1 62 74
## Downs, Deborah 7 W FINC 57139.90 high 2 90 86
## Hoang, Binh 15 M SALE 111074.86 low 3 96 97
## Jones, Alissa 5 W <NA> 53772.58 <NA> 1 65 62
Here with the traditional use of Extract[]
, specify rows
of data only when the value of Dept is ADMN.
$Dept=="ADMN", c('Gender', 'Dept', 'Salary')] d[d
## Gender Dept Salary
## Ritchie, Darnell M ADMN 53788.26
## NA <NA> <NA> NA
## Afshari, Anbar W ADMN 69441.93
## James, Leslie W ADMN 122563.38
## Singh, Niral W ADMN 61055.44
## Billing, Susan W ADMN 72675.26
## Capelle, Adam M ADMN 108138.43
The result provides what is requested, and also when Dept is
<NA>
, which is not requested. The requested value of
ADMN is not the same as <NA>
.
Use the .()
function to obtain what is requested, rows
of data in which the value of Dept is ADMN.
=="ADMN"), .(Gender:Salary)] d[.(Dept
## Gender Dept Salary
## Ritchie, Darnell M ADMN 53788.26
## Afshari, Anbar W ADMN 69441.93
## James, Leslie W ADMN 122563.38
## Singh, Niral W ADMN 61055.44
## Billing, Susan W ADMN 72675.26
## Capelle, Adam M ADMN 108138.43
The function .()
also provides for random selection of
rows. To randomly select the specified number of rows from the data
frame to subset, specify the random()
function for the
logical criterion of the rows
. The value passed to
random()
can either be the actual number of rows to select,
or the proportion of rows to select.
Here randomly select five rows of data from the d data frame.
random(5)), .(Years:Salary)] d[.(
## Years Gender Dept Salary
## Wu, James NA M SALE 94494.58
## Kralik, Laura 10 W SALE 92681.19
## Afshari, Anbar 6 W ADMN 69441.93
## Korhalkar, Jessica 2 W ACCT 72502.50
## Portlock, Ryan 13 M SALE 77714.85
Here specify a proportion of rows to select.
random(0.1)), .(Years:Salary)] d[.(
## Years Gender Dept Salary
## Capelle, Adam 24 M ADMN 108138.43
## Stanley, Grayson 9 M SALE 69624.87
## Langston, Matthew 5 M SALE 49188.96
## Skrotzki, Sara 18 W MKTG 91352.33