StdValue
)ErrorRisk
)OrigObsDataID
)This vignette is intended to demonstrate how to use the ‘rtry’ package to preprocess data exported from the TRY database, from importing and exploring the data to binding multiple datasets, to selecting, excluding specific data using user-defined criteria and removing duplicates, and finally exporting the preprocessed data.
Make sure you have the ‘rtry’ package installed. If not, you may refer to the vignette “Introduction to rtry” (rtry-introduction).
To start, set the working directory to the desired location:
# Set the working directory
setwd("<path_to_dir>")
# Check the working directory
getwd()
Note: The character “\
” is used as escape character in R to give the following character special meaning (e.g. “\n
” for newline, “\t
” for tab, “\r
” for carriage return and so on). Therefore, for Windows users, it is important to use the “\
” in the file path of the command instead of “/
” in order for R to correctly understand the input path.
Load the required packages using the commands:
# Load the rtry package
library(rtry)
# Check the version of rtry
packageVersion("rtry")
rtry_import()
takes five arguments input
, separator
, encoding
, quote
and showOverview
, and returns a data table that contains the entire dataset. Since the function by default imports the text file exported from the TRY database for further processing, to import the TRY data, simply type in the path to the text file.
In the context of this example workflow, we will use the trait data provided within the ‘rtry’ package. In this specific case the input argument for the file data_TRY_15160.txt
can be obtained via system.file()
that finds the full file path to the ‘rtry’ package:
# Obtain and print the path to the sample dataset within the rtry package
<- system.file("testdata", "data_TRY_15160.txt", package = "rtry")
path_to_data path_to_data
## [1] "C:/Program Files/R/R-4.0.5/library/rtry/testdata/data_TRY_15160.txt"
# Import TRY data requests into data frames
<- rtry_import(path_to_data) TRYdata1
## input: C:/Program Files/R/R-4.0.5/library/rtry/testdata/data_TRY_15160.txt
## dim: 1782 28
## col: LastName FirstName DatasetID Dataset SpeciesName AccSpeciesID AccSpeciesName ObservationID ObsDataID TraitID TraitName DataID DataName OriglName OrigValueStr OrigUnitStr ValueKindName OrigUncertaintyStr UncertaintyName Replicates StdValue UnitName RelUncertaintyPercent OrigObsDataID ErrorRisk Reference Comment V28
Note: You may ignore the message “Registered S3 method overwritten by <package_name>
” if it appears.
The rtry_import
function stores data in a format consistent with both classes data.table
and data.frame
. There are two ways to view the imported data (in this case TRYdata1
).
Method 1: Print the first 6 rows of the TRYdata1
using the command:
head(TRYdata1)
Method 2: View the entire TRYdata1
, use the following command and the data viewer (only available in RStudio) will be prompted:
View(TRYdata1)
We see that datasets released from TRY are in a long-table format, where the traits are defined in the columns TraitID
and TraitName
. Ancillary data are defined in the columns DataID
and DataName
, which also provide additional information for the traits.
Import another sample TRY dataset (data_TRY_15161.txt
) within the ‘rtry’ package:
<- system.file("testdata", "data_TRY_15161.txt", package = "rtry")
path_to_data path_to_data
## [1] "C:/Program Files/R/R-4.0.5/library/rtry/testdata/data_TRY_15161.txt"
<- rtry_import(path_to_data) TRYdata2
## input: C:/Program Files/R/R-4.0.5/library/rtry/testdata/data_TRY_15161.txt
## dim: 4627 28
## col: LastName FirstName DatasetID Dataset SpeciesName AccSpeciesID AccSpeciesName ObservationID ObsDataID TraitID TraitName DataID DataName OriglName OrigValueStr OrigUnitStr ValueKindName OrigUncertaintyStr UncertaintyName Replicates StdValue UnitName RelUncertaintyPercent OrigObsDataID ErrorRisk Reference Comment V28
Again to view the imported data, use either of the following commands:
head(TRYdata2)
View(TRYdata2)
rtry_explore()
takes four arguments input
, ...
, sortBy
and showOverview
, and converts the input into a grouped data table based on the specified column names. To provide a first understanding of the data, an additional column is added to show the total count within each group. By default (if sortBy
is not specified), the output data are grouped by the first attribute.
To explore the data TRYdata1
based on TraitID
and TraitName
, use the following:
<- rtry_explore(TRYdata1, TraitID, TraitName) TRYdata1_explore_trait
## dim: 3 3
Note: You may ignore the message “Registered S3 method overwritten by <package_name>
” if it appears.
View the output data using the command:
View(TRYdata1_explore_trait)
With this, it is clear that the TRYdata1
only includes data with two TraitID
(3115 and 3116). And that within this dataset, there are 1632 ancillary data, i.e. the entries where TraitID
is NA
.
Next, further exploration of TRYdata1
can be done based on the AccSpeciesID
, AccSpeciesName
, TraitID
and TraitName
.
<- rtry_explore(TRYdata1, AccSpeciesID, AccSpeciesName, TraitID, TraitName) TRYdata1_explore_species
## dim: 9 5
View(TRYdata1_explore_species)
The output is, by default, sorted by AccSpeciesID
. The TRYdata1
only contains three consolidated species (with AccSpeciesID
equals 10773, 35846 and 45737). Each consolidated species has records where the TraitID
equals 3115 and 3116, as well as the corresponding ancillary data.
After reassuring the data contains the necessary traits and species, for the purpose of preprocessing, it is also necessary to understand which ancillary data are provided within the dataset. To do so, explore the TRYdata
based on DataID
, DataName
, TraitID
and TraitName
. This time, the exploration is sorted by TraitID
to see if there are similar data in each trait.
# Group the input data based on DataID, DataName, TraitID and TraitName
# and sort the output by TraitID using the sortBy argument
<- rtry_explore(TRYdata1, DataID, DataName, TraitID, TraitName, sortBy = TraitID) TRYdata1_explore_anc
## dim: 156 5
View(TRYdata1_explore_anc)
With this exploration and the way it is sorted, it is clear that in the TRYdata1
: (1) there are two traits (TraitID
3115 and 3116); (2) within each trait, whether or not similar data exists (DataID
and DataName
); and (3) what types of ancillary data are provided (DataID
and DataName
, TraitID
: NA
) and how many (Count).
In the case of TraitID
3115, it can be seen that DataID
7222 and 7223 contain the extreme values (min or max) of the “Leaf specific area (SLA): petiole excluded”. This information might be useful later on when further preprocessing is performed.
A similar procedure can be performed on the other TRY data (TRYdata2
).
# Group the input data based on TraitID and TraitName
<- rtry_explore(TRYdata2, TraitID, TraitName)
TRYdata2_explore_trait
# Group the input data based on AccSpeciesID, AccSpeciesName, TraitID and TraitName
# Note: For TraitID == "NA", meaning that entry is an ancillary data
<- rtry_explore(TRYdata2, AccSpeciesID, AccSpeciesName, TraitID, TraitName)
TRYdata2_explore_species
# Group the input data based on DataID, DataName, TraitID and TraitName
# Then sort the output by TraitID using the sortBy argument
<- rtry_explore(TRYdata2, DataID, DataName, TraitID, TraitName, sortBy = TraitID) TRYdata2_explore_anc
## dim: 2 3
## dim: 6 5
## dim: 236 5
Then use either head()
or View()
function to view the data.
Via View(TRYdata2_explore_anc)
, it can be seen that in the TRYdata2
: (1) there is only one trait, with the TraitID
equals to 3117; (2) within this trait, whether or not similar data exists; and (3) what types of ancillary data could be found and how many (Count).
Here, it is clear that DataID
6584 and 6598 share very similar DataName
. This could mean that in the original dataset, two values of “SLA: undefined if petiole in- or excluded” were mapped to this trait, and the user might want to have this information in mind before further preprocessing.
rtry_bind_row()
takes two arguments ...
and showOverview
. It takes a sequence of data and combines them by rows. Note: A common attribute is not necessary (difference to the function rtry_join_left
and rtry_join_outer
) and the binding process simply puts the data one after another while matching the column names, and any missing columns will be filled with NA
.
With the two TRY data TRYdata1
and TRYdata2
already imported, it is possible to combine the two datasets into one (TRYdata
):
<- rtry_bind_row(TRYdata1, TRYdata2) TRYdata
## dim: 6409 28
## col: LastName FirstName DatasetID Dataset SpeciesName AccSpeciesID AccSpeciesName ObservationID ObsDataID TraitID TraitName DataID DataName OriglName OrigValueStr OrigUnitStr ValueKindName OrigUncertaintyStr UncertaintyName Replicates StdValue UnitName RelUncertaintyPercent OrigObsDataID ErrorRisk Reference Comment V28
From the dimension, it can be seen that now the two imported data have been combined together by rows, with TRYdata1
on top then followed by TRYdata2
, and a new data TRYdata
has been created.
Now this combined data TRYdata
can be explored using once again rtry_explore()
.
# Group the input data based on TraitID and TraitName
<- rtry_explore(TRYdata, TraitID, TraitName)
TRYdata_explore_trait
# Group the input data based on AccSpeciesID, AccSpeciesName, TraitID and TraitName
# Note: For TraitID == "NA", meaning that entry is an ancillary data
<- rtry_explore(TRYdata, AccSpeciesID, AccSpeciesName, TraitID, TraitName)
TRYdata_explore_species
# Group the input data based on DataID, DataName, TraitID and TraitName
# Then sort the output by TraitID using the sortBy argument
<- rtry_explore(TRYdata, DataID, DataName, TraitID, TraitName, sortBy = TraitID) TRYdata_explore_anc
## dim: 4 3
## dim: 12 5
## dim: 331 5
To view the data, use either head()
or View()
.
Within the ‘rtry’ package, there are two ways to reduce the number of columns, i.e. either selecting the columns to keep: rtry_select_col()
or explicitly removing certain columns: rtry_remove_col()
.
Note: To ensure that the later preprocessing steps (such as data selection and duplicates removal) work properly, do not remove the column ObservationID
and OrigObsDataID
.
rtry_select_col()
takes three arguments input
, ...
and showOverview
in order to select specified columns from the imported data.
rtry_remove_col()
takes also three arguments input
, ...
and showOverview
to remove the specified columns from the input data instead.
It is up to the users to decide which function they prefer in retrieving the relevant columns of the data. In general, it would be easier to use the rtry_remove_col()
when user would like to keep most of the columns and remove only a small fraction of the data column, such as V28
.
<- rtry_remove_col(TRYdata, V28) workdata
## dim: 6409 27
## col: LastName FirstName DatasetID Dataset SpeciesName AccSpeciesID AccSpeciesName ObservationID ObsDataID TraitID TraitName DataID DataName OriglName OrigValueStr OrigUnitStr ValueKindName OrigUncertaintyStr UncertaintyName Replicates StdValue UnitName RelUncertaintyPercent OrigObsDataID ErrorRisk Reference Comment
From the feedback in the console, it can be seen that only the column V28
has been removed, therefore, the output workdata
contains 27 columns, instead of the original 28. And users can continue removing columns if convenient.
On the other hand, if users are clear which columns they would like to keep, a more direct approach is, of course, the use of rtry_select_col()
:
<- rtry_select_col(workdata, ObsDataID, ObservationID, AccSpeciesID, AccSpeciesName, ValueKindName, TraitID, TraitName, DataID, DataName, OriglName, OrigValueStr, OrigUnitStr, StdValue, UnitName, OrigObsDataID, ErrorRisk, Comment) workdata
## dim: 6409 17
## col: ObsDataID ObservationID AccSpeciesID AccSpeciesName ValueKindName TraitID TraitName DataID DataName OriglName OrigValueStr OrigUnitStr StdValue UnitName OrigObsDataID ErrorRisk Comment
Here, from the feedback in the console, the users can confirm the remaining columns within the data (in this case the workdata
), and continue to reduce the number of columns by selecting only the relevant columns.
rtry_select_row()
takes five arguments input
, ...
, getAncillary
, rmDuplicates
and showOverview
to select specified rows based on the specified criteria from the imported data. Note that the argument getAncillary
makes use of the ObservationID
provided within the TRY data to select the whole observation including all corresponding ancillary data during the row extraction process.
For our example we will select all trait records, but only relevant ancillary data.
First, identify the relevant ancillary data using the rtry_explore
function.
<- rtry_explore(workdata, DataID, DataName, TraitID, TraitName, sortBy = TraitID) workdata_explore_anc
## dim: 331 5
To view the data, use either head()
or View()
.
We retrieve six DataID
s with trait records and 325 DataID
s with ancillary information. For further data preprocessing, we decide to select the following DataID
s that contain the relevant information:
To select all trait records and the ancillary data of interest:
<- rtry_select_row(workdata, TraitID > 0 | DataID %in% c(59, 60, 61, 6601, 327, 413, 1961, 113)) workdata
## dim: 1017 17
To view the data, use either head()
or View()
.
In order to better understand the data, inside the data viewer (only available via RStudio), click on the column ObservationID
to sort the dataset. This way, it can be seen that for example:
ObservationID
94068, there are two ObsDataID
1021243 and 1021245 with the first one belonging to the TraitID
3115 and the latter an ancillary data. Looking deeper into the DataID
and DataName
, we can see that this data “SLA: petiole excluded” is measured within “growth chambers”, and could be eliminated later on depending on the research question.ObservationID
158137, we can see there are four ancillary data with the DataID
59, 60, 61 and 413. Looking further into the ErrorRisk
of the data “SLA: petiole excluded”, which is roughly 2.5, meaning the observation is 2.5 standard deviation away from the mean. This is probably a “good” value that we would want to keep later. On top of this, it can be seen that the OrigObsDataID
is NA
, meaning that it is not a duplicate. Also, the “Plant developmental status” (DataID
413) could be an important information for further processing.To check if the required data are selected, the rtry_explore()
can be used:
<- rtry_explore(workdata, DataID, DataName, TraitID, TraitName, sortBy = TraitID) workdata_explore_anc
## dim: 14 5
Then use the command View(workdata_explore_anc)
to view the output in the data viewer.
It is recommended to backup the data at some stages of the preprocessing, best before excluding according to different attributes.
# Save workdata_unexcluded as backup
<- workdata workdata_unexcluded
If necessary, load the backup data using the following:
# Load workdata_unexcluded
<- workdata_unexcluded workdata
The ‘rtry’ package provides the function rtry_exclude
to exclude (remove) non-representative data.
rtry_exclude()
takes four arguments input
, ...
, baseOn
and showOverview
to exclude data from the input data based on the specified criteria.
The rtry_exclude
function is designed to use the baseOn
argument with either the column header ObservationID
or ObsDataID
. When the baseOn
argument is set to ObservationID
, the function removes all records of the whole observation if one selection criteria is fulfilled. If the baseOn
argument is set to ObsDataID
, only the (trait) record, which fulfills the selection criteria is removed. Other column headers may be used for the baseOn
argument, but the result would need to be checked carefully.
The rtry_exclude
function is very powerful and can be used on a range of different criteria. This function is therefore key for data cleaning. The workflow provides examples for the exclusion of whole observations using ancillary data (8.1 and 8.2), and the exclusion of single trait records according to additional trait specifications (8.3) and outliers (8.4 and 8.5).
For demonstration purpose, we remove the data where the observed plant is juvenile or a sapling. We keep only observations on the mature or adult plants and the observations where information on the developmental state is explicitly unknown or is not provided (no DataID
413 for the given observation), assuming no information would rather follow the recommended measurement protocol - here measuring traits on mature plant.
First, identify the DataID
that contains the information about plant development status, i.e. 413
. To check the different development status under the OrigValueStr
column using the functions rtry_select_row()
and rtry_explore()
. The temporary data is saved as tmp_unfiltered
.
# Select the rows where DataID is 413, i.e. the data containing the plant development status
<- rtry_select_row(workdata, DataID %in% 413)
tmp_unfiltered
# Then explore the unique values of the OrigValueStr within the selected data
<- rtry_explore(tmp_unfiltered, DataID, DataName, OriglName, OrigValueStr, OrigUnitStr, StdValue, Comment, sortBy = OrigValueStr) tmp_unfiltered
## dim: 104 17
## dim: 7 8
By sorting the exploration by OrigValueStr
, it is clear what types of developmental state exist in the dataset. This way it is possible to set the criteria and start the excluding process using rtry_exclude()
.
In this case, to exclude the juvenile plants and saplings, we need to use the keywords “juvenile” and “saplings” when using the rtry_exclude
function.
# Criteria
# 1. DataID equals to 413
# 2. OrigValueStr equals to "juvenile" or "saplings"
<- rtry_exclude(workdata, (DataID %in% 413) & (OrigValueStr %in% c("juvenile", "saplings")), baseOn = ObservationID) workdata
## dim: 957 17
Once the excluding is completed, double-check the workdata
using the functions rtry_select_row
and rtry_explore
. The temporary data is saved as tmp_filtered
.
# Select the rows where DataID is 413, i.e. the data containing the plant development status
# Then explore the unique values of the OrigValueStr within the selected data
<- rtry_select_row(workdata, DataID %in% 413)
tmp_filtered <- rtry_explore(tmp_filtered, DataID, DataName, OriglName, OrigValueStr, OrigUnitStr, StdValue, Comment, sortBy = OrigValueStr) tmp_filtered
## dim: 91 17
## dim: 5 8
From the exploration, it is clear that all the juvenile and sapling plants were excluded as expected.
To further confirm if the trait and/or other ancillary data of the deleted development states were also removed accordingly, use again the rtry_explore
function.
# Group the input data based on DataID, DataName, TraitID and TraitName
# Then sort the output by TraitID using the sortBy argument
<- rtry_explore(workdata, DataID, DataName, TraitID, TraitName, sortBy = TraitID) workdata_explore_anc_excluded
## dim: 14 5
Compared with the workdata_explore_anc
, it can be seen that the number of traits and ancillary data also decreased, which is as expected. We can see as well, that observations without a record for DataID
413 have not been removed by the exclude function, since it is assumed that these observations were measured according to the recommended measurement protocol (measuring traits on mature plants).
To also exclude the observations without information on plant development state as well, the user would first select only the observations, which include DataID
413, using the function rtry_select_row
(please see the following example on geo-referenced data).
To keep only the geo-referenced observations from a certain region for further processing, users can make use of the ancillary data “Latitude” (DataID
59) and “Longitude” (DataID
60). To ensure the excluding works as expected, it is best to perform the excluding process one after one. In this case, exclude according to latitude, then longitude.
Filter according to latitude
First, obtain only the observations that contain the Latitude (DataID
59) information, i.e. geo-referenced observations, using the function rtry_select_row
.
# Select only the geo-referenced observations, i.e. with DataID 59 Latitude
# Set getAncillary to TRUE to obtain (keep) all traits and ancillary data
<- rtry_select_row(workdata, DataID %in% 59, getAncillary = TRUE) workdata
## dim: 717 17
Next, check the DataID
that contains the latitude information, i.e. 59
, and check the different states of the StdValue
using the functions rtry_select_row
and rtry_explore
.
# Select the rows that contain DataID 59, i.e. latitude information
# Then explore the unique values of the StdValue within the selected data
<- rtry_select_row(workdata, DataID %in% 59)
tmp_unfiltered <- rtry_explore(tmp_unfiltered, DataID, DataName, OriglName, OrigValueStr, OrigUnitStr, StdValue, Comment, sortBy = StdValue) tmp_unfiltered
## dim: 149 17
## dim: 55 8
For demonstration purpose, the following example excludes the latitude smaller than 40 or when such information is missing, i.e. NA
.
# Exclude observations using latitude information
# Criteria
# 1. DataID equals to 59
# 2. StdValue smaller than 40 or NA
<- rtry_exclude(workdata, (DataID %in% 59) & (StdValue < 40 | is.na(StdValue)), baseOn = ObservationID) workdata
## dim: 624 17
Once the excluding is completed, double check the workdata
using the functions rtry_select_row
and rtry_explore
.
# Select the rows where DataID is 59 (Latitude)
# Then explore the unique values of the StdValue within the selected data
# Sort the exploration by StdValue
<- rtry_select_row(workdata, DataID %in% 59)
tmp_filtered <- rtry_explore(tmp_filtered, DataID, DataName, OriglName, OrigValueStr, OrigUnitStr, StdValue, Comment, sortBy = StdValue) tmp_filtered
## dim: 131 17
## dim: 42 8
From the exploration result, it is clear that only the latitude values larger than or equal to 40 remains.
Filter according to longitude
A similar procedure will be performed for longitude (DataID
60). To ensure the all the observations within the workdata
contains the longitude information, use the rtry_select_row
function.
# Select only the geo-referenced observations with DataID 60 Longitude
# Set getAncillary to TRUE to obtain (keep) all traits and ancillary data
<- rtry_select_row(workdata, DataID %in% 60, getAncillary = TRUE) workdata
## dim: 620 17
For demonstration purpose, this time we show how to remove the region of outside a certain range. The column StdValue
will be used for the excluding process. To identify which values to be excluded, use the rtry_select_row
and rtry_explore
functions to explore the dataset.
# Select the rows that contain DataID 60, i.e. longitude information
# Then explore the unique values of the StdValue within the selected data
<- rtry_select_row(workdata, DataID %in% 60)
tmp_unfiltered <- rtry_explore(tmp_unfiltered, DataID, DataName, OriglName, OrigValueStr, OrigUnitStr, StdValue, Comment, sortBy = StdValue) tmp_unfiltered
## dim: 130 17
## dim: 41 8
After data exploration, use the following command to exclude the longitude smaller than 10 or larger than 60 or when such information is missing, i.e. NA
.
# Exclude observations using longitude information
# Criteria
# 1. DataID equals to 60
# 2. StdValue smaller than 10 or larger than 60 or NA
<- rtry_exclude(workdata, (DataID %in% 60) & (StdValue < 10 | StdValue > 60 | is.na(StdValue)), baseOn = ObservationID) workdata
## dim: 227 17
Once the excluding is completed, double check the workdata
using the functions rtry_select_row
and rtry_explore
.
# Select the rows where DataID is 60 (Longitude)
# Then explore the unique values of the StdValue within the selected data
# Sort the exploration by StdValue
<- rtry_select_row(workdata, DataID %in% 60)
tmp_filtered <- rtry_explore(tmp_filtered, DataID, DataName, OriglName, OrigValueStr, OrigUnitStr, StdValue, Comment, sortBy = StdValue) tmp_filtered
## dim: 47 17
## dim: 18 8
From the exploration result, it is clear that only the latitude values in between 10 and 60 remain.
In the above examples the argument baseOn
is set to ObservationID
. This removes the whole observations. To select traits measured following standard measurement protocols and from specified environment we recommend to check at least the following DataID
s:
In this and the following examples, the argument baseOn
is set to ObsDataID
. These examples are supposed to exclude individual trait records or outliers, while keeping the rest of the observation, because the observation might contain records for other traits with relevant measurements.
In this context it might be of interest to remove non-representative sub-traits. This information can usually be found in the column DataName
. In order to identify these data, explore the workdata
based on DataID
, DataName
, TraitID
and TraitName
. Then sort the exploration by TraitID
, this way it is possible to see if there are similar data in each trait. Note: This step has already been performed in the previous step, but for demonstration purpose and completeness of this example, the same function is called again.
# Group the input data based on DataID, DataName, TraitID and TraitName
# Then sort the output by TraitID using the sortBy argument
<- rtry_explore(workdata, DataID, DataName, TraitID, TraitName, sortBy = TraitID) tmp_unfiltered
## dim: 13 5
From this exploration, it can be seen that DataID
7222 and 7223 contains the minimum and maximum value of the “Leaf specific area (SLA: petiole excluded)” for trait 3115. Depending on the research question, the user might want to remove these observations. The same goes for DataID
6598 in trait 3117, where two values of “SLA: undefined if petiole in- or excluded” had been provided.
Now that the excluding criteria have been identified, the user might decide to only remove the trait records for DataID
7222, 7223 and 6598, while keeping the rest of the observation, because it might contain relevant records for other traits. In this case, use the rtry_exclude
function with baseOn
specified to ObsDataID
:
# Criteria
# 1. DataID equals to 7222, 7223 or 6598
<- rtry_exclude(workdata, DataID %in% c(7222, 7223, 6598), baseOn = ObsDataID) workdata
## dim: 218 17
Double check the workdata
using the functions rtry_select_row
and rtry_explore
.
# Group the input data based on DataID, DataName, TraitID and TraitName
# Then sort the output by TraitID using the sortBy argument
<- rtry_explore(workdata, DataID, DataName, TraitID, TraitName, sortBy = TraitID) tmp_filtered
## dim: 10 5
It is clear that the defined DataID
s have been removed, while the corresponding ancillary data are kept.
StdValue
)For demonstration purposes, we assume the user has decided that SLA (specific leaf area) values below 5 mm2mg-1 are out of the relevant range for the purpose of the analyses.
Knowing that the SLA values can be found in DataID
6582, 6583 and 6584, first check if values below 5 exist in the dataset using the functions rtry_select_row
and rtry_explore
. The temporary data are saved as tmp_unfiltered
. Note: To exclude numeric values, it is recommended to use the column StdValue
.
# Select the rows where DataID is 6582, 6583 and 6584, i.e. the data containing the SLA information
# Then explore the unique values of the StdValue within the selected data
<- rtry_select_row(workdata, DataID %in% c(6582, 6583, 6584))
tmp_unfiltered <- rtry_explore(tmp_unfiltered, DataID, DataName, OriglName, OrigValueStr, OrigUnitStr, StdValue, UnitName, Comment, sortBy = StdValue) tmp_unfiltered
## dim: 43 17
## dim: 31 9
Here, it can be seen that there are three trait records with SLA value less than 5 mm2mg-1. To exclude these records, use the following command:
# Criteria
# 1. DataID equals to 6582, 6583 or 6584
# 2. StdValue smaller than 5
<- rtry_exclude(workdata, (DataID %in% c(6582, 6583, 6584)) & (StdValue < 5), baseOn = ObsDataID) workdata
## dim: 215 17
Once the excluding is completed, it is always recommended to double check the workdata
using the functions rtry_select_row
and rtry_explore
.
# Select the rows where DataID is 6582, 6583 and 6584, i.e. the data containing the SLA information
# Then explore the unique values of the StdValue within the selected data
<- rtry_select_row(workdata, DataID %in% c(6582, 6583, 6584))
tmp_filtered <- rtry_explore(tmp_filtered, DataID, DataName, OriglName, OrigValueStr, OrigUnitStr, StdValue, UnitName, Comment, sortBy = StdValue) tmp_filtered
## dim: 40 17
## dim: 30 9
ErrorRisk
)The final example for the excluding process would be to remove the outliers identified in the context of TRY data integration. To do so, we take advantage of the column ErrorRisk
provided inside the TRY output. The ErrorRisk
quantifies the maximum distance of the trait record from a respective mean at the species, genus or family level in terms of standard deviation (a modified z-transformation). An ErrorRisk
value of 3 indicates that the trait record is three standard deviations larger or smaller than the mean value based on species, genus or family (Kattge et al. 2011, 2020). We here filter the data with ErrorRisk
larger than or equal to 3.0. With this in mind, the rtry_explore
function is used to explore the data.
# Group the input data based on DataID, DataName, TraitID, TraitName and ErrorRisk
# Then sort the output by ErrorRisk using the sortBy argument
<- rtry_explore(workdata, DataID, DataName, TraitID, TraitName, ErrorRisk, sortBy = ErrorRisk) tmp_unfiltered
## dim: 34 6
From the exploration result, it can be seen that there are two outliers (with ErrorRisk
larger than 3.0 in this case). It can also be observed that for those ErrorRisk
equals NA, data may contain the geo-reference information and exposition of the observation. These are the values where ErrorRisk
does not apply and should be kept.
After understanding what data to be excluded, use the rtry_exclude
function to perform the data removal:
# Criteria
# 1. ErrorRisk larger than or equal to 3
<- rtry_exclude(workdata, ErrorRisk >= 3, baseOn = ObsDataID) workdata
## dim: 213 17
Always double check the excluded data before continuing the preprocessing.
# Group the input data based on DataID, DataName, TraitID, TraitName and ErrorRisk
# Then sort the output by ErrorRisk using the sortBy argument
<- rtry_explore(workdata, DataID, DataName, TraitID, TraitName, ErrorRisk, sortBy = ErrorRisk) tmp_filtered
## dim: 32 6
From the exploration result, it can be seen that the outliers (i.e. ErrorRisk
larger than or equal to 3.0) were removed.
OrigObsDataID
)As of July 2019, the TRY database comprised 588 data sets from 765 data contributors (Kattge et al. 2020). To keep track of potential duplicate entries, a unique identifier OrigObsDataID
is assigned when the probability is high that certain trait records have previously been contributed (Kattge et al. 2020). With the help of this OrigObsDataID
, the ‘rtry’ package provides a function rtry_remove_dup
for users to easily remove the duplicates from the data for further processing.
rtry_remove_dup()
takes two arguments input
and showOverview
, and returns a data table of the input data after removing the duplicates. Note: This function depends on the duplicate identifier OrigObsDataID
listed in the TRY data, therefore, if the column OrigObsDataID
has been removed, this function will not work. Also, if the original, not duplicate, trait record was not imported to ‘rtry’ (e.g. if only public data or specific datasets were requested from TRY and the original trait record was part of the restricted data or another dataset) the duplicates identified by TRY will still be removed, resulting in data loss.
To remove the duplicates from workdata
, simply use the call the function:
# Remove duplicates
<- rtry_remove_dup(workdata) workdata
## 13 duplicates removed.
## dim: 200 17
Once the function is called and executed, by default the number of duplicates removed and the resulting dimension of the data will be displayed on the console as reference.
For data management purposes, the TRY data is structured as a long-table, and this is what has been used so far. However, for human operators, it is often more common to read and understand a dataset in a wide-table format. Therefore, the ‘rtry’ package offers the function rtry_trans_wider
for users to transform the long-table into a wide-table (i.e. increasing the number of columns and decreasing the number of rows) when needed. The rtry_trans_wider
function represents the selected data-types (traits, ancillary data) as columns and every observation becomes one row.
rtry_trans_wider()
takes five arguments input
, names_from
, values_from
, values_fn
and showOverview
to transform the original long-table format of the TRY data into a wide-table format. Note: This function is based on the function tidyr::pivot_wider
.
To ensure the long-table to wide-table transformation would not result in duplicate entries because of the potentially existence of multiple ObservationID
for a single trait, the first step is to select only the traits with numerical values, and then the relevant columns will be selected.
#-------------------------------------------------
# Exclude
# 1. All entries with "" in TraitID
# 2. Potential categorical traits that don't have a StdValue
# 3. Traits that have not yet been standardized in TRY
# Then select the relevant columns for transformation
# Note: The complete.cases() is used to ensure the cases are complete,
# i.e. have no missing values
#-------------------------------------------------
<- rtry_select_row(workdata, complete.cases(TraitID) & complete.cases(StdValue))
num_traits <- rtry_select_col(num_traits, ObservationID, AccSpeciesID, AccSpeciesName, TraitID, TraitName, StdValue, UnitName) num_traits
## dim: 25 17
## dim: 25 7
## col: ObservationID AccSpeciesID AccSpeciesName TraitID TraitName StdValue UnitName
Now that only the traits with numerical values were selected, in order to keep the ancillary data while transforming the data from long-table into wide-table on traits, the ancillary data needs to be added as additional columns before proceeding.
For demonstration purpose, the latitude and longitude information will be added to the input data to create a geo-referenced trait table before proceeding. To extract the unique values of the ancillary data and the corresponding ObservationID
, use the rtry_select_anc
function.
# Extract the unique value of latitude (DataID 59) and the corresponding ObservationID
<- rtry_select_anc(workdata, 59)
workdata_lat
# Extract the unique value of longitude (DataID 60) and the corresponding ObservationID
<- rtry_select_anc(workdata, 60) workdata_lon
## dim: 47 2
## col: ObservationID Latitude
## dim: 47 2
## col: ObservationID Longitude
Next, add the extracted latitude and longitude information to the traits. To do so, make use of the ObservationID
in both the trait records and the extracted ancillary data to merge the data frames. And in order to keep all the trait records, a left join should be used. This can be done using the rtry_join_left
function provided in the ‘rtry’ package.
# To merge the extracted ancillary data with the numerical traits
# Merge the relevant data frames based on the ObservationID using rtry_join_left (left join)
<- rtry_join_left(num_traits, workdata_lat, baseOn = ObservationID)
num_traits_georef <- rtry_join_left(num_traits_georef, workdata_lon, baseOn = ObservationID) num_traits_georef
## dim: 25 8
## col: ObservationID AccSpeciesID AccSpeciesName TraitID TraitName StdValue UnitName Latitude
## dim: 25 9
## col: ObservationID AccSpeciesID AccSpeciesName TraitID TraitName StdValue UnitName Latitude Longitude
It can be seen that the extracted latitude and longitude information were added to the right of the numerical traits.
Once the trait records and the necessary ancillary data were prepared, the transformation is performed on the TraitID
, TraitName
and UnitName
, meaning to get the name of these three columns as part of the name of the output column (names_from
), and then use the StdValue
as the cell values (values_from
). Also, the mean function is to be applied to each cell in the output (values_fn
), e.g. if several trait records were measured within the same observation, because the tidyr::pivot_wider
function accepts only one record per trait or ancillary data (column) and observation (row). To do so, use the following command:
# Perform wide table transformation on TraitID, TraitName and UnitName
# With cell values to be the mean values calculated for StdValue
<- rtry_trans_wider(num_traits_georef, names_from = c(TraitID, TraitName, UnitName), values_from = c(StdValue), values_fn = list(StdValue = mean)) num_traits_georef_wider
## dim: 25 8
Immediately, it can be seen that the columns TraitID
, TraitName
and UnitName
were “transformed” into new columns, which were named for example “3115_Leaf area per leaf dry mass (specific leaf area, SLA or 1/LMA): petiole excluded_mm2 mg-1”.
The rtry_export()
takes four arguments data
, output
, quote
and encoding
for users to export the preprocessed data as a CSV file. Note: If the specified output directory does not exist, it will be created automatically.
To save the two preprocessed wide tables workdata_wider_traits
and workdata_wider_data
, simply:
# Export the data into a CSV file
= file.path(tempdir(), "workdata_wider_traits.csv")
output_file rtry_export(num_traits_georef_wider, output_file)
## File saved at: C:/Users/user/AppData/Local/Temp/Rtmp4wJAvQ/workdata_wider_traits.csv