Input data frame required for greatR
to register your
gene expression profiles is a data frame that contains gene expression
time-course data with all replicates. The illustrated diagram below
shows the required input_df
data frame format of
greatR
. This data frame must contain:
two different species which users wish to compare (for each gene, at least two replicates are required in all species), and
six data columns:
locus_name
: accession or unique ID of each gene.accession
: accession or name of the species to
compare.timepoint
: time point of the gene expression data.expression_value
: desired RNA-seq expression value or
measure of the abundance of gene or transcripts which one wishes to
compare. This value can be RPM, RPKM, FPKM, TPM, TMM, DESeq, SCnorm,
GeTMM, ComBat-Seq and raw reads counts.tissue
: tissue from which RNA-seq data were
extracted.group
: replicate ID for each expression value.greatR
package provides a helper function
get_expression_of_interest()
to combine reference and query
(data to register) data frames, if table ID which connect reference and
query data frames provided (as illustrated in the figure below).
This package also provides data frames as examples to use this function. Those data frames are expression data of B. rapa (Calderwood et al., 2021), Arabidopsis (Klepikova et al., 2015), and join table or table ID which contains both accession of B. rapa and Arabidopsis. This table ID is important to combine both B. rapa and Arabidopsis data frames.
# Load the package
library(greatR)
library(dplyr)
To load example data frames provided within the package, users need to run the following procedure:
# Load example data
<- system.file(
id_table_rapa_subset "extdata/sample_data/id_table_5genes.rds",
package = "greatR") %>%
readRDS()
<- system.file(
klepikova_subset "extdata/sample_data/arabidopsis_expression_5genes.rds",
package = "greatR") %>%
readRDS()
<- system.file(
rapa_subset "extdata/sample_data/brassica_rapa_expression_5genes.rds",
package = "greatR") %>%
readRDS()
Let’s take a look on each data frame we have loaded below:
# View reference data
%>%
rapa_subset ::head(3) %>%
utils::kable() knitr
CDS.model | sample_id | FPKM | accession | tissue | timepoint | dataset | est_counts | group | norm.cpm |
---|---|---|---|---|---|---|---|---|---|
BRAA02G018970.3C | ERR_ro18_rna_seq_v3_R18A1_1 | 0.287365 | Ro18 | apex | 11 | ro18_rna_seq | 12 | Ro18-apex-11 | 0.3968734 |
BRAA02G018970.3C | ERR_ro18_rna_seq_v3_R18A1_2 | 1.069565 | Ro18 | apex | 11 | ro18_rna_seq | 47 | Ro18-apex-11 | 1.4147711 |
BRAA02G018970.3C | ERR_ro18_rna_seq_v3_R18A1_3 | 0.541893 | Ro18 | apex | 11 | ro18_rna_seq | 23 | Ro18-apex-11 | 0.7423984 |
# View query data
%>%
klepikova_subset ::head(3) %>%
utils::kable() knitr
CDS.model | sample_id | FPKM | accession | tissue | timepoint | dataset | est_counts | group | norm.cpm |
---|---|---|---|---|---|---|---|---|---|
AT1G69120 | ERR_ds_klepikova_SRR1659921 | 0.299101 | Col0 | apex | 7 | ds_klepikova_1 | 4 | Col0-apex-07 | 0.4667855 |
AT1G69120 | ERR_ds_klepikova_SRR1659922 | 0.020064 | Col0 | apex | 7 | ds_klepikova_2 | 1 | Col0-apex-07 | 0.0741901 |
AT1G69120 | ERR_ds_klepikova_SRR1660397 | 0.000000 | Col0 | apex | 8 | ds_klepikova_1 | 0 | Col0-apex-08 | 0.0000000 |
As we can see, both data frames of expression profiles above contain
a column of gene accession (CDS.model
). These two columns
from each data frame were mapped into a table ID as we can see below:
CDS.model
for B. rapa gene accessions and
locus_name
for Arabidopsis gene accessions.
# View table ID
%>%
id_table_rapa_subset ::select(CDS.model, locus_name, symbol, gene_id) %>%
dplyr::head(3) %>%
utils::kable() knitr
CDS.model | locus_name | symbol | gene_id |
---|---|---|---|
BRAA01G016140.3C | AT4G24540 | AGL24 | AT4G24540 |
BRAA02G018970.3C | AT1G69120 | AP1 | AT1G69120 |
BRAA02G043220.3C | AT5G61850 | LFY | AT5G61850 |
Having all of these informations, we can then easily get the required input data.
Users can use function get_expression_of_interest()
to
combine two different table data frames containing expression values
that users wish to compare.
<- get_expression_of_interest(
all_data data_ref = rapa_subset,
data_to_transform = klepikova_subset,
id_table = id_table_rapa_subset,
lookup_col_ref_and_id_table = "CDS.model",
lookup_col_ref_and_to_transform = "locus_name",
colnames_wanted = NULL,
tissue_wanted = "apex",
gene_of_interest_acc = c("AT1G69120", "AT5G618"),
sum_exp_data_ref = FALSE,
accession_data_to_transform = "Col0"
)
Let’s take a look on how the combined data:
# View data
%>%
all_data ::group_by(accession) %>%
dplyr::slice(1:3) %>%
dplyr::kable() knitr
locus_name | CDS.model | sample_id | FPKM | accession | tissue | timepoint | dataset | est_counts | group | norm.cpm | id_transform_data |
---|---|---|---|---|---|---|---|---|---|---|---|
BRAA02G018970.3C | BRAA02G018970.3C | ERR_ds_klepikova_SRR1659921 | 0.299101 | Col0 | apex | 7 | ds_klepikova_1 | 4 | Col0-apex-07 | 0.4667855 | AT1G69120 |
BRAA02G018970.3C | BRAA02G018970.3C | ERR_ds_klepikova_SRR1659922 | 0.020064 | Col0 | apex | 7 | ds_klepikova_2 | 1 | Col0-apex-07 | 0.0741901 | AT1G69120 |
BRAA02G018970.3C | BRAA02G018970.3C | ERR_ds_klepikova_SRR1660397 | 0.000000 | Col0 | apex | 8 | ds_klepikova_1 | 0 | Col0-apex-08 | 0.0000000 | AT1G69120 |
BRAA02G018970.3C | BRAA02G018970.3C | ERR_ro18_rna_seq_v3_R18A1_1 | 0.287365 | Ro18 | apex | 11 | ro18_rna_seq | 12 | Ro18-apex-11 | 0.3968734 | AT1G69120 |
BRAA02G018970.3C | BRAA02G018970.3C | ERR_ro18_rna_seq_v3_R18A1_2 | 1.069565 | Ro18 | apex | 11 | ro18_rna_seq | 47 | Ro18-apex-11 | 1.4147711 | AT1G69120 |
BRAA02G018970.3C | BRAA02G018970.3C | ERR_ro18_rna_seq_v3_R18A1_3 | 0.541893 | Ro18 | apex | 11 | ro18_rna_seq | 23 | Ro18-apex-11 | 0.7423984 | AT1G69120 |
After combining reference data and data to transform, users need to adjust it to the required input data, this can be done as follows.
<- all_data %>%
all_data_final ::select(
dplyr
locus_name,
accession,
timepoint, expression_value = norm.cpm,
tissue,
group
)
%>%
all_data_final ::group_by(accession) %>%
dplyr::slice(1:6) %>%
dplyr::kable() knitr
locus_name | accession | timepoint | expression_value | tissue | group |
---|---|---|---|---|---|
BRAA02G018970.3C | Col0 | 7 | 0.4667855 | apex | Col0-apex-07 |
BRAA02G018970.3C | Col0 | 7 | 0.0741901 | apex | Col0-apex-07 |
BRAA02G018970.3C | Col0 | 8 | 0.0000000 | apex | Col0-apex-08 |
BRAA02G018970.3C | Col0 | 8 | 0.0000000 | apex | Col0-apex-08 |
BRAA02G018970.3C | Col0 | 9 | 0.3722542 | apex | Col0-apex-09 |
BRAA02G018970.3C | Col0 | 9 | 0.0000000 | apex | Col0-apex-09 |
BRAA02G018970.3C | Ro18 | 11 | 0.3968734 | apex | Ro18-apex-11 |
BRAA02G018970.3C | Ro18 | 11 | 1.4147711 | apex | Ro18-apex-11 |
BRAA02G018970.3C | Ro18 | 11 | 0.7423984 | apex | Ro18-apex-11 |
BRAA02G018970.3C | Ro18 | 29 | 11.3007002 | apex | Ro18-apex-29 |
BRAA02G018970.3C | Ro18 | 29 | 23.2055664 | apex | Ro18-apex-29 |
BRAA02G018970.3C | Ro18 | 29 | 22.0307747 | apex | Ro18-apex-29 |