Preparing data

Data requirements

Input data frame required for greatR to register your gene expression profiles is a data frame that contains gene expression time-course data with all replicates. The illustrated diagram below shows the required input_df data frame format of greatR. This data frame must contain:


Using a helper function to prepare input data

greatR package provides a helper function get_expression_of_interest() to combine reference and query (data to register) data frames, if table ID which connect reference and query data frames provided (as illustrated in the figure below).


This package also provides data frames as examples to use this function. Those data frames are expression data of B. rapa (Calderwood et al., 2021), Arabidopsis (Klepikova et al., 2015), and join table or table ID which contains both accession of B. rapa and Arabidopsis. This table ID is important to combine both B. rapa and Arabidopsis data frames.

Load and view example data

# Load the package
library(greatR)
library(dplyr)

To load example data frames provided within the package, users need to run the following procedure:

# Load example data
id_table_rapa_subset <- system.file(
  "extdata/sample_data/id_table_5genes.rds", 
  package = "greatR") %>% 
  readRDS()
klepikova_subset <- system.file(
  "extdata/sample_data/arabidopsis_expression_5genes.rds", 
  package = "greatR") %>% 
  readRDS()
rapa_subset <- system.file(
  "extdata/sample_data/brassica_rapa_expression_5genes.rds", 
  package = "greatR") %>% 
  readRDS()

Let’s take a look on each data frame we have loaded below:

# View reference data
rapa_subset %>%
  utils::head(3) %>%
  knitr::kable()
CDS.model sample_id FPKM accession tissue timepoint dataset est_counts group norm.cpm
BRAA02G018970.3C ERR_ro18_rna_seq_v3_R18A1_1 0.287365 Ro18 apex 11 ro18_rna_seq 12 Ro18-apex-11 0.3968734
BRAA02G018970.3C ERR_ro18_rna_seq_v3_R18A1_2 1.069565 Ro18 apex 11 ro18_rna_seq 47 Ro18-apex-11 1.4147711
BRAA02G018970.3C ERR_ro18_rna_seq_v3_R18A1_3 0.541893 Ro18 apex 11 ro18_rna_seq 23 Ro18-apex-11 0.7423984
# View query data
klepikova_subset %>%
  utils::head(3) %>%
  knitr::kable()
CDS.model sample_id FPKM accession tissue timepoint dataset est_counts group norm.cpm
AT1G69120 ERR_ds_klepikova_SRR1659921 0.299101 Col0 apex 7 ds_klepikova_1 4 Col0-apex-07 0.4667855
AT1G69120 ERR_ds_klepikova_SRR1659922 0.020064 Col0 apex 7 ds_klepikova_2 1 Col0-apex-07 0.0741901
AT1G69120 ERR_ds_klepikova_SRR1660397 0.000000 Col0 apex 8 ds_klepikova_1 0 Col0-apex-08 0.0000000

As we can see, both data frames of expression profiles above contain a column of gene accession (CDS.model). These two columns from each data frame were mapped into a table ID as we can see below: CDS.model for B. rapa gene accessions and locus_name for Arabidopsis gene accessions.

# View table ID
id_table_rapa_subset %>%
  dplyr::select(CDS.model, locus_name, symbol, gene_id) %>% 
  utils::head(3) %>%
  knitr::kable()
CDS.model locus_name symbol gene_id
BRAA01G016140.3C AT4G24540 AGL24 AT4G24540
BRAA02G018970.3C AT1G69120 AP1 AT1G69120
BRAA02G043220.3C AT5G61850 LFY AT5G61850

Having all of these informations, we can then easily get the required input data.

Combine data

Users can use function get_expression_of_interest() to combine two different table data frames containing expression values that users wish to compare.

all_data <- get_expression_of_interest(
  data_ref = rapa_subset,
  data_to_transform = klepikova_subset,
  id_table = id_table_rapa_subset,
  lookup_col_ref_and_id_table = "CDS.model",
  lookup_col_ref_and_to_transform = "locus_name",
  colnames_wanted = NULL,
  tissue_wanted = "apex",
  gene_of_interest_acc = c("AT1G69120", "AT5G618"),
  sum_exp_data_ref = FALSE,
  accession_data_to_transform = "Col0"
)

Let’s take a look on how the combined data:

# View data
all_data %>%
  dplyr::group_by(accession) %>%
  dplyr::slice(1:3) %>%
  knitr::kable()
locus_name CDS.model sample_id FPKM accession tissue timepoint dataset est_counts group norm.cpm id_transform_data
BRAA02G018970.3C BRAA02G018970.3C ERR_ds_klepikova_SRR1659921 0.299101 Col0 apex 7 ds_klepikova_1 4 Col0-apex-07 0.4667855 AT1G69120
BRAA02G018970.3C BRAA02G018970.3C ERR_ds_klepikova_SRR1659922 0.020064 Col0 apex 7 ds_klepikova_2 1 Col0-apex-07 0.0741901 AT1G69120
BRAA02G018970.3C BRAA02G018970.3C ERR_ds_klepikova_SRR1660397 0.000000 Col0 apex 8 ds_klepikova_1 0 Col0-apex-08 0.0000000 AT1G69120
BRAA02G018970.3C BRAA02G018970.3C ERR_ro18_rna_seq_v3_R18A1_1 0.287365 Ro18 apex 11 ro18_rna_seq 12 Ro18-apex-11 0.3968734 AT1G69120
BRAA02G018970.3C BRAA02G018970.3C ERR_ro18_rna_seq_v3_R18A1_2 1.069565 Ro18 apex 11 ro18_rna_seq 47 Ro18-apex-11 1.4147711 AT1G69120
BRAA02G018970.3C BRAA02G018970.3C ERR_ro18_rna_seq_v3_R18A1_3 0.541893 Ro18 apex 11 ro18_rna_seq 23 Ro18-apex-11 0.7423984 AT1G69120

After combining reference data and data to transform, users need to adjust it to the required input data, this can be done as follows.

all_data_final <- all_data %>% 
  dplyr::select(
    locus_name, 
    accession, 
    timepoint, 
    expression_value = norm.cpm, 
    tissue, 
    group
  )

all_data_final %>%
  dplyr::group_by(accession) %>%
  dplyr::slice(1:6) %>%
  knitr::kable()
locus_name accession timepoint expression_value tissue group
BRAA02G018970.3C Col0 7 0.4667855 apex Col0-apex-07
BRAA02G018970.3C Col0 7 0.0741901 apex Col0-apex-07
BRAA02G018970.3C Col0 8 0.0000000 apex Col0-apex-08
BRAA02G018970.3C Col0 8 0.0000000 apex Col0-apex-08
BRAA02G018970.3C Col0 9 0.3722542 apex Col0-apex-09
BRAA02G018970.3C Col0 9 0.0000000 apex Col0-apex-09
BRAA02G018970.3C Ro18 11 0.3968734 apex Ro18-apex-11
BRAA02G018970.3C Ro18 11 1.4147711 apex Ro18-apex-11
BRAA02G018970.3C Ro18 11 0.7423984 apex Ro18-apex-11
BRAA02G018970.3C Ro18 29 11.3007002 apex Ro18-apex-29
BRAA02G018970.3C Ro18 29 23.2055664 apex Ro18-apex-29
BRAA02G018970.3C Ro18 29 22.0307747 apex Ro18-apex-29