Introduction

The corto (“correlation tool”) package provides a pipeline to infer networks between “centroid” and “target” variables in a dataset, using a combination of Pearson correlation and Data Processing Inequality (DPI), first proposed in [1]. The main application of corto is in the field of Bioinformatics and Transcriptomics, where co-occurrence between variables can be used as a mean to infer regulatory mechanisms [2] or gene functions [3]. In this field, usually the tested features are genes (or rather, their expression profile across samples), whereas the centroids are Transcription Factors (TFs) and their targets are Target Genes (TGs). The TF-TG co-expression can hint at a causal regulatory relationship, as proven in many studies [4,5,6]. The corto tool replicates the well-established pipeline of the ARACNe family of tools [7,8,9]

In brief, corto operates using the following steps:

  1. Load a gene expression matrix and a list of user-provided centroids (e.g. TFs). Data should be normalized via Variance Stabilizing Transformation (VST) for RNA-Seq, and Robust Multiarray Average (RMA) for microarrays, as described in [10].
  2. Calculate all pairwise Pearson correlation coefficients between centroid and target features. The rank transformation operated by the Pearson correlation coefficient is a common procedure used in co-expression based studies, due to its benefits in reducing the effects of outliers [11].
  3. Filter out all centroid-target features whose correlation coefficient absolute value is below the provided p-value threshold p.
  4. Apply DPI to all centroid-centroid-target triplets, in order to identify which centroid-target correlation is stronger and identify the most likely association (e.g. which TF is the most likely regulator of the TG in the dataset).
  5. All edges are tested for Data Processing Inequality in a number of bootstrapped versions of the same input matrix (specified by the nbootstraps parameter, 100 by default, as in ARACNE-AP [8]). This step can be run in parallel by specifiying the number of threads using the nthreads parameter. The number of occurrences of each edge (if they survived DPI) is calculated. This number can range from 0 (the edge is not significant in any bootstrap) to nbootstraps+1 (the edge is significant in all bootstraps, and also in the original matrix).
  6. A network is generated in the form of a list, where each element is a centroid-based list of targets. In order to follow the structure of the regulon class implemented by downstream analysis packages (e.g. VIPER [12]), each target is characterized by two parameters:
    • The tfmode, the mode of action, i.e. the Pearson correlation coefficient between the target and the centroid in the original input matrix
    • The likelihood of the interaction, calculated as the number of bootstraps in which the edge survives DPI, divided by the total number of bootstraps (in this impelmentation, the presence of an edge in the original, non-bootstrapped matrix is considered as an additional evidence)

Running corto

Here is how to run corto. First, install the package:

install.packages("corto")

Then, load it:

library(corto)

Then, you can see how the input matrix looks like. For example, this dataset comes from the TCGA mesothelioma project [13] and measures the expression of 10021 genes across 87 samples:

load(system.file("extdata","inmat.rda",package="corto"))
inmat[1:5,1:5]
#         TCGA-3H-AB3K-01A TCGA-3H-AB3L-01A TCGA-3H-AB3M-01A TCGA-3H-AB3O-01A
# ZSCAN29         9.710627        10.045033         9.581725         9.966440
# MYOG            5.699747         5.699747         6.044523         5.699747
# PAX3            5.926491         5.699747         6.318039         6.667444
# BNC1           13.059897        13.289953        13.932597        11.860305
# SPIB            7.581019         7.818206         6.780734         7.629027
#         TCGA-3H-AB3S-01A
# ZSCAN29         9.317134
# MYOG            6.335294
# PAX3            6.303159
# BNC1           13.143449
# SPIB            5.985807
dim(inmat)
# [1] 7000   87

Another input needed by corto is a list of centroid features. In our case, we can specify a list of TFs generated from Gene Ontology with the term “Regulation of Transcription” [14].

load(system.file("extdata","centroids.rda",package="corto"))
centroids[15]
# [1] "BNC1"
length(centroids)
# [1] 294

Finally, we can run corto. In this example, we will run it with p-value threshold of 1e-30, 10 bootstraps and 2 threads

regulon<-corto(inmat,centroids=centroids,nbootstraps=10,p=1e-30,nthreads=2)
# Input Matrix has 87 samples and 10021 features
# Correlation Coefficient Threshold is: 0.889962633618839
# Removed 112 features with zero variance
# Calculating pairwise correlations
# Initial testing of triplets for DPI
# 246 edges passed the initial threshold
# Building DPI network from 37 centroids and 136 targets
# Running 100 bootstraps with 2 thread(s)
# Calculating edge likelihood
# Generating regulon object

The regulon object is a list:

regulon[1:2]
# $SCRT1
# $SCRT1$tfmode
#    ACTL6B    CACNG2     CDY1B 
# 0.9032878 0.8994689 0.9132632 
# 
# $SCRT1$likelihood
# [1] 0.2727273 0.3636364 0.3636364
# 
# 
# $ELF5
# $ELF5$tfmode
#      APOH  ARHGEF38       C8B   CEACAM5    CLDN18      CPB2    CXCL17       FGA 
# 0.9018898 0.9079885 0.9054840 0.9150803 0.9353695 0.9221696 0.8943170 0.9042021 
# 
# $ELF5$likelihood
# [1] 0.7272727 0.7272727 0.7272727 0.7272727 0.8181818 0.8181818 0.1818182
# [8] 0.7272727

The regulon in this dataset is composed of 34 final centroids with at least one target:

length(regulon)
# [1] 28
names(regulon)
#  [1] "SCRT1"   "ELF5"    "NEUROG1" "ZSCAN29" "BCL6B"   "MYOG"    "REST"   
#  [8] "MYBL2"   "FEZF2"   "PAX5"    "UHRF1"   "DMBX1"   "SOX7"    "PAX3"   
# [15] "BNC1"    "TBX21"   "IKZF3"   "EOMES"   "TFEC"    "ZNF831"  "FOXM1"  
# [22] "NKX2-1"  "SPIB"    "RFX4"    "SPI1"    "NOBOX"   "FOS"     "FOXO3"

Correcting with CNV data

As an additional, optional feature, corto gives the user the possibility to provide Copy Number Variation (CNV) data, if available, which can generate spurious correlations between TFs and targets [15]. The gene expression profiles for the target genes are corrected via linear regression and the residuals of the expression~cnv model substitute the original gene expression profile.

In this example, a CNV matrix is provided. The analysis will be run only for the features (rows) and samples (columns) present in both matrices

load(system.file("extdata","cnvmat.rda",package="corto",mustWork=TRUE))
regulon <- corto(inmat,centroids=centroids,nthreads=2,nbootstraps=10,verbose=TRUE,cnvmat=cnvmat,p=0.01)

References

[1] Reverter, Antonio, and Eva KF Chan. “Combining partial correlation and an information theory approach to the reversed engineering of gene co-expression networks.” Bioinformatics 24.21 (2008): 2491-2497.

[2] D’Haeseleer, Patrik, Shoudan Liang, and Roland Somogyi. “Genetic network inference: from co-expression clustering to reverse engineering.” Bioinformatics 16.8 (2000): 707-726.

[3] Hansen, Bjoern O., et al. “Elucidating gene function and function evolution through comparison of co-expression networks of plants.” Frontiers in plant science 5 (2014): 394.

[4] Basso, Katia, et al. “Reverse engineering of regulatory networks in human B cells.” Nature genetics 37.4 (2005): 382. [5] Amar, David, Hershel Safer, and Ron Shamir. “Dissection of regulatory networks that are altered in disease via differential co-expression.” PLoS computational biology 9.3 (2013): e1002955.

[6] Vandepoele, Klaas, et al. “Unraveling transcriptional control in Arabidopsis using cis-regulatory elements and coexpression networks.” Plant physiology 150.2 (2009): 535-546.

[7] Margolin, Adam A., et al. “ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context.” BMC bioinformatics. Vol. 7. No. 1. BioMed Central, 2006.

[8] Lachmann, Alexander, et al. “ARACNe-AP: gene network reverse engineering through adaptive partitioning inference of mutual information.” Bioinformatics 32.14 (2016): 2233-2235.

[9] Khatamian, Alireza, et al. “SJARACNe: a scalable software tool for gene network reverse engineering from big data.” Bioinformatics 35.12 (2018): 2165-2166.

[10] Giorgi, Federico M., et al. “Comparative study of RNA-seq-and microarray-derived coexpression networks in Arabidopsis thaliana.” Bioinformatics 29.6 (2013): 717-724.

[11] Usadel, Björn, et al. “Co‐expression tools for plant biology: opportunities for hypothesis generation and caveats.” Plant, cell & environment 32.12 (2009): 1633-1651.

[12] Alvarez, Mariano J., Federico Giorgi, and Andrea Califano. “Using viper, a package for Virtual Inference of Protein-activity by Enriched Regulon analysis.” Bioconductor (2014): 1-14.

[13] Ladanyi, Marc, et al. “The TCGA malignant pleural mesothelioma (MPM) project: VISTA expression and delineation of a novel clinical-molecular subtype of MPM.” (2018): 8516-8516.

[14] Gene Ontology Consortium. “The Gene Ontology (GO) database and informatics resource.” Nucleic acids research 32.suppl_1 (2004): D258-D261.

[15] Schubert, Michael, et al. “Gene networks in cancer are biased by aneuploidies and sample impurities.” BioRxiv (2019): 752816.