ActivePathways is a method for analysing multiple omics datasets in the context of molecular pathways, biological processes and other types of gene sets. The package uses p-value merging to combine gene- or protein-level signals, followed by ranked hypergeometric tests to determine enriched pathways and processes. This approach allows researchers to interpret a series of omics datasets in the context of known biology and gene function, and discover associations that are only apparent when several datasets are combined.
The package is part of the following publication:
Integrative Pathway Enrichment Analysis of Multivariate Omics Data. Paczkowska M, Barenboim J, Sintupisut N, Fox NS, Zhu H, Abd-Rabbo D, Mee MW, Boutros PC, PCAWG Drivers and Functional Interpretation Working Group; Reimand J, PCAWG Consortium. Nature Communications (2020) https://doi.org/10.1038/s41467-019-13983-9.
From a matrix of p-values, ActivePathways
creates a ranked gene list where genes are prioritised based on their combined significance of in the series of omics datasets provided in the input matrix. The ranked gene list includes the most significant genes first. ActivePathways
then performs a ranked hypergeometric test to determine if a pathway (i.e., a gene set with a common functional annotation) is enriched in the ranked gene list, by performing a series of hypergeometric tests (also known as Fisher’s exact tests). In each such test, a larger set of genes from the top of the ranked gene list is considered. At the end of the series, the ranked hypergeometric test returns the top most significant p-value from the series, corresponding to the point in the ranked gene list where the pathway enrichment reached the greatest significance of enrichment. This approach is useful when the genes in our ranked gene list have varying signals of biological importance in the input omics datasets, as the test identifies the top subset of genes that are the most relevant to the enrichment of the pathway.
A basic example of using ActivePathways is shown below.
We will analyse cancer driver gene predictions for a collection of cancer genomes. Each dataset (i.e., column in the matrix) contains a statistical significance score (P-value) where genes with small P-values are considered stronger candidates of cancer drivers based on the distribution of mutations in the genes. For each gene (i.e., row in the matrix), we have several predictions representing genomic elements of the gene, such as coding sequences (CDS), untranslated regions (UTR), and core promoters (promCore).
To analyse these driver genes using existing knowledge of gene function, we will use gene sets corresponding to known molecular pathways from the Reactome database. These gene sets are commonly distributed in text files in the GMT format (Gene Matrix Transposed) file.
Let’s start by reading the data from the files embedded in the R package. For the p-value matrix, ActivePathways
expects an object of the matrix class so the table has to be cast to the correct class after reading the file.
read.table(
scores <-system.file('extdata', 'Adenocarcinoma_scores_subset.tsv', package = 'ActivePathways'),
header = TRUE, sep = '\t', row.names = 'Gene')
as.matrix(scores)
scores <- scores
## X3UTR X5UTR CDS promCore
## A2M NA 3.339676e-01 9.051708e-01 4.499201e-01
## AAAS NA 4.250601e-01 7.047723e-01 7.257641e-01
## ABAT 0.9664125635 4.202735e-02 7.600985e-01 1.903789e-01
## ABCC1 0.9383431571 4.599443e-01 2.599319e-01 2.980455e-01
## ABCC5 0.8021028187 4.478902e-01 4.788846e-01 8.447995e-01
## ABCC8 NA 4.699356e-01 6.890250e-01 9.226617e-01
## ABHD6 0.4019418029 4.488881e-01 7.587176e-01 3.062514e-01
## ABI1 0.2894847305 1.977600e-01 4.815032e-01 3.486056e-01
## [ reached getOption("max.print") -- omitted 2406 rows ]
ActivePathways
does not allow missing (NA) values in the matrix of P-values and these need to be removed. One conservative option is to re-assign all missing values as ones, indicating our confidence that the missing values are not indicative of cancer drivers. Alternatively, one may consider removing genes with NA values.
is.na(scores)] <- 1 scores[
The basic use of ActivePathways
requires only two input parameters, the matrix of P-values with genes in rows and datasets in columns, as prepared above, and the path to the GMT file in the file system. Here we use a GMT file provided with the package. Importantly, the gene IDs (symbols, accession numbers, etc) in the P-value matrix need to match those in the GMT file.
library(ActivePathways)
system.file('extdata', 'hsapiens_REAC_subset.gmt', package = 'ActivePathways')
gmt.file <-ActivePathways(scores, gmt.file)
## term.id term.name adjusted.p.val term.size
## 1: REAC:2424491 DAP12 signaling 4.491268e-05 358
## 2: REAC:422475 Axon guidance 2.028966e-02 555
## 3: REAC:177929 Signaling by EGFR 6.245734e-04 366
## overlap evidence
## 1: TP53,PIK3CA,KRAS,PTEN,BRAF,NRAS,... CDS
## 2: PIK3CA,KRAS,BRAF,NRAS,CALM2,RPS6KA3,... X3UTR,promCore
## 3: TP53,PIK3CA,KRAS,PTEN,BRAF,NRAS,... CDS
## Genes_X3UTR Genes_X5UTR
## 1: NA NA
## 2: CALM2,ARPC2,RHOA,NUMB,CALM1,ACTB,... NA
## 3: NA NA
## Genes_CDS
## 1: TP53,PTEN,KRAS,PIK3CA,BRAF,NRAS,...
## 2: NA
## 3: TP53,PTEN,KRAS,PIK3CA,BRAF,NRAS,...
## Genes_promCore
## 1: NA
## 2: EFNA1,IQGAP1,COL4A1,SCN2B,RPS6KA3,CALM2,...
## 3: NA
## [ reached getOption("max.print") -- omitted 17 rows ]
A pathway is considered to be significantly enriched if it has adjusted.p.val <= significant
. The parameter significant
represents the maximum adjusted P-value for a resulting pathway to be considered statistically significant. Only the significant pathways are returned. P-values from pathway enrichment analysis are adjusted for multiple testing correction to provide a more conservative analysis (see below).
nrow(ActivePathways(scores, gmt.file, significant = 0.05))
## [1] 20
nrow(ActivePathways(scores, gmt.file, significant = 0.1))
## [1] 23
In the most common use case, a GMT file is downloaded from a database and provided directly to ActivePathways
as a location in the file system. In some cases, it may be useful to load a GMT file separately for preprocessing. The ActivePathways package includes an interface for working with GMT objects. The GMT object can be read from a file using the read.GMT
function. The GMT is structured as a list of terms (e.g., molecular pathways, biological processes, etc.). In the GMT object, each term is a list containing an id, a name, and the list of genes associated with this term.
read.GMT(gmt.file)
gmt <-names(gmt[[1]])
## [1] "id" "name" "genes"
# Pretty-print the GMT
1:3] gmt[
## REAC:1630316 - Glycosaminoglycan metabolism
## ST3GAL1, CHP1, B4GALT5, ST3GAL4, SLC9A1, CHST11, B3GAT3, SLC26A1, ARSB, ABCC5, LUM, PAPSS1, SDC4, NAGLU, AC022400.6, HYAL1, NDST3, SLC35B2, B3GAT1, CHST2, DCN, CEMIP, IDS, IDUA, HS3ST3B1, B3GNT7, CHST12, CHST14, BCAN, B4GALT3, HS3ST1, B4GALT1, CSPG4, CHPF2, HEXB, UST, XYLT1, NDST2, AL050331.3, NAT6, CHSY3, NCAN, FMOD, B3GNT2, HPSE, ACAN, LYVE1, GPC2, GPC1, B4GALT2, SLC35B3, KERA, GPC4, GUSB, HYAL3, HS6ST3, HAS3, CHPF, OMD, SDC2, B3GNT3, CSGALNACT2, CSPG5, BGN, CHST6, HS6ST2, SDC3, HS2ST1, CD44, AGRN, B3GALT6, PRELP, GLB1, GPC6, B3GNT4, PAPSS2, HS3ST4, CHST9, GNS, CHST1, CSGALNACT1, HS6ST1, CHST7, AL590542.1, HYAL2, HAS1, CHSY1, OGN, B4GAT1, B4GALT4, DSEL, EXT1, SDC1, SLC35D2, EXT2, ST3GAL2, ST3GAL6, CHST3, CHST5, HSPG2, SLC26A2, HEXA, XYLT2, HS3ST6, HS3ST5, GLCE, AC244197.3, B4GALT7, HAS2, NDST4, CHST15, B4GALT6, B3GAT2, STAB2, HMMR, ST3GAL3, VCAN, NDST1, HS3ST2, GPC5, HS3ST3A1, HPSE2, GLB1L, SGSH, GPC3, CHST13, DSE
##
## REAC:5633007 - Regulation of TP53 Activity
## PPP1R13L, RFC3, SGK1, HDAC1, SETD9, RPA3, CSNK2A2, EHMT1, PDPK1, CCNG1, TAF6, ING5, TP53INP1, BRCA1, TAF1, TAF9, EHMT2, RBBP8, CHD4, USP2, BRPF1, BANP, PPP2R5C, SSRP1, TAF4, TMEM55B, PPP1R13B, TAF10, PRR5, ATM, ZNF385A, MRE11, PIP4K2C, SMYD2, BARD1, MBD3, TP53BP2, AURKA, RPA1, KAT6A, CHD3, ATR, EXO1, PIN1, RMI2, PRKAB2, AKT2, PPP2CA, TAF13, RAD9B, PPP2CB, MDM4, POU4F1, TAF4B, MDM2, BRD1, CDK2, CSNK2B, PIP4K2A, RAD50, PML, AKT3, RAD1, USP7, PRDM1, RFC4, MTOR, POU4F2, RFC2, KMT5A, DNA2, NUAK1, PIP4K2B, MAPKAP1, PPP2R1A, RBBP7, MTA2, TP53RK, TAF1L, GATAD2B, JMY, TAF12, CDK5R1, RICTOR, PHF20, AC134772.2, STK11, TAF3, BLM, ATRIP, TAF7L, TP73, MAP2K6, RPA2, PRKAG1, TOPBP1, PRKAB1, TOP3A, CCNA1, RBBP4, PPP2R1B, DAXX, TTC5, CHEK2, CDK5, RMI1, TAF2, NOC2L, TP63, CSNK2A1, UBB, PRKAG2, ING2, RNF34, MLST8, HIPK2, TBP, L3MBTL1, EP300, MAPK11, BRIP1, PRMT5, PRKAG3, RFC5, NBN, TPX2, HDAC2, CDK1, MAPKAPK5, CHEK1, TAF5, BRPF3, RPS27A, AKT1, GATAD2A, RHNO1, CCNA2, HUS1, MEAF6, PRKAA1, CDKN2A, RAD17, KAT5, SUPT16H, TAF7, PLK3, UBA52, WRN, MAPK14, TAF9B, UBC, HIPK1, TP53, DYRK2, RFFL, TAF11, BRD7, RAD9A, AURKB, PRKAA2
##
## REAC:5358346 - Hedgehog ligand biogenesis
## PSMB11, PSMD8, PSMB6, PSMB8, PSMD13, PSMA2, PSMB3, PSMB7, PSME2, PSMD1, PSMA8, PSMA1, PSMD2, PSMD14, PSMC1, IHH, SEL1L, SCUBE2, PSMB10, PSMD10, PSMA5, PSMD3, PSMC2, PSMC6, PSME1, PSME4, PSMC4, AC010132.3, PSMD5, UBB, GPC5, PSMA7, PSMB2, PSMD9, SEM1, PSMA3, DISP2, PSMB9, PSMD6, OS9, PSMC3, PSMA6, RPS27A, ADAM17, PSMB1, DHH, PSMF1, HHAT, UBA52, UBC, DERL2, PSMD4, VCP, ERLEC1, PSME3, PSMB4, P4HB, PSMA4, PSMD7, PSMD11, PSMC5, NOTUM, PSMD12, PSMB5, SYVN1, SHH
# Look at the genes annotated to the first term
1]]$genes gmt[[
## [1] "ST3GAL1" "CHP1" "B4GALT5" "ST3GAL4" "SLC9A1"
## [6] "CHST11" "B3GAT3" "SLC26A1" "ARSB" "ABCC5"
## [11] "LUM" "PAPSS1" "SDC4" "NAGLU" "AC022400.6"
## [16] "HYAL1" "NDST3" "SLC35B2" "B3GAT1" "CHST2"
## [21] "DCN" "CEMIP" "IDS" "IDUA" "HS3ST3B1"
## [26] "B3GNT7" "CHST12" "CHST14" "BCAN" "B4GALT3"
## [31] "HS3ST1" "B4GALT1" "CSPG4" "CHPF2" "HEXB"
## [ reached getOption("max.print") -- omitted 92 entries ]
# Get the full name of Reactome pathway 2424491
$`REAC:2424491`$name gmt
## [1] "DAP12 signaling"
The most common processing step for GMT files is the removal of gene sets that are too large or small. Here we remove pathways (gene sets) that have less than 10 or more than 500 annotated genes.
Filter(function(term) length(term$genes) >= 10, gmt)
gmt <- Filter(function(term) length(term$genes) <= 500, gmt) gmt <-
The new GMT object can now be used for analysis with ActivePathways
ActivePathways(scores, gmt)
## term.id term.name adjusted.p.val term.size
## 1: REAC:2424491 DAP12 signaling 0.0002904356 358
## 2: REAC:177929 Signaling by EGFR 0.0032110362 366
## 3: REAC:2559583 Cellular Senescence 0.0001568181 196
## overlap evidence Genes_X3UTR
## 1: TP53,PIK3CA,KRAS,PTEN,BRAF,NRAS,... CDS NA
## 2: TP53,PIK3CA,KRAS,PTEN,BRAF,NRAS,... CDS NA
## 3: TP53,RB1,ATM,MAP2K7,CDKN1A,RPS6KA3,... CDS NA
## Genes_X5UTR Genes_CDS Genes_promCore
## 1: NA TP53,PTEN,KRAS,PIK3CA,BRAF,NRAS,... NA
## 2: NA TP53,PTEN,KRAS,PIK3CA,BRAF,NRAS,... NA
## 3: NA TP53,RB1,ATM,MAP2K7,CDKN1A,CDKN1B,... NA
## [ reached getOption("max.print") -- omitted 14 rows ]
This filtering can also be done automatically using the geneset.filter
option to the ActivePathways
function. By default, ActivePathways
removes gene sets with less than five or more than a thousand genes from the GMT prior to analysis. In general, gene sets that are too large are likely not specific and less useful in interpreting the data and may also cause statistical inflation of enrichment scores in the analysis. Gene sets that are too small are likely too specific for most analyses and also make the multiple testing corrections more stringent, potentially causing deflation of results. A stricter filter can be applied by running ActivePathways
with the parameter geneset.filter = c(10, 500)
.
ActivePathways(scores, gmt.file, geneset.filter = c(10, 500))
## term.id term.name adjusted.p.val term.size
## 1: REAC:2424491 DAP12 signaling 3.660807e-05 358
## 2: REAC:177929 Signaling by EGFR 5.064108e-04 366
## 3: REAC:2559583 Cellular Senescence 5.366637e-05 196
## overlap evidence Genes_X3UTR
## 1: TP53,PIK3CA,KRAS,PTEN,BRAF,NRAS,... CDS NA
## 2: TP53,PIK3CA,KRAS,PTEN,BRAF,NRAS,... CDS NA
## 3: TP53,RB1,ATM,MAP2K7,CDKN1A,RPS6KA3,... CDS NA
## Genes_X5UTR Genes_CDS Genes_promCore
## 1: NA TP53,PTEN,KRAS,PIK3CA,BRAF,NRAS,... NA
## 2: NA TP53,PTEN,KRAS,PIK3CA,BRAF,NRAS,... NA
## 3: NA TP53,RB1,ATM,MAP2K7,CDKN1A,CDKN1B,... NA
## [ reached getOption("max.print") -- omitted 14 rows ]
This GMT object can be saved to a file
write.GMT(gmt, 'hsapiens_REAC_subset_filtered.gmt')
To perform pathway enrichment analysis, a global set of genes needs to be defined as a statistical background set. This represents the universe of all genes in the organism that the analysis can potentially consider. By default, this background gene set includes every gene that is found in the GMT file in any of the biological processes and pathways. Another option is to provide the full set of all protein-coding genes, however this may cause statistical inflation of the results since a sizable fraction of all protein-coding genes still lack any known function.
Sometimes the statistical background set needs to be considerably narrower than the GMT file or the full set of genes. Genes need to be excluded from the background if the analysis or experiment specifically excluded these genes initially. An example would be a targeted screen or sequencing panel that only considered a specific class of genes or proteins (e.g., kinases). In analysing such data, all non-kinase genes need to be excluded from the background set to avoid statistical inflation of all gene sets related to kinase signalling, phosphorylation and similar functions.
To alter the background gene set in ActivePathways
, one can provide a character vector of gene names that make up the statistical background set. In this example, we start from the original list of genes in the entire GMT and remove one gene, the tumor suppressor TP53. The new background set is then used for the ActivePathways analysis.
makeBackground(gmt)
background <- background[background != 'TP53']
background <-ActivePathways(scores, gmt.file, background = background)
## term.id term.name adjusted.p.val term.size
## 1: REAC:2424491 DAP12 signaling 2.096408e-03 357
## 2: REAC:422475 Axon guidance 8.689293e-03 527
## 3: REAC:177929 Signaling by EGFR 2.003969e-02 365
## overlap evidence
## 1: PIK3CA,KRAS,PTEN,BRAF,NRAS,B2M,... CDS
## 2: PIK3CA,KRAS,BRAF,NRAS,CALM2,RPS6KA3,... X3UTR
## 3: PIK3CA,KRAS,PTEN,BRAF,NRAS,CALM2,... CDS
## Genes_X3UTR Genes_X5UTR
## 1: NA NA
## 2: CALM2,ARPC2,RHOA,NUMB,CALM1,ACTB,... NA
## 3: NA NA
## Genes_CDS
## 1: PTEN,KRAS,PIK3CA,BRAF,NRAS,CDKN1A,...
## 2: NA
## 3: PTEN,KRAS,PIK3CA,BRAF,NRAS,CDKN1A,...
## Genes_promCore
## 1: NA
## 2: NA
## 3: NA
## [ reached getOption("max.print") -- omitted 15 rows ]
Note that only the genes found in the background set are used for testing enrichment. Any genes in the input data that are not in the background set will be automatically removed by ActivePathways
.
A key feature of ActivePathways
is the integration of multiple complementary omics datasets to prioritise genes for the pathway analysis. In this approach, genes with significant scores in multiple datasets will get the highest priority, and certain genes with weak scores in multiple datasets may be ranked higher, highlighting functions that would be missed when only single datasets were analysed. ActivePathways
accomplishes this by merging the series of p-values in the columns of the scores matrix for each gene into a single combined P-value. The two main methods to merge P-values are Brown’s method (the default) and Fisher’s method. The Brown’s method is more accurate and/or conservative in the case when the input datasets show some large-scale similarities (i.e., covariation), since the Brown’s method will take that into account when prioritising genes across similar datasets. The Brown’s method is recommended for most cases since omics datasets are often not statistically independent of each other and genes with high scores in one dataset are more likely to have high scores in another dataset just by chance.
The following example compares the merged P-values of the first few genes between the two methods. The genes with the top scores are the same while the P-values of the second method are somewhat more conservative.
sort(merge_p_values(scores, 'Fisher'))[1:5]
## TP53 VHL PIK3CA KRAS PTEN
## 2.275933e-32 4.677780e-31 9.878802e-30 5.169163e-29 5.813021e-29
sort(merge_p_values(scores, 'Brown'))[1:5]
## TP53 VHL PIK3CA KRAS PTEN
## 2.850936e-21 1.933490e-20 1.334809e-19 3.808817e-19 4.102937e-19
This function can be used to combine some of the data before the analysis for any follow-up analysis or visualisation. For example, we can merge the columns X5UTR
, X3UTR
, and promCore
into a single non_coding
column (these correspond to predictions of driver mutations in 5’UTRs, 3’UTRs and core promoters of genes, respectively). This will consider the three non-coding regions as a single column, rather than giving them all equal weight to the CDS
column.
cbind(scores[, 'CDS'], merge_p_values(scores[, c('X3UTR', 'X5UTR', 'promCore')], 'Brown'))
scores2 <-colnames(scores2) <- c('CDS', 'non_coding')
c(2179, 1760),] scores[
## X3UTR X5UTR CDS promCore
## TP53 0.8331532 0.005613463 1.842404e-33 0.0254239
## PTEN 0.1627596 0.310132725 2.180301e-32 0.6867016
c(2179, 1760),] scores2[
## CDS non_coding
## TP53 1.842404e-33 0.0189964
## PTEN 2.180301e-32 0.3437851
ActivePathways(scores, gmt.file)
## term.id term.name adjusted.p.val term.size
## 1: REAC:2424491 DAP12 signaling 4.491268e-05 358
## 2: REAC:422475 Axon guidance 2.028966e-02 555
## 3: REAC:177929 Signaling by EGFR 6.245734e-04 366
## overlap evidence
## 1: TP53,PIK3CA,KRAS,PTEN,BRAF,NRAS,... CDS
## 2: PIK3CA,KRAS,BRAF,NRAS,CALM2,RPS6KA3,... X3UTR,promCore
## 3: TP53,PIK3CA,KRAS,PTEN,BRAF,NRAS,... CDS
## Genes_X3UTR Genes_X5UTR
## 1: NA NA
## 2: CALM2,ARPC2,RHOA,NUMB,CALM1,ACTB,... NA
## 3: NA NA
## Genes_CDS
## 1: TP53,PTEN,KRAS,PIK3CA,BRAF,NRAS,...
## 2: NA
## 3: TP53,PTEN,KRAS,PIK3CA,BRAF,NRAS,...
## Genes_promCore
## 1: NA
## 2: EFNA1,IQGAP1,COL4A1,SCN2B,RPS6KA3,CALM2,...
## 3: NA
## [ reached getOption("max.print") -- omitted 17 rows ]
ActivePathways(scores2, gmt.file)
## term.id term.name adjusted.p.val term.size
## 1: REAC:2424491 DAP12 signaling 0.0003694968 358
## 2: REAC:422475 Axon guidance 0.0051873493 555
## 3: REAC:177929 Signaling by EGFR 0.0018584534 366
## 4: REAC:2559583 Cellular Senescence 0.0005822291 196
## overlap evidence
## 1: TP53,PIK3CA,PTEN,KRAS,BRAF,NRAS,... CDS
## 2: PIK3CA,KRAS,BRAF,NRAS,RPS6KA3,CALM2,... combined
## 3: TP53,PIK3CA,PTEN,KRAS,BRAF,NRAS,... CDS
## 4: TP53,RB1,ATM,MAP2K7,CDKN1A,RPS6KA3,... CDS
## Genes_CDS Genes_non_coding
## 1: TP53,PTEN,KRAS,PIK3CA,BRAF,NRAS,... NA
## 2: NA NA
## 3: TP53,PTEN,KRAS,PIK3CA,BRAF,NRAS,... NA
## 4: TP53,RB1,ATM,MAP2K7,CDKN1A,CDKN1B,... NA
## [ reached getOption("max.print") -- omitted 26 rows ]
To perform pathway enrichment of the ranked gene list of merged P-values, ActivePathways
defines a P-value cutoff to filter genes that have little or no significance in the series of omics datasets. This threshold represents the maximum p-value for a gene to be considered of interest in our analysis. The threshold is 0.1
by default, but can be changed using the cutoff
option. The default option considers raw P-values that have not been adjusted for multiple-testing correction. Therefore the default option provides a relatively lenient approach to filtering the input data. This is useful for finding additional genes with weaker signals that associate with well-annotated and strongly significant genes in the pathway and functional context.
nrow(ActivePathways(scores, gmt.file))
## [1] 20
nrow(ActivePathways(scores, gmt.file, cutoff = 0.01))
## [1] 18
Multiple testing correction is essential in the analysis of omics data since each analysis routinely considers thousands of hypotheses and apparently significant P-values will occur by chance alone. ActivePathways
uses multiple testing correction at the level of pathways as P-values from the ranked hypergeometric test are adjusted for multiple testing (note that the ranked gene list provided to the ranked hypergeometric test remain unadjusted for multiple testing by design).
The package uses the p.adjust
function of base R to run multiple testing corrections and all methods in this function are available. By default, ‘holm’ correction is used. The option correction.method = 'none'
can be used to override P-value adjustment (not recommended in most cases).
nrow(ActivePathways(scores, gmt.file))
## [1] 20
nrow(ActivePathways(scores, gmt.file, correction.method = 'none'))
## [1] 84
Consider the results object from the basic use case of ActivePathways
ActivePathways(scores, gmt.file)
res <- res
## term.id term.name adjusted.p.val term.size
## 1: REAC:2424491 DAP12 signaling 4.491268e-05 358
## 2: REAC:422475 Axon guidance 2.028966e-02 555
## 3: REAC:177929 Signaling by EGFR 6.245734e-04 366
## overlap evidence
## 1: TP53,PIK3CA,KRAS,PTEN,BRAF,NRAS,... CDS
## 2: PIK3CA,KRAS,BRAF,NRAS,CALM2,RPS6KA3,... X3UTR,promCore
## 3: TP53,PIK3CA,KRAS,PTEN,BRAF,NRAS,... CDS
## Genes_X3UTR Genes_X5UTR
## 1: NA NA
## 2: CALM2,ARPC2,RHOA,NUMB,CALM1,ACTB,... NA
## 3: NA NA
## Genes_CDS
## 1: TP53,PTEN,KRAS,PIK3CA,BRAF,NRAS,...
## 2: NA
## 3: TP53,PTEN,KRAS,PIK3CA,BRAF,NRAS,...
## Genes_promCore
## 1: NA
## 2: EFNA1,IQGAP1,COL4A1,SCN2B,RPS6KA3,CALM2,...
## 3: NA
## [ reached getOption("max.print") -- omitted 17 rows ]
The columns term.id
, term.name
, and term.size
give information about each pathway detected in the enrichment analysis. The adjusted.p.val
column with the adjusted P-value indicates the confidence that the pathway is enriched after multiple testing correction.
The overlap
column provides the set of genes from the integrated gene list that occur in the given enriched gene set (i.e., molecular pathway or biological process). These genes were quantified across multiple input omics datasets and prioritized based on their joint significance in the input data. Note that the genes with the strongest scores across the multiple datasets are listed first.
$overlap[1:3] res
## [[1]]
## [1] "TP53" "PIK3CA" "KRAS" "PTEN" "BRAF" "NRAS" "B2M" "CALM2"
## [9] "CDKN1A" "CDKN1B"
##
## [[2]]
## [1] "PIK3CA" "KRAS" "BRAF" "NRAS" "CALM2" "RPS6KA3" "ACTB"
## [8] "EFNA1" "SCN2B" "EPHA2" "GAP43" "COL4A1" "RASAL2" "CLTC"
## [15] "IQGAP1" "NF1" "FGF9" "ADAM10" "PTPRC" "ITGA10" "PDGFB"
## [22] "COL4A2" "RGMB" "RASA1" "FGF6" "CALM1" "PSMB7" "PPP2R5D"
## [29] "PPP2R1A"
##
## [[3]]
## [1] "TP53" "PIK3CA" "KRAS" "PTEN" "BRAF" "NRAS" "CALM2" "CDKN1A"
## [9] "CDKN1B"
This column is useful in the further analysis on the data, allowing the researcher to go move from the space of enriched pathways back to space of individual genes and proteins involved in the pathways and the input omics datasets.
The evidence
column provides insights to which of the input omics datasets (i.e., columns in the scores matrix) contributed to the discovery of this pathway or process in the integrated enrichment analysis. To achieve this level of detail, ActivePathways
also analyses the gene lists ranked by the individual columns of the input matrix to detect enriched pathways. The evidence
column lists the name of a given column of the input matrix if the given pathway is detected both in the integrated analysis and the analysis of the individual column. For example, in this analysis the majority of the detected pathways have only ‘CDS’ as their evidence, since these pathways were found to be enriched in data fusion through P-value merging and also by analysing the gene scores in the column CDS
(for reference, CDS corresponds to protein-coding sequence where the majority of known driver mutations have been found). As a counter-example, the record for the pathway REAC:422475
in our results lists as evidence list('X3UTR', 'promCore')
, meaning that the pathway was found to be enriched when considering either the X3UTR
column, the promCore
column, or the combined omics datasets.
$term.id == "REAC:422475","evidence"] res[res
## evidence
## 1: X3UTR,promCore
Finally, if a pathway is found to be enriched only with the combined data and not in any individual column, ‘combined’ will be listed as the evidence. This subset of results may be particularly interesting since it highlights complementary aspects of the analysis that would remain hidden in the analysis of any input omics dataset separately.
The following columns named as Genes_{column}
help interpret how each pathway was detected in the multi-omics data integration, as listed in the column evidence
. These columns show the genes present in the pathway and any of the the input omics datasets. If the given pathway was not identified using the scores of the given column of the input scores matrix, an NA
value is shown. Again, the genes are ranked in by the significance of their scores in the input data, to facilitate identification of the most relevant genes in the analysis.
The results are returned as a data.table
object due to some additional data structures needed to store lists of gene IDs and supporting evidence. The usual R functions write.table
and write.csv
will struggle with exporting the data unless the gene and evidence lists are manually transformed as strings. Fortunately, the fwrite
function of data.table
can be used to write the file directly and the ActivePathways package includes the function export_as_CSV
as a shortcut that uses the vertical bar symbol to concatenate gene lists.
paste('ActivePathways_results.csv', sep = '/')
result.file <-export_as_CSV (res, result.file) # remove comment to run
read.csv(result.file, stringsAsFactors = F)[1:3,]
The fwrite
can be called directly for customised output.
paste('ActivePathways_results2.txt', sep = '/')
result.file <-::fwrite(res, result.file, sep = '\t', sep2 = c('', ',', ''))
data.tablecat(paste(readLines(result.file)[1:2], collapse = '\n'))
The Cytoscape software and the EnrichmentMap app provide powerful tools to visualise the enriched pathways from ActivePathways
as a network (i.e., an Enrichment Map). To facilitate this visualisation step, ActivePathways
provides the files needed for building enrichment maps. To create these files, a file prefix must be supplied to ActivePathways
using the argument cytoscape.file.tag
. The prefix can be a path to an existing writable directory.
ActivePathways(scores, gmt.file, cytoscape.file.tag = "enrichmentMap__") res <-
Four files are written using the prefix:
enrichmentMap__pathways.txt
contains the table of significant terms (i.e. molecular pathways, biological processes, other gene sets) and the associated adjusted P-values. Note that only terms with adjusted.p.val <= significant
are written.
enrichmentMap__subgroups.txt
contains a matrix indicating the columns of the input matrix of P-values that contributed to the discovery of the corresponding pathways. These values correspond to the evidence
evaluation of input omics datasets discussed above, where a value of one indicates that the pathway was also detectable using a specific input omics dataset. A value of zero indicates otherwise. This file will be not generated if a single-column matrix of scores corresponding to just one omics dataset is provided to ActivePathways
.
enrichmentMap__pathways.gmt
contains a shortened version of the supplied GMT file which consists of only the significant pathways detected by ActivePathways
.
enrichmentMap__legend.pdf
is a pdf file that displays a color legend of different omics datasets visualised in the enrichment map that can be used as a reference to the generated enrichment map.
The following sections will discuss how to create a pathway enrichment map using the results from ActivePathways
. The datasets analysed earlier in the vignette will be used ActivePathways vignette. To follow the steps, save the required files from ActivePathways
in an accessible location.
ActivePathways
writes four files that are used to build enrichment maps in Cytoscape.
c(system.file('extdata', 'enrichmentMap__pathways.txt', package='ActivePathways'),
files <-system.file('extdata', 'enrichmentMap__subgroups.txt', package='ActivePathways'),
system.file('extdata', 'enrichmentMap__pathways.gmt', package='ActivePathways'),
system.file('extdata', 'enrichmentMap__legend.pdf', package='ActivePathways'))
The following commands will perform the basic analysis again and write output files required for generating enrichment maps into the current working directory of the R session. All file names use the prefix enrichmentMap__
. The generated files are also available in the ActivePathways
R package as shown above.
system.file('extdata', 'hsapiens_REAC_subset.gmt', package = 'ActivePathways')
gmt.file <- system.file('extdata', 'Adenocarcinoma_scores_subset.tsv', package = 'ActivePathways')
scores.file <-
read.table(scores.file, header = TRUE, sep = '\t', row.names = 'Gene')
scores <- as.matrix(scores)
scores <-is.na(scores)] <- 1
scores[
ActivePathways(scores, gmt.file, cytoscape.file.tag = "enrichmentMap__") res <-
The four files written are:
enrichmentMap__pathways.txt
, a table of significant pathways and the associated adjusted P-values.
enrichmentMap__subgroups.txt
, a table of pathways and corresponding omics datasets supporting the enrichment of those pathways. This corresponds to the evidence
column of the ActivePathways
result object discussed above.
enrichmentMap__pathways.gmt
, a shortened version of the supplied GMT file which consists of only the significant pathways detected by ActivePathways
.
enrichmentMap__legend.pdf
, a reference color legend of different omics datasets visualised in the enrichment map.
The following code will examine a few lines of the files generated by ActivePathways
.
cat(paste(readLines(files[1])[1:5], collapse='\n'))
## term.id term.name adjusted.p.val
## REAC:2424491 DAP12 signaling 4.49126833230489e-05
## REAC:422475 Axon guidance 0.0202896586319552
## REAC:177929 Signaling by EGFR 0.000624573372270557
## REAC:2559583 Cellular Senescence 6.59544666946083e-05
cat(paste(readLines(files[2])[1:5], collapse='\n'))
## term.id CDS X3UTR promCore combined instruct
## REAC:2424491 1 0 0 0 piechart: attributelist="CDS,X3UTR,promCore,combined" colorlist="#FF0000FF,#80FF00FF,#00FFFFFF,#8000FFFF" showlabels=FALSE
## REAC:422475 0 1 1 0 piechart: attributelist="CDS,X3UTR,promCore,combined" colorlist="#FF0000FF,#80FF00FF,#00FFFFFF,#8000FFFF" showlabels=FALSE
## REAC:177929 1 0 0 0 piechart: attributelist="CDS,X3UTR,promCore,combined" colorlist="#FF0000FF,#80FF00FF,#00FFFFFF,#8000FFFF" showlabels=FALSE
## REAC:2559583 1 0 0 0 piechart: attributelist="CDS,X3UTR,promCore,combined" colorlist="#FF0000FF,#80FF00FF,#00FFFFFF,#8000FFFF" showlabels=FALSE
cat(paste(readLines(files[3])[18:19], collapse='\n'))
## REAC:5655253 Signaling by FGFR2 in disease NRAS FGF5 POLR2F GTF2F2 SOS1 FGF16 POLR2D PIK3CA FGF7 NCBP1 HRAS KRAS POLR2G FGF18 FGF4 FGFR2 FGF1 FGF10 POLR2K FGF3 POLR2J GRB2 GAB1 FGF23 POLR2C FGF9 POLR2H FGF8 POLR2A FGF17 FGF20 PIK3R1 GTF2F1 FRS2 POLR2I FGF22 POLR2L PLCG1 NCBP2 FGF2 FGF6 POLR2E POLR2B
## REAC:5673000 RAF activation PPP2CB ARAF BRAP HRAS KRAS PPP2R5D PPP2CA SRC YWHAB PPP2R5A MAP3K11 PPP2R5E BRAF PPP2R5B PHB MAP2K1 NRAS KSR1 RAF1 PPP2R1A PPP2R1B MARK3 MAP2K2 PPP2R5C JAK2
+
Add Data Set from Files in the top left corner of the dialogue.enrichmentMap__pathways.txt
and enrichmentMap__pathways.gmt
in the Enrichments and GMT fields, respectively.To color nodes in the network (i.e., molecular pathways, biological processes) according to the omics datasets supporting the enrichments, the third file enrichmentMap__subgroups.txt
needs to be imported to Cytoscape directly. To import the file, activate the menu option File -> Import -> Table from File and select the file enrichmentMap__subgroups.txt
. In the following dialogue, select To a Network Collection in the dropdown menu Where to Import Table data. Click OK to proceed.
Next, Cytoscape needs to use the imported information to color nodes using a pie chart visualisation. To enable this click the Style tab in the left control panel and select the Image/Chart1 Property in a series of dropdown menus (Properties -> Paint -> Custom Paint 1 -> Image/Chart 1).
The image/Chart 1 property now appears in the Style control panel. Click the triangle on the right, then set the Column to instruct and the Mapping type to Passthrough.
This step colours the nodes corresponding to the enriched pathways according to the supporting omics datasets, based on in the scores matrix initially analysed in ActivePathways
.
To allow better interpretation of the enrichment map, ActivePathways
generates a color legend in the file enrichmentMap__legend.pdf
that shows which colors correspond to which omics datasets.
Note that one of the colors corresponds to a subset of enriched pathways with combined evidence that were only detected through data fusion and P-value merging and not when any of the input datasets were detected separately. This exemplifies the added value of integrative multi-omics pathway enrichment analysis.
For a more diverse range of colors, ActivePathways supports any color palette from RColorBrewer. The color_palette parameter must be provided.
ActivePathways(scores, gmt.file, cytoscape.file.tag = "enrichmentMap__", color_palette = "Pastel1") res <-
Instead, to manually input the color of each dataset the custom_colors parameter must be specified as a vector. This vector should contain the same number of colors as columns in the scores matrix.
ActivePathways(scores, gmt.file, cytoscape.file.tag = "enrichmentMap__", custom_colors = c("violet","green","orange","red")) res <-
To change the color of the combined contribution, a color must be provided to the color_integrated_only parameter.
Tip: if the coloring of nodes did not work in Cytoscape after setting the options in the Style panel, check that the EnhancedGraphics Cytoscape app is installed.
Integrative Pathway Enrichment Analysis of Multivariate Omics Data. Paczkowska M, Barenboim J, Sintupisut N, Fox NS, Zhu H, Abd-Rabbo D, Mee MW, Boutros PC, PCAWG Drivers and Functional Interpretation Working Group; Reimand J, PCAWG Consortium. Nature Communications (2020) https://pubmed.ncbi.nlm.nih.gov/32024846/ https://doi.org/10.1038/s41467-019-13983-9.
Pathway Enrichment Analysis and Visualization of Omics Data Using g:Profiler, GSEA, Cytoscape and EnrichmentMap. Reimand J, Isserlin R, Voisin V, Kucera M, Tannus-Lopes C, Rostamianfar A, Wadi L, Meyer M, Wong J, Xu C, Merico D, Bader GD. Nature Protocols (2019) https://pubmed.ncbi.nlm.nih.gov/30664679/ https://doi.org/10.1038/s41596-018-0103-9.