metaObject
as input for analysissummarizeFilterResults()
datasetObject
with violinPlot()
datasetObject
with rocPlot()
forestPlot()
calculateScore()
The MetaIntegrator package comprises several analysis and plot functions to perform integrated multi-cohort analysis of gene expression data (meta-analysis). The advent of the gene expression microarray has allowed for a rapid increase in gene expression studies. Largely due to the MIAME standards for data sharing, many of these studies have been deposited into public repositories such as the NIH Gene Expression Omnibus (GEO) and ArrayExpress. There is now a wealth of publicly available gene expression data available for re-analysis.
An obvious next step to increase statistical power in detecting changes in gene expression associated with some condition is to aggregate data from multiple studies. However, inter-study technical and biological differences prevent us from simply pooling results and summarizing our findings. A random-effects model of meta-analysis circumvents these issues by assuming that the results from each study is drawn from a single distribution, and that such inter-study differences are thus a ‘random effect’. Thus, the MetaIntegrator package will perform a DerSimonian & Laird random-effects meta-analysis for each gene (not probeset) between all target studies between cases and controls; it also performs a Fischers sum-of-logs method on the same data, and requires that a gene is significant by both methods. The resulting p-values are False discovery rate (FDR) corrected to q-values, and will evaluate the hypothesis of whether each gene is differentially expressed between cases and controls across all studies included in the analysis.
The resulting list of genes with significantly different expression between cases and controls can be used for multiple purposes, such as (1) a new diagnostic or prognostic test for the disease of interest, (2) a better understanding of the underlying biology, (3) identification of therapeutic targets, and multiple other applications. Our lab has already used these methods in a wide variety of diseases, including organ transplant reject, lung cancer, neurodegenerative disease, and sepsis (Khatri et al., J Exp Med 2013; Chen et al, Cancer Res 2014; Li et al., Acta Neur Comm 2014; Sweeney et al, Sci Trans Med 2015).
The MetaIntegrator Vignette will take the user through the basic steps of the package, including basic multi-cohort analysis, leave-one-out (LOO) analysis (whereby each of the included datasets is left out and multi-cohort analysis is run on the remaining datasets in a round-robin fashion), selection of significant genes, and then analysis of the gene set. The MetaIntegrator package assumes that the user (1) already has their data in hand, and (2) has already decided which datasets to include in the multi-cohort meta-analysis. Our group recommends that some datasets be left out of the analysis, if possible, for independent validation.
Contact
Winston A. Haynes hayneswa@stanford.edu
Links
MetaIntegrator
Installation
install.packages("MetaIntegrator")
The Metaintegrator package can be used to run a meta-analysis on microarray gene expression data as described in Khatri et al. J Exp Med. 2013. Briefly, it computes an Hedges’ g effect size for each gene in each dataset defined as:
where \(1\) and \(0\) represent the group of cases and controls for a given condition, respectively. For each gene, the summary effect size \(g_s\) is computed using a random effect model as:
where \(W_i\) is a weight equal to \(1/(V_i+T^2)\), where \(V_i\) is the variance of that gene within a given dataset \(i\), and \(T^{2}\) is the inter-dataset variation (for details see: Borenstein M et al Introduction to Meta-analysis, Wiley 2009). For each gene, the False discovery rate (FDR) is computed and a final set of genes is selected based on FDR thresholding.
For a set of signature genes, a signature score can be computed as:
where \(pos\) and \(neg\) are the sets of positive and negative genes, respectively, and \(x_i(gene)\) is the expression of any particular gene in sample \(i\) (a positive score indicates an association with cases and a negative score with controls). This score \(S\) is then converted into a z-score \(Z_s\) as:
1. Data collection, curation and annotation, select datasets for discovery and validation: Helper Functions
2. Meta-analysis on discovery datasets: Meta-Analysis, Filtering, Validation, Visualization, Search, Helper Functions
3. Validation on independent validation datasets: Visualization, Validation, Helper Functions
metaObject
as input for analysisdatasetObject
for each gene expression GEO datasetdatasetObject$expr
) and phenotype (datasetObject$pheno
) information using the read.table()
function in R
datasetObject$class
vector using the phenotype information (0 is control, 1 is case)datasetObject$keys
vector. Mappings are usually stored in GPL files (for format details see GEO Platform guidelines).datasetObject$formattedName
The final datasetObject
should have the structure:
datasetObject: named list
$class: named vector. Names are sample names. Values are 0 if control, 1 if case.
$expr: matrix. Row names are probe names. Column names are sample names. Values are expression values
$keys: named vector. Names are probe names. Values are gene names.
$pheno: data frame. Row names are the sample names. Column names are the annotation information (none required).
$formattedName: string. A formatted name for this dataset which will be used in plots.
Example object structure for one datasetObject
from tinyMetaObject
:
dataObj1 <- tinyMetaObject$originalData$PBMC.Study.1
str(dataObj1, max.level = 1)
## List of 5
## $ class : Named num [1:115] 1 1 1 1 1 1 1 1 1 1 ...
## ..- attr(*, "names")= chr [1:115] "Sample 1" "Sample 2" "Sample 3" "Sample 4" ...
## $ expr : num [1:190, 1:115] 10.3 13.6 13.2 13.8 10.2 ...
## ..- attr(*, "dimnames")=List of 2
## $ keys : Named chr [1:190] "Gene41" "Gene58" "Gene112" "Gene59" ...
## ..- attr(*, "names")= chr [1:190] "1294_at" "200036_s_at" "200080_s_at" "200089_s_at" ...
## $ pheno :'data.frame': 115 obs. of 8 variables:
## $ formattedName: chr "PBMC Study 1"
Note: Gene expression values in dataObj1$expr
should be in \(log_2\) scale and the expression data might need normalization. Also, negative gene expression values are problematic for geometric mean calculation.
This can be checked by generating a boxplot of the dataset expression values:
boxplot(dataObj1$expr[,1:15]) # -> shows samples 1-15, to see all run: boxplot(dataObj1$expr)
Here, normalization is not necessary because the median of the samples is similar, and the data is already in log scale because the expression values are between 0 and 15. (If negative expression values would be observed e.g. the lowest expression value is -1, we recommend to shift all expression values of the dataset above 0 by adding +1 to each gene expression measurement in all samples.)
datasetObject
using checkDataObject()
The function checks for errors within the datasetObject
. I returns TRUE
if the object passed error checking, FALSE
otherwise, and it prints warning messages explaining failed checks.
checkDataObject(dataObj1, "Dataset")
## [1] TRUE
metaObject
from dataset objectsGenerate a named list of dataset objects that have been imported for analysis:
# use the additional 2 example datasets from tinyMetaObject
dataObj2 = tinyMetaObject$originalData$Whole.Blood.Study.1
dataObj3 = tinyMetaObject$originalData$Whole.Blood.Study.2
# and create the metaObject
discovery_datasets <- list(dataObj1, dataObj2, dataObj3)
names(discovery_datasets) = c(dataObj1$formattedName, dataObj2$formattedName, dataObj3$formattedName)
exampleMetaObj=list()
exampleMetaObj$originalData <- discovery_datasets
IMPORTANT: Keep at least one dataset out of the discovery datasets to use it for validation!
The final metaObject
should have the structure:
metaObject: named list
$originalData: named list [1]
$datasetName: Dataset object. 'datasetName' will be the (unquoted) name of that dataset.[0,n]
metaObject
before MetaAnalysis using checkDataObject()
The function checks for errors within the metaObject
.
Example how to check your metaObject
:
checkDataObject(exampleMetaObj, "Meta", "Pre-Analysis")
## [1] TRUE
runMetaAnalysis()
Once the data is written to metaObject$originalData
, the Meta-Analysis can be started by:
runMetaAnalysis(metaObject)
The Meta-Analysis results are written into metaObject$metaAnalysis
and the results of the leave-one-out analysis into metaObject$leaveOneOutAnalysis
For details see Meta-Analysis algorithm above
Example:
exampleMetaObj <- runMetaAnalysis(exampleMetaObj, maxCores=1)
## Found common probes in 3
## Computing effect sizes...
## Computing summary effect sizes...
## Computing Fisher's output...
## Computing effect sizes...
## Computing summary effect sizes...
## Computing Fisher's output...Found common probes in 2
## Computing effect sizes...
## Computing summary effect sizes...
## Computing Fisher's output...Computing effect sizes...
## Computing summary effect sizes...
## Computing Fisher's output...
The generated effect size plot compares the effect size distributions of the datasets used in the Meta-Analysis. Ideally, the effect sizes should have a normal distribution around 0.
Check if the result has been written into $metaAnalysis
and $leaveOneOutAnalysis
:
str(exampleMetaObj, max.level = 2)
## List of 3
## $ originalData :List of 3
## ..$ PBMC.Study.1 :List of 5
## ..$ Whole.Blood.Study.1:List of 5
## ..$ Whole.Blood.Study.2:List of 5
## $ metaAnalysis :List of 4
## ..$ datasetEffectSizes : num [1:167, 1:3] 0.959 0.279 -0.192 -0.332 -0.242 ...
## .. ..- attr(*, "dimnames")=List of 2
## ..$ datasetEffectSizeStandardErrors: num [1:167, 1:3] 0.312 0.306 0.305 0.306 0.305 ...
## .. ..- attr(*, "dimnames")=List of 2
## ..$ pooledResults :'data.frame': 167 obs. of 14 variables:
## ..$ analysisDescription : chr "MetaAnalysis: Random Effects Model"
## $ leaveOneOutAnalysis:List of 3
## ..$ removed_PBMC.Study.1 :List of 4
## ..$ removed_Whole.Blood.Study.1:List of 4
## ..$ removed_Whole.Blood.Study.2:List of 4
filterGenes()
metaObject
, they can be examined using different gene filter criteria with filterGenes()
. The standard filterParameter are the FDR cutoff and leave-one-out analysis on/off:FDRThresh
: FDR cutoff: a gene is selected, if it has a p-value less than or equal to the FDR cutoff (default: 0.05
)
isLeaveOneOut
: Do leave-one-out analysis on discovery datasets (default: TRUE
). Needs at least 2 datasets for discovery.filterGenes()
)
Example:
exampleMetaObj <- filterGenes(exampleMetaObj, isLeaveOneOut = TRUE, FDRThresh = 0.001)
summarizeFilterResults()
Either use the filter label as input for summarizeFilterResults()
summarizeFilterResults(exampleMetaObj, "FDR0.001_es0_nStudies1_looaTRUE_hetero0")
Or use the most recent filter with the function getMostRecentFilter()
summarizeFilterResults(exampleMetaObj, getMostRecentFilter(exampleMetaObj))
## $pos
## effectSize effectSizeStandardError effectSizePval effectSizeFDR
## Gene27 1.163213 0.1066813 1.107537e-27 1.849586e-25
## Gene13 1.629049 0.1514056 5.345649e-27 4.463617e-25
## Gene36 1.185257 0.1600311 1.297580e-13 3.095656e-12
## Gene26 1.009014 0.1396065 4.917610e-13 1.026551e-11
## Gene17 1.186679 0.1739487 8.978005e-12 1.363024e-10
## Gene24 1.056540 0.1798291 4.222159e-09 4.406879e-08
## tauSquared numStudies cochranesQ heterogeneityPval fisherStatUp
## Gene27 0.00000000 3 0.01310449 0.9934692 130.50116
## Gene13 0.00000000 3 1.39880786 0.4968814 136.99524
## Gene36 0.00000000 2 0.10687482 0.7437306 66.32116
## Gene26 0.00000000 3 1.43300798 0.4884569 68.53009
## Gene17 0.04541667 3 3.95629827 0.1383250 83.64513
## Gene24 0.01187769 2 1.16373481 0.2806923 55.75905
## fisherPvalUp fisherFDRUp fisherStatDown fisherPvalDown fisherFDRDown
## Gene27 1.008046e-25 8.417186e-24 1.818131e-02 0.9999999 1
## Gene13 4.313804e-27 7.204053e-25 4.881354e-05 1.0000000 1
## Gene36 1.355405e-13 1.886272e-12 9.336704e-05 1.0000000 1
## Gene26 8.182527e-13 9.760586e-12 6.352420e-03 1.0000000 1
## Gene17 6.298572e-16 2.103723e-14 1.365882e-04 1.0000000 1
## Gene24 2.252504e-11 2.285523e-10 3.895598e-05 1.0000000 1
##
## $neg
## effectSize effectSizeStandardError effectSizePval effectSizeFDR
## Gene59 -0.5772361 0.08198619 1.913453e-12 3.550518e-11
## Gene54 -0.4100211 0.08539501 1.575098e-06 8.768044e-06
## tauSquared numStudies cochranesQ heterogeneityPval fisherStatUp
## Gene59 0.001354013 3 2.119463 0.3465489 0.4061699
## Gene54 0.004890694 3 2.407518 0.3000641 6.2240635
## fisherPvalUp fisherFDRUp fisherStatDown fisherPvalDown fisherFDRDown
## Gene59 0.9988003 1.0000000 42.66136 1.360968e-07 1.623441e-06
## Gene54 0.3985643 0.6656024 68.13827 9.843187e-13 5.479374e-11
datasetObject
with violinPlot()
A violin plot is similar to a box plot, except the width of each violin is proportional to the density of points. It can be used to validate how well the selected gene set can be used to separate the groups (e.g. cases vs. controls). P-values are calculated and displayed on the plots.
Example:
violinPlot(exampleMetaObj$filterResults$FDR0.001_es0_nStudies1_looaTRUE_hetero0, dataObj2, labelColumn = 'group')
Note: usually the validation would be performed on an independent validation dataset not like here on the discovery dataObj2
!
datasetObject
with rocPlot()
In addition, a ROC plot can be generated to validate how well the groups (e.g. cases vs. controls) can be separated with the selected gene set. The function rocPlot()
returns a standard ROC plot, with area under the curve (AUC) and 95% confidence interval (CI) calculated according to Hanley method (Hanley et al, 1982).
Example:
rocPlot(exampleMetaObj$filterResults$FDR0.001_es0_nStudies1_looaTRUE_hetero0, dataObj2, title = "ROC plot for discovery dataset2, FDR: 0.001")
## Used 6 of 6 pos genes, and 2 of 2 neg genes
## For dataset Whole Blood Study 1, AUC = 0.866
Note: Usually the validation would be performed on an independent validation dataset not like here on the discovery dataObj2
!
forestPlot()
A forest plot can be used to compare the expression values of a gene across different datasets. The size of the blue boxes is proportional to the number of samples in the study and light blue lines indicate the standard error of the effect sizes for each study (95% confidence interval). The summary effect size for all studies is indicated as yellow diamond below and the width of the diamond indicates the summary standard error.
Example:
forestPlot(exampleMetaObj, "Gene27")
## [1] "character"
calculateScore()
Given a gene set of interest, it is often desirable to summarize the expression of that gene set using a single integrated signature score (for details see above). The method calculateScore()
calculates the geometric mean of the expression level of all positive genes, minus the geometric mean of the expression level of all negative genes. Although mostly used internally (e.g. to calculate the Z-scores for violinPlot()
), the function has been exported in case users want to compare multiple classes, etc., using the same Z-score as is used for producing two-class comparisons.
Example:
calculateScore(exampleMetaObj$filterResults$FDR0.001_es0_nStudies1_looaTRUE_hetero0, dataObj2)
## Used 6 of 6 pos genes, and 2 of 2 neg genes
## [1] -0.15306434 -0.68681735 -0.81023751 -0.81863726 -1.25195646 -1.87385218
## [7] -1.35369218 -1.00883527 -1.29960805 -1.38863102 -0.76999804 -0.71777841
## [13] 0.07056772 -0.37165758 -0.42032379 -0.51503945 -0.74899507 -0.14092856
## [19] 2.22416036 -0.01965896 0.78892927 0.40788351 0.17038795 1.59325560
## [25] 0.15670475 0.76272870 -0.85048217 0.55503900 -0.73279309 -0.77016199
## [31] -0.49176551 0.54325166 0.93394146 0.26956889 1.61741129 1.23905868
## [37] 0.99592740 1.38078543 1.02264230 0.90994709 0.33786849 -0.24990791
## [43] -1.05145613 -0.93880708 1.92819891 1.18440716 -0.62728134 -0.68417686
## [49] 1.65387796
Typically, this vector would be added as $score
column in datasetObject$pheno
.
metaObject
The metaObject
contains all input and output data of the Meta-Analysis
metaObject: named list
$originalData: named list [1]
$datasetName: datasetObject. 'datasetName' will be the (unquoted) name of that dataset.[0,n]
$metaAnalysis: MetaAnalysisObject. Corresponds to the meta-analysis results including all data in originalData [0,1]
$leaveOneOutAnalysis: named list [0,1]
$removed_datasetName: MetaAnalysisObject. 'datasetName' will be name of removed dataset in LOOA. [0,n]
$filterResults: named list [0,1]
$filterCriteria: filterObject. 'filterCriteria' will be a string representation of the filter criteria. [0,n]
datasetObject
The datasetObject
contains all information of one GEO gene expression dataset
datasetObject: named list
$class: named vector. Names are sample names. Values are 0 if control, 1 if case.
$expr: matrix. Row names are probe names. Column names are sample names. Values are expression values
$keys: named vector. Names are probe names. Values are gene names.
$pheno: data frame. Row names are the sample names. Column names are the annotation information (none required).
$formattedName: string. A formatted name for this dataset which will be used in plots.
MetaAnalysisObject
Object that contains results of one Meta-Analysis generated by runMetaAnalysis()
and stored in metaObject$metaAnalysis
and metaObject$leaveOneOutAnalysis
MetaAnalysisObject: named list
$datasetEffectSizes: data frame. Column names are dataset names. Row names are gene names. Values are dataset-specific effect sizes.
$datasetEffectSizeStandardErrors: data frame. Column names are dataset names. Row names are gene names. Values are dataset-specific effect size standard errors.
$pooledResults: data frame. Row names are gene names. Column names are:
$effectSize: double. pooled.ES$summary
$effectSizeStandardError: double. Non-negative.
$effectSizePval: double. Ranges from 0 to 1.
$effectSizeFDR: double. Ranges from 0 to 1.
$tauSquared: double. Non-negative.
$numStudies: integer. Non-negative.
$cochranesQ: double. Non-negative.
$heterogeneityPval: double. Ranges from 0 to 1.
$fisherStatUp: double. Non-negative.
$fisherPvalUp: double. Ranges from 0 to 1.
$fisherFDRUp: double. Ranges from 0 to 1.
$fisherStatDown: double. Non-negative.
$fisherPvalDown: double. Ranges from 0 to 1.
$fisherFDRDown: double. Ranges from 0 to 1.
$analysisDescription: string. Describes details of meta-analysis applied.
filterObject
Object that contains filter results of one Meta-Analysis generated by filterGenes()
and stored in metaObject$filterResults
filterObject: named list
$posGeneNames: character vector. Values are positively-regulated gene names which passed the filter.
$negGeneNames: character vector. Values are negatively-regulated gene names which passed the filter.
$FDRThresh: double. Ranges from 0 to 1.
$effectSizeThresh: double. Non-negative.
$numberStudiesThresh: integer. Non-negative.
$isLeaveOneOut: boolean.
$heterogeneityPvalThresh: double. Ranges from 0 to 1.
$filterDescription: string. Describes and additional details of the filter.
$timestamp: POSIXct. Result from Sys.time() call from when the filter was executed.
runMetaAnalysis()
Given a metaObject
with $originalData
populated this function will run the meta-analysis algorithm.
It returns a modified version of the metaObject
with the meta-analysis results written into metaObject$metaAnalysis
and the results of the leave-one-out analysis into metaObject$leaveOneOutAnalysis
.
Usage:
metaObject <- runMetaAnalysis(metaObject)
metaObject
: a metaObject
which must have the $originalData
metaObject
with both, $metaAnalysis
and $leaveOneOutAnalysis
, populated with a MetaAnalysisObject
filterGenes()
After the meta-analysis results have been written to the metaObject
, the results can be examined using different gene filtering criteria. This function will use the given filter parameter to select genes that fulfill the filter conditions. The function returns a modified version of the metaObject with results stored in metaObject$filterResults
Usage:
metaObject <- filterGenes(metaObject, filterParameter)
metaObject
: a metaObject, which must have the $originalData
, $metaAnalysis
populated
optional filterParameter:
isLeaveOneOut
: Do leave-one-out analysis on discovery datasets (Default: TRUE). Needs at least 2 datasets for discovery.
FDRThresh
: FDR cutoff: a gene is selected, if it has a p-value less than or equal to the FDR cutoff (Default: 0.05)
effectSizeThresh
: a gene is selected, if the absolute value of its effect size is above this threshold (default: 0)
numberStudiesThresh
: number of studies in which a selected gene has to be significantly up/down regulated (Default: 1)
heterogeneityPvalThresh
: heterogeneity p-value cutoff (filter is off by default: heterogeneityPvalThresh = 0
). Genes with significant heterogeneity and, thus a significant (low) heterogeneity p-value, can be filtered out by using e.g.: heterogeneityPvalThresh = 0.05
(removes all genes with heterogeneity p-value < 0.05)metaObject
: A modified version of the input metaObject
with an additional filterObject
stored within metaObject$filterResults
summarizeFilterResults()
Given a filterObject
, this function will print a summary style message about genes that passed the filtering step using filterGenes()
and return a dataFrame
that contains the $pooledResults
information for each gene which passed the filter.
Usage:
summarizeFilterResults(metaObject, metaFilterLabel)
metaObject
: the metaObject
that contains the filterObject
of interest
metaFilterLabel
: the name of a filterObject
generated with the function filterGenes()
$pooledResults
information for each gene which passed the filter
calculateScore()
Given a gene set of interest, it is often desirable to summarize the expression of that gene set using a single integrated signature score (for details see above). The calculateScore
method calculates the geometric mean of the expression level of all positive genes, minus the geometric mean of the expression level of all negative genes. The resulting scores are then standardized within the given dataset, such that the output ‘Z-score’ has mean=0 and std. dev=1. Such a Z-score can then be used for classification, etc.
Details: The Z-score is based off of the geometric mean of expression. As such, negative expression values are not allowed. A dataset is thus always scaled by its minimum value + 1, such that the lowest value = 1. Any individual NANs or NAs are also set to 1. If a dataset does not have any information on a given gene, the entire gene is simply left out of the score. When run, the function will print to command line the number of genes used, and the number passed in.
Although mostly used internally, the function has been exported in case users want to compare multiple classes, etc., using the same Z-score as is used for producing two-class comparisons.
Usage:
calculateScore(datasetObject, filterObject)
datasetObject
: A Dataset object for which the signature score (Z-score) will be calculated.
filterObject
: a MetaFilter object generated with filterGenes()
containing the signature genes that will be used for Z-score calculationdatasetObject
: A vector of Z-scores, of length ncols(datasetObject$expr)
(and in the same order). This vector would typically be added as $score
column in datasetObject$pheno
.
forestPlot()
A forest plot can be used to compare the expression values of a gene across different datasets. The size of the blue boxes is proportional to the number of samples in the study and light blue lines indicate the standard error of the effect sizes for each study (95% confidence interval). The summary effect size for all studies is indicated as yellow diamond below and the width of the diamond indicates the summary standard error.
Usage:
forestPlot(metaObject, geneName)
metaObject
: a filtered metaObject
(i.e. needs to include a filterObject
generated with filterGenes()
)
geneName
: name of the gene for which the forest plot should be generatedviolinPlot()
Given a filterObject
and a datasetObject
this function will use the selected genes of the filterObject
to calculate and compare the z-scores of the groups (e.g. cases vs. controls) from the datasetObject
by generating a violin plot. A violin plot is similar to a box plot, except the width of each violin is proportional to the density of points. violinPlot()
is commonly used to validate a gene signature in an independent dataset.
Usage:
violinPlot(filterObject, datasetObject, labelColumn)
filterObject
: a MetaFilter object containing the signature genes that will be used for the z-score calculation
datasetObject
: a Dataset object for group comparison in a violin plot
labelColumn
: the label of the column in $pheno
that specifies the groups to compare, typically case or control (default: ‘label’)
Generates:
rocPlot()
rocPlot
will plot an ROC curve (and return the AUC) that describes how well a gene signature (as defined in a filterObject) classifies groups in a dataset (in the form of a datasetObject).
Details:
Evaluates the ability of a given gene set to separate two classes. The gene set is evaluated as a Z-score of the difference in means between the positive genes and the negative genes (see calculateScore()
). Returns a standard ROC plot, plus AUC with 95% CI (calculated according to Hanley method).
Usage:
rocPlot(filterObject, datasetObject, title = "ROC Plot")
filterObject
: a MetaFilter object containing the signature genes that will be used for calculation of the ROC plot.
datasetObject
: a Dataset object for group comparison in the ROC plot. (At least, must have a $expr
of probe-level data, $keys
of probe:gene mappings, and $class
of two-class labels)
title
: Title for the ROC plot.
Returns:
forwardSearch()
forwardSearch
is a method of optimizing a given set of significant genes to maximize discriminatory power, as measured by area under the ROC curve (AUC). The function works by taking a given set of genes (presumably a set that has been filtered for statistical significance), and iteratively adding one gene at a time, until the stopping threshold is reached. At each round, the gene whose addition contributes the greatest increase in weighted AUC is added. Weight AUC is defined as the sum of the AUC of each dataset, times the number of samples in that dataset. The stopping threshold is in units of weighted AUC.
Usage:
forwardSearch(metaObject, geneList, yes.pos = NULL, yes.neg = NULL, forwardThresh = 0)
metaObject
: a Meta object which must be complete (have $filterResutls
included)
geneList
: A list of two vectors of filtered genes; must have positive genes as the first item and negative genes as the second item. Designed to pass in the filtered gene lists from the metaObject.
yes.pos
: (Optional) if passed, the forwardSearch will start with the genes in yes.pos and yes.neg (instead of starting from 0 genes).
yes.neg
: (Optional) if passed, the forwardSearch will start with the genes in yes.pos and yes.neg (instead of starting from 0 genes).
forwardThresh
: Stopping threshold for the backward search. Default=0.Data frame
: containing genes which passed the filtering process
Examples
#Run a forward search
forwardRes <- forwardSearch(tinyMetaObject, tinyMetaObject$filterResults[[1]], forwardThresh = 0)
backwardSearch()
backwardSearch
is a method of optimizing a given set of significant genes to maximize discriminatory power, as measured by area under the ROC curve (AUC). The function works by taking a given set of genes (presumably a set that has been filtered for statistical significance), and iteratively removing one gene at a time, until the stopping threshold is reached. At each round, the gene whose removal contributes the greatest increase in weighted AUC is removed. Weight AUC is defined as the sum of the AUC of each dataset, times the number of samples in that dataset. The stopping threshold is in units of weighted AUC.
Usage:
backwardSearch(metaObject, geneList, backThresh = 0)
metaObject
: The metaObject which must be complete (have $filterResults
included)
geneList
: A list of two vectors of filtered genes; must have positive genes as the first item and negative genes as the second item. Designed to pass in the filtered gene lists from the metaObject.
backThresh
: Stopping threshold for the backward search. Default=0.
Data frame
: containing genes which passed the filtering process
Examples
#Run a backward search
backwardRes <- backwardSearch(tinyMetaObject, tinyMetaObject$filterResults[[1]], backThresh = -3)
checkDataObject()
Given an object
to check, its objectType
and the objectStage
, the function checkDataObject
looks for errors within Meta, Dataset, MetaAnalyis, or MetaFilter objects. It returns TRUE
if the object passed error checking, FALSE
otherwise, and it prints warning messages explaining failed checks.
Usage:
checkDataObject(object, objectType, objectStage)
object
: the object to be checked for validation
objectType
: one of “Meta”, “Dataset”, “MetaAnalysis”, “MetaFilter”
objectStage
: if a Meta object, one of “Pre-Analysis”, “Pre-Filter”, or “Post-Filter”. Otherwise: "" (empty string)
TRUE
if passed error checking, FALSE
otherwise
Examples
# check a datasetObject
checkDataObject(tinyMetaObject$originalData$Whole.Blood.Study.1, "Dataset")
# check a metaObject before running the meta-analysis
checkDataObject(tinyMetaObject, "Meta", "Pre-Analysis")
# check a metaObject after running the meta-analysis with runMetaAnalysis()
checkDataObject(tinyMetaObject, "Meta", "Pre-Filter")
# check a metaObject after filtering the meta-analysis results with filterGenes()
checkDataObject(tinyMetaObject, "Meta", "Post-Filter")
# check a metaAnalysisObject
checkDataObject(tinyMetaObject$metaAnalysis, "MetaAnalysis")
# check a filterObject
checkDataObject(tinyMetaObject$filterResults[[1]], "MetaFilter")
getMostRecentFilter()
Given a metaObject
this function will look through $filterResults
for the most recent filter used and return the filter name.
Usage:
getMostRecentFilter(metaObject)
metaObject
: A Meta objectFilterLabel
: Name of the most recent filter
calculateROC()
Calculates receiver operating characteristic curve data, including AUC (using trapezoidal method). Takes only a vector of labels and a vector of predictions.
Details:
The code borrows its core ROC calculations from the ROCR package. AUC is calculated by the trapezoidal method. AUC standard errors are calculated according to Hanley’s method (Hanley et al, 1982).
Usage:
calculateROC(labels, predictions, AUConly = F)
labels
: Vector of labels; must have exactly two unique values (ie, cases and controls).
predictions
: Vector of predictions (for instance, test scores) to be evaluated for ability to separate the two classes. Must be exactly the same length as labels.
AUConly
: Return all ROC values, or just the AUC.roc
: dataframe consisting of two columns, FPR and TPR, meant for plotting
auc
: area under the curve
auc.CI
: 95% confidence interval for AUC
getSampleLevelGeneData()
Given a standard datasetObject
, and a set of target genes, this function will summarize probe-level data to gene-level data for the target genes. Returns a data frame with only the genes of interest, for each sample in the dataset.
getSampleLevelGeneData(datasetObject, geneNames)
datasetObject
: a Dataset object that is used to extract sample level data (At least, must have a $expr
of probe-level data, and probe:gene mappings in $keys
).
geneNames
: A vector of geneNames
Returns