The mstherm
package is used for modeling and analysing thermal proteome
profiling (TPP) experiments. For more details on the motivation and
methodology behind such experiments, please see Savitski et al,
2014. Analyzing a TPP experiment in
mstherm
involves generating table of quantative temperature series data for
each replicate, setting up metadata files describing the experiment, creating
an MSThermExperiment object, normalizing the data, modeling, visualization,
and data export. Each of these steps will be described below as a sample
dataset is analyzed.
The primary input to mstherm
is a set of tab-delimited data tables
containing quantative data from the experiment, one file per replicate. The
table consists of a header line followed by one line per peptide spectral
match. It can contain any number of columns, but certain columns must be
present (and labeled properly) to be used with mstherm
. Specifically, the
folowing columns are required to be present:
peptide
- the sequence of the matched peptide in single-letter IUPACprotein
- the protein group the peptide is assigned to...
- one column per temperature point, with labels matching those in
the meta
table described belowThe following columns are optional but contain information that can be used for filtering (all other columns are simply ignored):
score
- the score assigned by the search engine or processing software
(e.g. Percolator, PeptideProphet) to the peptide spectral matchcoelute_inf
- the fraction of the signal in the precursor window derived
from off-target ions (e.g. the percent co-isolation interference, from
0.0-1.0)These data files are generally produced by whatever quantification software is used, but may need manual or batch processing to apply the correct column labels.
Although mstherm
has some capabilities for reading metadata from existing
dataframes, typically this information will be read from files on disk. While
these files can be arranged in any way the user wishes, it is recommended that
a new directory be set up for each project with the following layout:
├── control.tsv
├── data
│ ├── Control_R1.tsv
│ ├── Control_R2.tsv
│ ├── Control_R3.tsv
│ ├── Treated_R1.tsv
│ ├── Treated_R2.tsv
│ └── Treated_R3.tsv
└── meta
└── tags.meta
The control.tsv
file is a tab-delimited table describing the experiment and paths to relevant files.
An example of a control file for the above experiment would be as follows:
name sample data_file meta_file
C1 Control data/Control_R1.tsv meta/tags.meta
C2 Control data/Control_R2.tsv meta/tags.meta
C3 Control data/Control_R3.tsv meta/tags.meta
T1 Treated data/Treated_R1.tsv meta/tags.meta
T2 Treated data/Treated_R2.tsv meta/tags.meta
T3 Treated data/Treated_R3.tsv meta/tags.meta
Paths in this file are relative to the location of the control file itself. In
this experiment, all samples used the same quantitative methodology and so all
share the same meta file, but any number of meta files can be used. The
meta_file
is another tab-delimited table that maps quantitative column
labels in the data files to temperature points. An example for an experiment
utilizing TMT-10plex isobaric labeling might look as follows:
channel temp
TMT10.126 28.2
TMT10.127N 31.3
TMT10.127C 35.0
TMT10.128N 38.5
TMT10.128C 42.1
TMT10.129N 45.7
TMT10.129C 49.1
TMT10.130N 52.2
TMT10.130C 56.2
TMT10.131 59.4
The channel labels can be anything the user chooses as long as they match with the appropriate columns in the data file. In all of these metadata files, the column labels must match those shown above.
If available, a tab-delimited table containing descriptions for each protein
group can be imported and used for plot labels during visualization. This
table should have two columns labeled name
and annotation
.
With the data files in hand as described, an mstherm
session is started as
follows:
library(mstherm)
control <- system.file("extdata", "demo_project/control.tsv", package="mstherm")
annots <- system.file("extdata", "demo_project/annots.tsv", package="mstherm")
expt <- MSThermExperiment(control, annots)
This will read the control file and load all necessary quantitative data and metadata into the MSThermExperiment object. This is the step in which the table of annotations can be provided if available (this information is only used to provide descriptive labels on protein melting plots). Here we are using files included in the demo of this package.
Typically, after importing the raw quantitiative data a normalization step is
performed to correct for differences in sampling handling and other sources of
variance that can introduce variance from channel to channel. There are two
methods available in mstherm
to perform this step, normalize_to_std()
and
normalize_to_profile()
. normalize_to_std()
requires that a protein
spike-in be added equally to all samples directly after gradient
precipitation, and the method calculates and applies scaling factors to each
channel such that the values for the spike-in are roughly equal across
channels. The normalization is applied independently to each replicate. In the
demo data, BSA was spiked in to each sample, and we use it here to apply the
normalization method (using the protein ID which matches that used in the data
tables):
expt <- normalize_to_std(expt, "cRAP_ALBU_BOVIN")
This performs the normalization as well as producing a summary plot for QC purposes (which can be captured to PDF, etc, by changing the output device before the above step). In the plots above, green points represent the original channel intensity sums (relative to the first temperature), the red points show relative quantification of spike-in standard pre-normalization (with IQR indicated by red bars), the blue points show normalized relative channel sums, and the blue curve is final normalization curve used to generate scaling factors.
The other available normalization method, normalize_to_profile()
, takes a
vector of values equal in length to the number of quantification channels and
calculates and applies scaling factors to each channel such that the ratios
between sums of quantification values for each channel match the ratios
between values given in the vector. Such values might correspond to, for
example, measured total protein concentration in each sample after gradient
precipitation.
The temperature series data is attempted to be fit to a logistic curve model for two-state unfolding according the following formula:
\[Pn = \frac{1 - \textit{p}}{1 + e^{-k(\frac{1}{T}-\frac{1}{m})}} + \textit{p}\]
where \(T\) is the temperature and \(m\), \(k\), and \(p\) are estimated parameters (\(m\) corresponds to the melting temperature, \(p\) the lower plateau of the curve, and \(k\) contributes to the slope of the curve).
Modeling of the data is performed with the following command:
res <- model_experiment(expt, bootstrap=T, smooth=T, min_rep_psm=3, np=1)
There are many, many options available to this command which control various
aspects of modeling and filtering (see the documentation for full details). As
modeling can be slow (especially if you have data for several thousand
proteins and bootstrapping is turned on), the package can utilize parallel
processing to speed things up. This is set by the np
parameter, which by
default will use all available threads on the machine.
A second round of normalization can be performed to correct for differences in sample handling between replicates in experiments where most proteins are not expected to be changing and the global distribution of protein Tms is expected to be constant. Doing so requires that the dataset first be modeled as above to estimate Tms for each protein. A single replicate is then chosen as the baseline and correction factors are calculated for all other replicates based on linear regression between a set of high-quality melting temperature estimates. These linear corrections are applied to the intial temperature vector for each replicate and a new MSThermExperiment object is returned which can be re-modeled as above. The command to perform this inter-replicate normalization is:
expt <- normalize_to_tm( expt, res )
res <- model_experiment(expt, bootstrap=T, smooth=T, min_rep_psm=3, np=1)
Melting plots can be produced for each protein in the result set. This can be
done either for an individual protein or for the whole set at once. For
example, the following command will generate a plot for protein P38707
of
the test data:
plot(res$P38707)
Again, there are a number of options that can be passed to the plot()
command to control what is shown on the plots (by default, most options are
turned on). The plot()
command can also be passed a list of proteins to
plot, or no arguments at all (in which case all proteins are plotted
sequentially). For example, the following will produce a PDF file with each
protein plot on a separate page:
pdf("models.pdf", 5, 5, pointsize=10, useDingbats=F)
plot(res)
dev.off()
The as.data.frame()
method produces a summary dataframe containing melting
temperatures, model parameter estimates, and other descriptive values for each
protein that can be used for further analysis in R or written to file using
the usual methods (e.g. write.table()
). The output from this command
contains the following columns (row labels are protein group IDs):