To cite MALDIrppa in publications, please use:
Palarea-Albaladejo J., McLean K., Wright F. and Smith (2018). MALDIrppa: quality control and robust analysis for mass spectrometry data. Bioinformatics 34(3):522–523. (http://dx.doi.org/10.1093/bioinformatics/btx628).
(Note that the following applies to MALDIrppa version 1.1.0 onwards. For compatibility with scripts using previous versions see the corresponding article.)
Mass spectrometry is actively being used to discover proteomic patterns in complex mixtures of peptides derived from biological samples. Reproducibility is however a significant challenge in MALDI protein profiling. Approaches to improve the analytical performance of MALDI protein profiling include extensive pre-fractionation strategies, immunocapture, pre-structured target surfaces, standardized matrix (co)crystallization, improved instrument components, internal standard peptides, quality-control samples, among others (Albrethsen 2007). Mass spectrometry (MS) data pre-processing algorithms play a crucial role in rendering the subsequent data analysis more robust and accurate. The package MALDIrppa
contributes a number of procedures for robust pre-processing and analysis, along with a number of functions to facilitate common data management operations. It is thought to work in conjunction with the MALDIquant
package (Gibb and Strimmer 2012), using object classes and methods from this latter. These notes provide basic guidelines for users to run a complete robust MALDI mass spectra pre-processing pipeline using their own data with small adaptations. Further specific details and examples on the full functionality of the MALDIrppa
package can be found in the associated help pages.
The data set spectra
used as example here is an anonymised collection of mass spectra trimmed to the [2500, 13000] m/z interval. Their m/z resolution is 30 times lower than the original (achieved using the redResolution
function) so as to make it manageable for sharing and illustration purposes. It consists of 4 technical replicates of 5 biological replicates from 20 bacterial isolates. The mass spectra were originally acquired over the [2000, 20000] Daltons mass range using the Bruker Daltonics Ultraflex II mass spectrometer. The data set type
contains a simplified version of the associated metadata.
The following sections go through the stages to transform raw mass spectra into formats suitable for downstream data analysis and modelling for biological interpretation. These involve data transformation, smoothing, baseline correction, normalisation, peak detection and peak alignment and binning. The features of the signals depend on technological progress and characteristics of the species under study. Unfortunately there is not a standard methodology for MS data pre-processing best working in all cases. This document illustrates a generic pipeline through MALDIrppa
in combination with MALDIquant
based on robust statistical methods which contributes to lessen the intrinsic reproducibility issues with MS data and facilitates obtaining reliable biological conclusions (Mclean et al. 2017). However, the finest pre-processing in a particular case might require further discussion of choices for parameter settings and methods.
The data are included in MALDIrppa
in the own R’s RData
binary file format. They can be loaded into the workspace using the data
function.
library(MALDIrppa)
data(spectra) # list of MassSpectra class objects
data(type) # metadata matrix
Alternatively, raw data stored in text files can be easily imported into R and given the convenient MassSpectrum
class format using the importSpectra
function included in MALDIrppa
. For importing data from more specialised file formats we suggest the reader to check out the package MALDIquantForeign
. The function summarySpectra
computes a table including some basic summary statistics. A look at these along with some graphical representations are useful to get a first insight into the data set and identify possible issues. The next code shows numerical results for the first 10 signals.
summarySpectra(spectra[1:10])
## ID No.MZ Min.MZ Max.MZ Min.Int Mean.Int Std.Int Med.Int MAD.Int Max.Int
## 1 160408F21 1857 2500.05 12995.77 19 722.2649 572.9375 664 486.2928 5888
## 2 160408F22 1857 2500.05 12995.77 18 637.1357 531.1853 582 382.5108 6128
## 3 160408F23 1857 2500.05 12995.77 8 542.4976 447.7237 483 401.7846 4713
## 4 160408F24 1857 2500.05 12995.77 34 957.3430 737.8402 903 587.1096 7843
## 5 160408G01 1857 2500.05 12995.77 42 1129.8185 871.5639 1064 652.3440 7860
## 6 160408G02 1857 2500.05 12995.77 24 637.1648 530.9029 595 358.7892 5459
## 7 160408G03 1857 2500.05 12995.77 12 505.9499 442.6567 466 298.0026 5198
## 8 160408G04 1857 2500.05 12995.77 43 984.1179 754.5525 936 530.7708 7333
## 9 160408G05 1857 2500.05 12995.77 27 906.9138 686.9536 854 558.9402 6068
## 10 160408G06 1857 2500.05 12995.77 46 906.9133 689.0228 891 481.8450 7505
Pre-processing methods are not immune to low-quality mass spectra and inadvertently letting them through can severely distort the results. For example, when dealing with a number of data sets we observed both clusters of spurious peaks in some cases and sparse peaks in others which interfere with the determination of common reference peaks for peak alignment and binning. Common low-quality mass spectra are characterised by extremely ripply and indented profiles due to low signal intensities. The low signal intensity causes poor peak formation and poorly resolved peaks. Moreover, ion suppression cases show one or a few abnormally high peaks whereas the other peptide features are pulled down. The package MALDIrppa
helps to detect exotic signals by a semi-automatic screening process implemented in the screenSpectra
function. It is based on robust scale estimators (Rousseeuw and Croux 1993) of the derivative mass spectra and median intensities. These components are given a weight to build an atypicality score (\(A\)) for each signal, which is then labelled as a suspicious case (failure
) if the score falls beyond the estimated tolerance limits (see help for screenSpectra
for options available). The atypicality score \(A\) is calculated as
\[A = \frac{S^{\lambda}}{(\texttt{med}+1)^{(1-\lambda)/2}},\]
where \(S\) is the robust scale estimator used (either Rousseeuw’s \(Q\) or the well-known median absolute deviance, MAD, estimator are available), \(\texttt{med}\) is the median intensity of the raw signal and \(\lambda\) distributes the weight given to each of these components (\(\lambda = 1/2\) by default).
The screenSpectra
function generates a table est.table
with individual information for each mass spectrum. The following code illustrates its use with default settings and creates a new scSpectra
class object called sc.results
. The function can take a matrix of metadata (type
in our case) as argument. This is convenient to easily filter problematic cases out in both the spectra and metadata sets for subsequent analysis. Ordinary R methods summary
and plot
can be applied on scSpectra
class objects to obtain a summary table and graphs of the results.
<- screenSpectra(spectra, meta = type)
sc.results summary(sc.results)
## (10 first mass spectra)
## ID A score Class
## 1 160408F21 0.1683850 success
## 2 160408F22 0.1716699 success
## 3 160408F23 0.1887859 success
## 4 160408F24 0.1478541 success
## 5 160408G01 0.1479799 success
## 6 160408G02 0.1734508 success
## 7 160408G03 0.1772491 success
## 8 160408G04 0.1534341 success
## 9 160408G05 0.1716697 success
## 10 160408G06 0.1538116 success
##
## ----------------------------
##
## Scale estimator: Q
## Method: adj.boxplot
## Threshold: 1.5
## Limits: [0.1196,0.2942]
## Deriv. order: 1
## Lambda: 0.5
## No. potentially faulty spectra: 19 (4.75 %)
plot(sc.results, labels = TRUE)
Among other features, the method plot
allows to visualise the distribution of the atypicality scores (type = "hist"
), customise the point labels and interactively display marked spectra for individual visual inspection (type = "casewise"
). When label = TRUE
it shows the position index of the mass spectra in the data set. It is advisable to visually inspect marked mass spectra, particular borderline cases, and further investigate any pattern. For example, in our illustrative data set we observe that mass spectra number 25 to 28 correspond to a series of technical replicates within the same biological replicates (see e.g. type[20:30,]
). Focusing on some of the most extreme \(A\) scores, the following figures display mass spectra 28, 29 and 87, which correspond with low signal intensity, ion suppression and flatline mass spectra respectively. Flatline mass spectra are produced when the instrument is unable to collect data that does not meet the minimum resolution and intensity requirements. These low-quality mass spectra are clearly identified using the \(A\) score as seen above.
plot(spectra[[28]])
plot(spectra[[29]])
plot(spectra[[87]])
As an example of the problems caused by faulty, low-quality spectra, the next figure shows the case of two peak profiles originated from the same biological material. It is evident that they are carrying very different information.
For the purpose of this tutorial we move on by updating the original spectra
and type
objects with the filtered collection of mass spectra and associated metadata provided by screenSpectra
through the fspectra
and fmeta
elements of the output.
<- sc.results$fspectra # Filtered list of mass spectra
spectra <- sc.results$fmeta # Filtered metadata type
A number of parameters are required to be set for the different pre-processing stages. The functions include safe default values, but parameter settings throughout pre-processing stages interact in complex ways. Using simulation, we customised some of the key parameter so as to obtain optimal combinations in terms of sensibility and false discovery rate in peak profiles. They have been slightly modified to accommodate the characteristics of the reduced example data set used here. It is convenient to write an initial section in the pre-processing pipeline script to set the values for the parameters which will be used later on. This facilitates testing results from different settings.
# Pre-processing parameter settings
<- 4 # Wavelet smoothing level
smoothLevel <- 105 # Baseline correction
ite <- 2 # Peak extraction
SigNoi <- 20 # Peak extraction
hws <- 0.003 # Peak binning tol
Signal denoising or smoothing is conducted in MALDIrppa
by undecimated discrete wavelet transform (UDWT) (Coombes et al. 2005) using the wavSmooting
function. Wavelets adapt well to changes in peak shape and the UDWT produces the same results if the start of the signal is shifted by a few time points (shift-invariance). This prevents from undesirable artifacts into the signal near either end of the spectrum due to large changes in wavelet coefficients as a consequence of small shifts in location. Alternatively, smoothing methods SavitzkyGolay
and MovingAverage
from the MALDIquant
package can be called directly from this function. Note that from version 1.1.0 of MALDIrppa
wavelet smoothing is conducted by maximal overlap discrete wavelet transformation and universal thresholding of coefficients based on methods available on the waveslim
package. If the previous implementation of the wavelet method is required for compatibility with your pre-processing pipelines you can download and install manually source files of version 1.0.5-1 from the archive of old sources of the package (https://CRAN.R-project.org/package=MALDIrppa); see also article/vignette about MALDIrppa v1.0.5-1.
In this example, smoothing is applied on square root-transformed (sqrt
function in R) signals using transfIntensity
for variance stabilisation. Note that in fact this function allows to apply any sensible user-defined transformation on the signal intensities (see MALDIrppa
documentation for an example). For baseline correction, the statistics-sensitive non-linear iterative peak-clipping (SNIP) algorithm as implemented in MALDIquant
generates positive intensities and provides better results with high matrix effects (pronounced mode). Regarding signal normalisation, MALDIquant
offers a number of sensible methods through the calibrateIntensity
function. In particular, probabilistic quotient normalisation (PQN) provides a robust method for scaling spectra, commonly originated from different dilutions, into a common overall concentration (Dieterie et al. 2006).
<- transfIntensity(spectra, fun = sqrt)
spectra <- wavSmoothing(spectra, method = "Wavelet", n.levels = smoothLevel)
spectra <- removeBaseline(spectra, method = "SNIP", iterations = ite)
spectra <- calibrateIntensity(spectra, method = "PQN") spectra
Note that in this illustrative pipeline we go on overwriting the MassSpectrum
object containing the list of signals as we apply the different operations to produce a clean and corrected collection of signals. Given the usual large size of the data set, this is convenient to save computer memory and speed up the process. However we do not then keep individual backup copies of the spectra at each pre-processing stage. If this is required, simply using the ordinary save(objectname, filename)
line in R after each stage will do. In any case, it is recommended to delete previous versions of the data from R’s workspace as soon as they are saved in order to prevent from memory usage problems.
Using detectPeaks
in MALDIquant
peak profiles are produced by searching for and extracting local maximum intensities along the corrected MS signals. A noise function is estimated (see estimateNoise
) and signal-to-noise ratio (SNR) and mass window size values are set in order to determine peaks corresponding to genuine peptide-induced signals. In particular, peaks are extracted if their intensity is greater than SNR times the estimated noise within a mass \(\pm\) half window size interval. Extracted peak lists still require alignment and binning across profiles. This is necessary to correct for slight mass drifts and arrange all peak profiles onto a common range of masses. A number of landmark peaks that are matched a minimum number of times across profiles serve as reference points for alignment, following procedures implemented in MALDIquant
(the minFreq
argument sets this as a proportion, 80% frequency in our example). Binning is then used to equalise masses of peaks found within a tolerable relative mass deviation range and, hence, assumed to be representatives of the same peptide (the tolerance
argument sets this as a proportion, 0.3% deviance in our case). The alignPeaks
function in MALDIrppa
is a wrapper for a number of MALDIquant
functions involved in this stage that simplifies the workflow into a single line of code. Moreover, alignPeaks
implements an additional binning round which we found useful to deal with misalignment issues that may be still present after putting the peak profiles through default MALDIquant
alignment and binning procedures. We faced this situation when working with mass spectra over a particularly high-resolution mass range.
<- detectPeaks(spectra, SNR = SigNoi, halfWindowSize = hws)
peaks <- alignPeaks(peaks, minFreq = 0.8, tolerance = tol) peaks
It is important to keep exploring the data throughout the pre-processing workflow in order to identify potential issues or unexpected results. Analogously to summarySpectra
used above, summaryPeaks
produces a table of summary statistics from the peak profiles. The following line of code shows for example details for the first 10 peak profiles.
summaryPeaks(peaks[1:10])
## ID No.MZ Min.MZ Max.MZ Min.Int Mean.Int Std.Int Med.Int MAD.Int Max.Int Min.SNR Mean.SNR Std.SNR Med.SNR MAD.SNR Max.SNR
## 1 1 17 2834.517 10498.76 1e-04 3e-04 2e-04 4e-04 2e-04 7e-04 2.0658 5.2588 2.5826 5.7045 3.0740 11.0548
## 2 2 19 2834.517 10498.76 1e-04 4e-04 2e-04 3e-04 2e-04 8e-04 2.1924 6.1643 3.1813 5.2446 3.4485 13.6566
## 3 3 16 2834.517 10498.76 2e-04 4e-04 2e-04 4e-04 2e-04 8e-04 2.2235 5.2367 2.4604 5.8425 3.1136 10.8363
## 4 4 19 2834.517 12185.69 1e-04 3e-04 2e-04 3e-04 2e-04 7e-04 2.0122 5.1029 2.7053 4.0903 3.0105 11.5476
## 5 5 17 2834.517 10498.76 1e-04 3e-04 2e-04 3e-04 2e-04 7e-04 2.4147 5.7237 2.7602 4.9986 3.7420 11.1606
## 6 6 22 2834.517 12185.69 1e-04 3e-04 2e-04 2e-04 1e-04 7e-04 2.1804 5.7572 3.5148 4.0762 2.1713 13.1585
## 7 7 18 2834.517 12185.69 1e-04 4e-04 2e-04 3e-04 2e-04 8e-04 2.2183 6.2045 3.3271 5.1182 3.2585 13.2845
## 8 8 18 2834.517 12185.69 1e-04 3e-04 2e-04 3e-04 2e-04 7e-04 2.2711 5.6168 2.9050 4.7738 2.6931 11.6444
## 9 9 20 2834.517 12185.69 1e-04 3e-04 2e-04 2e-04 2e-04 6e-04 2.0212 4.9953 2.7368 3.8578 2.5544 10.7881
## 10 10 26 2834.517 12185.69 1e-04 3e-04 2e-04 2e-04 1e-04 7e-04 2.0199 5.5798 3.6450 4.4211 2.5821 14.6057
The function countPeaks
directly counts the number of peaks in each profile. Either an obvious excess or scarcity of peaks in a profile may deserve a closer look. The following code produces the list of counts and plots them for overall comparison using the profile number as label.
<- countPeaks(peaks)
cP plot(cP, type = "n")
text(cP, label = 1:length(cP))
At this point we have essentially concluded data pre-processing and it is then convenient to save a backup of the peak profiles and associate metadata (note that in this example the file is created in a temporary location, the user should choose their preferred location).
save(peaks, file = file.path(tempdir(), "peaks.pp.Rdata"))
save(type, file = file.path(tempdir(), "type.pp.Rdata"))
The peakPatterns
function is useful to visualise and explore the resulting peak patterns. It displays peak presence (coloured cells) and absence (blank cells). The barplot on the top reflects the frequency of a peak across replicates. The function can be applied either on a list of MassPeaks
objects, like our peaks
, or on an intensity matrix. Any remaining data artifacts or hints about peak features differentiating e.g. strains should be evident here. This display can also be useful to check for homogeneity of replicates from an isolate by simply subsetting the replicates within that same isolate (see ?peakPatterns
for more details and examples).
peakPatterns(peaks)
Note that due to process disturbances, chemical contamination or human errors for example, some peak profiles may exhibit patterns beyond anticipated biological variability at a level at which certain homogeneity is expected, say at biological and technical replication level. This is not related to low-quality, data acquisition problems as before, the signals are correct, it is instead a case of outlying peak profiles characterised for its large distance from the main mass of data points. They usually include relevant information about certain artifacts or substructures in the data, although in practice it is not always straightforward to find the underlying reasons for their deviating behaviour. As it is well-known in general data modelling, outliers can heavily influence e.g. correlations, fitting in regression models or similarity measures used by classification algorithms. Outliers in our context are often sought after on a peptide-wise basis (Erhard and Zimmer 2012), but we adhere to the idea that peaks of outlying profiles do not necessarily stand out individually. They can simply show an atypical relative pattern with combinations of peaks that occur very rarely. Working within a robust statistics framework we consider the multidimensional and inter-dependent structure of the data for outlier detection (Rocke and Woodruff 1996; Maronna, Martin, and Yohai 2006). In particular, we adapt the multivariate outlier detection method proposed by Filzmoser, Garrett, and Reimann (2005) (adaptive outlier detection, AOD) which allows the boundaries for a peak profile to be considered as potential outlier to be adjusted to the number of replicates involved. By using a multidimensional scaling (MDS) coordinate projections of the data, the approach can be applied at different levels of replication and on varied data formats, e.g. either peak intensities or peak presence/absence binary data. This customised procedure is implemented in the detectOutliers
function. It generates a logical vector flagging out potential outlying replicates at a given level of aggregation, e.g. at isolate level as set using the argument by
in the following line of code. This can be used to filter outliers out if that is judged adequate. The argument binary
facilitates the use of the method on either peak presence/absence profiles (binary = TRUE
) or raw peak intensity profiles (binary = FALSE
).
<- detectOutliers(peaks, type$Isolate, binary = TRUE) out
The following figure illustrates the result from a collection of 20 replicates from the same bacterial isolate. Note that the method was applied on peak presence/absence profiles. Quantile ellipsoids based on the robust Mahalanobis distance determine outlierness thresholds. The graph displays ellipsoids at increasing quantiles, including the adjusted quantile ellipsoid produced by the AOD method. The profiles, projected onto a 2-dimensional space by MDS, falling beyond the adjusted quantile boundaries were marked as potential outliers. In this case there is evidence of a set of 4 technical replicates within a biological replicate that were probably contaminated samples (distant points located on the right-hand side of the graph).
The identification of outliers can be sometimes a primary purpose of data analysis, as they might reveal hidden structures in the data of biological interest. Other times outliers are a nuisance resulting from errors or contamination, and it might be then convenient to discard or downweight them prior to the actual data modelling. MALDIrppa
provides a exploratory tool to inform of samples that are considerably different from the majority. The final decision about their biological meaning and how to deal with them will depend on the assessment of the researcher. To simplify the exposition we here proceed by discarding outlying peak profiles.
<- peaks[out[,2] == FALSE] # Discard outlying peak profiles
peaks.clean <- type[out[,2] == FALSE,] # and corresponding metadata type.clean
The inherent nature of the data implies that pre-processed data may still contain peaks only reflecting noise and probably irrelevant signals. MALDIquant
facilitates further cleaning by eliminating rare peaks across replicates using the filterPeaks
function. This is done by setting a minimum frequency (minFreq
argument) for a peak at a desired level of aggregation (isolate level in our case). Note that in MALDIquant
the argument labels
is used to set the grouping factor. The following line discards peaks occurring in less than 25% replicates within isolate.
<- filterPeaks(peaks.clean, minFreq = 0.25, labels = type.clean$Isolate) peaks.clean.f
We can compare these filtered peak profiles with the more noisy raw peaks profiles above using peakPatterns
.
peakPatterns(peaks.clean.f)
Finally replicates are typically merged to obtain a single composite representative peak profile by isolate. In accordance with our robust approach we propose to do this by computing the median peak profile across replicates within isolate. This contributes to downgrade the influence of extreme peak intensities on the pooled estimates. The associated metadata should be also aggregated in the same way. Here we use the aggregate
function available in R base, but other alternatives are available through several R packages. Note that we specify FUN = length
simply as a trick, as we are not really interested in any summary of the metadata. Moreover, we only keep the information about the isolate, which is in the first column of the resulting table (note that in this illustrative example type.clean.fm
becomes a vector after this).
<- mergeMassPeaks(peaks.clean.f, labels = type.clean$Isolate, method = "median")
peaks.clean.fm <- aggregate(type.clean, list(Isolate = type.clean$Isolate), FUN = length)[,1] type.clean.fm
Statistical and machine learning methods typically work on data matrices. The intensityMatrix
function in MALDIquant
converts a list of MassPeaks
objects into a matrix. The replicates or composite profiles are arranged by rows, and the columns correspond with the list of transformed intensities along the common \(m/z\) range determined after peak alignment and binning. When a peak is not present at a certain mass position the cell is given a NA
value by default. The final intensity matrix is obtained as follows.
<- intensityMatrix(peaks.clean.fm) int.clean.fm
The intensity matrix and associated metadata can be stored in different formats to facilitate its use as input in, for example, specialised bioinformatics software packages. The following code uses simple functions to save them as CSV text files and also in the popular NEXUS format, which includes options to save either peak intensities or binary patterns and to add peak weights among others (see ?writeIntensity
help files for more details). Note that in this example the file is saved to a temporary location, the user should choose their preferred destination. The argument labels
can be used to add labels to the pre-processed samples. For this example data set, we use the spectra IDs stored in type.clean.fm
, which became a vector containing only the IDs in a previous stage. Thus, using the $
operator for matrices is not required to select the corresponding ID column here.
writeMetadata(type.clean.fm, file = file.path(tempdir(),"type.clean.fm"),
format = "csv") # save metadata
writeIntensity(int.clean.fm, file = file.path(tempdir(),"int.clean.fm"),
format = "csv", labels = type.clean.fm)
writeIntensity(int.clean.fm, file = file.path(tempdir(),"int.clean.fm"),
format = "NEXUS", labels = type.clean.fm)
The argument binary
can be used to export data in binary (peak presence/absence) format. Note that this is internally turned TRUE
whenever the NEXUS or FASTA format is chosen. As an example, the next figure shows a phylogenetic network contructed using the NeighborNet algorithm implemented in the SplitsTree software package (https://uni-tuebingen.de/fakultaeten/mathematisch-naturwissenschaftliche-fakultaet/fachbereiche/informatik/lehrstuehle/algorithms-in-bioinformatics/software/splitstree/) directly from the NEXUS file generated above by MALDIrppa
.