This release brings an updated implementation of PLS algorithm
(SIMPLS) which is more numerically stable and gives sufficiently less
warnings about using too many components in case when you work with
small y-values. The speed of pls()
method in general has
been also improved.
Another important thing is that cross-validation of regression and
classification models has been re-written towards more simple solution
and now you can also use your own custom splits by providing a vector
with segment indices associated with each measurement. For example if
you run PLS with parameter
cv = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2)
it is assumed that you
want to use venetian blinds split with four segments and your dataset
has 10 measurements. See more details in the tutorial, where description
of cross-validation procedure has been moved to a separate section.
Other changes and improvements:
Refactoring and improvements of prep.savgol()
code
made the method much faster (up to 50-60 times faster for datasets with
many measurements).
Refactoring and improvements of prep.alsbasecorr()
code made the method 2-3 times faster especially for large
datasets.
added new plotting method plotRMSERatio()
for
regression models (inspired by this
post by Barry M. Wise)
added PQN
normalization method to prep.norm()
function.
fixed a bug in vipscores()
which could lead to a bit
higher values for PLS2 models.
fixes to several small bugs and general improvements.
This release is mostly about preprocessing - added some new methods, improved the existent once and implemented a possibility to combine preprocessing methods together (including parameter values) and apply them all together in a correct sequence. See preprocessing section in the tutorials for details
method prep.norm()
for normalization of spectra (or
any other signals) is more versatile now and supports normalization to
unit sum, length, area, to height or area under internal standard peak,
and SNV. SNV via prop.snv()
is still supported for
compatibility.
prep.savgol()
has been rewritten to fix a minor bug
when first derivative was inverted, but also to make the handling of the
edge points better. See details in help text for the function and in the
tutorial.
added a new method prep.transform()
which can be
used for transformation of values of e.g. response variable to handle
non-linearity.
added a new method prep.varsel()
which makes
possible to select particular variables as a part of preprocessing
framework. For example you can apply baseline correction, normalization
and noise suppression to the whole spectra and after that select only a
particular part for modelling.
added new method prep()
which let you to combine
several preprocessing methods and their parameters into a list and use
e.g. it as a part of model.
fixed a bug in mcrals()
which in rare occasions
could lead to a wrong error message.
fixed a bug when attribute yaxis.value
was used as
ylab
when creating line and bar plots.
fixed an earlier reported issue with plotXYResiduals (#100)
function employ()
which was used to employ
constraints in MCR-ALS has been renamed to
employ.constraint()
. The function is for internal use and
this change should not give any issues in your code.
the user guides have been revised and improved.
Machine$longdouble.eps
), which lead to an error when the
package is tested on Apple M1.added possibility for providing partially known contributions
(parameter cont.forced
) or spectral values (parameter
spec.forced
) to mcrals()
. See more in help
text and user guide for the package.
added possibility to run iPLS using test set (parameters
x.test
and y.test
) instead of
cross-validation.
added a possibility to provide user defined indices of the purest
variables in mcrpure()
instead of detecting them
automatically.
fixed bug #98, which caused a drop of row names when data frame was used as a data source for PCA/SIMCA.
fixed bug #99,
which did not allow to use user defined indices of pure variables in
mcrpure()
.
added Procrustes
Cross-Validation method, pcv()
(it is also available as
a separate
project).
added Kubelka-Munk transformation for diffuse reflectance spectra
(prep.ref2km()
).
fixed bug #94 which caused wrong limits in PCA distance plot when outliers are present but excluded.
fixed bug #95
which lead to issues when PLS regression methods
(e.g. plotRMSE()
) are used for PLS-DA model
object.
added additional check that parameter cgroup
for
plotting functions is provided as a vector or as a factor to avoid
confusion.
added link to YouTube channel with Chemometric course based on mdatools package.
fixed an issue, which lead to a bug in
simcam.getPerformanceStats
, returning implausible and
asymmetrical results (thanks to @svonallmen).
fixed a small issue sometimes giving warning when running tests on CRAN (did not influence the user experience though).
mcrpure()
method has been modified to
avoid potential issues with original patented version.added new method, mcrals()
, implementing
multivariate curve resolution based on the alternating least squares.
The method uses one of the three solvers (OLS, NNLS, FC-NNLS) together
with several basic constraints (non-negativity, normalization, closure,
etc.). It is also possible to create and use user-defined constraints as
well as combine them with the implemented ones.
added new method, mcrpure()
, implementing
multivariate curve resolution based on the purity approach (also known
as SIMPLISMA).
added a new preprocessing method,
prep.alsbasecorr()
, implementing baseline correction with
asymmetric least squares. It preserves all important data arguments
similar to other preprocessing methods.
added a new datasets, carbs
, with Raman spectra of
ribose, glucose and fructose and simulated spectra of their mixtures.
The dataset aims at testing and trying the curve resolution
methods.
fixed bug #88
which appears when initial number of
components in PLS model is too large. From v. 0.10.3 in this case the
algorithm warns user and reduces maximum number of components
automatically. But if cross-validation is used, sometimes for
cross-validation local model this number should be even smaller (because
local calibration subset has fewer observations). In this case the
pls()
method will raise an error and asks user to limit the
maximum number of components and run the model again.
main model methods (pls()
, pca()
,
etc.), now do additional check for the consistency of provided
datasets.
opacity
option in
plots.Fixed bug #85
when using y-values as data frame gave
an error in PLS regression
Fixed bug #86
and changed the way PLS limits maximum
number of components to avoid problems with singular matrices. Now if
PLS algorithm finds during calculations that provided number of
components is too large, it gives a warning and reduces this
number.
Code refactoring and tests for preprocessing methods
categorize.pls()
method, which could
give wrong results for test set measurements (see issue #82).Small improvements to plotExtreme.pca()
so user can
specify additional parameters, such as, for example cex
. If
plot is made for several components, you can now specify just one value
for all points (e.g. color of points or marker symbol).
Parameter show.limits
in methods
plotResiduals.pca()
, plotXResiduals.pls()
,
plotXYResiduals.pls()
can now take two logical values —
first for extreme limit and second for outlier limit. So, you can show
only one of the two limits on the plot. If one value is specified it
will be taken for both limits.
New function plotHotellingEllipse()
adds Hotelling
T2 ellipse to any scatter plot (of course it is made first of
all for PCA and PLS scores plots). The function works similar to
plotConvexHull()
and plotConfidenceEllipse()
,
see help for examples.
Fixed a bug in summary()
method for PLS, which
worked incorrectly in case of several response variables
(PLS2).
Many changes have been made in this version, but most of them are under the hood. Code has been refactored significantly in order to improve its efficiency and make future support easier. Some functionality has been re-written from the scratch. Most of the code is backward compatible, which means your old scripts should have no problem to run with this version. However, some changes are incompatible and this can lead to occasional errors and warning messages. All details are shown below, pay a special attention to breaking changes part.
Another important thing is the way cross-validation works starting from this version. It was decided to use cross-validation only for computing performance statistics, e.g. error of predictions in PLS or classification error in SIMCA or PLS-DA. Decomposition results, such as explained variance or residual distances are not computed for cross-validation anymore. It was a bad idea from the beginning, as the way it has been implemented is not fully correct — distances and variances measured for different local models should not be compared directly. After a long consideration it was decided to implement this part in a more correct and conservative way.
Finally, all model results (calibration, cross-validation and test
set validation), are now combined into a single list,
model$res
. This makes a lot of things easier. However, the
old way of accessing the result objects (e.g. model$calres
or model$cvres
) still works, you can access
e.g. calibration results both using model$res$cal
and
model$calres
, so this change will not break the
compatibility.
Below is more detailed list of changes. The tutorial has been updated accordingly.
Here are changes which can potentially lead to error messages in previously written code.
Cross-validation results are no more available for PCA (as
mentioned above), so any use of model$cvres
object for PCA
model will lead to an error. For the same reason pca()
does
not take the cv
parameter anymore.
Method plotModellingPower()
is no longer available
(was used for SIMCA models).
Method plotResiduals()
is no longer available for
SIMCAM models (multiclass SIMCA), use corresponding method for
individual models instead.
Selectivity ratio and VIP scores are not a part of PLS model
anymore. This is done to make the calibration of models faster. Use
selratio()
and vipscores()
to compute them.
Functions plotSelectivityRatio()
and
plotVIPScores()
are still available but they both compute
the values first, which may take a bit of time on large datasets. This
change makes parameter light
superfluous and it is no more
supported in pls()
.
Other two parameters, which are no more needed when you use
pls()
, are coeffs.ci
and
coeffs.alpha
. Jack-Knifing based confidence intervals for
regression coefficients now automatically computed every time you use
cross-validation. You can specify the significance level for the
intervals when you either visualize them using
plot.regcoeffs()
or plotRegcoeffs()
for PLS
model or when you get the values by using
getRegcoeffs()
.
When you make prediction plot for any classification model, you
should specify name of result object to show the predictions for. In old
versions the name of results were "calres"
,
"cvres"
, "testres"
. From this version they
have been changed to "cal"
, "cv"
and
"test"
correspondingly.
In PLS-DA there was a possibility to show predictions not for
classification results but for regression model the PLS-DA is built upon
using the following code:
plotPredictions(structure(model, class = "pls"))
. From this
version you should use
plotPredictions(structure(model, class = "regmodel"))
instead, as the plotPredictions()
function for regression
has been moved from pls
class to its parent, more general
class, regmodel
.
In methods plotCorr()
and plotHist()
for randomization test, parameter comp
has been renamed to
ncomp
. Parameter comp
assumes a possibility to
specify several values as a vector, while ncomp
assumes
only one value, which is the case for these two plots.
In regression coefficients plot logical parameter
show.line
has been replaced with more general
show.lines
from mdaplot()
.
plotPredictions()
method for models and results is
now based on mdaplot
(not mdaplotg()
as
before) and does not support arguments for e.g. legend position,
etc.
build:passed
on
bage in GitHubmdaplot()
now returns object with plot data
(plotseries
class), which can be used for extra options
(e.g. adding convex hull).colmap="old"
if you don’t like it).plotConvexHull()
adds convex hull for groups
of points on any scatter plot.plotConfidenceEllipse()
adds confidence
ellipse for groups of points on any scatter plot.opacity
can now be used with
mdaplotg()
plots and be different for each group.mdaplot()
and mdaplotg()
based plots
now can take parameters grid.col
and grid.lwd
for tuning the grid look.pch=21...25
using
col
and bg
parameters.type="d"
) is now based on hexagonal
binning - very fast for large data (>100 000 rows).mdaplotyy()
to create a line plot for two
line series with separate y-axis for each.As mentioned above, the biggest change which can potentially lead to some issues with your old code is that cross-validation is no more available for PCA models.
Other changes: * Default value for lim.type
parameter is
"ddmoments"
(before it was "jm"
). This changes
default method for computing critical limits for orthogonal and score
distances. * Added new tools for assessing complexity of model (e.g. DoF
plots, see tutorial for details). * More options available for analysis
of residual distances (e.g marking objects as extremes, etc.). * Method
setResLimits()
is renamed to
setDistanceLimits()
and has an extra parameter,
lim.type
which allows to change the method for critical
limits calculation without rebuilding the PCA model itself. * Extended
output for summary()
of PCA model including DoF for
distances (Nh and Nq). * plotExtreme()
is
now also available for PCA model (was used only for SIMCA models
before). * For most of PCA model plots you can now provide list with
result objects to show the plot for. This makes possible to combine, for
example, results from calibration set and new predictions on the same
plot. * You can now add convex hull or confidence ellipse to groups of
points on scores or residuals plot made for a result object. * New
method categorize()
allowing to categorize data rows as
“regular”, “extreme” or “outliers” based on residual distances and
corresponding critical limits.
plotResiduals.simcam()
and
plotResiduals.simcamres ()
are not available anymore (both
were a shortcut for plotResiduals.simca()
which was
superfluous.confint()
which returns confidence
interval (if corresponding statistics are available).show.line
is replaced with
show.lines
from mdaplot()
).As mentioned above, the PLS calibration has been simplified, thus
selectivity ratio and VIP scores are not computed automatically when PLS
model is created. This makes the calibration faster and makes parameter
light
unnecessary (removed). Also Jack-Knifing is used
every time you apply cross-validation, there is no need to specify
parameters coeffs.alpha
and coeffs.ci
anymore
(both parameters have been removed). It does not lead to any additional
computational time and therefore it was decided to do it
automatically.
Other changes are listed below:
summary()
output has been slightly improved.plotWeights()
for creating plot with PLS
weights.selratio()
.getSelectivityRatio()
is deprecated and shows
warning (use selratio()
instead).plotSelectivityRatio()
computes the ratio
values first, which makes it a bit slower.vipscores()
.getVIPScores()
is deprecated and shows warning
(use vipscores()
instead).plotVIPScores()
computes the score values
first, which makes it a bit slower."ven"
) now takes into
account the order of response values, so there is no need to order data
rows in advance.lim.type
parameter (default value "ddsimca"
).
X-residuals plot show the limits.plotXYResiduals()
showing distance/residuals
plot for both X (full distance) and Y.categorize()
allowing to categorize data
rows based on PLS results and critical limits computed for X- and
Y-distance.regres
methodsregres
methodscv
)Y cumexpvar
cex
parameter for group plots (can be
specified differently for each group)cex
is specified it will be also applied for legend
itemsmax.cov
in prep.autoscale()
(#59)ipls()
method plus fixed a bug
preventing breaking the selection loop (#56)selectCompNum()
related to use of Wold
criterion (#57)max.cov
parameter in
prep.autoscale()
(#58)max.cov
value in prep.autoscale()
is set to 0 (to avoid scaling only of constant variables)prep.autoscale()
opacity
parameter for
semi-transparent colorsplotExtreme()
method for SIMCA modelssetResLimits()
method for PCA/SIMCA modelsplotProbabilities()
method for SIMCA resultsgetConfusionMatrix()
method for classification
resultsplotPrediction()
for PLS resultsplotPrediction()
for PLS resultspls.getRegCoeffs()
now also returns standard error and
confidence intervals calculated for unstandardized variablessummary()
for object with regression
coefficients (regcoeffs
)mdaplot
for data frame with one or more
factor columns, the factors are now transofrmed to dummy variables
(before it led to an error)mdaplots
when using factor with more
than 8 levels for color grouping led to an errorpca
with wrong calculation of
eigenvalues in NIPALS algorithmlab.cex
and lab.col
now are
also applied to colorbar labelsdocs
foldermdaplot()
and mdaplotg()
were
rewritten completely and now are more easy to use (check tutorial)'d'
) for density scatter plotxlas
and ylas
in plots to
rotate axis ticksplotBiplot()
)cgroup
) if no there is no test setprep.autoscale()
now do not scale columns with
coefficient of variation below given thresholdprep.norm
)getRegcoeffs
was added to PLS modelcgroup
for plots now can work with factors
correctly (including ones with text levels)lab.col
and
lab.cex
for changing color and font size for data point
labels?randtest
?crossval
roxygen2
packageclassres
class for representation and visualisation
of classification resultsxticklabels
and
yticklabels
to mdaplot
and
mdaplotg
functionssimca
and simcares
classes for
one-class SIMCA model and resultssimcam
and simcamres
classes for
multiclass SIMCA model and resultsplsda
and plsdares
classes for PLS-DA
model and resultsselectNumComp(model, ncomp)
instead of
pls.selectncomp(model, ncomp)
, test.x
ad
test.y
instead of Xt
and yt
,
finally separate logical arguments center
and
scale
are used instead of previously used
autoscale
. By default scale = F
and
center = T
.?pls
)mdaplot
or
mdaplotg
functions, which extend basic functionality of R
plots. For example, they allow to make color groups and colorbar legend,
calculate limits automatically depending on elements on a plot, make
automatic legend and many other things.