M

Since the XtX matrix depends on the relative magnitudes of the individual x vectors, the z vectors depend on a possible scaling of the original x descriptors.

The idea in a Principal Component Analysis (PCA) is to use the z variables as the descriptive variables. If all M eigenvectors are used the result is identical to using the original x variables (i.e. MLR). However, the premise of the PCA method is to include only a few (J) z variables, and to select them according to their eigenvalues. A series of multiple linear regressions are done using more and more eigenvectors, first z1, then z1 and z2, then including also z3, etc. At each stage the predictive capabilities of the model are calculated, for example quantified by Q2. If the original x data have a reasonable correlation with the y data, then a plot of Q2 against the number of variables included will typically display an initial steep increase, but then level off or even start to decrease slightly as the number of latent variables is increased. The point where Q2 levels off indicates that the optimal number of components has been reached, i.e. at this point the predictive power of the model cannot be increased further by including more components.

The main problem with the PCA method is that some of the x variables may not be particularly good at describing the y variables, i.e. the first few PCA vectors describing the largest variation among the x variables may correlate poorly with the variation in the y data. In such cases, a global optimization search can be made for a model based on a relatively small number of components selected from all the PCA vectors with eigenvalues above a suitable cutoff.

The Partial Least Squares (PLS, also sometimes called Projection to Latent Structures) method attempts to improve on the selection of the latent variables by weighting the X matrix with the y vector prior to diagonalization, i.e. diagonalizing the XtyytX (equivalent to (ytX)t(ytX)) matrix instead of XtX. This ensures that the eigenvectors with the largest eigenvalues will be biased towards describing the variation in y. The only difference between PCA and PLS is thus in how the latent z variables are generated, either by diagonalization of the XtX matrix, or from the corresponding y-weighted matrix. The PLS latent variables will naturally be ordered according to their ability to describe the y variation, alleviating the necessity for performing a combinatorial search for which latent vectors to use in the regression. For optimal cases, a plot of Q2 against the number of PLS components will rapidly reach a maximum and provide a compact model with good predictive capabilities.

A disadvantage of the PLS method is the inherent bias towards selecting latent variables describing noise in the y data, i.e. x variables that only have a small internal variation but that correlate with the noise in the y data are selected as important. For this reason, x variables with small internal variance over the y data points are often removed from the descriptor data set prior to performing the PLS analysis. This preselection procedure, however, requires user involvement and it is not always easy to decide which variables to remove. Unfortunately, the predictive capabilities of a PLS

model are often sensitive to elimination of one or more x variables. A global optimization scheme may again be employed in such cases, i.e. performing a search for which x components to remove from the PLS analysis in order to provide a model with a high Q2 value.

0 0

Post a comment