Readme file

SERIES C  
Applied Statistics

Simple component analysis V. Rousson and Th. Gasser
Appl. Statist., 53 (2004), 539 - 555

This file contains information on the interactive S-Plus procedure `sca'
for simple component analysis, whose code can be found in the file `sca.s'.

Further useful references are:

Rousson, V. and Th. Gasser (2003). Some Case Studies of Simple Component
Analysis. Manuscript to be found on http://www.unizh.ch/biostat/Manuscripts/.

Gervini, D. and Rousson, V. (2004).
Criteria for evaluating dimension-reducing components for multivariate data.
The American Statistician. To appear.

Download of the S-Plus code
-------------------------------

In order to download the interactive procedure `sca', as well as the associated procedures
(such as `print.sca'), one can write in S-Plus:

source("sca.s")

Description of the procedure `sca'
------------------------------------

sca

Interactive program for Simple Component Analysis

Description:

A system of simple components calculated from a correlation matrix is built interactively
following the methodology of Rousson and Gasser (2004).

Usage:

sca(S, b=5, d=0, criterion="csv", cluster="median", withinblock=TRUE, invertsigns=FALSE,
corblocks=0, qmin=0, inter=TRUE)

Arguments:

S: the correlation matrix to be analyzed.

b: the number of block-components initially proposed.

d: the number of difference-components initially proposed.

criterion: indicates the optimality criterion to be used for evaluating a
system of simple components. One of "csv" (corrected sum of variances) or
"blp" (best linear predictor) can be abbreviated.

cluster: indicates the clustering method to be used in the definition of
the block-components. One of "single" (single linkage), "median" (median linkage) or
"complete" (complete linkage) can be abbreviated.

withinblock: a logical indicating whether any difference-component should only
involves variables belonging to a same block-component.

invertsigns: a logical indicating whether the sign of some variables may be inverted
in order to avoid negative correlations.

corblocks: if larger than zero, the number of block-components is chosen
such that correlations among them are all smaller than `corblocks' (take over argument `b').

qmin: if larger than zero, the number of difference-components is chosen
such that the system contains at least `qmin' components (take over argument `d').

inter: a logical indicating whether the system of simple components should be
built interactively. If inter=FALSE, an optimal system of simple components is
automatically calculated without any intervention of the user
(according to `b' or `corblocks', and to `d' or `qmin').

Details:

When confronted with a large number `p' of variables measuring different aspects of a same theme,
the practitionner may like to summarize the information into a limited number `q' of components.
A ``component" is a linear combination of the original variables, and the weights in
this linear combination are called the ``loadings". Thus, a system of components is defined
by a `p' times `q' dimensional matrix of loadings.

Among all systems of components, principal components (PCs) are optimal in many ways.
In particular, the first few PCs extract a maximum of the variability of the original variables
and they are uncorrelated, such that the extracted information is organized in an optimal way:
we may look at one PC after the other, separately, without taking into account the rest.

Unfortunately PCs are often difficult to interpret. The goal of Simple Component
Analysis is to replace (or to supplement) the optimal but non-interpretable PCs by suboptimal
but interpretable ``simple components". The proposal of Rousson and Gasser (2004) is to look for an
optimal system of components, but only among the simple ones, according to some definition of
optimality and simplicity. The outcome of their method is a simple matrix of loadings calculated
from the correlation matrix `S' of the original variables.

Simplicity is not a guarantee for interpretability (but it helps in this regard).
Thus, the user may wish to partly modify an optimal system of simple components in order
to enhance interpretability. While PCs are by definition 100\% optimal, the optimal
system of simple components proposed by the procedure `sca' may be, say, 95\%, optimal,
whereas the simple system altered by the user may be, say, 93\% optimal. It is ultimately
to the user to decide if the gain in interpretability is worth the loss of optimality.

The interactive procedure `sca' is intended to assist the user in his/her choice for an interptetable
system of simple components. The algorithm consists of three distinct stages and proceeds in an
interative way. At each step of the procedure, a simple matrix of loadings is displayed in a window.
The user may alter this matrix by clicking on its entries, following the instructions given there.

If all the loadings of a component share the same sign, it is a ``block-component".
If some loadings are positive and some loadings are negative, it is a ``difference-component".
Block-components are arguably easier to interpret than difference-components. Unfortunately, PCs
almost always contain only one block-component. In the procedure `sca', the user may choose the number
of block-components in the system, the rationale being to have as many block-components such that
correlations among them are below some cut-off value (typically .3 or .4).

Simple block-components should define a partition of the original variables. This is done in the
first stage of the procedure `sca'. An agglomerative hierarchical clustering procedure is used there.

The second stage of the procedure `sca' consists in the definition of simple difference-components.
Those are obtained as simplified versions of some appropriate ``residual components". The idea is to
retain the large loadings (in absolute value) of these residual components and to shrink to zero the
small ones. For each difference-component, the interactive procedure `sca' displays the loadings of the
corresponding residual component (at the right side of the window), such that the user may know which
variables are especially important for the definition of this component.

At the third stage of the interactive procedure `sca', it is possible to remove some of the
difference-components from the system.

For many examples, it is possible to find a simple system which is 90% or 95% optimal, and where
correlations between components are below .3 or .4. When the structure in the correlation matrix
is complicated, it might be advantageous to invert the sign of some of the variables in order
to avoid as much as possible negative correlations. This can be done using the option `invertsigns=TRUE'.

Value:

A list containing the following components:

simplemat: a matrix defining a system of simple components, whose row correspond to variables
and whose columns correspond to components.

loadings: loadings of simple components. This is a normalized version of `simplemat'.

allcrit: a list containing the following components:

varpc: a vector containing the percentage of total variability accounted by each of the
the first `nblock'+`ndiff' principal components of `S'.

varsc: a vector containing the percentage of total variability accounted by each of
the simple components defined by `simplemat'.

cumpc: the sum of varpc, indicating the percentage of total variability accounted by
the first `nblock'+`ndiff' principal components of `S'.

cumsc: a score indicating the percentage of total variability accounted
by the system of simple components. `cumsc' is calculated according to `criterion'.

opt: indicates the optimality of the system of simple components.
`opt' is obtained as `cumsc'/`cumpc'.

corsc: correlation matrix of the simple components defined by `simplemat'.

maxcor: a list containing the following components:

row: label of the row of the maximum value in `corsc'.

col: label of the column of the maximum value in `corsc'.

val: maximum value in `corsc' (in absolute value).

nblock: number of block-components in `simplemat'.

ndiff: number of difference-components in `simplemat'.

criterion: as above.

cluster: as above.

withinblock: as above.

invertsigns: as above

vardata: the correlation matrix which was analyzed. In principle it should be
equal to argument `S' above, except if it has been transformed in order to avoid negative correlations.

Examples:

Let `X' be a matrix containing some data set with at least 5 columns, whose rows
correspond to subjects and whose columns correspond to variables.

An optimal simple system with two block-components and three difference-components
for the data in `X' can be automatically obtained as:

r<-sca(cor(X),b=2,d=3,inter=F)

The resulting simple matrix is contained in `r$simplemat'.
A matrix of scores for such simple components can then be obtained as:

Z<-as.matrix(scale(X))%*%r$loadings

An optimal simple system with at least 5 components for the data in `X',
where the number of block-components is such that correlations among them are all smaller
than 0.4, can be obtained as:

r<-sca(cor(X),corblocks=0.4,qmin=5)

Since the interactive part of the program is active here, the proposed system can then be
modified according to the user's wishes. The result of the procedure will be contained in `r'.

Description of the procedure `print.sca'
---------------------------------------

print.sca

Printing of a list resulting from the procedure `sca'.

Usage:

print.sca(r)

Arguments:

r: a list resulting from the procedure `sca'.

Examples:

Let `S' be a correlation matrix.

r<-sca(S,b=2,d=3,inter=F)
print.sca(r)

Valentin Rousson
Department of Biostatistics
Institute for Social and Preventive Medicine
University of Zürich
Sumatrastrasse 30
CH-8006 Zürich
Switzerland

E-mail: rousson@ifspm.unizh.ch

Journals

SERIES A
Statistics in Society

SERIES B
Statistical Methodology

SERIES C
Applied Statistics

SERIES D
The Statistician