| 
|
Readme file
SERIES
B
Statistical
Methodology
Functional clustering and identifying substructures of longitudinal data,
J.-M. Chiou and P.-L. Li
J. R. Statist. Soc. B, Volume 69 (2007), 679–699
1. Data sets used in the data applications in Section 5:
Growth curve data
growth.dat - The heights of 54 girls and 39 boys measured at 31 ages between 1 and 18 years in
the Berkeley Growth Study (Tuddenham and Snyder, 1954). The 31 recording time points
are 1, 1.25, 1.5, 1.75, 2, 3, 4, 5, 6, 7, 8, 8.5, 9, 9.5, 10, 10.5, 11, 11, 12, 12.5,
13, 13.5, 14, 14.5, 15, 15.5, 16, 16.5, 17, 17.5 and 18 years, respectively.
This data set consists of the following variables :
Column 1 : ID number of subject
Column 2 : Gender classification (1 = boy; 2 = girl)
Column 3 : Recording time point
Column 4 : Height
Source: Tuddenham, R. D. and Snyder, M. M. (1954)
Physical growth of California boys and girls from birth to eighteen years.
University of California Publications in Child Development 1, 183-364.
Website: http://www.stats.ox.ac.uk/~silverma/fdacasebook/growthchap.html
Gene expression profile data
gene.dat- A subet of the Drosophila life cycle gene expression data of Arbeitman
et al. (2002). The original data set contains 77 gene expression profiles
during 58 sequential time points from the embryonic, larval, and pupal
periods of the life cycle. The gene expression levels were obtained by
a cDNA microarray experiment. This subset of the data consists of the
following variables:
Column 1 : CG number of gene
Column 2 : Biological classifications identified in Arbeitman et al.(2002)
(1 = transient early zygotic genes; 2 = muscle-specific genes;
3 = eye-specific genes. )
Column 3 : Recording time
Column 4 : Gene expression level
Source: Arbeitman, M.N., Furlong, E.E.M., Imam,F., Johnson, E., Null,B.H., Baker,B.S.,
Krasnow, M.A., Scott,M.P., Davis,R.W. and White,K.P. (2002)
Gene expression during the life cycle of Drosophila melanogaster.
Science, 297, 2270-2274.
Wedsite: http://genome.med.yale.edu/Lifecycle/ or
Gene Expression Ominibus http://www.ncbi.nlm.nih.gov/geo/ (find "GDS191" record)
2. Matlab functions for k-centers FC algorithm:
datap.m - Transform the input data set into a set of required format for kcfc.m.
kcfc.m - K-centers functional clustering algorithm. (main function)
cmp2p.m - Calculate adjust Rand index and correct classification rate,
and produce the clustering results to an output file.
Ex1_growth.m - Example of clustering the Berkeley growth data (in Section 5).
Ex2_gene.m - Example of clustering the gene expression data (in Section 5).
3. Procedures of using the k-centers FC algorithm:
(1) Transform the input data object (data) for k-centers FC:
[idsubj,idclass,data] = datap(file_in,nrow).
Input arguments:
filename - file name of data set with the following format:
1st column = ID number of subject
2nd column = External class label
3rd column = Recording time
4th column = Oberservations measured at the recording time
nrow - number of rows of the input data set in "filename"
Output arguments:
idsubj - n x 1 vector of ID numbers of n subjects
idclass - n x 1 vector of external class labels of n subjects
data - object of input data for kcfc.m, including
data.isobs - n x m matrix of indicators for data status.
isobs(i,j)=1: the ith curve is observed at time Tin(i,j).
isobs(i,j)=0: no observation for the ith curve at time Tin(i,j).
data.Tin - n x m matrix of recording time points of n subjects
m is the maximun number of time points for subjects.
data.Yin - n x m matrix of observation corresponding to time points Tin
(2) Implement the k-centers FC algorithm:
[nc_kcfc,P_out,M_out,idfpca,idkcfc] = kcfc(nc,pcopt,clustopt,data)
Input arguments:
nc - number of clusters for initial clustering
pcopt.ops - option of using the smoothing procedures for estimating the unknown functions
in FPCA model
0: without smoothing procedures
1: with smoothing procedures (need to set pcopt.bw1d and pcopt.bw2d)
pcopt.bw1d - bandwidth for 1d smoothing
pcopt.bw2d - bandwidth for 2d smoothing
clustopt.PP - threshold value for selecting the number of random components in initial clustering
(i.e. the value \tau_{\lambda} in the paper)
clustopt.MM - threshold value for selecting the number of random components in iterative
reclassification (i.e. the value \tau_D in the paper)
Output arguments:
nc_kcfc - total number of clusters after iterative reclassification
Pout - number of FPC scores used for initial clustering
Mout - 1 x nc_in vector of number of eigen functions, where M_out(k) is the no.
of eigenfunctions for the truncated KL expansion under group k.
idfpca - n x 1 vector of initial clustering result (cluster labels)
idkcfc - n x 1 vector of final clustering result (cluster labels)
(3) Produce outputs for clustering results and compare similarity between two partitions of nobs subjects
using adjusted Rand index (Hubert & Arabie, 1985) and correct classification rate:
[adjustRand,correct_rate,class_order,ctable] = cmp2p(clust_result,clust_external)
Input arguments:
cluster_result - labels of cluster memberships by clustering results
cluster_external - labels of cluster memberships by external criterion
Output arguments:
adjustRand - adjusted Rand index
correct_rate - correct classification rate
class_order - nclust x 1 vector of cluster labels for the cluster orders.
i.e., If the jth cluster of clust_result corresponds to the ith class
of clust_external, then class_order(j) = i.
ctable - nclass x nclust contingency table, the nclass rows correspond to the partition way of
clust_external and the nclust columns correspond to clust_result.
(4) Examples for demonstration: 'Ex1_growth.m' and 'Ex2_gene.m'
Jeng-Min Chiou
Institute of Statistical Science, Academia Sinica
128 Section 2 Academia Road
Taipei 115
Taiwan
E-mail: jmchiou@stat.sinica.edu.tw
|
Journals
SERIES
A
Statistics
in Society
SERIES
B
Statistical
Methodology
SERIES C
Applied Statistics
SERIES D
The
Statistician
|