Readme file

SERIES C  
Applied Statistics

A modified false discovery rate multiple-comparisons procedure for discrete data, applied to human immunodeficiency virus genetics, by P. Gilbert et al
Journal of the Royal Statistical Society, Series C, Applied Statistics, Volume 54 (2005) part 1, 143-158

This file describes how to implement the Tarone-modified FDR multiple comparisons adjustment procedure described in the paper.

Two programs are needed, the Splus function fastAtobinary.s and the Fortran program genomescanTaroneFDR.f

For each of the two groups the amino acid sequence data are read in as 1's and 0's, as an nk by p by 21 array, with first dimension indexing the sequences, second dimension indexing the positions in the sequences, and third dimension indexing the 20 amino acids plus a gap as the 21st element. The reference sequence is also read in as an nk by p by 21 array.

The following code carries out the analyses that are presented in the paper, using Splus (R would also work) and Fortran.

First, the Splus function fastAtobinary.s is used to translate Fast A versions of the sequence datasets into files that are input into the Fortran program genomescanTaroneFDR.f

source('fastAtobinary.s')

The Example data consist of gag p24 HIV sequences, 73 subtype C from Southern Africa and 73 subtype B from the United States and Europe
n1 <- 73
n2 <- 73
p <- 231
seqfile1 <- 'p24_73C.fas'# subtype C
seqfile2 <- 'p24_73B.fas'# subtype B
seqfileref <- 'Referenceseq.txt'# Reference sequence, the C/B consensus sequence
alpha <- 0.05

Unweighted analysis:
Blosmat <- matrix(rep(1,21*21),ncol=21)
weightprior <- rep(1,p)
Delta <- matrix(rep(1,21*21),ncol=21)

fastAtobinary(n1,n2,p,seqfile1,seqfile2,seqfileref,Blosmat,Delta,weightprior,alpha)

fastAtobinary writes out the three input files inputexample.dat, input1.dat, and input2.dat that are used by the Fortran program genomescanTaroneFDR.f.
input1.dat contains the sequence data for group 1 and input2.dat contains the sequence data for group 2, in the binary format described above. The file inputexample.dat contains the following variables:
n1 n2 p# nk = No. of sequences in group k, k=1,2
# p = No. of positions in the sequences, i.e.,
the length of the aligned sequences
alpha# The type I error rate (e.g., 0.05)
1st row of Blosmat# Blosmat is a 21 by 21 matrix that weights
# amino acid substitutions; the 20 rows
2nd row of Blosmat

...
# are for the 20 amino acids and the 21st row
# is for a gap (insertion or deletion).
# The simplest choice for Blosmat is simply
# a matrix of all 1's.
21st row of Blosmat

1st row of Delta# Delta is a 21 by 21 matrix that weights
# biostructural importance of amino acid
2nd row of Delta

...
# substitutions for a different test statistic
# than Fisher's exact test statistic. For the
# current program Delta is set to the matrix
# of all 1's.
21st row of Delta

weightprior # a p-vector of prior weights on the p positions
# 1, ... , p. These weights are not used for Fisher's
# exact test; they are used for other test statistics
# studied in other work. Thus, weightprior is set
# to be a vector of 1's for the current application.
yyref

# the sequence data for the reference sequence
entry 1, 1, 21 entries

entry 1, 2, 21 entries

...
entry 1, p, 21 entries

entry 2, 1, 21 entries

and so on... (a large file of 1's and 0's)

After the Splus function fastAtobinary is run, the Fortran program can be run at a Unix or Linux command line. The Fortran function can be compiled with the command f77 -O3 genomescanTaroneFDR.f -o genomescanTaroneFDR.out, and run with the command: genomescanTaroneFDR.out

The file genomescanTaroneFDR.f contains documentation describing the output of the Fortran program.

Contact information:
Peter B. Gilbert
Fred Hutchinson Cancer Research Center
1100 Fairview Avenue North
PO Box 19024
Seattle
WA 98109
USA

Phone: 206 667-7299
Fax: 206 667-4812
E-mail: pgilbert@scharp.prg

Journals

SERIES A
Statistics in Society

SERIES B
Statistical Methodology

SERIES C
Applied Statistics

SERIES D
The Statistician