Title: A non-homogeneous hidden Markov model for precipitation occurrence Authors: Hughes J.P., Guttorp P, Charles S Reference: Applied Statistics Contact: Jim Hughes (hughes@diamond.cfas.washington.edu) ====================================================================== nhmm - A program for fitting nonhomogeneous hidden Markov models ====================================================================== A. INTRODUCTION NHMM is a partially interactive program for parameter estimation, state estimation and simulation from a m-state, nsta-station nonstationary hidden Markov model with multivariate binary outcome data. A hidden Markov model has two parts - a transition matrix and an observation distribution. In this application the transition matrix is assumed to depend on observable covariates, hence the term nonstationary. The multivariate binary observations may be assumed to be (conditionally) independent or they may be modelled jointly using an autologistic model. In the latter case, the user may specify a model relating the between station dependence parameters to the distance and direction between stations (see section D for details). The EM algorithm with modification is used for parameter estimation. Missing values in the outcome are allowed only for the conditional independence model. Missing values in the covariates are not allowed. To run this program the user must link the compiled code with either the NAG library (as presently implemented) or the NPSOL optimization routine (a version using NPSOL is available from the author). After starting the program the user is prompted for the name of the parameter control file. The parameter control file defines the current model and provides information about the form of the data. See section B for further details. The program reads the data, echos the model definition and then presents the user with the following menu of options: (1) Run EM algorithm from current parameters (2) Evaluate likelihood at current parameters (3) Compute covariance matrix of parameters (4) Print current parameters to the screen (5) Print current parameters to a file (6) Run Viterbi algorithm for current parameters (7) Generate conditional simulations from current model (9) Exit See section C for more information on each of these options. The user may select any option and upon completion of that task the program returns to this menu. The remainder of this documentation is divided into 5 sections: PROGRAM INPUTS, OPTIONS, DEPENDENCE MODELS, EXAMPLE, NOTES. B. PROGRAM INPUTS Input consists of 2 or 3 files, depending on the model. These are PARAMETER CONTROL FILE - sets user-defined options, defines the model, and specifies initial parameter values. The program echos the values that have been entered. RAW DATA FILE - contains the outcome (coded as 0/1) and covariate data with one record per time step. A brief summary of the data is printed after this file is read. DISTANCE FILE - contains the matrix of distances and directions between the measurement stations. Used only if the autologistic model is selected for the multivariate binary outcome. Each of these files is described further below. 1. PARAMETER CONTROL FILE. Name: supplied by the user interactively at program startup Function: sets user-defined options, defines the model, specifies initial parameter values Format: all input is free format. Note: In the following each dashed line (-----) denotes a new record. ------------------------------------------------------------ First record of PARAMETER CONTROL FILE read(11,*) m, nsta, nyears, ndays, natm, atmopt, raiopt number of states (m), number of stations (nsta), number of years of data (nyear), number of time periods per year (ndays), number of covariates (natm), covariate model option (atmopt), outcome model option (raiopt). Note: Years are treated as independent sequences. Thus, one could have 10 years of Jan. data. However, if the data do not form independent sequences (e.g. 5 complete years of data), then a "year" may be arbitrarily long. For example, one "year" of 1825 days. The only constraint is that nyears*ndays < MAXOBS (set in main.inc; see TIPS) Note: atmopt = 1 is Bayes model = 2 is Logistic model (not implemented) Note: raiopt = 1 is Independence model = 3 is autologistic model (a spatial model) ------------------------------------------------------------ read(11,'(a80)') datfil the data file name (datfil) ------------------------------------------------------------ read(11,'(a80)') dfmt the format of the data in datfil (dfmt); this should read 1 day per line and the first `ndays' lines should be the first year, etc. Each line gives the outcome indicators and the covariate data. See format of RAW DATA FILE below. ------------------------------------------------------------ INCLUDE THIS ONLY IF raiopt = 3 read(11,'(i5)') nb The number of parameters in the spatial model for each state (nb) ------------------------------------------------------------ INCLUDE THIS ONLY IF raiopt = 3 read(11,'(a80)') disfil The name of file containing distances/directions between stations (disfil). See format of DISTANCE FILE below. ------------------------------------------------------------ read(11,*) ifix(1) An indicator variable which determines whether the next set of parameters (gam) should be treated as fixed or estimated. Values are 0 = fixed, 1 = estimate. ------------------------------------------------------------ Next m records read(11,*)(gam(i,j),j=1,m) Each line should contain the m initial transition probabilities (gam) from state i (i=1,m) to states 1 - m. These are the \gamma_{ij} in equation (3) in Hughes et al. ------------------------------------------------------------ read(11,*) ifix(2) An indicator variable which determines whether the next set of parameters (pipars) should be treated as fixed or estimated. Values are 0 = fixed, 1 = estimate. ------------------------------------------------------------ Next nsta records read(11,*)(pipars(2,i,j),j=1,m) Each line should contain m values (pipars), one for each state. There are nsta such lines, one for each station. The value given is the initial value for p_is = 1/(1+exp(-alpha_is)) where i is the station number and s is the state. alpha is the parameter in equation 1 in Hughes et al. Under the independence model, p_is is the probability of rain in state s (s=1,m) for station i (i=1,nsta). ----------------------------------------------------------------------- INCLUDE ONLY IF raiopt = 3 read(11,*) ifix(3) An indiator variable which determines whether the next set of parameters (b) should be treated as fixed or estimated. Values are 0 = fixed, 1 = estimate. ----------------------------------------------------------------------- INCLUDE ONLY IF raiopt = 3 Next nb records (b) read(11,*) (b(i,j),j=1,m) These lines should contain the m*nb inital values for b. Each line contains the m values of b(i,.), one for each of the m states. The b's are used to relate \beta_{sij} (equation 1 in Hughes et al) to the distance and direction between the observation stations. See equation 13 in Hughes et al and section D for an example and more details. ----------------------------------------------------------------------- INCLUDE ONLY IF natm > 0 read(11,*) ifix(4) An indicator variable which determines whether the next set of parameters (mu) should be treated as fixed or estimated. Values are 0 = fixed, 1 = estimate. ----------------------------------------------------------------------- INCLUDE ONLY IF natm > 0 Next m*m records read(11,*)(mu(i,j,k),k=1,natm) Each line should contain the natm initial means (mu) for the atmospheric variables when S(t-1) = i, S(t) = j (i,j = 1,m) with j varying more quickly. These are the vectors \mu_{ij} in equation (3) of Hughes et al. ----------------------------------------------------------------------- INCLUDE ONLY IF natm > 0 read(11,*) ifix(5) ifix(5) must be entered as 0 ----------------------------------------------------------------------- INCLUDE ONLY IF natm > 0 Next natm records read(11,*)(sigma(i,j),j=1,natm) The (symmetric) variance-covariance matrix for the atmospheric variables (sigma). This is the matrix V in equation (3) of Hughes et al. Not estimated. ----------------------------------------------------------------------- read(11,*)dbeg,dend Beginning of time period to analyse (dbeg), end of time period to analyse (dend) Note: 1 <= dbeg < dend <= ndays ----------------------------------------------------------------------- read(11,*) iest estimation option (iest) Note: iest = 1 is exact maximum likelihood = 2 is mcml ----------------------------------------------------------------------- INCLUDE ONLY IF iest = 2 read(11,*) iseed,neps A negative random number seed (iseed) and the maximum relative variability allowed in the monte carlo estimate of the norm (neps) ----------------------------------------------------------------------- INCLUDE ONLY IF iest = 2 read(11,*) (norm0(j),j=1,m) the m log(normalizing constants) corresponding to the initial values of alpha and b given above. If the b are set such that \beta_{ij} = 0 for all i and j (where \beta_{ij} are the second order parameters in the autologistic model i.e. the autologistic model reduces to conditional independence) then norm0 is ignored the program computes the norms anyway. However, norm0 may be used to restart the estimation procedure from some parameter values for which beta is not zero. In that case the values of norm0 are given in the output from the previous run. ----------------------------------------------------------------------- INCLUDE ONLY IF iest = 2 read(11,*) (lsamp(j),j=1,m) m binary coded decimal random numbers from each of the autologistic distributions corresponding to the initial values of alpha and b. If the b are set such that \beta_{ij} = 0 for all i and j (i.e. the autologistic model reduces to conditional independence) then lsamp is ignored, so just enter 0. However, lsamp may be used to restart the estimation procedure from some parameter values for which beta is not zero. In that case the values of lsamp are given in the output from the previous run. ----------------------------------------------------------------------- read(11,*) ncladd Number of additional linear constraints on the parameters. Note that the constraints \sum_j gam_{ij} = 1 and \sum_j mu_{ijk} = 0 are already imposed. ----------------------------------------------------------------------- read(11,*) (a(nclin+i,j),j=1,n),bl(n+nclin+i),bu(n+nclin+i) The linear constraint (a) on the parameters and the lower (bl) and upper (bu) values of the constraint. 2. RAW DATA FILE Name: Given in PARAMETER CONTROL FILE Function: supplies the data Format: This file should contain ndays*nyears lines. Each line represents one day and contains the outcome (e.g. rainfall) indicators coded as 0/1/9 (0 = dry, 1 = wet, 9 = missing) followed by the natm atmospheric vars for that day. Each line is read with the format specified by dfmt (see PARAMETER CONTROL FILE) 3. DISTANCE FILE Name: Given in PARAMETER CONTROL FILE if rainopt = 3 Function: supplies the distance and direction matrix for the spatial model for rainfall. Format: There are 2*nsta lines and each line contains nsta numbers. The first nsta lines give distances; the second nsta give directions (in degrees). The j'th number in line i is the distance (direction) from station j to i. C. OPTIONS The options given by the program are as follows: (1) Run EM algorithm from current parameters This option runs either the "exact" EM algorithm or the EM\MCML algorithm described by Hughes et al (1998) to estimate the model parameters. (2) Evaluate likelihood at current parameters Computes the "exact" or approximate (using MCML) likelihood at the current parameter values. (3) Compute covariance matrix of parameters This routine is experimental. It implements the method described by Hughes (Stat. and Prob Letters 32:107-114, 1997) to evaluate the covariance matrix. However, it may be numerically unstable with large numbers of parameters. (4) Print current parameters to the screen (5) Print current parameters to a file User is prompted for a file name. (6) Run Viterbi algorithm for current parameters Estimates the most likely (MAP) hidden state sequence, given the data and current parameter values. Results are saved to a user-specified file. (7) Generate conditional simulations from current model Generates any number of simulated datasets conditional on the observed covariates (if there are any). Writes results to a user-specified file. User is prompted for a random number seed. (9) Exit D. DEPENDENCE MODELS In many applications we have found that the conditional independence model (equation 2 in Hughes et al) for the observation distribution provides satisfactory results. Sometimes however, it may be necessary to specify a model with includes additional local dependencies. Thus, NHMM also has implemented the autologistic model (equation 1 in Hughes et al.) for the observation distribution. In its most general form the autologistic model may be written as P(r|s) \propto \exp[\sum_i \alpha_{si} r^i + \sum_{j