Approaches to clustering gene expression time course data
MetadataShow full item record
Conventional techniques to cluster gene expression time course data have either ignored the time aspect, by treating time points as independent, or have used parametric models where the model complexity has to be fixed beforehand. In this thesis, we have applied a non-parametric version of the traditional hidden Markov model (HMM), called the hierarchical Dirichlet process - hidden Markov model (HDP-HMM), to the task of clustering gene expression time course data. The HDP-HMM is an instantiation of an HMM in the hierarchical Dirichlet process (HDP) framework of Teh et al. (2004), in which we place a non-parametric prior on the number of hidden states of an HMM that allows for a countably infinite number of hidden states, and hence overcomes the issue of fixing model complexity. At the same time, by having a Dirichlet process in a hierarchical framework we let the same countably infinite set of "next states" in the Markov chain of the HMM be shared without constraining the flexible architecture of the model. We describe the algorithm in detail and compare the results obtained by our method with those obtained from traditional methods on two popular datasets - Iyer et al. (1999) and Cho et al. (1998). We show that a nonparametric hierarchical model such as ours can solve complex clustering tasks effectively without having to fix the model complexity beforehand and at the same time avoids overfitting.