Mining coherent patterns and clusters from genomic data
MetadataShow full item record
Recent high-throughput biotechnologies have generated various large-scaled biological data, such as genome sequences, protein structures, gene expression measurements, protein-protein interactions and DNA binding data. These ever-increasing genomic data provide us opportunities to explore the modular organization of the cell on a genome-wide scale. As the first step toward exciting knowledge discovery, mining hidden patterns and clusters in genomic data is a critical task in bioinformatics research and biomedical applications. In particular, microarray technology has made it possible to monitor the expression levels of thousands of genes in parallel. Many clustering algorithms have been applied to microarray data to find co-expressed genes and coherent gene expression patterns . However, due to the specific characteristics of microarray data and the special requirements from the domain of biology, clustering microarray data is still facing several challenges. In this dissertation, we first propose three novel approaches which effectively and efficiently identify coherent expression patterns (1) across the whole experimental conditions; (2) over subsets of samples or sub-intervals of time-series; and (3) embeded in three-dimensional microarray data, respectively. Being the experimental results from high-throughput technologies, large-scaled genomic data are typically noisy. Mining a single type of genomic data may not lead to reliable and meaningful results. It has been suggested that combining multiple data sets is likely to discover interesting, novel, and reliable patterns that cannot be obtained solely from any single source. In this dissertation, we also study the problem of joint mining across multiple genomic data sets. We propose a cross-graph quasi-clique model to describe the clusters which are consistently supported by heterogeneous data sources. Efficient algorithms have been developed to search the cross-graph quasi-cliques from multiple data sets. We conduct extensive performance study for our approaches on both real data and synthetic data. The mining results from real data sets have proved to be biologically meaningful. Moreover, our approaches are robust to parameters and are scalable to large data sets.