Selected topics in statistical methods for DNA microarray analysis
DNA microarray technology has revolutionized biomedical research in the past decade, making it possible to study hundreds of thousands of DNA markers simultaneously. It has been well recognized that experiment design, data preprocessing, and classification are the three key components of microarray data analysis. Motivated by unique statistical challenges emerged with microarray analysis, we developed new computational methods and tools to be used in each of these three parts. First, we presented two algorithms for proper sample-to-batch assignment in order to reduce the impact of batch effect at the microarray profiling stage. Our methods can effectively remove potential dependence between experimental batches and outcome variables. It can handle challenging instances where incomplete and unbalanced sample collections are involved as well as ideally balanced designs. Second, we developed a statistical model based on principle component analysis and generalized additive model to correct the technical bias stemming from microarray probe compositions. We demonstrated the proposed preprocessing method is capable of adjusting the undesired technical bias while improving the detection of genuine biological signal. Third, we proposed a new nonparametric method using the receiver operating characteristic technique for microarray data classification. The proposed method used a novel pairwise approach to identify the optimal linear combination of biomarkers to maximize the area under curve. We demonstrated the superiority of the method over existing ones in various simulation scenarios, and applied it to a suite of real datasets used to develop MammaPrint, an FDA-cleared prognostic test specific for breast cancer based on microarray. The methods we developed here are not limited to microarray analysis and can be applied or adapted to other high-throughput omics technologies.