Novel methods for estimating null distributions in gene and gene pathway analysis for large scale hypothesis testing
MetadataShow full item record
RNA-seq is a novel technology for transcriptome profiling. It is replacing the traditional microarrays in many gene expression studies. Meanwhile it poses challenges to statistical methods for gene differential expression (DE) and pathway analysis, both of which reply on inference of a large number of test statistics, e.g. likelihood ratio test (LRT) or score test statistics. A key step for determining significant test statistics is obtaining the null distribution. We show that the asymptotic null assumption is often inappropriate for many of the Chi-squared tests in RNA-seq analysis. To estimate a more accurate null, we propose a Gamma mixture model for the empirical distribution of the test statistics. Based on the mixture model, we derive new methods from maximum likelihood and characteristic functions to estimate the null parameters. The estimated empirical null leads to more accurate inference of false discovery rate control in large scale hypothesis testings. In RNA-seq experiments, gene expression is summarized by the number of reads which is correlated with gene length. This inherent correlation creates bias in gene set analysis. We develop a gene set analysis method designed to work with RNA-seq tests based on read counts. We address length bias in RNA-seq data by implementing a flexible randomization scheme in gene set analysis. In the presence of a length bias our method improves the power of identifying enriched gene sets and reduces type I errors for null sets. The commonly used framework of Gene Set Analysis is based on the maxmean statistics. We derive an asymptotic distribution of the maxmean statistics under the Lindeberg condition. Further, we propose an empirical method based on two group mixture model and maximum likelihood to estimate the empirical null distribution. Compared to the permutation based approach, the empirical method improves the accuracy of the results of GSA in large scale hypothesis testings. It is easy to implement and works well with all common test statistics for individual genes. It also reduces the computational burden of the permutation steps in restandardization.