High dimensional factor analysis for predictive modeling, gene discovery, and genetic risk assessment
Imam, Netsanet Tewelde
MetadataShow full item record
High-dimensional datasets from proteomic, metabonomic, gene expression microarray, and imaging studies contain number of variables p larger than the sample size n. The first objective of this dissertation is to develop factor analysis (FA) based method for predicting values of univariate response y as a linear function of x , when p > n. We compute [Special characters omitted.] <math> <f> <b><a><ac><b><g>g</g></b></ac><ac>&d4;</ac></a></b></f> </math> in y = x γ + E and predict values of y. To achieve this goal, we develop methods for high-dimensional FA (HDFA) regression and thereby obtain a prediction equation. We employ HDFA under the assumption that the relationships among observed variables is driven by underlying latent constructs. Following a bivariate factor model conceptualization, we developed a high dimensional bivariate HDFA of z, z = ( y , x' )', to estimate the covariance matrix of z and use it to compute [Special characters omitted.] <math> <f> <a><ac><g>g</g></ac><ac>&d4;</ac></a></f> </math> . We perform a Monte Carlo (MC) study to compare the performance of HDFA regression with two popular methods that focus on dimension reduction - principal component regression (PCR) and partial least square (PLS) regression - under three underlying correlation structures: arbitrary correlation; factor model correlation structure; and when y is independent of x . Given the independence structure, we observe severe over-fitting by PLS regression compared to HDFA regression and PCR. Under the two dependent structures, HDFA regression out performs PCR and is comparable to or slightly better than PLS regression. Thus HDFA regression is recommended over PCR and PLS regression. The second objective of the dissertation focuses on gene discovery and genetic risk assessment. We assume the existence of a latent construct associated with characteristic genes - a group of genes that are known to be involved in etiology of a disease or biological process of interest. HDFA with varimax rotation is applied in identifying candidate genes associated with Bardet-Biedl syndrome (BBS). In cases where varimax rotation fails to identify the characteristic genes, we propose Sparse Procrustean rotation (SPR). SPR is a new rotation that strives to rotate the loading matrix to a target matrix that is Procrustean but otherwise simple structure to identify potential characteristic genes that load most highly on the same factor. We applied SPR on the expression of 1781 genes of which 52 are genes whose mutation is associated with ciliopathy. We identified the factor f 1 where the expression of 38/52 (73.1%) ciliopathic genes loaded most highly on. 638/1729 other gene expression variables (36.9%) loaded most highly on f 1 and are proposed as candidates for further investigation to determine whether they are cilia function genes. Finally, we derived factor scores for genetic risk assessment - an index that reflects the risk of ciliopathy conditions - represented by the selected factor. We computed 90% upper normal limit for F 1 * for the population of interest, in our case twelve-week-old male F2 rats, and any subject with F 1 * > 0.47 is categorized as " high-risk " for the ciliopathy conditions under SPR. HDFA and SPR, together, are promising methods for gene discovery and genetic risk assessment.