Machine-learning-based meta approaches to protein structure prediction
Girgis, Hani Zakaria
MetadataShow full item record
The importance of knowing the three dimensional structure of proteins and the difficulty of determining it experimentally, have led scientists to develop several computational methods for protein structure prediction. Despite the abundance of protein structure prediction methods, these approaches have two major limitations in additions to others. First, the top ranked 3d-model reported by a prediction server is not necessarily the best predicted 3d-model. The correct predicted 3d-model may be ranked within the top 10 predictions after some false positives. Second, no single method can give correct predictions for all proteins. To attempt to remedy these limitations, protein structure prediction "meta" approaches have been developed. Some meta-servers apply a local model quality assessment program (MQAP) to select a set of candidate 3d-models by ranking 3d-models obtained from other servers. However, model quality assessment programs suffer from the same two limitations as the prediction servers. The data available for training machine-learning-based meta-approaches is constantly growing in size on a monthly or a weekly basis. Once new data become available which may contain new patterns, typically one will discard the models trained on the old training data and train new ones. Clearly such an approach is a waste of computation and needs manual human intervention to retrain the learning algorithm. My research has three goals, (i) to invent a novel machine-learning based meta-MQAP; (ii) to develop a new meta-selector based on the meta-MQAP; (iii) to devise new machine learning algorithms that can extend my meta-MQAP-meta-selector to make use of the newly available labeled data dynamically. To that end, (i) I have developed a new meta-MQAP-meta-selector based a on a three-levels-hierarchy of general linear models; (ii) I have proposed two algorithms to handle the problem of the constantly growing training date. The first algorithm trains a model dynamically on the related data to the unlabeled query (testing) data, in another words, it trains dynamically a custom-made expert. The second algorithm dynamically mixes local experts which are already trained and cached. My experimental results show that my meta-MQAP outperforms the best of the tested model quality assessment program by 7%-8% in the overall score. When selecting from the predictions made by humans in a standard benchmark CASP7, my meta-selectors achieve about 3% improvement above the best human predictor. I have participated in the world wide CASP8 competition with three meta-MQAP-meta-selectors. Based on the evaluation of 46 target proteins used in the recently completed, truly blind and independent CASP8 experiment, my meta-MQAP outperforms the best tested MQAP by 6%, 5%, 29%, and 10% in the easy, medium, hard categories and in the overall score respectively. These results show that my meta-MQAP outperforms any of its components proving that a hierarchy of weighted sums of the MQAP's scores has more information than a single MQAP. The three meta-selectors performances are very similar to the performance of the best performing CASP8 server, demonstrating that the "meta" approach used here, namely, meta-MQAPmeta-selection, is promising and further improvements are likely to result in a significant improvement in the performance over the best of the servers.