Comparing imputation procedures for affymetrix gene expression analysis using MAQC datasets
Sadananda Sadasiva Rao, Sreevidya
MetadataShow full item record
Background: Microarray technology makes possible the monitoring of gene expression on a genome-wide scale and has been widely applied to detect gene activity changes in many areas of biomedical research. However, due to the complexities of the microarray process, the expression data of individual genes may be missing due to flaws in the array and background noise. The microarray datasets on well-characterized RNA samples from MicroArray Quality Control (MAQC) project has enabled the assessment of the precision and comparability of microarrays, as well as the strengths and weakness of various microarray analysis methods. However, to date few studies have reported the performance of missing value imputation schemes on the MAQC datasets. In this study, we use the Affymetrix data sets generated by the MAQC projects to evaluate various imputation procedures in single color microarray platform. Results: Using the MAQC data, we evaluated several imputation procedures (BPCA, KNN, LLS, LSA, NIPALS, SVD, Row average), comparing them using five error measures (RMSE, LRMSE, NRMSE, RAE, RAEL2). We randomly deleted 5% and 10% of the data and imputed the missing values using these imputation tests. We performed a 1000 simulations and averaged the results. The results for both 5% and 10% deletion are similar. Among all the imputation methods, we observe that LLS with k = 4 has the lowest value across all the error measures. KNN with k = 1 has the highest value of all the imputation methods for all the error measures. Conclusion: Based on our study we conclude that, for imputing missing values in Affymetrix microarray datasets, using the MAS 5.0 pre-processing scheme, local least squares method with k = 4 has the best overall performance and k nearest neighbour method with k = 1 has and worst overall performance. These results hold true for both 5% and 10% missing values. These conclusions are based on technical datasets and without any down-stream analyses.