Title: Data imputation algorithms for mixed variable types in large scale educational assessment: a comparison of random forest, multivariate imputation using chained equations, and MICE with recursive partitioning
Authors: W. Holmes Finch; Maria E. Hernandez Finch; Melissa Singh
Addresses: Department of Educational Psychology, Ball State University, Muncie, IN 47306, USA ' Department of Educational Psychology, Ball State University, Muncie, IN 47306, USA ' Department of Educational Psychology, Ball State University, Muncie, IN 47306, USA
Abstract: Missing data is a major issue with which researchers working on large scale assessments must contend. Such research efforts frequently collect a wide array of variables, including dichotomous, ordinal, nominal, normal, skewed, and counts. This variation in data distributions renders many recommended methods for missing data imputation less than optimal because they assume a single joint probability model for all variables. This simulation study compared four imputation methods, random forest imputation (RF), multivariate imputation by chained equations (MICE), and combinations of the two methods using either the recursive partitioning tree (MICE-RPT) or random forest (MICE-RF) methodologies. Results reveal that data imputed with RF, MICE, MICE-RF, and MICE-RPT yield more accurate parameter estimates than data treated with LD and that MICE-RF and MICE-RPT are associated with more accurate estimates than MICE or RF alone. Implications of these results and recommendations for practice are discussed.
Keywords: missing data; random forest; multivariate imputation; chained equations; MICE; data imputation; recursive partitioning; educational assessment datasets; simulation.
DOI: 10.1504/IJQRE.2016.077803
International Journal of Quantitative Research in Education, 2016 Vol.3 No.3, pp.129 - 153
Received: 23 Dec 2014
Accepted: 02 Oct 2015
Published online: 15 Jul 2016 *