Title: Evaluating the performance of regression algorithms on datasets with missing data
Authors: Luciano Costa Blomberg; Daiane Hemerich; Duncan Dubugras Alcoba Ruiz
Addresses: Graduate Program in Computer Science, Pontifical Catholic University of Rio Grande do Sul (PUCRS), Av. Ipiranga 6681, Porto Alegre – RS, Brazil ' Graduate Program in Computer Science, Pontifical Catholic University of Rio Grande do Sul (PUCRS), Av. Ipiranga 6681, Porto Alegre – RS, Brazil ' Graduate Program in Computer Science, Pontifical Catholic University of Rio Grande do Sul (PUCRS), Av. Ipiranga 6681, Porto Alegre – RS, Brazil
Abstract: Real-world applications frequently involve missing data, turning the data analysis into a non-trivial task. This paper presents an analysis of six representative regression algorithms, evaluating their predictive performance and sensitivity to missing data. For this purpose, we used 20 public datasets and manipulated them to hold controlled levels of missing data. Our empirical analysis shows that RepTree is the least influenced by missing data, being LinearRegression the next. IBK is the most influenced, presenting the highest error. However, M5P remains as the algorithm with best predictive performance, although being only the fourth less influenced by missing data.
Keywords: business intelligence; missing data; machine learning; data mining; regression algorithms; predictive performance.
DOI: 10.1504/IJBIDM.2013.057744
International Journal of Business Intelligence and Data Mining, 2013 Vol.8 No.2, pp.105 - 131
Received: 05 Jul 2013
Accepted: 11 Jul 2013
Published online: 28 Jun 2014 *