Abstract
Treatment of missing data represents a recurrent problem in biology, in particular in the sero-epidemiological studies. Indeed, the most common method used to deal with missing data is to restrict the analyses to subjects having complete information for the set of variables of interest, which can lead to a drop-out and/or introduce some slants in the evaluations.
The aim of this work is to study the statistical methods developed for the processing of missing data. In particular, we will consider two methods. The first method deals with the simple imputation approach (the mean imputation (I.mean), the k-nearest neighbors imputation (knn), and the regression predictions) and the second one considers the multiple imputation approach (The multiple imputation based on the bootstrap approaching the results of the EM algorithm (IM. EM) and the multiple imputation used the Predictive-Mean- Matching (IM. Pmm)).
Therefore, data from 1448 children less than 10 years from eight villages of the administrative subdivision of Toubacouta of the Fatick region are used to assess the effect of malaria disease. To do so, we created 10 incomplete databases, based on a complete sample of 300 children, with rate of missing values varying between 5% and 50% and 290 databases completed uncertainly. For every completed database, we computed the Mean-Absolute-Error (MAE), the Root-Mean-Square-Error (RMSE) as well as estimate of the mean and standard deviation, in order to compare the different methods. Calculations were done using the R software version 2.15.1.The set of the methods produces some estimations with an error rate enough weak average (4,9 to 1,33 on percentage). This rate was lower with the multiple imputation method (pmm, EM) and the nearest neighbors. Concerning the estimation of the averages and standard deviations, the simple imputation approach (I.mean, I.Reg) gave some estimations more centered with a light misjudgment from 25% of missing data. In summary that is the multiple imputation method (IM.pmm, IM.EM) and the k-nearest neighbors that give the best results.
Our results show that the multiple imputation (IM. EM, IM. pmm) method is most appropriate. The method also based on the k-nearest neighbors gives satisfactory results.