Conférence Internationale de Statistique Appliquée pour le Developpement en Afrique / International Conference on Applied Statistics for Development in Africa

Statistique Appliquée pour le Développement en Afrique

4-8 mars 2013 Cotonou (Bénin)

Jeudi 7

Epidémiologie 3

› 8:30 - 9:00 (30min)

sciencesconf.org:sada2013:10906

Treatement of missing data in sero-epidemiological study

Oumy Niass 1, 2, @ , Aïssatou Touré 2, @ , Abdou Kâ Diongue 3, @ , Ali Souleymane Dabye 3

1 : Laboratoire d’Etudes et de Recherches en Statistiques et Développement (LERSTAD)

2 : Unité d’Immunologie, Institut Pasteur de Dakar, 36 Avenue Pasteur, BP 220 Dakar-Sénégal (UI/IPD)

3 : Laboratoire d’Etudes et de Recherches en Statistiques et Développement, UFR SAT, UGB, BP 234, Saint-Louis, Sénégal (LERSTAD)

Abstract

Treatment of missing data represents a recurrent problem in biology, in particular in the sero-epidemiological studies. Indeed, the most common method used to deal with missing data is to restrict the analyses to subjects having complete information for the set of variables of interest, which can lead to a drop-out and/or introduce some slants in the evaluations.

The aim of this work is to study the statistical methods developed for the processing of missing data. In particular, we will consider two methods. The first method deals with the simple imputation approach (the mean imputation (I.mean), the k-nearest neighbors imputation (knn), and the regression predictions) and the second one considers the multiple imputation approach (The multiple imputation based on the bootstrap approaching the results of the EM algorithm (IM. EM) and the multiple imputation used the Predictive-Mean- Matching (IM. Pmm)).

Therefore, data from 1448 children less than 10 years from eight villages of the administrative subdivision of Toubacouta of the Fatick region are used to assess the effect of malaria disease. To do so, we created 10 incomplete databases, based on a complete sample of 300 children, with rate of missing values varying between 5% and 50% and 290 databases completed uncertainly. For every completed database, we computed the Mean-Absolute-Error (MAE), the Root-Mean-Square-Error (RMSE) as well as estimate of the mean and standard deviation, in order to compare the different methods. Calculations were done using the R software version 2.15.1.The set of the methods produces some estimations with an error rate enough weak average (4,9 to 1,33 on percentage). This rate was lower with the multiple imputation method (pmm, EM) and the nearest neighbors. Concerning the estimation of the averages and standard deviations, the simple imputation approach (I.mean, I.Reg) gave some estimations more centered with a light misjudgment from 25% of missing data. In summary that is the multiple imputation method (IM.pmm, IM.EM) and the k-nearest neighbors that give the best results.

Our results show that the multiple imputation (IM. EM, IM. pmm) method is most appropriate. The method also based on the k-nearest neighbors gives satisfactory results.

Type :	:	oral
Thématiques	:	Epidémiologie 3
Mots-Clés	:	serology ; missing ; data ; Toubacouta ; imputation

Personnes connectées : 1