Rechercher

sur ce site


Accueil du site > Résumés des séminaires > Labo > Résumés > A missing values tour with principal components methods

A missing values tour with principal components methods

The problem of missing values exists since the earliest attempts of exploiting data as a source of knowledge as it lies intrinsically in the process of obtaining, recording, and preparation of the data itself. Clearly, (citing Gertrude Mary Cox) ``The best thing to do with missing values is not to have any’’, but in the contemporary world of increasingly growing demand in statistical justification and amounts of accessible data this is not always the case, if not to say more. Missing values occur for a variety of reasons : machines that fail, survey participants who do not answer certain questions, destroyed or lost data, dead animals, damaged plants, etc. In addition, the problem of missing data is almost ubiquitous for anyone analyzing multi-sources data, performing meta analysis, etc. Missing values are problematic since most statistical methods can not be applied directly on a incomplete data. In this talk, we show how to perform dimensionality reduction methods such as Principal Component Analysis (PCA) with missing values. PCA is a powerful tool to study the similarities between observations, the relationship between variables and to visualize data. Then, we show how principal component methods can be used to predict (impute) the missing values. These approaches showed excellent performance in recommendation systems problems such as the "Netflix challenge" and consequently caught the attention of the machine learning community. Indeed, the methods can handle large matrices with large amount of missing entries. We present other popular techniques to impute missing values, discuss the potential pitfalls of the different approaches and challenges that need to be addressed in the future.

CMAP UMR 7641 École Polytechnique CNRS, Route de Saclay, 91128 Palaiseau Cedex France, Tél: +33 1 69 33 46 23 Fax: +33 1 69 33 46 46