Missing data can plague researchers in many scenarios, arising from incomplete surveys, experimental objects broken or destroyed, or data collection/computational errors. This short course will explore what missing data is and where it comes from, as well as how to deal with it effectively. First, we will explore the concepts of “missing completely at random”, “missing at random”, and “not missing at random”, learning the differences between these three, how to know which one fits a particular data set, and how this classification will affect our procedures for dealing with the missing data. Next, we will briefly cover early methods for handling missing data, such as complete case analysis and single imputation techniques (mean, hot deck, etc.), and why in practice they can produce inefficient results. Finally, we will learn the basics of multiple imputation and how to apply them to some real-world data sets.
We will use SAS to perform the imputation methods, and the instructor will explain the code. Basic SAS knowledge will be helpful but is not required. Some basic probability and Bayesian knowledge may also be helpful. We will use a San Francisco data set attempting to predict household income from demographic information. The data set may be found at the link below but will be provided in the course.
Income Data. Impact Resources, Inc., Columbus, OH (1987). Retrieved Jan 30, 2014 from http://sci2s.ugr.es/keel/dataset.php?cod=163