Introduction
To understand living conditions and assess the impacts of programs aimed at improving those conditions, governments and development partners have often relied on household quantitative and qualitative surveys of samples of the population. Sometimes, this sample focuses on specific types of individuals, while at other times the sample is meant to be representative of the whole population. These surveys are seen as the best way to discover both the overall well-being of the population and the potential impacts of a certain project.
Development aid resources are finite, and while these surveys are important, they can be very expensive. For example, the cost of conducting a standard round of the Demographic & Health Surveys (DHS) in even a medium-sized country averages approximately $1 M; in countries with large populations, such as Cote d'Ivoire, these costs typically exceed $1.2M (Jerven 2017). The high cost of these surveys means they are conducted only infrequently (typically only every 3-5 years) for nationally representative samples. Moreover, these costs can be prohibitive for potential impact evaluations of development programs, leading only a small share of such programs to be covered by survey-based data collections. Finding ways to economize on survey costs while still gathering high quality data from the sought-after population of people may thus be a crucial.
One way to potentially decrease survey costs is to build on existing (secondary) geolocated data that are increasingly available, including national or project household surveys, administrative data, and remotely sensed data (from satellites and other sensors). These can be used to improve the precision of estimates from new primary data, thereby reducing the sample sizes required in these new data collection efforts and therefore the associated cost and effort. Moreover, in evaluation settings, these additional data sources can sometimes be used in place of primary data collection efforts if they do not exist, especially as baseline data that help draw comparison groups that are similar to the treated groups.
There are a number of challenges in potentially using existing secondary data for these purposes. First, high-quality surveys, such as the DHS and Multiple Indicator Cluster Surveys (MICS), are available only for samples of the population, and these are often sparsely distributed over a given country or area and only representative at national scales. As a result, most locations in a country are not directly covered by a DHS round. Second, administrative data (such as health facility data) and remotely sensed data (such as satellite imagery) may not capture the exact outcomes of interest in a given study, particularly in the health, education, and related social sectors. However, recent progress against both challenges now puts workable, general solutions within reach. For example, newly produced interpolated layers from the Institute for Health Metrics and Evaluation (IHME) provide estimates of key health measures at small subnational scales, allowing us to potentially compensate for missing populations not covered in sample surveys.
We aim to help evaluators, funders, and project implementers understand their options for combining multiple rounds of surveys and spatial data to evaluate projects. We describe new geospatial methods that allow these diverse datasets to be geospatially interpolated and joined to primary data collection survey data. We simulate the statistical power under the most common cases that entail alternative configurations of these methods. We concentrate on cases where new, primary data collection for both baseline and follow-up rounds is not feasible or within-budget. We then provide both an overall assessment and a comparison of the statistical power afforded by the alternative configurations. As an application of these methods, we focus on the case of HIV/AIDS indicators in Cote d'Ivoire, derived from gee-located DHS and MICS data, as well as interpolated layers from IHME. This setting is particularly salient for our application because it offers both a rich potential set of data configurations and an array of development efforts aimed at combating HIV/AIDS in a large population.
Our simulation results suggest that combining baseline outcome measures from predicted surfaces with follow-up measures from newly collected primary data provides the greatest statistical precision. Notably, such designs are more feasible than alternatives that rely on baseline survey data collection, opening the door for many more retrospectively conducted evaluation designs. These results also highlight how newly created predicted surfaces can be used to help recover "missing baseline" scenarios (where no baseline survey data was conducted in both treated and comparison locations)--an important feature of geospatial impact evaluations.