Incorporating survey-derived information in human trait modeling
Shaila Musharoff (Cornell University)
In the era of massive biobank-scale human genetic datasets, trait modeling requires new approaches. These datasets also contain individual-level survey data that could represent trait-relevant environmental information. However, this survey data is typically noisy and characterized by missingness, making the goal of integrating it with genetic data to model human traits challenging. In addition, environmental factors differ between populations, further complicating cross-population comparisons. Here, we consider two key trait modeling tasks: heritability estimation from population samples and trait prediction with polygenic risk scores. We analyze data from the All of Us Research Program dataset, which contains genetic data and health- and lifestyle-related surveys. We apply dimensionality reduction techniques to the survey data to summarize them and include these survey summaries as covariates in heritability estimation models and trait prediction models. When applied to common biomarkers, several of which are used in disease diagnosis, we find that the gains from including survey-derived data vary by population and by trait, indicating a context-specific role of survey data in trait modeling.