Ethan Linck

Inferring population structure – the subdivision of a species into groups of individuals interbreeding with each other at a higher frequency than expected by chance – is a fundamental goal of population genetics. Beyond its obvious relevance to systematics, taxonomy, and conservation, understanding patterns of population structure is crucial for a range of applications, from detecting selection and migration to identifying genetic variants associated with specific traits in genome-wide association studies (the red-hot “GWAS” trend). While on the one hand, the wealth of data resulting from the advent of next generation sequencing technology has proved a boon for describing subdivision in nonmodel organisms, it has also provided challenges. For example, many computational methods for detecting structure were developed prior to the genomics era and may not be suited to the unique characteristics of large SNP datasets. Of these, Pritchard et al.’s structure is the most widely cited, and the basis for a raft of other methods that feature an underlying generative model and explicitly estimate a suite of population genetic parameters.

Figure 2 from our preprint, showing inferred (A) population discrimination and (B) admixture across different MAF levels.

Unfortunately, at least in our lab’s experience, these model-based approaches frequently fail with modern datasets produced by methods like RADseq. In particular, we’ve observed a pattern where all individuals have the great majority of their ancestry assigned to a single population, producing a “smear” pattern with the remaining small percentage assigned to informative clustering results. Minor allele frequency cutoffs (or selecting parsimony informative sites alone) appeared to partially help – perhaps reflecting theoretical predictions that different frequency classes of alleles reflect different evolutionary histories and levels of subdivision. But we lacked a real rationale for applying these filters while processing our data, and lacked an understanding of what different levels of filtering actually did.

So to address these issues, C.J. and I used both simulated and empirical (Golden-crowned Kinglets!) SNP datasets to explore how MAF choice affected our ability to recover known population limits. We looked at both programs like structure and non-model-based approaches like k-means clustering with principal components. We found the former class of methods are indeed very sensitive, but the latter class are fairly robust, speculated on why, and made a few recommendations for best practices. You can read our new preprint here – we’re hoping to submit it soon, but would love feedback in the mean time.

back