Variant Filtering

In the previous lab we produced multi-sample .vcf files and learned how to interpret the most important fields of these plain text files—particularly those related to variant quality, such as DP (read depth), AD (allele depth), and MQ (mapping quality). We now turn to these fields as the basis for filtering our data for reliability, size, and content.

First, some caveats are in order. As you likely gathered from this week’s reading assignments, variant filtering is a complex topic beyond what we can address in a single week of class. There are also no hard-and-fast rules: The quality filters that work for one dataset may not work for another, and the very goals of filtering will differ based on the requirements of a given downstream analysis. As a result, it is not uncommon to produce multiple filtered .vcf files for different purposes, each with a different composition and quantity of SNPs and / or indels. In keeping with our general approach this semester, today’s goal is simply to introduce a fundamental variant filtering tool, and some of the options you might choose to use with it. (We will likely return to filtering in the second half of the semester, when we will work with larger datasets.)

Recall first that the argument bcftools view -H sample.vcf | wc -l counts the number of variants—both SNPs and indels—in your .vcf file, something that may be useful for keeping track of the impact of parameter choices on the quantity of data (e.g., for the homework). Confusingly, however, we will be doing the actual filtering with a program called VCFtools, written and maintained by a separate research group. As usual, I recommend reviewing the documentation to get a feel for its uses and style. Our mamba environment today should thus include both these tools (and a few others besides):

name: vcftools
channels:
  - conda-forge
  - bioconda
dependencies:
  - vcftools
  - bcftools
  - htslib
  - tabix
  - samtools
mamba env create -f vcftools.yml

(At this point, you may be wondering why we didn’t simply create a single mamba environment with all the tools we will need for the duration of the semester. My response is that doing so would reduce the names of these programs to jargon—and also take many minutes to install.)

Before beginning, let’s compress our .vcf with gzip to speed up processing. For datasets as small as ours, this step is hardly necessary; however, it is generally good practice, and can make a large difference as the number of SNPs and samples increases. Do so on the log-in node, editing the command below for the specifics of your file:

gzip multi_sample.vcf

Our first filtering step is standard for many population genetics studies: removing indels and leaving only SNPs in the resulting data. To do so, create a new .sbatch script that loads Mamba, makes it callable, loads your new vcftools environment, invokes vcftools with these arguments, and saves a file with a before and after summary of variant numbers:

vcftools --gzvcf multi_sample.vcf.gz --remove-indels --recode --out step1

This should produce an uncompressed .vcf1 with the prefix step1 (or whatever you put in its stead). Check the .err file using cat or tail. After unimportant warning messages tripped up by .vcf versions, you should see text like this:

After filtering, kept 2 out of 2 Individuals
Outputting VCF file...
After filtering, kept 369 out of a possible 388 Sites
Run Time = 0.00 seconds

A helpful summary, though somewhat hard to extract. (We’ll return to this later.) Next, let’s remove all sides with a Phred-scaled quality score lower than 40:

vcftools --vcf step1.recode.vcf --minQ 40 --recode --out step2 

By now you should be getting the swing of things. We’ll perform one more filtering step together, thinning our .vcf so that no two sites are closer than 50 nucleotides apart:

vcftools --vcf step2.recode.vcf --thin 50 --recode --out step3 

Simple bash scripting can summarize the impacts of filtering decisions each step of the way. The script below first creates a file called snp_counts.tsv with columns for filtering step, filename, and total SNP count (separated by a tab, i.e. \t below). It then loops over all files identified by the wildcard *.vcf, creates a step column and fills it with the prefix of each .vcf, creates a count column and fills it with the output of a bcftools SNP counting command, and appends these values to a new row of the .tsv. (You should create a new .sbatch script to run this, as it requires an environment with bcftools available and may be memory-intensive for large files.)

echo -e "file\tsnp_count" > snp_counts.tsv
for f in *.vcf; do
    count=$(bcftools view -H -v snps "$f" | wc -l)
    echo -e "${f}\t${count}" >> snp_counts.tsv
done

The output will look something like this:

file    snp_count
multi_sample.vcf    388
step1.recode.vcf    369
step2.recode.vcf    360
step3.recode.vcf    354
step4.recode.vcf    113

At this point I must again confess that I have forced you to do something the long way, as the typical production VCFtools command takes multiple arguments simultaneously. In other words, we could have simply run the following line of code to skip directly to the end of this filtering tutorial:

vcftools --gzvcf multi_sample.vcf.gz --remove-indels --minQ 40 --thin 50 --recode --out final

The upsides of this are obvious. The downside is that you have no way of determining which filter removed the most variants—and thus whether it might be worth changing to be more stringent or more sensitive.

VCFtools Summary Statistics

Those of you who were diligent in perusing the VCFtools manual may have noticed that in addition to filtering, the program can be used to produce a wide range of common population genetic summary statistics. For example, running the following command (in an .sbatch) will produce a summary of the count of mutations of different types, as well as the ratio of transitions to transversions:

vcftools --vcf step3.recode.vcf --TsTv-summary 

Similarly, appending the flag --site-quality will output per-site SNP quality (taken from the QUAL column of a .vcf), --site-pi will output average pairwise nucleotide diversity (\(\pi\)), etc. Most useful of these options are the flags for \(F_{ST}\) (especially in rolling windows for genome scans), \(H_e\), Hardy-Weinberg Equilibrium, and other widely reported metrics; keep this functionality in mind for your own work.

NoteHomework 7

Today’s homework includes the following tasks:

  1. First, I’d like you to start your filtering over from scratch. Specifically, I’d like to have you write a new .sbatch script with a single VCFtools command containing multiple filtering arguments. At a minimum, these must include --remove-indels, --minQ 40 --mac, --max-missing-count, --min-meanDP, and one other option of your choice. Note that several of these flags were not introduced in the tutorial, i.e. you’ll need to to read about their useage in the manual.

  2. In a separate .md file, justify your filtering choices. This does not need to be extensive, or reveal bulletproof reasoning, but should betray and understanding of what each flag is actually doing, and what the impacts of too loose or strict a value might be.

  3. Run one summary statistic command (e.g., vcftools --vcf step3.recode.vcf --hardy). Interpret its output, decide whether or not it is sensible, and add this to a separate section of your .md file.

  4. Upload .sbatch script(s), your .md discussion of filtering choices and summary statistics, and any other documentation you think is necessary. Organize and update your .README.md appropriately.