Calculating Kinship Coefficients

For the remainder of the semester we will transition from assembling genomic data and calling variants to analyzing these data to answer biological questions. Doing so will involve the continued use of command-line programs and .sbatch scripts but add the use of the R programming language and several packages developed for population genomics adn data visualization. Today’s lab involves a bit of both, calculating kinship coefficients for empirical data from Hauser et al. 2022 using multiple methods. You will recall their study assessed consistency of kinship estimates from both reduced representation and whole genome sequence data across a range of taxa. As is customary on publication, the authors deposited .vcf files in the Dryad archive; I have downloaded to our class directory for your use (~/bioe-591-genomes/course-materials/data/kinship/). To begin, choose a single dataset to use in the following tutorials. (As usual, you will not be able to write in this directory, so remember to change output file paths or copy a .vcf into your own subdirectory. )

Working with ngsRelate

The first tool we will use is the command-line program ngsRelate, part of the ANGSD ecosystem of methods for working with low-coverage sequence data. Importantly, ngsRelate can work directly from genotype likelihoods, incorporating uncertainty in variants into downstream estimates. However, it can also take a .vcf file with hard-called genotypes; we will follow Hauser et al. in doing so below.

As usually, ngsRelate can be installed via Mamba and the bioconda channel:

name: ngsrelate
channels:
  - conda-forge
  - bioconda
dependencies:
  - ngsrelate
  - bcftools
  - htslib
mamba env create -f ngsrelate.yaml

Once you have done so and confirmed the installation was successful with command ngsRelate, create an .sbatch script with the usual header information and module / Mamba initialization, updated for a new job name. You will also need to use tabix (from htslib) to index the .vcf file and bcftools query to extract sample IDs; compressing it with gzip is also a good idea. Following those commands, ngsRelate can be run in a single line:

#SBATCH --account=priority-bioe-591-genomics        
#SBATCH --job-name=ngsrelate                            
#SBATCH --partition=priority              
#SBATCH --nodes=1                       
#SBATCH --ntasks-per-node=1             
#SBATCH --cpus-per-task=1              
#SBATCH --time=0-10:00:00                 
#SBATCH --output=ngsrelate-%j.out
#SBATCH --error=ngsrelate-%j.err

module load Mamba/23.11.0-0;
eval "$(conda shell.bash hook)"

# gzip the vcf
bcftools view -Oz -o sample.vcf.gz sample.recode.vcf

# index the vcf
tabix -p vcf sample.vcf.gz

# generate a sample ID file 
bcftools query -l sample.vcf.gz > sample_ids.txt

# call ngsRelate
ngsRelate -h sample.vcf.gz -T GT -c 1 -O sample.ngsrelate.out -z sample_ids.txt

Above, the flag -h indicates an input .vcf, -T instructs the program to use genotypes instead of Phred-scaled likelihoods (e.g., data under the PL column of a .vcf, which are often absent), -c specifies the number of cores needed, -O is the path for an output file, and -z inputs the list of sample IDs created by bcftools query. Run the script with sbatch. Even using a single core, the analysis should complete in a manner of minutes. Examine sample.ngsrelate.out using head. You should see something like the following, albeit with individual IDs in column positions 3 and 4 (scroll right to view the entire text file):

a   b   nSites  J9  J8  J7  J6  J5  J4  J3  J2  J1  rab Fa  Fb  theta   inbred_relatedness_1_2  inbred_relatedness_2_1  fraternity  identity    zygosity    2of3_IDB    FDiff   loglh   nIter   bestoptimll coverage    2dsfs   R0  R1  KING    2dsfs_loglike   2dsfsf_niter
0   1   13374   0.998610    0.000004    0.000003    0.000020.000000 0.000063    0.000000    0.001299    0.000000    0.000000.001362 0.001321    0.000002    0.000000    0.000000    0.001300.000000 0.001302    0.001346    0.000020    -19679.443977   78  -1  0.971101    4.349743e-01,1.627332e-01,2.734623e-02,1.748214e-01,1.164063e-01,2.195767e-02,3.280485e-02,2.277343e-02,6.182655e-03    0.516734    0.263103    -0.006334   -21871.119570   10
0   2   13442   0.999888    0.000000    0.000000    0.000000.000000 0.000007    0.000002    0.000000    0.000103    0.000100.000111 0.000103    0.000104    0.000104    0.000103    0.000000.000103 0.000103    0.000108    0.000003    -19574.629564   50  -1  0.976038    4.329153e-01,1.693494e-01,2.289844e-02,1.824367e-01,1.063359e-01,2.463403e-02,2.989518e-02,2.559036e-02,5.944733e-03    0.496480    0.233806    0.001218    -21904.265296   10

Each column is described in detail in the “Output format” section of the documentation. The most important of these are 13 through 23, which are 11 summary statistics spanning familiar (e.g., 13, which is \(R_{AB}\)) and unfamiliar (e.g., 23, “Two-out-of-three IBD”) ways of measuring kinship and inbreeding. The final columns use a two-dimensional site frequency spectrum to calculate three aditional metrics, of which no. 31 should be familiar (“KING-robust” from Hauser et al. 2022). Once you have oriented yourself to this output (and made sure it is saved somewhere convenient) you may exit the Tempest command-line interface; the remainder of the lab will use RStudio from its web application environment.