Understanding .fastq Data and Short Read Quality Control
Our introduction to the Tempest HPC included our first exposure to real live short read sequencing data. I will assume that this week’s reading and discussion has everyone feeling a little more comfortable with how these data are generated and what a read is in the first place. Today’s lab, which is somewhat shorter than its predecessors, will cover the basics of .fastq data: its format, what aspects of a sample and its preparation are contained in reads, and how to trim and filter reads to ensure that only high-quality data get passed on to the next stage of your workflow.
Example Data
The remaining classes before spring break will lead us through an abbreviated sequence alignment, variant calling, and variant filtering pipeline. These basic bioinformatic processing steps are required for nearly all population-level genomics and transcriptomics projects; while the exact programs and filtering choices will vary, the backbone presented here should orient you to successfully preparing your own data for subsequent analysis. To achieve this goal, we will work with unpublished exon capture data from 13 species of high Andean tanagers (Passeriformes: Thraupidae). These data were generated by exon capture—a kind of reduced representation library preparation method that targets specific expressed genes by enriches them with synthetic user-designed probes—of the seven hemoglobin peptide chains, or subunits. Because hemoglobin structure impacts its ability to bind and transport \(O_2\) in the blood, variation in underlying sequences may be associated with adaptations to high or low altitude and attendant difference in the partial pressure of oxygen (\(PO_2\)).
Below is a complete list of species with linked natural history and distributional data. Feel free to pick your favorite, but note that sample sizes vary, posing a trade-off between the time needed to run tools and how interesting the data ultimately is. All can be found in named subdirectories within bioe-591-genomics/course-materials/data/raw_reads/.
- Giant Conebill Oreomanes fraseri (outdated binomial used here)
- Capped Conebill Conirostrum albifrons
- Cinereous Conebill Conirostrum cinereum
- Blue-backed Conebill Conirostrum sitticolor
- White-sided Flowerpiercer Diglossa albilatera
- Cinnamon-bellied Flowerpiercer Diglossa baritula
- Black-throated Flowerpiercer Diglossa brunneiventris
- Blueish Flowerpiercer Diglossa caerulescens
- Masked Flowerpiercer Diglossa cyanea
- Deep-blue Flowerpiercer Diglossa glauca
- Glossy Flowerpiercer Diglossa lafresnayii
- Moustached Flowerpiercer Diglossa mystacalis
- Rusty Flowerpiercer Diglossa sittoides
The .fastq Format
Sequencing reads are turned into plain-text data as .fastq files, which are then typically compressed using gzip into compressed .fast.gz files. Typing gunzip filename.fastq.gz will decompress the target, albeit slowly; because most tools can read compressed data and even small genomic datasets can be prohibitively large in their raw form, there is usually little reason to do this. However, it can be useful to look at least part of a human-readable .fastq. Luckily, we can do so using a version of the familiar bash command cat that works on compressed data (zcat). We’ll pipe this to head to avoid printing the entire contents of the file to your screen; if you’d like to expand beyond ten lines, use head’s -n <integer> argument.
zcat intro/catamenia_analis.fastq.gz | headThe output of this command should look something like this:
@J00138:88:HG7TGBBXX:3:1101:1083:1086 1:N:0:NGCGGAAT+NATGTTCC
NTCAAAGTCCAGCCCAGACCTCTCACCTTGCAGACAACTCCTTCCCCAACACGGCCCCCCACACCCCGCCCCCCCCAACCCCCCCCCAAACCCCCCAACAGCCCCCCCCGGCCCCCCCCTGCTCCCGCCCCCCCCCCCCGGCTCCCCCCCC
+
#<AAFJFAFFJ77JJ<JJJJJJJ<JJJJJJJ7J7AJ7JJJJAF<FAA--F-7--7-AAJ----777-7A-7FJA----777JJ----7--7-AJF-------7-7AA----7-FF7A77-----------A-A)-A)-)))))))-A---)
@J00138:88:HG7TGBBXX:3:1101:7415:1086 1:N:0:NGCGGAAT+NATGTTCC
NTCAGGGCATGGGCAAATGCATAGACTGGCACAGACCCTCCCCAGAACTGGGAATTCAGGGCACGGGCAAATGCACAGACTGGCACAGACCCTCCCCAGAACTGGGAATTCAGGGCATGGGCAAATGCACAGACTGGCACAGACCCTCCCC
+
#AAAFFAJJJJJJJFJJJJJJJJJJJJJFFJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJFAJJJAJJJJJJFJJFJFJJJJJJJFJA<JAJJJJJJJJFJJJJJFJFFJFJJFJF-AFFJAFJFFJJJA-<AFJ-FAFFJJA7-)<A<AJ
@J00138:88:HG7TGBBXX:3:1101:8958:1086 1:N:0:NGCGGAAT+NATGTTCC
NGTGAGCGAGCAACAGCGCCAGAGGGCAGACAGGATGGAAGAGCGCAAGAACAGAGCCTGGGGAGTGCGGAGACGCCGGCAGGCAGTTCACTTCCCATGCTCTTTCTTCTCCTGTGTGTCGTGACAGCAGCAGCAGCAGCAGCAGCAGCAGThough initially overwhelming, you will notice that the characters above consist of only four lines, which can be interpreted as follows:
- a unique sequence identifier;
- the nucleotide sequence itself;
- a quality score identifier line (simply a
+) - a quality score, encoded by an
ASCIIcharacter.
The sequence identifier, which is initially cryptic, follows the format below. Note that these are technical details about sequencing itself, and largely not the biology of the sample.
@<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos> <read>:<is filtered>:<control number>:<sample number>The nucleotide sequence itself should be self explanatory, though note that N (and sometimes other )
(Illumina has a helpful explainer on exactly what each character preceding the sequence data means.)
In your own subdirectory, dig up the size.txt file you generated during last week’s activity:
reads bases
5417329 818016679Here, the reads column is equvalent to the number of unique entries in the .fastq, i.e. every four-line set beginning with @. Notice too that dividing the number of bases by the number of reads produces a round number (5417329 / 818016679)—151. This means that the length of each Illumina sequencing read in the file is 151 bases, or more precisely, 150 with an extra base appended due to the chemistry of the sequencer. 150 bases is a standard read length for modern genomics, though 50-base and 100-base reads are also common, particularly when using older sequencing platform.
We now need to turn to a different source of data, as the simplicity of the standalone catamenia_analis.fastq.gz file elides something important—that .fastq reads often come as “paired-end reads”, or two sequencing reads of the same DNA fragment from opposite directions. Because we know the distribution of the size of fragments, paired-end (PE) reads give us constraints to work with during assembly. For example, a given read pair should map to the same chromosome and will likely be within a standard deviation or two of the mean insert size apart. This can be useful in many contexts, but particularly in resolving the repetive elements common to many genomes.
Though you should know long before you touch the terminal what kind of sequencing data you will recieve, naming conventions take the guesswork out of identifying PE reads. Navigate to raw_reads/ within ~/bioe-591-genomics/course-materials/data/, pick a species, and look at the data within its own named subdirectory. You should see files in the following format: A binomial, a sample ID number, and a string reading R1_001 or R2_001 before the suffix. (In addition to *.fastq.gz, you’ll see files ending in .md5. These contain “checksum” values to verify the integrity of the data with their prefix; we will not concern ourselves with them further in this class, I hope.)
Conirostrum_albifrons_163719_R1_001.fastq.gz
Conirostrum_albifrons_163719_R1_001.fastq.gz.md5
Conirostrum_albifrons_163719_R2_001.fastq.gz
Conirostrum_albifrons_163719_R2_001.fastq.gz.md5
...Intuitively, *R1_001.fastq.gz contains the forward reads for a given sample, while*R2_001.fastq.gz contain the reverse reads. Within each file, reads are in the same order, i.e. correspond to sequences of the same DNA fragment from opposite directions. Let’s look at the first read in each direction from a single sample:
zcat Conirostrum_albifrons_163719_R1_001.fastq.gz | head -n 4@J00138:88:HG7TGBBXX:3:1101:4229:1086 1:N:0:NATGGAAC+NGAACTGT
NAATGGGAAGTGTAGGGATACTCCTGACAACTAGATAAGAATCCTTCTTCACCTAAATATAAACGCTTTTCCTCAACCTCTGTGATCAGAAGAGTGGAAGAAACCCAACCATAGCACATCCAGAAACACTCTTGGCCAGGACACAGCAGAT
+
#AAAFJJFJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFJJJJJJJJJFFFJJFJJJJJJJJJJFJJJJJJJJJJJJJJJJJJFJJJJJJJJJJAJJJFFJJFAJzcat Conirostrum_albifrons_163719_R2_001.fastq.gz | head -n 4@J00138:88:HG7TGBBXX:3:1101:4229:1086 2:N:0:NATGGAAC+NGAACTGT
NTTATGCTTTAAGTCAGAAAAGGGCCAGTCAGCAGGAAAACATGCTACTTGCTTGGGGAAAGGATTGTGGATGTGGCAGAGCAAGGGAAGAGGTGCCCAACCATCAGCAGCATCTGTTGGTCTGAGAGGGTGGGAGATGGTGGGGTTCCTC
+
#AAFAAF7J<FFJFFFJJFFFJJJJJJJJJJJJFJJJFFFFJJJFJJJJFJAAAFFJJJJJJJJ--<AJJF7FFJ<7AJJJJJJJAJJFJFFFFJJJFFFFJJFFJJFJJJF-<-<-<AFAA<FAFFJJJJJJJ-F<-7<-7<)-)-)-<<You should note two things. First, the headers are identical other than the number before :N following the space in the first line, which corresponds to read 1 (1:N) or read 2 (2:N). This is what we would expect, as PE reads come from the exact same flowcell, lane, and position on the sequencer. Second, it should be clear that the sequences are different—recall that “same fragment” does not mean “same genomic location”, and reads are shorter than the DNA they are generated from.
Filtering Reads with fastp
We will now learn the basics of the traditional first step of preparing sequence data for analysis: trimming adapter sequences and filtering reads based on quality and levels of missing data. Recall from your reading last class that preparing genomic libraries typically involves appending adapter sequences to each sample. These contain oligonucleotides used in sequencing reactions, as well as unique indices that permit data to be disambiguated (i.e., matched to known samples). Though relatively short, including adapter sequences in downstream processing steps could cause numerous errors, and thus must be removed prior to alignment. Additionally, rare but pervasive sequencing errors will result in variable quality among reads. Ensuring reliable data necessarily involves filtering out reads with the highest likelihood of errors.
To achieve these goals, we will use a clever, simple tool called fastp. Please briefly read or skim the documentation here before proceeding. Some of it will likely be above your head, but you will hopefully recognize a simple command-line syntax from using seqtk the week prior: the name of the program, the name of a particular task it performs, and parameters.
As with last week and the weeks to come, we will begin by setting up a mamba environment for fastp. If it’s your first time logging on today, you’ll need to load the mamba module and make it callable:
module load Mamba/23.11.0-0;
eval "$(conda shell.bash hook)"Next, create a file called fastp.yaml somewhere in your personal directory. The contents should be as follows:
name: fastp
channels:
- conda-forge
- bioconda
dependencies:
- fastpCreate a new environment with the command mamba env create -f fastp.yaml. Activate the environment, and type fastp. You should see a suite of commands matching those in the documentation. Discussing how to make a highly informed decision about which parameter settings to use is beyond our purview (and realistically beyond the purview of most fastp users). Luckily, however, the program’s default parameters are sensible for the majority of cases: it automatically detects and removes commonly used adapters, trims error-prone bases from both ends until a quality threshhold is reached, ditches reads below 15 bp in length (from super-short DNA fragments), ditches reads where 40% or more bases are low quality, and trims repetitive sequences from ends (if present). Quality filtering is based on Phred scores, which are a transformation of error probabilities at each base:
\[ Q = -10log_{10}(P_{error}) \]
In other words, a Phred score of 10 (typically reported as Q10) means there is a 1 in 10 probability of error; Q20, a 1/100 probability; Q30, a 1/1000 probablity. But where does the probability of error come from?, I can hear you asking. The answer is numerous aspects of the sequencing process itself, such as fluorescence intensity and the overlap of different “signals” for nucleotides. For now, the important thing is that filtering by Phred scores gives you control over the error you are willing to tolerate in your data.
Other fastp parameters include paths for input files, output files, and reports. Very few of these need to be set explicitly, as we will see when we use the tool shortly. To do so, we will again need to schedule a job via Slurm. Copy the file ~/bioe-591-genomics/course-materials/scripts/fastp.sbatch to your own directory. It should look like this:
#!/bin/bash
##
## example-array.slurm.sh: submit an array of jobs with a varying parameter
##
## Lines starting with #SBATCH are read by Slurm. Lines starting with ## are comments.
## All other lines are read by the shell.
##
#SBATCH --account=priority-bioe-591-genomics #specify the account to use
#SBATCH --job-name=fastp # job name
#SBATCH --partition=priority # queue partition to run the job in
#SBATCH --nodes=1 # number of nodes to allocate
#SBATCH --ntasks-per-node=1 # number of descrete tasks - keep at one except for MPI
#SBATCH --cpus-per-task=1 # number of cores to allocate
#SBATCH --time=0-00:30:00 # Maximum job run time
#SBATCH --output=fastp_demo-%j.out
#SBATCH --error=fastp_demo-%j.err
# load module and activate mamba
module load Mamba/23.11.0-0
eval "$(conda shell.bash hook)"
# activate your fastp environment
conda activate fastp
# run fastp
fastp -i ~/bioe-591-genomics/course-materials/data/raw_reads/species/species_ID_R1_001.fastq.gz \
-o ~/bioe-591-genomics/students/username/trimmed_reads/species_ID_R1_001_trimmed.fastq.gz \
-I ~~/bioe-591-genomics/course-materials/data/raw_reads/species/species_ID_R2_001.fastq.gz \
-O ~/bioe-591-genomics/students/username/trimmed_reads/species_ID_R2_001_trimmed.fastq.gz \
-h ~/bioe-591-genomics/students/username/trimmed_reads/sample.fastp.html \
-j ~/bioe-591-genomics/students/username/trimmed_reads/sample.fastp.jsonFour flags indicate necessary arguments, all of which will appear at the top of fastp's usage info:
options:
-i, --in1 read1 input file name (string [=])
-o, --out1 read1 output file name (string [=])
-I, --in2 read2 input file name (string [=])
-O, --out2 read2 output file name (string [=])Here, either the short version (-) or long version (--) of the argument flag will work; string [=] simply means characters (your path) are accepted. Note that the demo script includes input paths to specific species in course-materials/data/raw_reads, but output paths to a new subdirectory called trimmed_reads in your personal directory. Make this, and edit the other paths so that they point to real files for your tanager species of choice. Pay particular attention to the read direction—it’s easy to end up with both input or output files labeled R1_001 if you aren’t careful. From the subdirectory with your copy of this now-editing .sbatch script, run it via the usual command (sbatch fastp.sbatch). Recall that you can monitor its progress via sacct and related commands, and that standard output and error messages are logged in .out and .err files that appear in your working directory.
The program should finish running on most data within a few minutes. Navigate to the path you set for the .html report (-h). This file is an extremely useful summary of what .fastp did and discovered in your data; it can be viewed in any web browser after transferring it to your laptop with Globus. (Today’s homework will address its contents in greater detail.) Navigate also to your trimmed_reads directory and confirm it contains the expected files. If so, congratulations—you’ve completed the first step of a typical bioinformatics pipeline for one sample.
Today’s homework assignment has four components:
- The main goal is to write an
sbatchscript that will runfastpon all samples of your focal species and summarize the output. (I realize this will involve re-trimming one sample; overwriting the files you generated above is fine.) You may do so anyway you see fit, so long as the script is robust and achieves that goal. However, you will likely use one of the following three approaches.- Copy and paste your
fastpcommand as many times as necessary, editing paths as necessary. - Use
fastp’s built-in “batch processing” script. - Write a bash script that runs your
fastpcommand on each file in a repository (or list of files). Using online resources to do so is encouraged, but I request that you add a comment to each line of code explaining what it is doing. Here is one possibility, though note that as written it will onlyechothe commands it generates, not run them:
- Copy and paste your
for r1 in data/*_R1*.fastq.gz; do
r2="${r1/_R1/_R2}"
sample=$(basename "$r1" | sed 's/_R1.*//')
echo "fastp -i \"$r1\" -I \"$r2\" \
-o clean/${sample}_R1.clean.fastq.gz -O clean/${sample}_R2.clean.fastq.gz \
-h reports/${sample}.fastp.html -j reports/${sample}.fastp.json"
doneOnce you have written your
sbatchscript, push it to your GitHub repository as04_homework_<your_name>.sh. Next, transfer thefastp.htmlreport from the cluster to your computer, and rename it in the format04_fastp_<your_name>.html. Push this to GitHub as well.Open
04_fastp_<your_name>.htmlin your web browser. In a separate Markdown file named04_interpretation_<your_name>.md, answer the following brief questions:- What percentage of reads passed filtering?
- What percentage of reads were low quality?
- What is the estimated insert size, and what percentage of your reads is it based on?
- At approximately what read position does mean quality drop below Q30 on read2 after filtering? How does this differ from the position at which mean quality drops below Q30 on read2 before filtering?
Assess the structure of your GitHub repository, organize as needed, and edit your
README.mdto explain what files and directories contain and where to find them. You now have.sbatchscripts, an.htmldocument, and a.mdfile with answers to the questions above. These could be arranged in subdirectories, corresponding to the week they were assigned, separated intoscripts/,markdown/andreports/subdirectories, or sorted in some other way. What you do is up to you, but make sure it is clear and well documented. (And don’t forget to add, commit, and push your changes!)