Population Genomics Tutorial 3: Mapping cleaned reads to the reference genome

Our previous analysis finished (hopefully!) with cleaned fastq files that are free of low quality bases and adapter sequences. We’re now ready to map these reads to the spruce reference genome!

1. Obtaining and indexing the reference genome

The first step of read mapping is downloading and preparing the reference genome.

The black spruce (Picea mariana) reference genome was published by Lo et al. (2024).

You don’t actually need to download the genome because we already have the file in the directory below. But for future reference wget is a useful command to ownload files from the web.

cd /gpfs1/cl/ecogen/pbio6800/PopulationGenomics/ref_genome

wget "ftp://plantgenie.org:980/Data/PlantGenIE/Picea_abies/v1.0/fasta/GenomeAssemblies/Pabies01-genome.fa.gz"

Rather than trying to map to the entire 18+ Gbp reference (yikes!), we first subsetted the P. mariana reference to include just the contigs that contain one or more of our 80,000 probes from our exome capture experiment. For this, we did a BLAST search of each probe against the P. mariana reference genome, and then retained all scaffolds that had a best hit.

This reduced reference contains:
- 2,927,599,912 bp (~2.93 Gbp) in 28,953 scaffolds
- The largest scaffold is 3,698,311 (~3.7 Mbp)
- The N50 of the reduced reference is 250,906 bp
The indexed reduced reference genome to use for your mapping is on our server here:

/gpfs1/cl/ecogen/pbio6800/PopulationGenomics/ref_genome/Pmariana/Pmariana-genome_reduced.fa

2. Mapping our cleaned and trimmed reads against the reference genome!

Let’s cd to the class directory where shared files live:

cd /gpfs1/cl/ecogen/pbio6800/PopulationGenomics/

Become familiar with this part of the filesystem – it’s where a lot of our shared data and scripts will reside.

Take a peek inside each directory:

cleanreads/

bams/

scripts/

To help make our scripting approach efficient, we’re going to use several short scripts, running each one at a time. In the future, we could put them all together into a single pipeline

From the scripts/ directory, copy the following files over to your repo and save them into your myscripts/ directory:

mapping.sh  
process_bam.sh  
bam_stats.sh 
SBATCH_header.txt

For mapping, we’ll use the program bwa-mem2, which is a very efficient and very well vetted read mapper. Lots of others exist and can be useful to explore for future datasets. We tried several, and for our exome data, bwa seems to be the best. the citation info for bwa-mem2 is here: Vasimuddin et al. 2019
Let’s customize the bash script called mapping.sh that calls the R1 and R2 reads for each individual in your population, and uses the bwa-mem2 algorithm to map reads to the reduced black spruce reference genome.

The basic bwa-mem2 command we’ll use is below:

bwa-mem2 mem -t 10 ${REF} ${READ1} ${READ2} > ${OUT}${NAME}.sam

where

-t 10 is the number of threads, or computer cpus to use (in this case, 10)
-${REF} specifies the path and filename for the reference genome
${READ1} specifies the path and filename for the cleaned and trimmed R1 reads 
${READ2} specifies the path and filename for the cleaned and trimmed R2 reads 
>${OUT}/${NAME}.sam  specifies the path and filenam for the .sam file to be saved into a new directory

Think about how we should write this into a loop to call all the fastq files for our population of interest…(hint, look back at the fastp.sh script)

Open up the mapping.sh script and let’s customize the fields together.

Because you’re each mapping sequences from multiple samples (N=8/pop), it’s going to take a little while.

Whenever you have a job that will take a long time, you’re going to want to run it on the cluster by writing a batch script and submitting to the SLURM scheduler (remember Shelly and Bennet’s VACC tutorial?)

You’ll find a version of the SBATCH header in this file you copied to your myscripts/ directory:

SBATCH_header.txt

Open that in RStudio text editor and copy/paste it into your mapping.sh script. Be sure to paste the header below the top line #!/bin/bash

Custom the SBATCH settings to your run and resource needs!

When we’re ready, cd to your scripts/ folder and submit to SLURM like so:

cd ~/projects/eco_genomics_2025/population_genomics/myscripts

sbatch mapping.sh

We can check on the status of our submission”

squeue --me

And if you ever need to stop the run:

scancel jobID where jobID is the unique ID number assigned to your job

3. Process the alignment files

The second step will be to process the sequence alignment files (SAM files) that are outputs of the mapping.sh script.

We can use the program sambamba for manipulating sam/bam files. This program was published by Tarasov et al. 2015.
There are several steps we need to do:
- convert sam alignment file to (binary) bam format
- sort the bam file by its read coordinates
- mark and remove PCR duplicate reads
- index the processed alignment file for quick lookup
Open up the process_bams.sh script and we’ll customize the fields together
Add the SBATCH header and submit to SLURM!

4. Calculate mapping stats: How can we get a summary of how well our reads mapped to the reference?

We can use the program samtools Written by Heng Li, the same person who wrote bwa. It is a powerful tool for manipulating sam/bam files. The citation is here: Danecek et al. 2021
The samtools command flagstat gets us some basic info on how well the mapping worked
We can also estimate depth of coverage (avg. number of reads/site) using the samtools command depth
We’ll use both of these commands in loops to assess the mapping stats on each sample in our population.
We’ll also use the awk tool to help format the output.
Open up the bam_stats.sh script and let’s customize it together.
Add the SBATCH header and submit to SLURM!

Something for the future: You could put it all together in one pipeline

Now that we have each of our mapping, processing, and stats scripts written, if we ran this again in the future or on other data, we could put these separate scripts into a single pipeline script and submit to the cluster.

Here’s how we’d do it:

Open a new text file in RStudio
On the first line, write #!/bin/bash to let it know this is a bash script
Paste the SBATCH_header.txt to the top
Code the lines below into the body of the script:


# cd to your `scripts/` folder in your repo:

cd ~/projects/eco_genomics_2025/population_genomics/myscripts

# Step 1: map reads to the reference

bash mapping.sh

# Step 2: process the alignment files

bash process_bams.sh

# Step 3. get stats on our alignments

bash bam_stats.sh

Save as mypipeline.sh

At the terminal, cd to your scripts/ folder and submit by writing:

sbatch mypipeline.sh

Check the status of your job:

squeue --me

5. Visually inspect the alignment using samtools tview

We can get a sense of how the data look after they’re aligned by using the alignment viewer tview built into samtools

We just need a indexed bam file and an indexed reference genome.

Here’s an example using a previously processed bam file, but you can use your own once they’re done being mapped:

module load gcc samtools 

samtools tview \
/gpfs1/cl/ecogen/pbio6800/PopulationGenomics/examples/2030_57_O.sorted.rmdup.bam \
/gpfs1/cl/ecogen/pbio6800/PopulationGenomics/ref_genome/Pmariana/Pmariana1.0-genome_reduced.fa