Population Genomics Tutorial 5: Visualizing genomic diversity and intro to population structure

1. Review the diversity stats

We’ll live code this using RMarkdown on the server to get practice making documents that summarize our results.

Let’s compare the diversity across our different populations with this google doc:

https://docs.google.com/spreadsheets/d/1SLwhW3OgQiX2z1rxH-ske236NYxjDXCvUu0l8XFeS_w/edit?usp=sharing

I also put map in there for reference so you can see where different pops are located within the range.

What do you notice about the diversities?
Where is Ne the highest/lowest?
What do the average Tajima’s D values suggest about demographic history in these pops?

2. Use ANGSD and the SFS for multiple pops to calculate genetic divergence between pops (Fst)

We can calculate Fst between any pair of populations by comparing their SFS to each other. For this, we’ll need to estimate the SFS for pairs of populations; we can each contribute to the overall analysis by looking at how our focal pop is divergent from the others.

For this analysis, let’s calculate Fst between our focal red spruce population (MYRSPOP) and the black spruce samples (MYBSPOP)…this could tell us which of our pops might be hybridizing. What would we expect for Fst in this case?
Let’s write a bash script called ANGSD_Fst.sh that includes the following code:

# Start with the usual bash script header

# load your modules

module purge

module load gcc angsd

# Give yourself some notes

# Path to the `population_genomics` folder in your repo

REPO=""

# Path to the Black Spruce (BS) input saf.idx data (I made this file for you using the ANGSD.sh script)

BLKSPR="/gpfs1/cl/ecogen/pbio6800/PopulationGenomics/ANGSD/black_spruce"

# Make directory in `myresults/ANGSD` to store your results

mkdir ${REPO}/myresults/ANGSD/Fst

MYBSPOP="WISC"  

MYRSPOP=""

SUFFIX="ALL"

### Estimate Fst between my red spruce pop and black spruce:

# Step 1: estimate the 2D site frequency spectrum for BS and your RS pop
realSFS \
${REPO}/mydata/ANGSD/${MYRSPOP}_${SUFFIX}.saf.idx \
${BLKSPR}/${MYBSPOP}_${SUFFIX}.saf.idx \
>${REPO}/myresults/ANGSD/Fst/${MYRSPOP}_${MYBSPOP}_2D.sfs \
-P 10

# Step 2: Calculate Fst for all SNPs and store it in a binary format (.fst.idx)
realSFS fst index \
${REPO}/mydata/ANGSD/${MYRSPOP}_${SUFFIX}.saf.idx \
${BLKSPR}/${MYBSPOP}_${SUFFIX}.saf.idx \
-sfs ${REPO}/myresults/ANGSD/Fst/${MYRSPOP}_${MYBSPOP}_2D.sfs \
-fstout ${REPO}/mydata/ANGSD/${MYRSPOP}_${MYBSPOP} \
-whichFst 1 \
-P 10

# Step 3: Caclulate the average Fst across all sites (we use the weighted average)
realSFS fst stats \
${REPO}/mydata/ANGSD/${MYRSPOP}_${MYBSPOP}.fst.idx \
>${REPO}/myresults/ANGSD/Fst/${MYRSPOP}_${MYBSPOP}_Fst.txt \
-P 10

# Step 4 (optional): Calculate the averate Fst in sliding windows
realSFS fst stats2 \
${REPO}/mydata/ANGSD/${MYRSPOP}_${MYBSPOP}.fst.idx \
-win 50000 \
-step 50000 \
>${REPO}/myresults/ANGSD/Fst/${MYRSPOP}_${MYBSPOP}_Fst_50kbWindows.txt \
-P 10

After a few minutes you should get some results…we can talk about the calculation and interpretation of Fst in the meantime

Enter the weighted Fst value into the google sheet for your pop. What’s the trend?
Do you see any relationships between Fst with black spruce and other diversity metrics for your red spruce populations?
(Optional/if time allows): If you ran the sliding window analysis, I’ve placed some R code you can use to explore the results here:

/gpfs1/cl/ecogen/pbio6800/PopulationGenomics/scripts/Fst.R

3. Visualize population structure across the landscape using PCA and Admixture

We often want to visualize differences in the genetic structure or genetic ancestry in our sample, and lots of papers we’ve read approach this using PCA or Admixture analysis. We can do each of these approaches on genotype likelihoods in ANGSD using a special routine called PCAngsd.

PCAngsd uses a really cool iterative approach where it refines the estimation of allele frequencies for each individual at the same time that it finds the clusters that individual may have ancestry within.

Here are some resources to understand the program options:

The manual page
A nice PCAngsd tutorial](http://www.popgen.dk/software/index.php/PCAngsdTutorial) that walks through most of the routines
The original paper describing the application to PCA and admixture are here: Meisner & Albrechtsen 2019, Genetics

Since we want to run PCAngsd for the entire set of samples – not just your focal pop – we need the genotype likelihoods from ANGSD for all 95 red spruce + 18 black spruce samples. That would take a long time to run (about 24 hrs) and would be redundant for each of you to do, so I ran these once for the class.

For your reference (and future work), the code I used to estimate the genotype likelihoods is here: (you don’t have to run this now!)

/gpfs1/cl/ecogen/pbio6800/PopulationGenomics/scripts/ANGSD_RSBS_poly.sh

and exported the genotype likelihoods in “beagle” format here:

/gpfs1/cl/ecogen/pbio6800/PopulationGenomics/ANGSD/RSBS_poly.beagle.gz

We can use the beagle file containing the genotype likelihoods for all 95 red spruce + 18 black spruce samples as input to PCAngsd. The script is actually not too bad…let’s give it a go:

# Start with the usual bash script header

# load modules

module purge
module load pcangsd/1.36.1

# Give yourself some notes

# Path to your input data (where the beagle file lives)

INPUT="/gpfs1/cl/ecogen/pbio6800/PopulationGenomics/ANGSD"

# Make directory and set path to save your output (in your repo):

mkdir ~/x/x/xxxx/population_genomics/myresults/ANGSD/PCA_ADMIX

OUT="/x/x/xxxx/population_genomics/myresults/ANGSD/PCA_ADMIX"

SUFFIX="RSBS_poly"

# Make a copy of the list of bam files for all the red + black spruce samples and place in your repo. You'll need this later for making figures.

cp ${INPUT}/RSBS_bam.list ${OUT}


# Set value of K and number of PCA eigenvalues (=K-1)
# K corresponds to the number of distinct ancestry groups you want to cluster genotypes into

K=2
E=$((K-1))

# Then, run PCA and admixture scores with pcangsd:

pcangsd -b ${INPUT}/${SUFFIX}.beagle.gz \
        -o ${OUT}/${SUFFIX}_K${K} \
        -e $E \
        --admix \
        --admix-K $K \
        --maf 0.05 \
        --threads 10

This will run PCAngsd assuming it fits a single “eigenvalue” (the -e 1 option) to split your samples into K=2 clusters. If you want to explore higher levels of clustering in the future, you can increase the -e <int> flag and the --admix-K <int> flag to test different numbers of clusters you want to fit to the data.

Add your SBATCH header to it and submit to the cluster with 10 cpus and 64G RAM requested

Once the run is finished, you should see the following files in your myresults/ANGSD/PCA_Admix/

RSBS_bam.list
RSBS_poly.cov
RSBS_poly.admix.2.Q

When you have these files transferred (don’t forget where you saved them to on your laptop!), open up RStudio and let’s start making some figures in an RMarkdown document!

Just a reminder, the following is R code, not bash. ;)

Remember to set your path in your RMarkdown using:

knitr::opts_chunk$set(echo = TRUE)
knitr::opts_knit$set(root.dir="<path to your repo>/population_genomics/myresults/ANGSD/PCA_ADMIX/")

Load your libraries

library(ggplot2) # plotting
library(ggpubr) # plotting

First, let’s work on the genetic PCA:

COV <- as.matrix(read.table("RSBS_poly.cov")) # read in the genetic covariance matrix

PCA <- eigen(COV) # extract the principal components from the COV matrix

How much variance is explained by the first few PCs?

var <- round(PCA$values/sum(PCA$values),3)

var[1:3]

A “screeplot” of the eigenvalues of the PCA hows you how each axis explains variance in the genotypes:

barplot(var, 
        xlab="Eigenvalues of the PCA", 
        ylab="Proportion of variance explained")

Bring in the bam.list file and extract the sample info to prepare for plotting:

names <- read.table("RSBS_bam.list")
names <- unlist(strsplit(basename(as.character(names[,1])), split = ".sorted.rmdup.bam"))
split = strsplit(names, "_")
pops <- data.frame(names[1:113], do.call(rbind, split[1:113]))
names(pops) = c("Ind", "Pop", "Row", "Col")

data=as.data.frame(PCA$vectors)
data=data[,c(1:3)]
data= cbind(data, pops)

Now we can plot the PCA using ggplot

ggscatter(data, x = "V1", y = "V2",
          color = "Pop",
          mean.point = TRUE,
          star.plot = TRUE,
          main="Red spruce-Black spruce genetic PCA") +
  theme_bw(base_size = 13, base_family = "Times") +
  labs(x = paste0("PC1: (",var[1]*100,"%)"), y = paste0("PC2: (",var[2]*100,"%)"))

Next, we can look at the admixture clustering at K=2

Import the ancestry scores (these are the .Q files)

q <- read.table("RSBS_poly.admix.2.Q", sep=" ", header=F)

K=dim(q)[2] #Find the level of K modeled

## rank order the samples according to population code
ord<-order(pops[,2])

Now make the plot:

barplot(t(q)[,ord],
        #col=cols[1:K],
        col=c("black","red"),
        space=0,border=NA,
        xlab="Populations",ylab="Admixture proportions",
        main=paste0("Red Spruce-Black Spruce Admixture: K=",K))
text(tapply(1:nrow(pops),pops[ord,2],mean),-0.05,unique(pops[ord,2]),xpd=T,cex=0.75,srt=45)
abline(v=cumsum(sapply(unique(pops[ord,2]),function(x){sum(pops[ord,2]==x)})),col=1,lwd=1.2)

What have we learned about the genetic structure from the PCA and admixture plots?

What does the PCA seem to be telling us?
What different picture does the admixture plot reveal? How does it relate to the PCA?
Would we want to look for higher levels of K in the admixture analysis? How do we do that??