1. Review the diversity stats
We’ll live code this using RMarkdown on the server to get practice making documents that summarize our results.
Let’s compare the diversity across our different populations with this google doc:
https://docs.google.com/spreadsheets/d/1SLwhW3OgQiX2z1rxH-ske236NYxjDXCvUu0l8XFeS_w/edit?usp=sharing
I also put map in there for reference so you can see where different pops are located within the range.
- What do you notice about the diversities?
- Where is Ne the highest/lowest?
- What do the average Tajima’s D values suggest about demographic history in these pops?
2. Use ANGSD and the SFS for multiple pops to calculate genetic divergence between pops (Fst)
We can calculate Fst between any pair of populations by comparing their SFS to each other. For this, we’ll need to estimate the SFS for pairs of populations; we can each contribute to the overall analysis by looking at how our focal pop is divergent from the others.
For this analysis, let’s calculate Fst between our focal red spruce population (MYRSPOP) and the black spruce samples (MYBSPOP)…this could tell us which of our pops might be hybridizing. What would we expect for Fst in this case?
Let’s write a bash script called ANGSD_Fst.sh that includes the following code:
# Start with the usual bash script header
# load your modules
module purge
module load gcc angsd
# Give yourself some notes
# Path to the `population_genomics` folder in your repo
REPO=""
# Path to the Black Spruce (BS) input saf.idx data (I made this file for you using the ANGSD.sh script)
BLKSPR="/gpfs1/cl/ecogen/pbio6800/PopulationGenomics/ANGSD/black_spruce"
# Make directory in `myresults/ANGSD` to store your results
mkdir ${REPO}/myresults/ANGSD/Fst
MYBSPOP="WISC"
MYRSPOP=""
SUFFIX="ALL"
### Estimate Fst between my red spruce pop and black spruce:
# Step 1: estimate the 2D site frequency spectrum for BS and your RS pop
realSFS \
${REPO}/mydata/ANGSD/${MYRSPOP}_${SUFFIX}.saf.idx \
${BLKSPR}/${MYBSPOP}_${SUFFIX}.saf.idx \
>${REPO}/myresults/ANGSD/Fst/${MYRSPOP}_${MYBSPOP}_2D.sfs \
-P 10
# Step 2: Calculate Fst for all SNPs and store it in a binary format (.fst.idx)
realSFS fst index \
${REPO}/mydata/ANGSD/${MYRSPOP}_${SUFFIX}.saf.idx \
${BLKSPR}/${MYBSPOP}_${SUFFIX}.saf.idx \
-sfs ${REPO}/myresults/ANGSD/Fst/${MYRSPOP}_${MYBSPOP}_2D.sfs \
-fstout ${REPO}/mydata/ANGSD/${MYRSPOP}_${MYBSPOP} \
-whichFst 1 \
-P 10
# Step 3: Caclulate the average Fst across all sites (we use the weighted average)
realSFS fst stats \
${REPO}/mydata/ANGSD/${MYRSPOP}_${MYBSPOP}.fst.idx \
>${REPO}/myresults/ANGSD/Fst/${MYRSPOP}_${MYBSPOP}_Fst.txt \
-P 10
# Step 4 (optional): Calculate the averate Fst in sliding windows
realSFS fst stats2 \
${REPO}/mydata/ANGSD/${MYRSPOP}_${MYBSPOP}.fst.idx \
-win 50000 \
-step 50000 \
>${REPO}/myresults/ANGSD/Fst/${MYRSPOP}_${MYBSPOP}_Fst_50kbWindows.txt \
-P 10
After a few minutes you should get some results…we can talk about the calculation and interpretation of Fst in the meantime
Enter the weighted Fst value into the google sheet for your pop. What’s the trend?
Do you see any relationships between Fst with black spruce and other diversity metrics for your red spruce populations?
(Optional/if time allows): If you ran the sliding window analysis, I’ve placed some R code you can use to explore the results here:
/gpfs1/cl/ecogen/pbio6800/PopulationGenomics/scripts/Fst.R
3. Visualize population structure across the landscape using PCA and Admixture
We often want to visualize differences in the genetic structure or
genetic ancestry in our sample, and lots of papers we’ve read approach
this using PCA or Admixture analysis. We can do each of these approaches
on genotype likelihoods in ANGSD using a special routine called
PCAngsd
.
PCAngsd
uses a really cool iterative approach where it
refines the estimation of allele frequencies for each individual at
the same time that it finds the clusters that individual may have
ancestry within.
Here are some resources to understand the program options:
The manual page
A nice PCAngsd tutorial](http://www.popgen.dk/software/index.php/PCAngsdTutorial) that walks through most of the routines
The original paper describing the application to PCA and admixture are here: Meisner & Albrechtsen 2019, Genetics
Since we want to run PCAngsd
for the entire set of
samples – not just your focal pop – we need the genotype likelihoods
from ANGSD for all 95 red spruce + 18 black spruce samples. That would
take a long time to run (about 24 hrs) and would be redundant for each
of you to do, so I ran these once for the class.
For your reference (and future work), the code I used to estimate the genotype likelihoods is here: (you don’t have to run this now!)
/gpfs1/cl/ecogen/pbio6800/PopulationGenomics/scripts/ANGSD_RSBS_poly.sh
and exported the genotype likelihoods in “beagle” format here:
/gpfs1/cl/ecogen/pbio6800/PopulationGenomics/ANGSD/RSBS_poly.beagle.gz
We can use the beagle file containing the genotype likelihoods for
all 95 red spruce + 18 black spruce samples as input to
PCAngsd
. The script is actually not too bad…let’s give it a
go:
# Start with the usual bash script header
# load modules
module purge
module load pcangsd/1.36.1
# Give yourself some notes
# Path to your input data (where the beagle file lives)
INPUT="/gpfs1/cl/ecogen/pbio6800/PopulationGenomics/ANGSD"
# Make directory and set path to save your output (in your repo):
mkdir ~/x/x/xxxx/population_genomics/myresults/ANGSD/PCA_ADMIX
OUT="/x/x/xxxx/population_genomics/myresults/ANGSD/PCA_ADMIX"
SUFFIX="RSBS_poly"
# Make a copy of the list of bam files for all the red + black spruce samples and place in your repo. You'll need this later for making figures.
cp ${INPUT}/RSBS_bam.list ${OUT}
# Set value of K and number of PCA eigenvalues (=K-1)
# K corresponds to the number of distinct ancestry groups you want to cluster genotypes into
K=2
E=$((K-1))
# Then, run PCA and admixture scores with pcangsd:
pcangsd -b ${INPUT}/${SUFFIX}.beagle.gz \
-o ${OUT}/${SUFFIX}_K${K} \
-e $E \
--admix \
--admix-K $K \
--maf 0.05 \
--threads 10
This will run PCAngsd assuming it fits a single “eigenvalue” (the
-e 1
option) to split your samples into K=2 clusters. If
you want to explore higher levels of clustering in the future, you can
increase the -e <int>
flag and the
--admix-K <int>
flag to test different numbers of
clusters you want to fit to the data.
Add your SBATCH header to it and submit to the cluster with 10 cpus and 64G RAM requested
Once the run is finished, you should see the following files in your
myresults/ANGSD/PCA_Admix/
- RSBS_bam.list
- RSBS_poly.cov
- RSBS_poly.admix.2.Q
When you have these files transferred (don’t forget where you saved them to on your laptop!), open up RStudio and let’s start making some figures in an RMarkdown document!
Just a reminder, the following is R code, not bash. ;)
Remember to set your path in your RMarkdown using:
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_knit$set(root.dir="<path to your repo>/population_genomics/myresults/ANGSD/PCA_ADMIX/")
Load your libraries
library(ggplot2) # plotting
library(ggpubr) # plotting
First, let’s work on the genetic PCA:
COV <- as.matrix(read.table("RSBS_poly.cov")) # read in the genetic covariance matrix
PCA <- eigen(COV) # extract the principal components from the COV matrix
How much variance is explained by the first few PCs?
var <- round(PCA$values/sum(PCA$values),3)
var[1:3]
A “screeplot” of the eigenvalues of the PCA hows you how each axis explains variance in the genotypes:
barplot(var,
xlab="Eigenvalues of the PCA",
ylab="Proportion of variance explained")
Bring in the bam.list file and extract the sample info to prepare for plotting:
names <- read.table("RSBS_bam.list")
names <- unlist(strsplit(basename(as.character(names[,1])), split = ".sorted.rmdup.bam"))
split = strsplit(names, "_")
pops <- data.frame(names[1:113], do.call(rbind, split[1:113]))
names(pops) = c("Ind", "Pop", "Row", "Col")
data=as.data.frame(PCA$vectors)
data=data[,c(1:3)]
data= cbind(data, pops)
Now we can plot the PCA using ggplot
ggscatter(data, x = "V1", y = "V2",
color = "Pop",
mean.point = TRUE,
star.plot = TRUE,
main="Red spruce-Black spruce genetic PCA") +
theme_bw(base_size = 13, base_family = "Times") +
labs(x = paste0("PC1: (",var[1]*100,"%)"), y = paste0("PC2: (",var[2]*100,"%)"))
Next, we can look at the admixture clustering at K=2
Import the ancestry scores (these are the .Q files)
q <- read.table("RSBS_poly.admix.2.Q", sep=" ", header=F)
K=dim(q)[2] #Find the level of K modeled
## rank order the samples according to population code
ord<-order(pops[,2])
Now make the plot:
barplot(t(q)[,ord],
#col=cols[1:K],
col=c("black","red"),
space=0,border=NA,
xlab="Populations",ylab="Admixture proportions",
main=paste0("Red Spruce-Black Spruce Admixture: K=",K))
text(tapply(1:nrow(pops),pops[ord,2],mean),-0.05,unique(pops[ord,2]),xpd=T,cex=0.75,srt=45)
abline(v=cumsum(sapply(unique(pops[ord,2]),function(x){sum(pops[ord,2]==x)})),col=1,lwd=1.2)
What have we learned about the genetic structure from the PCA and admixture plots?
What does the PCA seem to be telling us?
What different picture does the admixture plot reveal? How does it relate to the PCA?
Would we want to look for higher levels of K in the admixture analysis? How do we do that??