1. What have we learned about the genetic structure from the PCA and admixture plots?

Let’s take a look at the PCA and admixture figures

2. How has selection acted to drive divergence of individual loci in excess of the population structure we’ve observed?

When selection acts in response to local environmental conditions, we observe an excess of population structure at certain loci. This can be thought of as Fst at a single locus exceeding some background level of divergence that exists across the genome as a whole. We call these “Fst outliers” and identifying these outliers is a major goal in ecological genomic studies. But, a major challenge is how best to control for background population structure – in other words, how can we sift out the “outliers” from the rest of the population structure in the genome?

Run the PCA on the 95 Red Spruce samples

/gpfs1/cl/ecogen/pbio6800/PopulationGenomics/ANGSD/allRS_poly.beagle.gz

Try creating this script yourself based on the PCANgsd_RSBS.sh script you wrote last week…

You’ll want to hink about what level of K to test…

K=
E=$((K-1))

To tell PCAngsd to test for selection, you’ll need to add the following options after the call to pcangsd. Also, you’ don’t need the options about amix or admix-K anymore, so can remove those. All other options should stay the same.

--selection-eig $E \
--selection \
--sites-save \
--maf-save \
--snp-weights \

Note the minMaf 0.05 option tests only those loci that have a minor allele frequency of 0.05 of greater. That’s because there’s little statistical power for testing association of alleles that are very rare (<5% in the sample).

It should run quickly! Once finished, you’ll the see a new set of files, including:

File name contents
allRS_poly.selection selection scores for each locus on each tested PC axis
allRS_poly.weights weights that show how strongly each locus “loads” on each PC axis
allRS_poly.sites for each locus shows whether it got tested (1) or not (0)

After we’ve got these resiults files, we can go over to RStudio and start exploring the results.

Note: In RStudio, you’ll need to read in a file with minor allele frequencies (“maf”) that ANGSD makes when estimating genotype likelihoods. You have one of these files from when you worked on your focal pops, but we need one for the “allRS_poly” run of all red spruce individuals for just polymorphic sites. I’ve provided this here:

/gpfs1/cl/ecogen/pbio6800/PopulationGenomics/ANGSD/allRS_poly.mafs.gz

3. Identify and visualize outlier loci

We’ll use R to import the selection scan results and associated meta-data, assign p-values for each tested locus, and visualize the results!

We’ll create this in RMarkdown, annotating in between code chunks. But here’s some useful code if you’re doing it on your own again later.

Make sure your working directory is set using root.dir option in the initial RMarkdown setup chunk.

Set the `root.dir=“~/projects/eco_genomics_2025/population_genomics/myresults/ANGSD/PCA_Admix”

library(tidyverse)
library(ggplot2)
library(ggpubr)
library(qqman)
allRScov <- as.matrix(read.table("allRS_poly_K3.cov"))

allRSPCA <- eigen(allRScov)

var <- round(allRSPCA$values/sum(allRSPCA$values),3)

var[1:3]

# A "screeplot" of the eigenvalues of the PCA:

barplot(var, 
        xlab="Eigenvalues of the PCA", 
        ylab="Proportion of variance explained")

Bring in the bam.list file and extract the sample info:

names <- read.table("allRS_bam.list")
names <- unlist(strsplit(basename(as.character(names[,1])), split = ".sorted.rmdup.bam"))
split = strsplit(names, "_")
pops <- data.frame(names[1:95], do.call(rbind, split[1:95]))
names(pops) = c("Ind", "Pop", "Row", "Col")

Plot the PCA using ggplot

data = as.data.frame(allRSPCA$vectors)
data = data[,c(1:3)]
data = cbind(data, pops)

# PCA with just red spruce (95 samples):

ggscatter(data, x = "V1", y = "V2",
          color = "Pop",
          mean.point = TRUE,
          star.plot = TRUE,
          main="Red spruce-Black spruce genetic PCA") +
  theme_bw(base_size = 13, base_family = "Times") +
  labs(x = paste0("PC1: (",var[1]*100,"%)"), y = paste0("PC2: (",var[2]*100,"%)")

What do you see in terms of clustering with the PCA using only red spruce samples?

We could see if there are correlations with the Admixture ancestry scores from the RSBS analysis we ran last week:

q_K2 <- read.table("RSBS_poly.admix.2.Q")
names(q_K2) = c("qBS","qRS")

data2 <- cbind(q_K2[1:95,1:2],data)
  
p1 <- ggscatter(data2, x ="qBS", y = "V1",
          color="Pop",
          main="Red spruce-Black spruce genetic PCA") +
  theme_bw(base_size = 13, base_family = "Times") +
  labs(x = "Black spruce ancestry", y = "PC1 score")

p2 <- ggscatter(data2, x ="qBS", y = "V2",
                color="Pop",
                main="Red spruce-Black spruce genetic PCA") +
  theme_bw(base_size = 13, base_family = "Times") +
  labs(x = "Black spruce ancestry", y = "PC2 score")

ggarrange(p1,p2)

read in selection statistics (these are chi^2 distributed)

!!!You'll need to edit the value you used for K here!!!

s <- readtable("allRS_poly_K3.selection")

# convert test statistic to p-value
pval_PC1 <- as.data.frame(1-pchisq(s[,1],1))
names(pval_PC1) = "p_PC1"

pval_PC2 <- as.data.frame(1-pchisq(s[,2],1))
names(pval_PC2) = "p_PC2"

## read positions
pos <- read.table("allRS_poly_mafs.sites",sep="\t",header=T, stringsAsFactors=T)
dim(p)

Get the info on the sites tested:

## read in site positions

PCAngsd_sites <- read.table("allRS_poly_K3.sites")
MAF_all_sites <- read.table("/gpfs1/cl/ecogen/pbio6800/PopulationGenomics/ANGSD/allRS_poly.mafs.gz", header=T)

pos_filtered <- cbind(MAF_all_sites,PCAngsd_sites)[which(PCAngsd_sites==1),]

dim(pos_filtered)

How many sites got filtered out when testing for selection? Why?

make a manhattan plot


plot(-log10(pval_PC1), col="black", cex=0.8, xlab="Locus position", main="Selection outliers: PCAngsd PC1")

plot(-log10(pval_PC2), col="black", cex=0.8, xlab="Locus position", main="Selection outliers: PCAngsd PC2")

We can zoom in if there’s something interesting near a position…

plot(-log10(pval$p_PC1[2e05:2.01e05]),
  col=p_filtered$chromo, 
  xlab="Position", 
  ylab="-log10(p-value)", 
  main="Selection outliers: pcANGSD e=1 (K2)")

# get the contig with the lowest p-value for selection
sel_contig <- p_filtered[which(pval==min(pval$p_PC1)),c("chromo","position")]
sel_contig

# get all the outliers with p-values below some cutoff
cutoff=1e-3   # equals a 1 in 5,000 probability
outlier_contigs <- p_filtered[which(pval<cutoff),c("chromo","position")]
outlier_contigs

# how many outlier loci < the cutoff?
dim(outlier_contigs)[1]

# how many unique contigs harbor outlier loci?
length(unique(outlier_contigs$chromo))