1. What have we learned about the genetic structure from the PCA and admixture plots?

Let’s take a look at the PCA and admixture figures

What does the PCA seem to be telling us?
What different picture does the admixture plot reveal? How does it relate to the PCA?
Would we want to look for higher levels of K in the admixture analysis? How do we do that??

2. How has selection acted to drive divergence of individual loci in excess of the population structure we’ve observed?

When selection acts in response to local environmental conditions, we observe an excess of population structure at certain loci. This can be thought of as Fst at a single locus exceeding some background level of divergence that exists across the genome as a whole. We call these “Fst outliers” and identifying these outliers is a major goal in ecological genomic studies. But, a major challenge is how best to control for background population structure – in other words, how can we sift out the “outliers” from the rest of the population structure in the genome?

One approach to identifying Fst outliers is using genetic PCA to (a) identify the major axes of population structure, and then (b) find loci that “load” very strongly on these axes, indicating their exceptional divergence is probably driven by the action of selection in excess of genetic drift.
We can run a scan for Fst outliers using PCAngsd, just like we did for the genetic PCA, but here we initiate a selection scan for loci that are exceptionally divergent along one or more of the inferred genetic PC axes. This method is especially helpful when it’s hard to define what are genetic “populations”, since it uses the genetic PCA to infer the genetic structure directly. The method employed in PCAngsd follows the “FastPCA” selection scan method described by Galinsky et al. 2015
The approach essentially uses the SNP weights or “loadings” on a given genetic PC axis to determine the strength of association, and if it exceeds the expectation due to neutral drift.
We could run this on our full set of 113 red + black spruce samples, but the results would be dominated by loci that are strongly divergent between the species (potentially owing to reproductive isolation). What we’d like to find is evidence of adaptive introgression, where alleles from blask spruce hae contributed to divergence and local adaptation within red spruce
To do this, we can run PCAngsd on just the 95 red spruce sampls and look at the resulting PCA to see how the PC axes capture the genetic divergence within red spruce. Is one of the axes related to the degree of introgression with black spruce?
If so, we can have PCAngsd run a selection scan to identify lociu under selection along 1 or more of the PC axes it identifies within red spruce. Knowing which PC axis relates to introgression should then let us test outlier loci within red spruce due to introgression from its sister species.

Run the PCA on the 95 Red Spruce samples

Let’s write a bash script called PCAngsd_allRS_selection.sh that includes the following code. Recall that PCAngsd uses genotype likelihoods in “beagle” format, which have been calculated for you (see tutorial 5) and can be found here:

/gpfs1/cl/ecogen/pbio6800/PopulationGenomics/ANGSD/allRS_poly.beagle.gz

Try creating this script yourself based on the PCANgsd_RSBS.sh script you wrote last week…

You’ll want to hink about what level of K to test…

K=
E=$((K-1))

To tell PCAngsd to test for selection, you’ll need to add the following options after the call to pcangsd. Also, you’ don’t need the options about amix or admix-K anymore, so can remove those. All other options should stay the same.

--selection-eig $E \
--selection \
--sites-save \
--maf-save \
--snp-weights \

Note the minMaf 0.05 option tests only those loci that have a minor allele frequency of 0.05 of greater. That’s because there’s little statistical power for testing association of alleles that are very rare (<5% in the sample).

It should run quickly! Once finished, you’ll the see a new set of files, including:

File name	contents
allRS_poly.selection	selection scores for each locus on each tested PC axis
allRS_poly.weights	weights that show how strongly each locus “loads” on each PC axis
allRS_poly.sites	for each locus shows whether it got tested (1) or not (0)

After we’ve got these resiults files, we can go over to RStudio and start exploring the results.

Note: In RStudio, you’ll need to read in a file with minor allele frequencies (“maf”) that ANGSD makes when estimating genotype likelihoods. You have one of these files from when you worked on your focal pops, but we need one for the “allRS_poly” run of all red spruce individuals for just polymorphic sites. I’ve provided this here:

/gpfs1/cl/ecogen/pbio6800/PopulationGenomics/ANGSD/allRS_poly.mafs.gz

3. Identify and visualize outlier loci

We’ll use R to import the selection scan results and associated meta-data, assign p-values for each tested locus, and visualize the results!

We’ll create this in RMarkdown, annotating in between code chunks. But here’s some useful code if you’re doing it on your own again later.

Make sure your working directory is set using root.dir option in the initial RMarkdown setup chunk.

Set the `root.dir=“~/projects/eco_genomics_2025/population_genomics/myresults/ANGSD/PCA_Admix”

library(tidyverse)
library(ggplot2)
library(ggpubr)
library(qqman)

allRScov <- as.matrix(read.table("allRS_poly_K3.cov"))

allRSPCA <- eigen(allRScov)

var <- round(allRSPCA$values/sum(allRSPCA$values),3)

var[1:3]

# A "screeplot" of the eigenvalues of the PCA:

barplot(var, 
        xlab="Eigenvalues of the PCA", 
        ylab="Proportion of variance explained")

Bring in the bam.list file and extract the sample info:

names <- read.table("allRS_bam.list")
names <- unlist(strsplit(basename(as.character(names[,1])), split = ".sorted.rmdup.bam"))
split = strsplit(names, "_")
pops <- data.frame(names[1:95], do.call(rbind, split[1:95]))
names(pops) = c("Ind", "Pop", "Row", "Col")

Plot the PCA using ggplot

data = as.data.frame(allRSPCA$vectors)
data = data[,c(1:3)]
data = cbind(data, pops)

# PCA with just red spruce (95 samples):

ggscatter(data, x = "V1", y = "V2",
          color = "Pop",
          mean.point = TRUE,
          star.plot = TRUE,
          main="Red spruce-Black spruce genetic PCA") +
  theme_bw(base_size = 13, base_family = "Times") +
  labs(x = paste0("PC1: (",var[1]*100,"%)"), y = paste0("PC2: (",var[2]*100,"%)")

What do you see in terms of clustering with the PCA using only red spruce samples?

We could see if there are correlations with the Admixture ancestry scores from the RSBS analysis we ran last week:

q_K2 <- read.table("RSBS_poly.admix.2.Q")
names(q_K2) = c("qBS","qRS")

data2 <- cbind(q_K2[1:95,1:2],data)
  
p1 <- ggscatter(data2, x ="qBS", y = "V1",
          color="Pop",
          main="Red spruce-Black spruce genetic PCA") +
  theme_bw(base_size = 13, base_family = "Times") +
  labs(x = "Black spruce ancestry", y = "PC1 score")

p2 <- ggscatter(data2, x ="qBS", y = "V2",
                color="Pop",
                main="Red spruce-Black spruce genetic PCA") +
  theme_bw(base_size = 13, base_family = "Times") +
  labs(x = "Black spruce ancestry", y = "PC2 score")

ggarrange(p1,p2)

read in selection statistics (these are chi^2 distributed)

!!!You'll need to edit the value you used for K here!!!

s <- readtable("allRS_poly_K3.selection")

# convert test statistic to p-value
pval_PC1 <- as.data.frame(1-pchisq(s[,1],1))
names(pval_PC1) = "p_PC1"

pval_PC2 <- as.data.frame(1-pchisq(s[,2],1))
names(pval_PC2) = "p_PC2"

## read positions
pos <- read.table("allRS_poly_mafs.sites",sep="\t",header=T, stringsAsFactors=T)
dim(p)

Get the info on the sites tested:

## read in site positions

PCAngsd_sites <- read.table("allRS_poly_K3.sites")
MAF_all_sites <- read.table("/gpfs1/cl/ecogen/pbio6800/PopulationGenomics/ANGSD/allRS_poly.mafs.gz", header=T)

pos_filtered <- cbind(MAF_all_sites,PCAngsd_sites)[which(PCAngsd_sites==1),]

dim(pos_filtered)

How many sites got filtered out when testing for selection? Why?

make a manhattan plot


plot(-log10(pval_PC1), col="black", cex=0.8, xlab="Locus position", main="Selection outliers: PCAngsd PC1")

plot(-log10(pval_PC2), col="black", cex=0.8, xlab="Locus position", main="Selection outliers: PCAngsd PC2")

We can zoom in if there’s something interesting near a position…

plot(-log10(pval$p_PC1[2e05:2.01e05]),
  col=p_filtered$chromo, 
  xlab="Position", 
  ylab="-log10(p-value)", 
  main="Selection outliers: pcANGSD e=1 (K2)")

# get the contig with the lowest p-value for selection
sel_contig <- p_filtered[which(pval==min(pval$p_PC1)),c("chromo","position")]
sel_contig

# get all the outliers with p-values below some cutoff
cutoff=1e-3   # equals a 1 in 5,000 probability
outlier_contigs <- p_filtered[which(pval<cutoff),c("chromo","position")]
outlier_contigs

# how many outlier loci < the cutoff?
dim(outlier_contigs)[1]

# how many unique contigs harbor outlier loci?
length(unique(outlier_contigs$chromo))

Population Genomics Tutorial 6: Genome scan for selection