Volume 233, Issue 1 p. 555-568
Full paper
Free Access

Linked selection shapes the landscape of genomic variation in three oak species

Yi-Ye Liang

Yi-Ye Liang

Key Laboratory of Plant Resources Conservation and Sustainable Utilization, South China Botanical Garden, Chinese Academy of Sciences, Guangzhou, 510650 China

University of the Chinese Academy of Sciences, Beijing, 100049 China

Search for more papers by this author
Yong Shi

Yong Shi

Key Laboratory of Plant Resources Conservation and Sustainable Utilization, South China Botanical Garden, Chinese Academy of Sciences, Guangzhou, 510650 China

Search for more papers by this author
Shuai Yuan

Shuai Yuan

Key Laboratory of Plant Resources Conservation and Sustainable Utilization, South China Botanical Garden, Chinese Academy of Sciences, Guangzhou, 510650 China

Search for more papers by this author
Biao-Feng Zhou

Biao-Feng Zhou

Key Laboratory of Plant Resources Conservation and Sustainable Utilization, South China Botanical Garden, Chinese Academy of Sciences, Guangzhou, 510650 China

Search for more papers by this author
Xue-Yan Chen

Xue-Yan Chen

Key Laboratory of Plant Resources Conservation and Sustainable Utilization, South China Botanical Garden, Chinese Academy of Sciences, Guangzhou, 510650 China

Search for more papers by this author
Qing-Qing An

Qing-Qing An

Key Laboratory of Plant Resources Conservation and Sustainable Utilization, South China Botanical Garden, Chinese Academy of Sciences, Guangzhou, 510650 China

Search for more papers by this author
Pär K. Ingvarsson

Pär K. Ingvarsson

Department of Plant Biology, Linnean Center for Plant Biology, Uppsala BioCenter, Swedish University of Agricultural Sciences, Uppsala, SE-75007 Sweden

Search for more papers by this author
Christophe Plomion

Christophe Plomion

University of Bordeaux, INRAE, BIOGECO, Cestas, F-33610 France

Search for more papers by this author
Baosheng Wang

Corresponding Author

Baosheng Wang

Key Laboratory of Plant Resources Conservation and Sustainable Utilization, South China Botanical Garden, Chinese Academy of Sciences, Guangzhou, 510650 China

Center of Conservation Biology, Core Botanical Gardens, Chinese Academy of Sciences, Guangzhou, 510650 China

Author for correspondence:

Baosheng Wang

Email:[email protected]

Search for more papers by this author
First published: 12 October 2021
Citations: 9


  • Natural selection shapes genome-wide patterns of diversity within species and divergence between species. However, quantifying the efficacy of selection and elucidating the relative importance of different types of selection in shaping genomic variation remain challenging.
  • We sequenced whole genomes of 101 individuals of three closely related oak species to track the divergence history, and to dissect the impacts of selective sweeps and background selection on patterns of genomic variation.
  • We estimated that the three species diverged around the late Neogene and experienced a bottleneck during the Pleistocene. We detected genomic regions with elevated relative differentiation (‘FST-islands’). Population genetic inferences from the site frequency spectrum and ancestral recombination graph indicated that FST-islands were formed by selective sweeps. We also found extensive positive selection; the fixation of adaptive mutations and reduction neutral diversity around substitutions generated a signature of selective sweeps. Prevalent negative selection and background selection have reduced genetic diversity in both genic and intergenic regions, and contributed substantially to the baseline variation in genetic diversity.
  • Our results demonstrate the importance of linked selection in shaping genomic variation, and illustrate how the extent and strength of different selection models vary across the genome.


How evolutionary forces shape patterns of genetic diversity within species and divergence between species is a fundamental question in evolutionary biology (Mitchell-Olds et al., 2007; Wolf & Ellegren, 2016; Kern & Hahn, 2018). The neutral theory of molecular evolution states that most polymorphisms within species are generally neutral and under mutation–drift equilibrium, and divergence between species is a result of the accumulation of neutral substitutions rather than adaptation (Kimura, 1968, 1983). Ohta (1973, 1992) further developed the ‘nearly neutral theory’ and suggested that slightly deleterious amino acid substitutions are common. Recent studies in population genomics have revealed widespread signatures of selection in many species (e.g. Sattath et al., 2011; Williamson et al., 2014; Stankowski et al., 2019; Wang et al., 2020; Murphy et al., 2021), indicating that genomic landscapes are influenced predominantly by natural selection (Kreitman, 1996; Corbett-Detig et al., 2015; Kern & Hahn, 2018). Although the importance of selection in shaping genomic variation is clearly established, quantifying the efficacy of selection and elucidating the relative contribution of different selection models remain challenging.

Natural selection affects genetic diversity at functional sites and at surrounding neutral sites via linked selection. The extent of linked selection varies across the genome and is associated with the rate of recombination and density of targets for selection (Maynard Smith & Haigh, 1974; Kaplan et al., 1989; B. Charlesworth et al., 1993; Charlesworth et al., 1995; Hudson & Kaplan, 1995). Strong linked selection is expected in genomic regions with low recombination rates and high densities of functional sites, resulting in a heterogeneous genomic landscape with peaks and troughs of diversity and divergence (Renaut et al., 2013; Cruickshank & Hahn, 2014; Burri et al., 2015; Vijay et al., 2016; Wang et al., 2016). Other processes can generate a pattern of genomic variation similar to that created by linked selection. In a scenario of speciation with gene flow, divergent selection could prevent introgression at loci involved in reproductive isolation or ecological adaptation, generating genomic islands with elevated differentiation (Malinsky et al., 2015). Alternatively, divergent sorting of ancient polymorphisms resulting from selection or drift can lead to regions with elevated divergence between descendant lineages (Guerrero & Hahn, 2017; Han et al., 2017; Ma et al., 2018; Wang et al., 2019). Additionally, neutral demographic changes may affect the genomic landscape via genetic drift, mimicking the patterns produced by selection (Cruickshank & Hahn, 2014). These processes are not mutually exclusive. Therefore, understanding the evolutionary mechanisms that drive genetic variation requires detailed information on demographic history and the use of multiple informative summary statistics (Wolf & Ellegren, 2016). For example, the divergent selection with gene flow and divergent sorting of ancient polymorphisms can generate genomic regions with elevated relative population differentiation (FST) and absolute divergence (dXY) (Malinsky et al., 2015; Guerrero & Hahn, 2017). By contrast, the linked selection is expected to generate regions with elevated FST but reduced or unchanged dXY (Cruickshank & Hahn, 2014). Combined use of FST and dXY could distinguish the effects of linked selection from other evolutionary processes affecting genomic divergence (Wolf & Ellegren, 2016).

Two models explain how linked selection reduces genetic diversity at neutral sites: selective sweeps acting on beneficial mutations (Maynard Smith & Haigh, 1974; Kaplan et al., 1989) and background selection (BGS) against deleterious mutations (Charlesworth et al., 1992; B. Charlesworth et al., 1993; Hudson & Kaplan, 1995; Slotte, 2014). Although BGS contributes to patterns of diversity (D. Charlesworth et al., 1993; Charlesworth et al., 1995), recent studies have shown that BGS alone cannot fully explain genomic variation and that other evolutionary forces, such as positive selection, should be taken into account (Elyashiv et al., 2016; Matthey-Doret & Whitlock, 2019; Rettelbach et al., 2019; Schrider, 2020). Several approaches have been developed to infer the types of linked selection and assess their relative contributions to genetic diversity. Patterns of neutral diversity surrounding amino acid substitutions and synonymous substitutions can provide evidence of selective sweeps (Hernandez et al., 2011; Sattath et al., 2011; Williamson et al., 2014). Models of selective sweeps predict reduced neutral diversity around substitutions as a consequence of hitchhiking, with the size of affected regions determined by the strength of selection and recombination rate (Maynard Smith & Haigh, 1974; Sattath et al., 2011). Recent studies have introduced a joint modelling framework to distinguish between the effects of BGS and selective sweeps by incorporating information on the recombination rate and distribution of functional sites under selection (Elyashiv et al., 2016; Rettelbach et al., 2019; Murphy et al., 2021). However, comprehensive investigations of linked selection are limited to a few plant species, for example Capsella grandiflora (Williamson et al., 2014) and monkeyflowers (Stankowski et al., 2019). The action of selection across the genome remains poorly understood in nonmodel plants.

The genus Quercus (oak) is one of the most widespread and species-rich tree genera in the Northern Hemisphere (Denk et al., 2017). Ecological divergence has played important roles in driving diversification and speciation in oaks (Cavender-Bares, 2019; Kremer & Hipp, 2020). Common garden and reciprocal transplant experiments have provided evidence for adaptation (Kremer & Hipp, 2020), and genome scans have identified many single nucleotide polymorphisms (SNPs) significantly correlated with climatic variables and phenological traits (Martins et al., 2018; Gao et al., 2020; Leroy et al., 2020). The widespread signature of selection (Rellstab et al., 2016; Sork, 2018; Kremer & Hipp, 2020) and recently established genomic resources (Sork et al., 2016; Plomion et al., 2018) have made oak species an important system for understanding how selection drives genomic divergence and speciation in long-lived organisms, which are likely to experience changes in the strength and direction of selection over their lifespan.

We focused on three closely related oak species (Q. acutissima, Q. variabilis and Q. chenii) confined to East Asia (Denk et al., 2017). These three species form an East Asian clade of the section Cerris. They diverged from other Cerris species (confined to Europe) 30 million years ago (Ma) and from other sections of the genus Quercus 48–52 Ma (Hipp et al., 2020). A previous study has shown that there is no gene flow between these three species and other oak species (Hipp et al., 2020; Kremer & Hipp, 2020). Therefore, the three species provide an excellent opportunity to investigate evolutionary processes driving genomic divergence, in which the contribution of gene flow from other species is negligible. Of these species, Q. acutissima and Q. variabilis are dominant in warm temperate deciduous forests with wide geographical distributions, whereas Q. chenii is endemic to eastern China and has a relatively narrow distribution (Chen et al., 2012; Zhang et al., 2018; Li et al., 2019). These species occupy similar niches in overlapping ranges and form mixed forests (Huang et al., 1999). Nevertheless, they could be distinguished based on leaf and nut morphologies (Supporting Information Fig. S1) (Huang et al., 1999). Phylogeographical studies based on genetic markers have reported high levels of intraspecific genetic diversity and low levels of intraspecific population differentiation in all three species, and revealed that their demographic histories were influenced by topographic movement and climatic oscillations during the late Neogene (Chen et al., 2012; Zhang et al., 2018; Li et al., 2019). A recent study of Q. acutissima based on genetic and phenotypic data identified a set of adaptive morphological traits and candidate genes (Gao et al., 2020). Owing to the focus on a small subset of the genome, these studies cannot provide broad insights into genome-wide patterns of variation and underlying driving forces. In this study, we re-sequenced the whole genomes of 101 individuals representing much of the range of the three species. We inferred patterns of genomic variation and tracked divergence among the three species. Next, we quantified the efficacy of positive and negative selection in each species and investigated the signatures of linked selection across the genome. Finally, we evaluated the extent to which classic selective sweeps and/or BGS influenced genome-wide patterns of variation.

Materials and Methods

Sampling, whole-genome re-sequencing and SNP calling

We collected samples from 37, 50 and 14 individuals spanning the geographic distribution of Q. acutissima, Q. variabilis and Q. chenii, respectively (Fig. 1a; Table S1). Each individual was sequenced on the Illumina NovaSeq 6000 platform (paired-end 150 bp) to a target coverage of 30×. We cleaned short reads using Trimmomatic v.0.38 (Bolger et al., 2014), aligned them to the Q. robur reference genome (Plomion et al., 2018) using bwa v.0.7.15 (Li & Durbin, 2010), and called genotypes using HaplotypeCaller implemented in Gatk v.4.1 (DePristo et al., 2011). We applied a strict filtering process to reduce low-quality variants (Notes S1) and retained 16 654 671 SNPs for subsequent analyses, including 2523 800 (15.15%) SNPs that were polymorphic in all species (Fig. S2).

Details are in the caption following the image
Population structure and demographic history of the three oak species. (a) Locations of 19, 25 and three populations of Quercus acutissima, Q. variabilis and Q. chenii sampled in this study. (b) Principal component analysis (PCA) based on genome-wide single nucleotide polymorphisms (SNPs). The first two principal components (PCs) are shown, with the percentage of variance explained by each component shown in parentheses. (c) Changes in the historical effective population size (Ne) of each species estimated by multiple sequentially Markovian coalescent (MSMC) based on eight haplotypes. The solid line and shading represent means and standard deviations of estimates, respectively. (d) Best demographic model for the three oak species inferred by fastsimcoal2 (model 8 in Supporting Information Fig. S3). The relative effective population size and divergence time are shown; yellow arrows represent gene flow, as detailed in Table S7. Ma, million years ago.

Population structure analyses

In order to investigate the population structure, we performed an admixture analysis using Admixture v.1.3.0 (Alexander et al., 2009) and a principal component analysis (PCA) using PCAngsd v.0.98 (Meisner & Albrechtsen, 2018). We also constructed neighbor-joining (NJ) trees using MegaX (Kumar et al., 2018). Admixture was run with predefined numbers of clusters (K) ranging from 1 to 6, each repeated 20 times. The K-value with the lowest cross-validation error was chosen as the most likely number of putative genetic groups. PCAngsd was implemented based on genotype likelihoods accounting for sequencing errors and uncertainty in genotype calls. Population structure analyses were performed based on 618 038 common SNPs with missing rate < 0.1, minor allele frequency (MAF) > 5%, and correlation coefficient < 0.2 with any other SNPs in sliding windows of 50 SNPs.

Demographic history

We used three methods to infer the demographic history of the three oak species. First, we performed multiple sequentially Markovian coalescent (MSMC) in msmc v.2.0.0 (Schiffels & Durbin, 2014) to estimate the cross-coalescence rate for each pair of species and to track variation in effective population size (Ne) over time for each species. MSMC analyses of 50 combinations of haplotypes (eight haplotypes per species or species-pair) were performed to obtain means and standard deviations. Second, we applied the sequentially Markovian method using Smc++ v.1.15.2 (Terhorst et al., 2017) to investigate changes in Ne based on unphased data.

Finally, we implemented coalescent simulations using fastsimcoal v. (Excoffier et al., 2013) to infer demographic parameters. The 2D joint-unfolded site frequency spectrum (SFS) was calculated using Angsd v.0.935 (Korneliussen et al., 2014) based on 101 953 475 intergenic sites, which are least affected by selection. Ten models with different demographic scenarios were evaluated (Fig. S3). For each model, we sampled parameters from a wide prior distribution (Table S2) and optimized model parameters using 80 independent runs. The best-fit model (referred as the ‘unconstrained-best-fit model’) was chosen based on Akaike weights (Excoffier et al., 2013). We further constrained the times of two splitting events according to MSMC estimates (i.e. 1.5–3 Ma for the divergence between Q. chenii and the ancestor of Q. acutissima and Q. variabilis, and 0.5–1.5 Ma for the divergence between Q. acutissima and Q. variabilis; see Results for details). We applied the same procedure described above to choose the best model (referred to as the ‘constrained-best-fit model’). For both constrained- and unconstrained-best-fit models, the 95% confidence intervals (CIs) of parameters were computed based on 100 simulated datasets. The goodness-of-fit was evaluated by comparing the SFS, two SFS-based summary statistics (nucleotide diversity, π, and relative population differentiation, FST) and a linkage disequilibrium (LD)-based statistic (ZnS; Kelly, 1997) between observed and simulated data from the best-fit models. To convert the demographic estimates into absolute values, we assumed a mutation rate of 2.0 × 10−9 per site per year based on an estimate for a low copy nuclear gene in oak species (Cavender-Bares et al., 2011) and a generation time of 50 yr proposed for Q. variabilis (Chen et al., 2012).

Population genetic statistics

We used Angsd to calculate summary statistics, including π (Tajima, 1989), nucleotide polymorphism (θw) (Watterson, 1975), Tajima’s D (Tajima, 1989), FST (Weir & Cockerham, 1984) and genetic distance dXY (Nei, 1987) based on the folded SFS, and Fay & Wu’s H (Fay & Wu, 2000) based on the unfolded SFS. We also calculated the relative node depth (RND) (Feder et al., 2005) to account for mutation rate variation across the genome. Quercus robur was used as an outgroup species to estimate Fay & Wu’s H and RND. All summary statistics were calculated in 10-kb nonoverlapping sliding windows. FST within species was estimated by averaging all pairwise FST values between populations.

We calculated Z-transformed per-window FST using the formula Z-FST = (per-window FST – mean FST across windows)/standard deviation of FST across windows (Han et al., 2017) and identified windows with Z-FST ≥ 2 as ‘FST-islands’. To test whether FST-islands could be generated by demographic processes alone, we simulated 10-kb fragments (100 000 in each simulated dataset) based on the two best-fit models inferred by fastsimcoal2 (see ‘Demographic history’ above). We conducted simulations with six recombination rates (0−100 cM/Mb), generating 12 datasets. For each dataset, we estimated FST for species-pairs, applied the same cutoff (Z-FST ≥ 2) used for empirical data to define the upper tails, and compared FST values in FST-islands vs the upper tails of simulated data. To examine over-represented functional classes in FST-islands, we performed gene ontology (GO) analyses using Gowinda (Kofler & Schloetterer, 2012). GO terms with Benjamini–Hochberg (Benjamini & Hochberg, 1995) false discovery rate (FDR) < 0.01 were considered significantly enriched.

The population-scaled recombination rate (ρ = 4Nec) between each pair of SNPs with a sliding window of 50 SNPs was estimated using LDhelmet v.1.9 (Chan et al., 2012) and then weight-averaged over 10-kb windows. Because LDhelmet can handle up to 50 haplotypes, we randomly sampled 25 individuals from Q. acutissima and Q. variabilis and retained all 14 individuals of Q. chenii (Table S1). LD between SNPs with a sliding window of 10 kb was estimated using Plink v.1.07 (Purcell et al., 2007) and then averaged over all pairwise sites to obtain the ZnS statistic (Kelly, 1997). For LDhelmet and Plink, only SNPs with MAF > 5% were included and only windows with ≥ 10 SNPs were retained. Identical-by-descent (IBD) blocks were identified using Beagle v.4.1 (Browning & Browning, 2013).

Ancestral recombination graph

We estimated the ancestral recombination graph (ARG) across the genome. We first inferred gene genealogies using ARGweaver (Rasmussen et al., 2014; Hubisz & Siepel, 2020) and then calculated three measures in 10-kb nonoverlapping windows based on the genealogical information of local trees. Following Hejase et al. (2020), we estimated enrichment scores (ECH) to test for species differentiation in local trees, the relative time to most recent common ancestry half-life (RTH′) and cross-coalescent times (CCs) between species. The RTH′ and CC values are related to the genetic diversity (π) and divergence (dXY), respectively, but are less affected by mutation rate variation and background selection (BGS) (Hejase et al., 2020). The parameter settings for ARGweaver are provided in Notes S2.

Distribution of fitness effects (DFE) and proportion of adaptive substitutions (α)

In order to quantify the strength of positive and negative selection, we calculated DFE and α using polyDFE2.0 (Tataru et al., 2017) for each category of sites (Table S3). The rate of adaptive substitutions (ω) was estimated as: ω = α × (dN/dS), where dN and dS are per-site nonsynonymous and synonymous substitutions, respectively. Following Tataru et al. (2017), we projected data into 50, 50 and 20 haplotypes for Q. acutissima, Q. variabilis and Q. chenii, respectively. All sites retained after projection were used for analyses in polyDFE2.0. The four-fold degenerate sites were used as a neutral reference to estimate α. To account for both demography and polarization errors, four models were tested, each with a maximum of 500 iterations to thoroughly explore the parameter space. The best model was chosen based on the Akaike information criterion. The 95% CIs were calculated based on 200 bootstrap replicates.

Genetic diversity around fixed substitutions and genic regions

When a new beneficial mutation is fixed by a hard selective sweep, diversity at surrounding linked sites is expected to be reduced (Sattath et al., 2011). To detect this signature of hard selective sweeps, we calculated neutral diversity (π) in 500-bp and 1-kb sliding windows around each fixed nonsynonymous and synonymous substitution based on four-fold degenerate sites and intergenic positions. To correct for systematic variation in mutation rates, π was divided by divergence (d) from Q. robur in each window, generating scaled diversity (π/d). Windows were grouped based on the distance to the nearest substitution, mean diversity was calculated, and 95% CIs were constructed by 1000 bootstrap replicates. A one-tailed P-value was calculated to test whether scaled diversity is lower around nonsynonymous substitutions than around synonymous substitutions. Following Hernandez et al. (2011), we extracted synonymous substitutions that were ≥ 5 kb from the nearest nonsynonymous substitution, resulting in 4441−4542 synonymous substitutions for analyses (Fig. S4). Similar results were obtained when using all synonymous substitutions (Fig. S5); however, lower diversity around nonsynonymous substitutions was only observed in a short region close to the fixed positions, indicating the effects of clustering.

In order to look for the signature of BGS, we calculated scaled diversity (π/d; see above) in 1-kb sliding windows from a gene to halfway to the nearest gene. Windows were grouped based on the distance to nearest genes to calculate mean values across windows, and 95% CIs were estimated based on 1000 bootstrap replicates.

Modelling the effects of BGS and selective sweeps

In order to assess the effects of BGS and selective sweeps on patterns of diversity, we calculated the expected nucleotide diversity under two models, following Elyashiv et al. (2016) and Rettelbach et al. (2019). Only BGS was considered in Model 1, whereas both BGS and selective sweeps were considered in Model 2.

In Model 1, diversity at a neutral site affected by BGS (B-value) was calculated using the equation developed by Hudson & Kaplan (1995) and Nordborg et al. (1996):
urn:x-wiley:0028646X:media:nph17793:nph17793-math-0001(Eqn 1)
ud, deleterious mutation rate; Sd, negative selection coefficient; r(x, i), recombination probability between the neutral sites and the selected site. Following Apuli et al. (2020), we converted the population-scaled recombination rate (in units of ρ/bp) to cM/Mb according to the genetic map of Q. robur (Plomion et al., 2018).
In Model 2, diversity at a neutral site (x) was modelled under BGS and selective sweeps using the function developed by Elyashiv et al. (2016):
urn:x-wiley:0028646X:media:nph17793:nph17793-math-0002(Eqn 2)
In Eqn 2, S(x) represents the effect of a selective sweep at neutral site x due to positive selection at site j, and was estimated as follows:
urn:x-wiley:0028646X:media:nph17793:nph17793-math-0003(Eqn 3)
T, divergence time between the focal species and outgroup; α, fraction of nonsynonymous substitutions fixed due to positive selection; Sb, positive selection coefficient; τ, expected time to fixation of a positively selected mutation, which depends on Ne and Sb. To reduce the computational time, B-values were calculated for each 1-kb nonoverlapping window following Rettelbach et al. (2019) and then averaged in sliding windows. We considered three different selection coefficients (s = 0.01, 0.05 and 0.1) and calculated Pearson correlation coefficient (r) between observed and expected diversity using four windows sizes (Table 1). All parameters in models 1 and 2 are provided in Table S4.
Table 1. Pearson correlation coefficients (r) for relationships between nucleotide diversity (π) from the Quercus data and models including only background selection (BGS) or both BGS and selective sweep (BGS & SS).
Window size BGS BGS & SS
s = 0.01 s = 0.05 s = 0.1
Quercus acutissima
20 kb 0.4711 0.4695 0.4709 0.4714
100 kb 0.5288 0.5275 0.5291 0.5292
200 kb 0.5529 0.5517 0.5532 0.5532
500 kb 0.5905 0.5894 0.5915 0.5927
Q. variabilis
20 kb 0.4283 0.4267 0.4278 0.4284
100 kb 0.4834 0.482 0.4835 0.4842
200 kb 0.5051 0.5038 0.5049 0.5049
500 kb 0.5279 0.5265 0.5279 0.529
Q. chenii
20 kb 0.3509 0.3497 0.3501 0.3502
100 kb 0.3886 0.3878 0.3883 0.3886
200 kb 0.4002 0.3995 0.3999 0.4001
500 kb 0.4148 0.4147 0.4157 0.4169
  • r was calculated using nonoverlapping windows of different sizes, and three different selection coefficients (s) were tested in the BGS & SS model.


Population structure and demographic history of three oak species

Population structure analyses revealed a clear division among the three species. Using Admixture, cross-validation error was lowest when K was 3 (Fig. S6a). Setting K = 3, 34, 47 and 14 individuals were assigned to Q. acutissima, Q. variabilis and Q. chenii, respectively (Fig. S6b). Six individuals showing high admixture were excluded from downstream analyses (Fig. S6b; Table S1). The PCA and NJ analyses confirmed the patterns of genetic differentiation detected by the Admixture algorithm (Figs 1b, S6c). Consistent with the clear population structure, genetic differentiation was high among the three species (FST = 0.25–0.36; Table S5). Within species, population differentiation was low in Q. acutissima and Q. variabilis (FST = 0.03 and 0.04, respectively) but relatively high in Q. chenii (FST = 0.11) owing to a highly divergent population (Fig. S7; Table S5).

Demographic analyses using MSMC revealed that the divergence between Q. chenii and the other two species (Q. acutissima and Q. variabilis) occurred about 2 Ma, earlier than that between Q. acutissima and Q. variabilis (c. 1 Ma; Fig. S8). The early splitting of Q. chenii was consistent with the finding that there were fewer IBD blocks shared between Q. chenii and the other two species (Q. acutissima and Q. variabilis) than between Q. acutissima and Q. variabilis (Fig. S6d). All species experienced bottlenecks with different magnitudes and durations (Fig. 1c). Quercus chenii experienced a long bottleneck, lasting > 1 Myr (1.5–0.15 Ma; Fig. 1c), whereas Q. acutissima and Q. variabilis experienced much shorter bottlenecks, starting almost simultaneously at 0.3 Ma and ending around 0.05 Ma and 0.1 Ma, respectively (Fig. 1c). Smc++ confirmed the population size changes in all three species revealed by the MSMC analyses and supported the more severe bottleneck of Q. chenii (Fig. S9).

We further inferred the divergence history between the three species from the joint SFS by coalescent simulation in fastsimcoal2. The unconstrained-best-fit model (model 8; Akaike’s weight = 1; Table S6) revealed that Q. chenii split from the ancestor of Q. acutissima and Q. variabilis at approximately 7.06 Ma (95% CI = 11.22–4.61 Ma), whereas Q. acutissima and Q. variabilis diverged approximately 2.40 Ma (95% CI = 2.72–2.25 Ma; Fig. 1d; Table S7). All species experienced bottlenecks between 1.01 Ma and 0.30 Ma with different magnitudes. The contemporary population sizes of the three species estimated by fastcimcoal2 (Ne = 1.36 × 105 to 4.36 × 105) were very similar to the Smc++-based estimates (Figs 1d, S9; Table S7). The rates of inferred gene flow among species were generally low, with the highest gene flow between the two recently diverged species, Q. acutissima and Q. variabilis (per generation migration rate = 1.81 × 10–5 to 1.99 × 10–5; Table S7). By constraining the split times according to the MSMC estimates (see ‘the Materials and Methods section’ for details), we obtained a constrained-best-fit model with the same scenario (model 8) as the unconstrained-best-fit model. Compared to the unconstrained-best-fit model, the constrained-best-fit model had similar demographic parameters (except for constrained split times) but a lower likelihood (Tables S6, S7). Both models predict the observed SFS and summary statistics (FST, π and ZnS) well, and the unconstrained model fitted better (Fig. S10). These results suggested that the demographic parameters of these models provide a good description of the population history.

Heterogeneous landscape of genomic variation

The genetic variation was heterogeneous across the genome in the three oak species. Nucleotide diversity π was positively correlated with the population-scaled recombination rate (Spearman’s ρ = 0.466–0.549, P < 2.20 × 10–16; Fig. S11a) and negatively correlated with gene density (Spearman’s ρ = −0.087 to −0.050, P < 2.20 × 10–16; Fig. S11b) in all species. The recombination rate, nucleotide diversity, Tajima’s D, and Fay & Wu’s H all were positively correlated between species (Spearman’s ρ = 0.24–0.84, P < 2.20 × 10–16; Table S8), indicating similar patterns of genomic variation across these species.

We further looked at genomic regions with elevated differentiation, and identified 1323–2248 (3.27–5.44%) windows with Z-FST ≥ 2 as FST-islands (Fig. 2a). Of these FST-islands, 1598 (52.5%) were unique to a single species-pair and 630 (20.7%) were shared between all three species-pairs (Fig. S12). Importantly, the shared FST-islands may be partially a consequence of nonindependence (i.e. each species was included in two pairwise comparisons). By combining adjacent 20-kb islands, we found that most combined FST-islands were short (< 50 kb; Fig. S13) and scattered across the genome (Fig. 2a). We performed coalescent simulations using the constrained- and unconstrained-best-fit demographic models to show that the FST values in the upper tails of simulated data were significantly lower than observed values in the FST-islands (P < 2.20 × 10–16, Wilcoxon–Mann–Whitney U-test; Fig. S14), indicating that elevated differentiation in FST-islands cannot be explained by neutral demographic processes alone.

Details are in the caption following the image
Genome-wide differentiation among the three oak species. (a) Manhattan plot of population differentiation (FST) in 10-kb nonoverlapping windows. Alternating colours represent different chromosomes, and ‘FST-islands’ are highlighted in red. Scaffolds that were unmapped to 12 pseudo-chromosomes are pooled in a chromosome 13. (b) Comparisons of summary statistics, relative node depth (RND), nucleotide diversity (π), enrichment score (ECH), relative time to most recent common ancestry half-lift (RTH′), Fay & Wu’s H, and recombination rate (ρ) between FST-islands (red boxes) and genomic background (grey boxes). All parameters, except RND, were calculated within species for each species-pair. All parameters differed significantly (P < 2.2e−16, Wilcoxon–Mann–Whitney U-test), except RND, between FST-islands and the genomic background for the species-pair Q. acutissima vs Q. chenii (0.047 vs 0.049, P = 0.075, Wilcoxon–Mann–Whitney U-test). In these box plots, the median is shown by a horizontal line, whereas the bottom and top of each box represents the first and third quartiles. The whiskers extend to 1.5 times the interquartile range. Outliers are not shown in the plot. Qac, Quercus acutissima vs Q. chenii; Qvc, Q. variabilis vs Q. chenii; Qav, Q. acutissima vs Q. variabilis; Qa, Q. acutissima; Qv, Q. variabilis; Qc, Q. chenii.

Compared to background regions, the dXY and CCs were lower in FST-islands of one species-pair (Q. acutissima vs Q. variabilis, P < 2.20 × 10–16, Wilcoxon–Mann–Whitney U-test) and slightly higher in FST-islands of the two other species-pairs (P < 2.20 × 10–16, Wilcoxon–Mann–Whitney U-test; Table S9). Controlling for the effects of mutation rate variation, the mean RND was significantly lower in FST-islands across all three species-pairs (P < 2.78 × 10–8 for two comparisons, and P = 0.075 for the other, Wilcoxon–Mann–Whitney U-test; Fig. 2b; Table S9). The recombination rate and RTH′ in FST-islands were lower than genomic background in both species for all comparisons (P < 2.20 × 10–16, Wilcoxon–Mann–Whitney U-test; Fig. 2b; Table S9). Gene genealogies also revealed higher ECH in FST-islands than in genomic background (P < 2.20 × 10–16, Wilcoxon–Mann–Whitney U-test; Fig. 2b; Table S9), reflecting dense clusters of coalescent events due to recent sweeps in these regions. These results are consistent with a model of selective sweeps contributing to the formation of FST-islands in the three oak species.

Consistent with the hypothesis that FST-islands were driven by selective sweeps, we found multiple signals of selection in FST-islands, including reduced genetic diversity (π and θw), an excess of low-frequency alleles (more negative Tajima’s D), and higher frequencies of derived alleles (more negative Fay &Wu’s H) in both species for each comparison (P < 2.20 × 10–16, Wilcoxon–Mann–Whitney U-test; Fig. 2b; Table S9). GO analyses of genes located in FST-islands revealed 7–11 over-represented GO categories, including proteasome assembly, mRNA splice site selection and DNA binding (Table S10).

Genome-wide estimates of positive and negative selection

The inferred DFE indicated strong purifying selection at zero-fold sites for all three oak species, with 37.5%, 37.3% and 26.1% of nonsynonymous mutations identified as strongly deleterious (Nes < −100) in Q. acutissima, Q. variabilis and Q. chenii, respectively (Fig. 3a; Table S3). Strong purifying selection also was detected in other genic regions (UTRs and introns) but not at intergenic sites (Table S3). Using four-fold degenerate sites as a neutral reference, we found strong positive selection at zero-fold degenerate sites. The estimated α and ω values were highest for Q. acutissima (α = 0.44 and ω = 0.16), intermediate for Q. variabilis (α = 0.36 and ω = 0.13) and lowest for Q. chenii (α = 0.19 and ω = 0.07; Fig. 3b,c; Table S3). Overall, we observed prevalent signatures of positive and negative selection in coding regions in all three Quercus species, with more sites affected by strong selection in the two species with larger NeQ. variabilis and Q. acutissima.

Details are in the caption following the image
Strength of purifying and positive selection at zero-fold degenerate sites in the three oak species (Quercus acutissima, Q. variabilis and Q. chenii). (a) Full distribution of fitness effects of new amino acid mutations. (b) Proportion of adaptive substitutions (α). (c) Rate of adaptive substitutions (ω). Error bars represent 95% bootstrap confidence intervals.

The strong positive and negative selection also affected genetic diversity at surrounding neutral sites. First, scaled neutral diversity was lower around nonsynonymous than around synonymous substitutions (Figs 4, S5, S15). Second, the scaled diversity was reduced in genic regions and gradually recovered to the genomic background level in c. 10–20 kb (Fig. S16), consistent with the extent of LD observed in these species (Fig. S17).

Details are in the caption following the image
Relative diversity vs distance to nearest substitutions in the three oak species. Upper panels, scaled diversity around synonymous (blue) and nonsynonymous (red) substitutions in Quercus acutissima, Q. variabilis and Q. chenii, with means represented by solid lines and 95% confidence intervals represented by shading. Lower panels, bar plot of one-tailed P-values (on a −log10 scale) based on 1000 bootstrap replicates. Results are shown as a function of the distance from the nearest substitution, and each bar represents a 500-bp bin. Only the subset of synonymous substitutions located ≥ 5 kb from the nearest nonsynonymous substitutions was used. The red horizontal line indicates the threshold for significance (P < 0.05).

Relative impacts of BGS and selective sweeps on diversity

In order to investigate the relative contributions of BGS and selective sweeps to the observed pattern of genetic diversity, we calculated the expected nucleotide diversity under two selection models, one considering only BGS (Model 1) and another including both BGS and selective sweeps (Model 2). The genome-wide correlations between observed diversity and expected diversity from Model 1 in 20-kb windows were 0.47, 0.43 and 0.35 for Q. acutissima, Q. variabilis and Q. chenii, respectively (Table 1). The low correlation in Q. chenii could be explained by strong genetic drift due to the small Ne. The correlations were stronger for larger windows in all species (r2 = 0.39–0.53, 0.40–0.55 and 0.41–0.59 for 100-kb, 200-kb and 500-kb windows, respectively), indicating lower noise in larger genomic windows (Table 1). Regardless of the selection coefficient, the correlations between observed diversity and expected diversity for Model 2 were comparable to those for Model 1 when applying the same window size (Table 1). These results indicated that BGS largely explains the base-level variation in genetic diversity.


Divergence of the three oak species during the Neogene

The demographic analyses indicated that the three oak species diverged around 7.06−2.40 Ma, in agreement with the strengthening of the modern Asian winter monsoon in the late Miocene and Pliocene (Sun & Wang, 2005). Thus, the dramatic climate shift may have triggered the divergence of these species. Similar scenarios have been proposed to explain the rapid diversification of other East Asian oak lineages during the late Neogene (Meng et al., 2017; Deng et al., 2018; Hipp et al., 2020). After divergence, the three oaks experienced a bottleneck at 1.0–0.3 Ma, coinciding with Pleistocene glaciations in East Asia (Zhou et al., 2011). The estimated time (0.3 Ma) of population expansion of these three oaks was much earlier than the Last Glaciation Maximum (LGM; 0.026−0.019 Ma; Clark et al., 2009), supporting previous results showing that East Asia was not extensively glaciated (Zhou et al., 2011) and temperate deciduous trees had relatively stable populations during the LGM (Qiu et al., 2011). The evolutionary history of East Asian forest trees differs from most of their counterparts in Europe and North America, which have recolonized their current range after the LGM (Petit & Vendramin, 2007; Shafer et al., 2010).

Notably, the inferred time frame depends on the mutation rate and generation time, both of which are difficult to estimate and may vary over space and time for oaks. In this study, we assumed a mutation rate of 2.0 × 10−9 per site per year estimated for a low copy nuclear gene in oak species (Cavender-Bares et al., 2011) and a generation time of 50 yr proposed for Q. variabilis (Chen et al., 2012), resulting in a per generation mutation rate of 1 × 10−7 per site. This mutation rate is close to recent estimates of 4.2–5.2 × 10−8 per site per generation for somatic mutations in Q. robur (Schmid-Siegert et al., 2017), suggesting that the parameters in this study are reasonable for oaks. When using a longer generation time (100 yr) and a lower bound for the mutation rate (4.2 × 10−8 per generation), the estimated divergence time between the three species was 33.3–11.4 Ma and the expansion occurred 1.43 Ma. Thus, we concluded that the demographic history of the three oak species was most likely to have been affected by the climate shift during the Neogene.

Genomic variation driven by linked selection

We detected heterogeneous genomic variation in the three oak species driven by linked selection. Genetic diversity was positively correlated with recombination rates and negatively correlated with the density of functional sites, consistent with the expectation that genomic regions with lower recombination rates and higher gene densities tend to be under stronger linked selection (Nordborg et al., 1996; Slotte, 2014). Genomic islands with elevated relative differentiation (FST) were scattered across the genome and showed low recombination rates, reduced absolute divergence (dXY) and RND, and multiple signals of selection. These results are compatible with a model in which FST-islands are derived from the action of linked selection, either background selection (BGS) or recurrent selective sweeps (Cruickshank & Hahn, 2014; Burri et al., 2015). Moreover, FST co-varied with both dXY (Spearman’s ρ = −0.208 to −0.048, P < 2.20 × 10–16) and RND (Spearman’s ρ = −0.423 to −0.371, P < 2.20 × 10–16; Table S8), providing further evidence for linked selection in the ancestral population of the three species (Cruickshank & Hahn, 2014). The formation of genomic islands also could be due to divergent sorting of ancient polymorphisms (Guerrero & Hahn, 2017; Han et al., 2017; Ma et al., 2018; Wang et al., 2019) or reduced gene flow in regions associated with reproductive isolation or local adaptation (Malinsky et al., 2015). However, these hypotheses predict elevated dXY and RND in FST-islands (Cruickshank & Hahn, 2014; Guerrero & Hahn, 2017), inconsistent with our observation of reduced dXY and RND. Additionally, FST-islands were generally short (< 50 kb) and scattered across the genome, different from the pattern expected under a model of heterogeneous gene flow among genomic regions (Cruickshank & Hahn, 2014). Overall, our observations clearly support the hypothesis that long-term linked selection can best explain genome-wide patterns of differentiation among the three oak species, consistent with observations in other plants, such as sunflower (Renaut et al., 2013) and Populus (Wang et al., 2016).

The efficacy of linked selection is determined by both the strength of selection and recombination rate (Maynard Smith & Haigh, 1974; Kaplan et al., 1989; B. Charlesworth et al., 1993; Charlesworth et al., 1995; Hudson & Kaplan, 1995; Slotte, 2014; Burri, 2017). In all three species, we found extensive positive and negative selection in coding regions. It is worth noting that we considered only point mutations in coding regions; however, structural variation and mutations in noncoding regions also could be targets of linked selection. Previous studies have revealed that structural variants, such as indels and inversions, are involved in adaptation (Wellenreuther & Bernatchez, 2018) and conserved noncoding regions are under selection in plant species (Williamson et al., 2014). Therefore, linked selection could affect a large proportion of the genome owing to strong selection on both coding and noncoding regions. Genomes in the genus Quercus are evolutionarily stable with high levels of synteny (Cannon & Petit, 2020). Thus, we expected a conserved genomic architecture among the three closely related oak species. If this assumption is correct, long-term linked selection could have acted at the same genomic regions in these species to reduce intraspecific genetic diversity and result in genomic islands with elevated differentiation. Conserved genomic features have been proposed as important factors driving differentiation in closely related species, such as Populus (Wang et al., 2016), sunflower (Renaut et al., 2013) and avian taxa (Burri et al., 2015; Vijay et al., 2016; Van Doren et al., 2017).

Relative contributions of BGS and selective sweeps in shaping genomic variation

Linked selection arises from selective sweeps fixing haplotypes with beneficial alleles or BGS removing haplotypes carrying deleterious mutations (Maynard Smith & Haigh, 1974; Kaplan et al., 1989; B. Charlesworth et al., 1993; Hudson & Kaplan, 1995; Slotte, 2014). We found multiple lines of evidence for the contributions of selective sweeps in shaping genomic variation in all three species. First, the reduced genetic diversity around fixed nonsynonymous substitutions is consistent with a classic selective sweep model, where selection fixes a new beneficial mutation and leaves a signature of reduced diversity at surrounding linked sites (Sattath et al., 2011). Second, both summary statistics and ancestral recombination graph (ARG)-based inferences suggested that selective sweeps have generated highly differentiated genomic regions, including shared FST-islands across species-pairs. BGS also is expected to affect genetic variation and result in FST-islands (Cruickshank & Hahn, 2014). However, the ARG-based statistics were designed to accommodate the effects of BGS and be sensitive to sweeps (Rasmussen et al., 2014). Simulation studies also revealed that BGS is unlikely to produce patterns of genetic diversity generated by selective sweeps (Schrider, 2020), and the detection of genetic islands with elevated FST is robust to BGS (Matthey-Doret & Whitlock, 2019). A recent study in flycatchers suggested that selective sweeps were responsible for the formation of both shared and lineage-specific FST-islands (Chase et al., 2021). We concluded that the FST-islands detected in the three oaks species could be explained by selective sweeps. Gene ontology (GO) analyses suggested that genes with a wide range of functions may have been involved in the genetic differentiation among the three species.

We also found that BGS contributed substantially to the base-level variation of genetic diversity. The correlation between simulated diversity data based on the BGS model and diversity estimated from the data was within the range of correlations found in similar studies of Drosophila (Elyashiv et al., 2016), flycatchers (Rettelbach et al., 2019) and humans (Murphy et al., 2021). These results are consistent with the expectation that BGS is widespread across the genome due to prevalent deleterious mutations (B. Charlesworth et al., 1993; Eyre-Walker & Keightley, 2007; Burri, 2017). Notably, including selective sweeps did not improve the model-fit in our study, consistent with observations in humans (Murphy et al., 2021). By contrast, in studies of Drosophila (Elyashiv et al., 2016) and flycatchers (Rettelbach et al., 2019), selective sweeps were necessary to explain the pattern of diversity. These contrasting results may reflect different patterns of linked selection among species resulting from differences in the genetic architecture of adaptive traits, recombination rate, and population size history (Cutter & Payseur, 2013; Josephs & Wright, 2016). Alternatively, the lack of a signal of sweeps in neutral diversity may be explained by an inability to distinguish between the effects of different selection models. First, soft selective sweeps and polygenic selection are common in many plant species (Pritchard & Di Rienzo, 2010; Flood & Hancock, 2017), and can result in patterns of diversity similar to those of BGS (Cutter & Payseur, 2013; Josephs & Wright, 2016). Thus, these processes may have contributed to the inference of BGS in our model. Second, we considered only substitutions in coding regions. Other forms of mutations (e.g. structural variation and noncoding substitutions) could be targets of both BGS and selective sweeps, and affect the model predictions. Future studies should apply a comprehensive modelling framework to properly account for effects of multiple forms of selection to provide an accurate overview of the relative importance of different types of selection processes in structuring genomic variation.

In summary, our results provided insights into divergence among three oak species and the evolutionary forces that contributed to the distribution of variation across the genome. We found that the three species diverged and experienced fluctuations in population size during the Neogene. We further revealed widespread positive and negative selection in coding regions in all species, which may have facilitated linked selection driving differentiation. BGS contributed substantially to the global patterns of genetic variation, whereas selective sweeps are dominant force for the formation of highly differentiated genomic islands and deep trough in diversity around fixed substitutions. Our results demonstrate how linked selection contributed to the heterogeneous landscape of genomic variation and highlight the importance of incorporating background selection as a null model to scan for genes related to adaptive evolution and speciation.


We thank anonymous reviewers and the Associate Editor for their insightful comments. This work was supported by the Guangdong Natural Science Funds for Distinguished Young Scholar (2018B030306040) and the National Natural Science Foundation of China (NSFC 31971673 and 31901325).

    Author contributions

    BW designed the study; SY, B-FZ and YS collected samples; Y-YL, X-YC and Q-QA generated the sequencing data; Y-YL and YS analyzed data; and Y-YL, BW, PKI and CP drafted the manuscript. All authors edited the revision of the manuscript.

    Data availability

    Whole-genome sequencing data have been deposited NCBI BioProject PRJNA769460 under GenBank accession nos. SAMN22137851SAMN22137951.