Identifying genomic regions associated with C4 photosynthetic activity and leaf anatomy in Alloteropsis semialata
Summary
- C4 photosynthesis is a complex trait requiring multiple developmental and metabolic alterations. Despite this complexity, it has independently evolved over 60 times. However, our understanding of the transition to C4 is complicated by the fact that variation in photosynthetic type is usually segregated between species that diverged a long time ago.
- Here, we perform a genome-wide association study (GWAS) using the grass Alloteropsis semialata, the only known species to have C3, intermediate, and C4 accessions that recently diverged. We aimed to identify genomic regions associated with the strength of the C4 cycle (measured using δ13C), and the development of C4 leaf anatomy.
- Genomic regions correlated with δ13C include regulators of C4 decarboxylation enzymes (RIPK), nonphotochemical quenching (SOQ1), and the development of Kranz anatomy (SCARECROW-LIKE). Regions associated with the development of C4 leaf anatomy in the intermediate individuals contain additional leaf anatomy regulators, including those responsible for vein patterning (GSL8) and meristem determinacy (GIF1).
- The parallel recruitment of paralogous leaf anatomy regulators between A. semialata and other C4 lineages implies the co-option of these genes is context-dependent, which likely has implications for the engineering of the C4 trait into C3 species.
Introduction
Oxygenic photosynthesis originated over 2 billion years ago and is the ultimate source of nearly all energy used by living organisms. Almost 90% of plants fix carbon using the ancestral C3 cycle, but this process is inefficient in hot environments (Sage & Monson, 1999). This is because the key enzyme responsible for the initial fixation of atmospheric CO2 (Ribulose-1,5-bisphosphate carboxylase/oxygenase, Rubisco) is less able to discriminate CO2 from O2 at higher temperatures, and as a result, energy is lost through photorespiration (Farquhar et al., 1982). To reduce photorespiration, plants have evolved C4 photosynthesis, wherein atmospheric CO2 is initially assimilated into a 4-carbon organic acid by phosphoenolpyruvate carboxylase (PEPC) in the mesophyll cells, before shuttling the acid to the neighboring bundle sheath cells where it is decarboxylated and the CO2 recaptured by Rubisco (Hatch, 1971; Edwards & Ku, 1987). This compartmentalization of Rubisco effectively prevents photorespiration. C4 photosynthesis is a complex trait that relies on both changes to the leaf anatomy and the coordinated regulation of multiple metabolic enzymes (Hatch, 1987). In order to understand the sequence of events that led to C4 evolution, comprehensive genomic and phenotypic datasets have been generated in many systems, such as Flaveria (Adachi et al., 2023) and Alloteropsis (Pereira et al., 2023). These existing data sets can potentially be mined for quantitative genetics approaches to identify novel genetic factors involved in the evolution of C4 (Simpson et al., 2021).
By comparing species with different photosynthetic types, the core C4 enzymes, multiple accessory genes, and loci associated with C4 leaf anatomy (often termed ‘Kranz’ anatomy) have been identified (Langdale et al., 1987, 1988; Slewinski et al., 2012; Cui et al., 2014). However, decomposing the individual steps during the transition to C4 is confounded by the fact that variation in photosynthetic type is usually segregated between distinct species that have been independently evolving for millions of years, meaning that they differ in many aspects besides those linked to the photosynthetic pathway (Heyduk et al., 2019). The interspecific segregation of variation in photosynthetic type makes it challenging to apply quantitative genetics methods, such as quantitative trait loci (QTL) mapping and genome-wide association studies (GWAS), since these rely on traits varying within a species, or the ability to hybridize species with divergent phenotypes. GWAS has been used to investigate the variation of C4 traits within C4 species, such as photosynthetic performance during chilling in maize (Strigens et al., 2013), and to identify genes associated with stomatal conductance and water use efficiency in sorghum (Ferguson et al., 2021; Pignon et al., 2021). However, to date there has been no QTL region identified for differences in C4 carbon fixation or Kranz anatomy (Simpson et al., 2021).
In grasses, the proportion of carbon that is fixed through the C4 cycle can be measured using the stable carbon isotope ratio (δ13C) (O'Leary, 1981; Farquhar et al., 1989). Both 12C and 13C occur naturally in the atmosphere, and in C3 plants, Rubisco preferentially fixes 12C during photosynthesis (O'Leary, 1981). Conversely, in C4 plants, carbon is initially fixed by CA and PEPC, and this coupled enzyme system discriminates less than Rubisco between the two isotopes (O'Leary, 1981). The rate of CO2 release in the bundle sheath is coordinated with the rate of CO2 fixation by Rubisco, which reduces the fractionation effect of this enzyme. δ13C is therefore commonly used as a proxy for photosynthetic type and the relative strength of the C4 cycle (Bender, 1968; Smith & Epstein, 1971; Smith & Brown, 1973; Von Caemmerer, 1992; Cerros-Tlatilpa & Columbus, 2009; Gowik et al., 2011; Lundgren et al., 2015; Stata et al., 2019; Olofsson et al., 2021). While there is intraspecific variation in δ13C for C4 species such as maize and Gynandropsis (Voznesenskaya et al., 2007), we do not know whether this variation arises from differences in anatomy or biochemistry (Simpson et al., 2021). In addition, some of the observed variation in δ13C could also be due to environmental effects on water use efficiency (Farquhar & Richards, 1984), particularly if the phenotypic data comes from individuals sampled in the field. However, differences in the δ13C between accessions of some species are maintained in a common environment (Lundgren et al., 2016), indicating that the δ13C ratio likely has a genetic component. Intraspecific, heritable variation in δ13C offers an excellent opportunity for using quantitative genetic approaches to discover C4 QTLs.
The grass Alloteropsis semialata has long been used as a model to study C4 evolution, since it is the only species known to have C3, C4, and intermediate genotypes that diverged relatively recently and can be crossed, allowing gene flow among them (reviewed by Pereira et al., 2023). The common ancestor of this species is thought to be an intermediate with some chloroplasts in its bundle sheath and performing a very weak C4 cycle, with the C3 being a reversal from this intermediate state as that lineage colonized cooler environments in southern Africa (Dunning et al., 2017). The intermediate populations are found in the grassy ground layer of the Central Zambezian miombo forests that we refer to as ‘C3+C4’ because they perform a weak C4 cycle in addition to directly fixing CO2 through the C3 cycle (Lundgren et al., 2016; Dunning et al., 2017). Comparative studies have shown that the transition to a purely C4 physiology in A. semialata is caused by the overexpression of relatively few core C4 enzymes (Dunning et al., 2019a) and the acquisition of C4-like morphological traits, notably the presence of minor veins (Lundgren et al., 2019). The δ13C of the C3+C4 plants ranges from values characteristic of a weak (or absent) C4 cycle to values that show that the C4 cycle accounts for more than half of the carbon acquisition (Von Caemmerer, 1992; Lundgren et al., 2015; Stata et al., 2019; Olofsson et al., 2021). Furthermore, the strengthening of the C4 cycle in the C3+C4 intermediates (measured using δ13C) is significantly associated with alterations in a number of leaf anatomical traits related to the preponderance of inner bundle sheath (IBS) tissue, the cellular location of the C4 cycle in this species (Alenazi et al., 2023), including the distance between consecutive bundle sheaths, the width of IBS cells, and the proportion of bundle sheath tissue in the leaf (Alenazi et al., 2023).
Alloteropsis semialata therefore represents an ideal system to identify the genes correlated with the strengthening of the C4 cycle. Here, we first conducted a global analysis to identify candidate genes associated with the strength of the C4 cycle (δ13C) using genomic data from 420 individuals representing C3, C3+C4, and C4 phenotypes. We then focused specifically on the C3+C4 intermediates, to identify candidate genes associated with the relative expansion of bundle sheath tissue during the transition from a weak to a strong C4 cycle. The high level of interspecific variation in A. semialata permits a fine-scale understanding of the genetic basis of C4 evolution, including the intermediate steps involved in the assembly of this complex trait. This is crucially important to identify the initial changes required for the emergence of this trait, something that may ultimately have applications in the engineering of C4 photosynthesis in C3 crops such as rice.
Materials and Methods
Genome data, δ13C values, and population genetic analyses
For the genomic analyses, we compiled previously published double digest restriction-site associated DNA sequencing (ddRADSeq) data sets for Alloteropsis semialata (R. Br.) Hitchc. individuals that also had known δ13C values from field-collected leaves measured using mass spectrometry (Lundgren et al., 2015, 2016; Bianconi et al., 2020; Olofsson et al., 2021; Alenazi et al., 2023). Depending on the source of the δ13C values, these were either single measures (Lundgren et al., 2015, 2016; Bianconi et al., 2020), replicated if the δ13C values did not match other individuals of the population and genomic group (Olofsson et al., 2021), or medians of triplicate technical replicates if sufficient material was available (Alenazi et al., 2023). In total, the data set comprised 420 individuals collected from 87 populations across Africa and Asia (Supporting Information Table S1), representing the full range of photosynthetic types found in A. semialata (45 × C3; 132 × C3+C4; 243 × C4).
The ddRADseq data were downloaded from NCBI Sequence Read Archive and cleaned using Trimmomatic v.0.38 (Bolger et al., 2014) to remove adapter contamination (ILLUMINACLIP option in palindrome mode) and low-quality bases (Q < 3 from both 5′ and 3′ ends; Q < 15 for all bases in four-base sliding window). The cleaned ddRADseq data were then mapped to a chromosomal scale A. semialata reference genome for a C4 Australian individual (Dunning et al., 2019b) using Bowtie2 v.2.2.3 with default parameters (Langmead & Salzberg, 2012). We called single-nucleotide polymorphisms (SNPs) from these alignments using the Gatk v.3.8 (McKenna et al., 2010) pipeline with default parameters. We generated individual variant files (gVCF) with HaplotypeCaller and then combined them into a single multi-sample VCF file with Genotype GVCFs. Biallelic SNPs were extracted from this file using SelectVariants, and high-quality SNPs retained using VariantFiltration (MQ > 40, QD > 5, FS < 60, MQRankSum > −12.5 ReadPosRankSum > −8). Finally, we used VCFtools to filter the remaining SNPs to remove those with > 30% missing data and/or a minor allele frequency < 0.05 (Danecek et al., 2011).
The evolutionary relationship among individuals was inferred using a maximum likelihood phylogenetic tree. We used VCF2phylip v.2.8 (Ortiz, 2019) to generate a nucleotide alignment from the filtered VCF file. To reduce the effect of linked SNPs on phylogenetic reconstruction, we thinned the data set so that SNPs were at least 1 kb apart (starting from the first SNP on each chromosome). The phylogenetic tree was inferred using RAxML v.8.2.12 (Stamatakis, 2014) with the GTRCAT model and 100 bootstrap replicates (Dataset S1). Finally, to verify previous phylogenetic groupings (Alenazi et al., 2023), we determined the population structure of the C3+C4 accessions using Admixture v.1.3.0 (Alexander et al., 2009). We ran the analysis with multiple values of k (range: 2–7), with 10 replicate runs for each value. The optimal k was inferred using Admixture's cross-validation error method. We also used Plink-v.1.9 to perform a principal component analysis to quantify population structure and to generate a pairwise kinship matrix (Purcell et al., 2007).
Leaf anatomical traits of C3+C4 A. semialata
Leaf anatomy data for all 132 C3+C4 individuals were either extracted from a previous study (n = 100; Alenazi et al., 2023) or generated here using the same method from field-preserved samples (n = 32; Table S2). The measurements themselves were taken from leaf cross-sections that were prepared from silica-dried leaf material following the method described by Alenazi et al. (2023). The slide images were captured using a mounted camera on an Olympus BX51 microscope (Olympus, Hamburg, Germany), and images from the same leaf were stitched together with Hugin's software (Hugin Development Team, 2015). All measurements of leaf anatomical characteristics were made using ImageJ v.1.53f (Schneider et al., 2012), avoiding the midrib and leaf margins. For each individual, the anatomical measurements are based on the mean of at least five technical replicates, each measured between independent pairs of secondary veins in the same cross-section using.
We recorded the total cross-sectional areas between secondary veins (i.e. veins accompanied by extraxylary fibers and epidermal thinning) for mesophyll (including airspaces; MS) and IBS tissues (Fig. 1). We used these values to then calculate the inner bundle sheath fraction (IBSF = IBS/[MS + IBS]), which is the portion of the photosynthetic part of the leaf that can be responsible for refixing carbon obtained through the C4 cycle. Finally, we also measured the bundle sheath distance (BSD) and the inner bundle sheath width (IBSW) using the mean widths of equatorial cells.
Estimating trait heritability
To estimate the proportion of phenotypic variation explained by underlying genetic differences, we calculated the heritability of δ13C (complete and restricted C3+C4 datasets) and the leaf anatomical traits (C3+C4 dataset) using Genome-wide Complex Trait Analysis (Gcta) v.1.94.1 (Yang et al., 2011). A genetic relationship matrix was inferred from the previously generated SNP calls and combined with the phenotype values in GCTA. Heritability was then estimated for each trait using the restricted maximum likelihood method.
Genome-wide association study of photosynthetic traits
We performed a GWAS for the strength of the C4 cycle measured using δ13C, and leaf anatomy traits previously correlated with the strength of the C4 cycle (IBSF, BSD, and IBSW; Fig. 1; Alenazi et al., 2023), with the objective of ultimately proposing some candidate genes underpinning the phenotype. We used the variation in photosynthetic type which exists across A. semialata as a whole, before focusing on anatomical variation in the C3+C4 individuals that have been associated with the strength of the C4 cycle (Alenazi et al., 2023). We defined our associated regions of the genome as the linkage block containing a significant SNP from the GWAS. We then identified the gene models located within the correlated region as candidate genes and assessed their functional relevance, gene expression pattern, and selective forces they have been evolving under.
The GWAS itself was performed using the rmvp package (Yin et al., 2021) in Rstudio v.4.3, with the MVP.Data function and default parameters used for single-locus GWAS analysis for each phenotypic trait with the fixed and random model circulating probability unification (FarmCPU) approach (Yin et al., 2021). Population structure and genetic relatedness can confound a GWAS and result in false associations (Chen et al., 2016). We therefore included the previously generated pairwise kinship matrix so that the relationships among individuals could be accounted for. The phenotypic data for each trait were normalized (if required), and a Bonferroni corrected SNP significance threshold of P ≤ 0.05 was used.
Linkage disequilibrium
Linkage blocks are regions of the genome that are likely to be co-inherited, and the association of the significant SNPs identified from the GWAS could be caused by any gene within this region. To determine the linkage block encapsulating each SNP, we used Haploview v.4.1 (Barrett et al., 2005). The input map and binary files were processed using Plink v.1.9 (Purcell et al., 2007), and we used a solid spine of linkage disequilibrium (LD) with default parameters to infer linkage block size (Kim et al., 2018). This approach requires the first and last SNPs in a block to be in strong LD with all intermediate markers (normalized deviation (D′) ≥ 0.8), but the intermediate markers do not necessarily need to be in LD with each other. Identifying linkage blocks is heavily impacted by the distribution of SNPs across the genome, something that is accentuated by reduced sequencing methods such as ddRADSeq. We therefore used the genome-wide mean linkage block size if the analysis failed to place a significant SNP in a block of its own (Fig. S1). To do this, we positioned the significant SNP at the center of the artificial linkage block and if necessary truncated it to avoid incorporating unlinked SNPs up and/or downstream from this marker.
Identification of candidate genes
The linkage blocks associated with the phenotype of interest contain the causal gene(s) in addition to those that happen to be in close physical linkage (hitchhiking). To try and identify plausible candidate genes in each region, we compared their functional annotations, expression patterns, and the selective pressures they are evolving under.
Orthofinder v.2.5.4 (Emms & Kelly, 2015) was used to identify orthologous genes to the loci in the associated regions. To do this, we combined the A. semialata protein sequences with nine other plant species (Arabidopsis thaliana, Brachypodium distachyon, Hordeum vulgare, Oryza sativa, Physcomitrium patens, Solanum lycopersicum, Triticum aestivum, and Zea mays) downloaded from Phytozome v.13 (Goodstein et al., 2012). Orthogroup phylogenies are presented in Dataset S2. We then used publicly available databases (e.g. TAIR (Berardini et al., 2015), RAP-DB (Sakai et al., 2013), and maizeGDB (Monaco et al., 2013)) and literature searches to extrapolate the functions of each orthogroup containing a gene from a correlated linkage block identified from the GWAS.
Gene expression data for the candidate genes was extracted from a Dunning et al. (2019a). The gene expression data come from mature leaf tissue grown under controlled conditions (60% relative humidity, day/night temperatures of 25/20°C), sampled in the middle of the photoperiod (Dunning et al., 2019a). To test for differential expression between the photosynthetic types, we used two-tailed t-tests, with P-values Bonferroni corrected to account for multiple testing.
Finally, we used whole-genome resequencing data (Bianconi et al., 2020) for 45 A. semialata individuals to determine whether the genes in the GWAS regions were evolving under positive selection. In short, the datasets were downloaded from NCBI sequence read archive and mapped to the reference genome using Bowtie2, and consensus sequences generated using previously developed methods (Olofsson et al., 2016; Dunning et al., 2022), and a maximum likelihood phylogeny tree for each gene was inferred using RAxML (Stamatakis, 2014) with 100 bootstrap. We then inferred the selective pressure each gene was evolving under by running the M0 model in codeML v.4.9 h.
Results
Population structure
The broadscale phylogenetic (Fig. 2a) and population genetic (Fig. 2b) analyses recovered those previously inferred by earlier studies, with the different photosynthetic types (C3, C3+C4, and C4) belonging to separate clades (Olofsson et al., 2016, 2021; Bianconi et al., 2020). Within the C3+C4 intermediates, individuals are separated into five populations geographically spread across the Central Zambezian miombo woodlands (Fig. 2). This reconfirms the phylogenetic groupings previously demarcated (Alenazi et al., 2023), although the earliest diverging sixth lineage is absent in this study because it is only represented by a single herbarium individual from the Democratic Republic of the Congo and lacks ddRADSeq data. The population structure analysis (Fig. 2c) largely concurs with the phylogenetic groupings, although it indicates gene flow between populations. The distribution of the C3+C4 groups has a pattern largely matching a scenario of isolation by distance along an east–west axis through Zambia and Tanzania (Fig. 2d). Q–Q plots show no sign of P-value inflation for the subsequent GWAS results, indicating that population structure was sufficiently corrected for using the pairwise kinship matrix (Fig. S2).
Identifying regions of the genome correlated with the strength of the C4 cycle
We used the δ13C values as a proxy for the strength of the C4 cycle for all 420 A. semialata individuals used in this study. As expected, the δ13C values supported the demarcation of the main nuclear clades into the C3, C3+C4, and C4 phenotypes (Fig. 3a). For C3 and C4 individuals, we found δ13C average values of −26.67‰ and −12.63‰ with little dispersion within each group, whereas for C3+C4 individuals, we found substantial variation ranging from −28.35‰ to −18.47‰ with an average of −23.87‰. The heritability estimate, which represents the proportion of phenotypic variation due to genetic variation in the population, was high for δ13C when considering all photosynthetic types (h2 = 0.75; SE = 0.06; n = 420), and threefold lower when just considering the C3+C4 intermediates (h2 = 0.25; SE = 0.00; n = 132). This reduced heritability estimate could be due to multiple factors, including reduced power (132 C3+C4 individuals vs 420 full dataset), limited variation (C3+C4 range = 9.88‰; full data set range = 19.4‰), or increased residual variation within the C3+C4 compared with other photosynthetic types (C3+C4 SD = 2.1‰, C3 SD = 1.08‰, C4 SD = 1.0‰; Fig. 3a).
We conducted a combined GWAS using all individuals (Fig. 3b), as well as various partitions by photosynthetic type (Fig. S3). When considering all individuals, the GWAS identified three significant SNPs on chromosome 9, which all corresponded to relatively narrow regions based on the LD (Fig. 3b). The region with the highest association with δ13C (LB-01) is a 121-kb region at 32.2 Mb (Tables 1, S3; Fig. 3b). The same region was also significant when repeating the GWAS within the C3+C4, and when combining the C3+C4 with either the C3 or C4 individuals (Table S3; Fig. S2), but not when excluding the C3+C4 individuals. These results imply that the underlying causative gene segregates only within the C3+C4 group. There were six predicted protein-coding genes in the LB-01 region, and all were expressed in the leaf tissue of at least one A. semialata individual (Table S4). One of these genes (Shewanella-like protein phosphatase 1, SLP1 (ASEM_AUS1_34305)) was significantly more highly expressed in the C3 than in the other photosynthetic types (C3 vs C4 Bonferroni-adjusted P-value = 0.073; C3 vs C3+C4 Bonferroni-adjusted P-value = 0.015; Fig. 3c; Table S4), although there is no consistent differential expression between photosynthetic types when individual populations are compared separately (Dunning et al., 2019a). None of the six genes were found to be strictly evolving under positive selection with a dN/dS ratio (ω) > 1 (Table S4), although those with the highest values may be seeing a relaxation of purifying selection (e.g. ω = 0.83 for ASEM_AUS1_34303). The annotated genes in the LB-01 region have a variety of functions (Table S4), including loci associated with the regulation of the Calvin cycle (SLP1 (ASEM_AUS1_34302) (Kutuzov & Andreeva, 2012; Johnson et al., 2020)) and the activation of NADP-malic enzyme 2 (NADP-ME2), a C4 decarboxylation enzyme (RPM1-Induced Protein Kinase, RIPK (ASEM_AUS1_34305) (Wu et al., 2022)).
Phenotype | Chromosome | SNP position | LD block (kb) | −log10 P | Bonferroni-adjusted P-value | No. of genes |
---|---|---|---|---|---|---|
δ13C | 9 | 32 191 256 | 121 | 29.18 | 5.33E-26 | 6 |
9 | 49 592 498 | <1b | 5.73 | 1.51E-02 | 1 | |
9 | 81 051 792 | 63 | 5.58 | 2.12E-02 | 1 | |
IBSF | 9 | 58 663 539 | 362 | 6.88 | 3.58E-04 | 33 |
2 | 634 463 | 6b | 4.83 | 4.95E-02 | 2 | |
5 | 23 291 361 | 59a | 5.89 | 3.53E-03 | 4 | |
4 | 63 231 217 | 13 | 6.71 | 5.34E-04 | 2 | |
8 | 23 357 684 | 275 | 16.47 | 9.25E-14 | 24 | |
BSD | 9 | 10 612 628 | 11 | 5.10 | 2.17E-02 | 1 |
9 | 48 749 591 | <1b | 5.59 | 7.02E-03 | 0 |
- Detailed location of the significant single-nucleotide polymorphism (SNP) identified by GWAS and the linkage disequilibrium (LD) block where they are contained within the Alloteropsis semialata genome. The phenotypes used to perform the GWAS are carbon isotope ratio (δ13C), internal bundle sheath fraction (IBSF), and bundle sheath distance (BSD).
- a This SNP was not located in a linkage block in our analyses; we therefore defined the region using the genome-wide median block size.
- b This SNP was not located in a linkage block in our analyses; we therefore defined the region using the genome-wide median block size that was truncated if there was a closely located unlinked SNP up- or downstream.
The two other regions identified in the δ13C GWAS using all individuals (LB-02 and LB-03; Fig. 3b) were not significant when partitioning the data by photosynthetic type (Table S3). Both these regions are delimited by LD blocks narrow in size and that contain one annotated gene each. The candidate gene in LB-02 (ASEM_AUS1_29467) was not expressed at all in any A. semialata mature leaves, while the one in LB-03 (ASEM_AUS1_14480) was expressed in all individuals, but was not differentially expressed between photosynthetic types. In addition, both genes do not seem to have been under positive selection (Table S4). One of these genes (ASEM_AUS1_29467) encodes a SCARECROW-LIKE protein 9 (SCL9) protein belonging to the GRAS gene family, a group of transcription factors shown to play a key role in C4 leaf anatomy and photosynthetic development in maize (Slewinski et al., 2012; Hughes & Langdale, 2020). The lack of expression (or differential expression) of these candidate genes in transcriptomes generated from mature leaf tissues is likely explained by the involvement of these genes, such as SCL9, in leaf development. The other gene encodes a protein associated with the suppression of nonphotochemical quenching and maintaining the efficiency of light harvesting (suppressor of quenching 1, SOQ1 (ASEM_AUS1_14480) (Brooks et al., 2013; Duan et al., 2023)).
The δ13C GWAS analyses were repeated with a subset of photosynthetic types (Fig. S3; Table S4), and these also identified potentially interesting candidate genes, particularly those related to leaf vein patterning. WIP C2H2 zinc finger protein (WIP2 (ASEM_AUS1_03361); LB-20; C4 & C3+C4 individuals) is paralogous to the WIP6 transcription factor TOO MANY LATERALS that specifies vein rank in maize and rice (Vlad et al., 2024). Defectively Organized Tributaries 4 (DOT4 (ASEM_AUS1_05127); LB-23; C3+C4 individuals) is orthologous to a vein patterning gene in Arabidopsis thaliana (Petricka et al., 2008).
Identifying regions of the genome associated with C4 leaf anatomy in the C3+C4 intermediates
We studied the genetic basis of three leaf anatomical traits previously associated with the strength of the C4 cycle (δ13C) using the 132 C3+C4 individuals (Fig. 1; Alenazi et al., 2023). The heritability estimates for the three leaf anatomical traits in the C3+C4 intermediates ranged from roughly equivalent to the value for δ13C value to much lower (IBSF h2 = 0.22 (SE = 0.04); BSD h2 = 0.12 (SE = 0.06); IBSW h2 = 0.06 (SE = 0.06); n = 132). No significantly correlated genomic region was detected for IBSW (Fig. S4), the trait with the lowest heritability. However, we did detect SNPs significantly associated with BSD and IBSF.
Bundle sheath distance
The distance between consecutive bundle sheaths plays a significant role in determining the rate and efficiency of photosynthesis in plants, with smaller distances being significantly correlated with higher δ13C (more C4-like) values (Alenazi et al., 2023). The C3+C4 intermediate individuals showed a range of BSDs from 55.14 to 178.36 μm, with variation between subclades (Fig. 4a). The GWAS identified two significant regions associated with BSD, both on chromosome 9 (Tables 1, S3; Fig. 4). Only one annotated gene was identified in the correlated genomic regions associated with BSD, the function of which is associated with leaf development (Glucan Synthase-Like 8, GSL8 (ASEM_AUS1_16831); Table S4 (Linh & Scarpella, 2022)).
Inner bundle sheath fraction
Inner bundle sheath fraction represents the portion of the leaf that can be used for C4 photosynthesis (Fig. 1). A higher IBSF in the C3+C4 A. semialata has been significantly correlated with a higher δ13C (more C4 like; Alenazi et al., 2023). In the C3+C4 populations, there is a range from 0.05 to 0.29, with variation between subclades (Fig. 5a). We identified five regions of the genome correlated with IBSF, each on a different chromosome (Tables 1, S3; Fig. 5). Expression was detected in mature leaves for 62% of the 65 genes located in the five regions, with no consistent differential expression between photosynthetic types in mature leaves (Dunning et al., 2019a), although two are on average more highly expressed in the C3 vs C4 accessions (ASEM_AUS1_21119 Bonferroni-adjusted P-value = 0.073; ASEM_AUS1_17094 Bonferroni-adjusted P-value = <0.001). Five out of the 65 genes were also evolving under strong positive selection with a dN/dS ratio (ω) > 1 using the one-ratio model (Table S4). The annotated genes in the correlated regions of the genome have a variety of functions (Tables 1, S4), including loci directly connected to the response to light stress (Ferulate 5-Hydroxylase 1, FAH1 (ASEM_AUS1_36251) (Maruta et al., 2014)) and leaf development (GATA transcription factor 19, GAT19 (ASEM_AUS1_21136), CCR4-NOT transcription complex subunit 11, CNOT11 (ASEM_AUS1_25789) & GRF1-interacting factor 1, GIF1 (ASEM_AUS1_21151)) (Sarowar et al., 2007; Zhang et al., 2018; An et al., 2020).
Discussion
Alloteropsis semialata has C3, C3+C4, and C4 genotypes that recently diverged, and it is therefore a useful model to study the initial steps leading to the establishment of the C4 phenotype since these modifications are not conflated with other changes that accumulate over time (Pereira et al., 2023), and its emergence in this species provided an immediate demographic advantage (Sotelo et al., 2024). Here, we estimate the heritability and identify regions of the genome correlated with variation in both the stable carbon isotope ratio (δ13C) and leaf anatomical traits known to influence δ13C from field-based measurements (Alenazi et al., 2023). Despite a relatively modest sample size (n = 420 for δ13C; n = 132 for leaf anatomy), we identified regions of the genome significantly associated with these traits, which may indicate that the genetic architecture of C4 evolution in A. semialata is relatively simple, although a broader study may identify additional loci. At present, functional validation of the candidate loci is not possible in Alloteropsis, as no proven stable transformation system has been established (Pereira et al., 2023). This is something that would greatly advance the utility of Alloteropsis semialata as a model system to study C4 evolution in the future.
Genetic basis of the carbon isotope ratio (δ13C) in A. semialata
Using linked phenotype and genotype information for 420 A. semialata individuals, we identified three associated regions of the genome, containing eight protein-coding genes (Fig. 3). The underlying differences in the δ13C between photosynthetic types are driven by C4 plants evolving to fix carbon with the PEPC enzyme rather than Rubisco. However, genes encoding PEPC were not detected in the associated regions identified in our GWAS. This absence could be due to variation in the specific PEPC gene copy used for C4 in the different individuals masking the signal, with up to five different versions known to be used by different A. semialata populations (Dunning et al., 2017). Among these five copies, three were laterally acquired (Christin et al., 2012), complicating the matter further as they appear as large structural variants inserted randomly into the genome (Dunning et al., 2019b) and are only present in a subset of individuals (Raimondeau et al., 2023). However, based on the annotations of the genes in the associated regions, we did identify candidate genes with functions potentially associated with the δ13C, the most promising of which include those co-expressed with Rubisco (SLP1 (ASEM_AUS1_34302)), the activation of the NADP-ME C4 decarboxylating enzyme (RIPK (ASEM_AUS1_34305)), the development of C4 ‘Kranz’ anatomy (SCL9 (ASEM_AUS1_29467)), and the suppression of nonphotochemical quenching (SOQ1 (ASEM_AUS1_14480)).
SLP1 encodes a Shewanella-like protein phosphatase 1, an ancient chloroplast phosphatase that is generally more highly expressed in photosynthetic tissue (Kutuzov & Andreeva, 2012; Johnson et al., 2020). In Arabidopsis thaliana, it is co-expressed with a number of photosynthetic genes (including all of the Calvin cycle enzymes and Rubisco activase) and it is predicted to play a role in the light-dependent regulation of chloroplast function (Kutuzov & Andreeva, 2012). In A. semialata, SLP1 is significantly more highly expressed in the C3 individuals compared with the other photosynthetic types. This greater expression in C3 individuals could indicate a higher Calvin cycle activity at the whole leaf level; meanwhile, in the C3+C4 and C4 individuals, its expression would be increasingly restricted to the IBS tissue. Subdivision of the light signaling networks is one of the key steps in the partitioning of photosynthesis across tissue types in C4 species (Hendron & Kelly, 2020), and SLP1 is potentially one of the regulators of this key innovation in A. semialata.
RIPK is an enzyme that plays a role in disease resistance and plant immunity (Liu et al., 2011), but has pleiotropic effects. In A. thaliana, RIPK directly phosphorylates NADP-ME2 (AT5G11670) to enhance its activity and increase cytosolic NADPH concentrations (Wu et al., 2022). In C4 species, CO2 is initially fixed in the mesophyll by CA and PEPC before being transported to an internal leaf compartment and released for Rubisco to assimilate through the Calvin cycle. Preliminary studies in A. semialata concluded that NADP-ME was the predominant decarboxylating enzyme, although its activity varied with temperature (Frean et al., 1983). Subsequent transcriptome work showed that NADP-ME expression (specifically the nadpme-1P4 gene that is a many-to-many ortholog of AT5G11670) has a mean expression level four times higher in C4 and C3+C4 individuals (mean = 300 RPKM; SD = 235) than in C3 plants (mean = 75 RPKM; SD = 32), although this difference is not always consistent between populations (Dunning et al., 2019a). The other decarboxylating enzyme commonly used by C4 Alloteropsis is phosphoenolpyruvate carboxykinase (PCK), but like PEPC, a C4 copy of PCK was also laterally acquired (Christin et al., 2012), complicating its identification in a GWAS analysis because it is absent in the C3 (Dunning et al., 2019b).
SCL9 belongs to the GRAS gene family of transcription factors that regulate plant development (Hirsch & Oldroyd, 2009). This multigene family includes two known C4 Kranz anatomy regulators identified in maize, SHORTROOT (Slewinski et al., 2014) and SCARECROW (Slewinski et al., 2012). Orthologous SCARECROW (SCR) genes have divergent functions, being recruited for distinct roles in leaf development within maize, rice, and A. thaliana (Hughes & Langdale, 2022). In addition to its influence on leaf anatomy, SCR is also required for maintaining photosynthetic capacity in maize (Hughes & Langdale, 2020). The correlation of the SCARECROW-LIKE SCL9 gene with the strength of the C4 cycle in A. semialata may indicate that convergence in C4 phenotypes are a result of the parallel recruitment of GRAS transcription factors between species, although there is divergence in the specific loci recruited for this purpose.
SOQ1 is a chloroplast-localized thylakoid membrane protein that regulates nonphotochemical quenching in A. thaliana (Brooks et al., 2013; Duan et al., 2023). In full sunlight, plants absorb more light energy than they can process, which can ultimately result in the generation of free radicals that damage the photosynthetic apparatus (Müller et al., 2001). To overcome this, plants have evolved nonphotochemical quenching, which enables them to dissipate the excess energy as heat. This problem is potentially exacerbated in C4 species, which typically grow in high-light conditions compared with their C3 counterparts (Sage & Monson, 1999). Preliminary evidence indicates that C4 species exhibit a significantly faster and greater nonphotochemical quenching relaxation than their C3 relatives, including between photosynthetic types in A. semialata (Acre Cubas, 2023). SOQ1 may therefore play a direct role in regulating differences in the nonphotochemical quenching responses among A. semialata photosynthetic types, and it may represent a good candidate gene to target for reduced photoinhibition associated with fluctuating light conditions in crops (Long et al., 1994).
The genetic basis of C4 leaf anatomy
In A. semialata, the IBS is the site of C4 photosynthesis, and three leaf anatomical variables linked to the proliferation of this tissue explain the strength of the C4 cycle (δ13C) in the C3+C4 intermediate individuals: IBSW, BSD, and IBSF (Alenazi et al., 2023). IBSW has the lowest heritability (h2 = 0.06 (SE = 0.06)), and we failed to identify any significant SNPs correlated with this phenotype in our GWAS. This absence of significant genetic factors contributing to the trait may indicate that IBSW has a complex genetic architecture or high phenotypic plasticity. In A. semialata, δ13C is largely genetically based as it is highly heritable after population structure has been accounted for (h2 = 0.75 (SE = 0.06)), and field-based differences are preserved in a common environment (Lundgren et al., 2016). However, slight variation in δ13C can still be caused by environmental effects on water use efficiency (Farquhar & Richards, 1984). The previously observed correlation of field-based IBSW measurements with δ13C may potentially arise from such environmental-induced plasticity (Alenazi et al., 2023). For example, bundle sheath cells in wheat have a larger diameter (more C4-like) under drought conditions (David et al., 2017).
Plasmodesmata and reduced distance between bundle sheaths
We identified two regions of the genome associated with BSD that contain a single protein-coding gene. This gene is GSL8 (ASEM_AUS1_16831), a member of the GSL family that encodes enzymes synthesizing callose. GSL8 plays an important role in tissue-level organization (Chen et al., 2009), including stomatal (Guseman et al., 2010) and leaf vein patterning (Linh & Scarpella, 2022). Mutants of GSL8 in A. thaliana formed networks of fewer veins in their leaves (Linh & Scarpella, 2022). This change in venation is mediated by the aperture of plasmodesmata, channels through cell walls that connect neighboring cells (Paterlini, 2020; Band, 2021), which is regulated by GSL8 (Saatian et al., 2018; Linh & Scarpella, 2022). Normal vein patterning is reliant on an auxin hormone signal travelling through these plasmodesmata, and any interference of this signal disrupts leaf vein development (Linh & Scarpella, 2022). GSL8 might play a role in strengthening the C4 cycle in A. semialata by reducing the distance between bundle sheaths through modulation of the auxin signal. The transition to being fully C4 in A. semialata is also correlated with the presence of minor veins, which reduces both the number of mesophyll cells and the distance between bundle sheaths in C3+C4 in comparison with C3 populations (Lundgren et al., 2019). Therefore, GSL8 may play a pleiotropic role in the strengthening of C4 photosynthesis in A. semialata by increasing both the proportion of bundle sheath tissue in the leaf, and the connectivity between the two distinct cell types required to complete the cycle. The δ13C GWAS using the C4 and C3+C4 (which differ in the presence of minor veins; Lundgren et al., 2019) interestingly identified a paralog of a gene recently shown to specify vein rank in maize (Vlad et al., 2024), and potentially, WIP2 has been co-opted for a similar function in A. semialata.
The genetic basis of the inner bundle sheath fraction in A. semialata
The inner bundle sheath fraction has the highest heritability of all the three leaf anatomy measures used (h2 = 0.22 (SE = 0.04)). Since it is a composite trait, it is more likely to be influenced by multiple developmental processes. Our GWAS identified five regions of the genome significantly associated with IBSF, containing 65 predicted protein-coding genes. Interestingly, we found a number of genes associated with leaf development that could play a role in the development of C4 leaf architecture. These include homologs of genes that alter leaf area and vascular development (GAT19 (ASEM_AUS1_21136)) (An et al., 2020), leaf thickness (CNOT11 (ASEM_AUS1_25789)) (Sarowar et al., 2007), and leaf width by regulating meristem determinacy (GIF1 (ASEM_AUS1_21151)) (Zhang et al., 2018). GIF1 (also called ANGUSTIFOLIA3) is perhaps the most interesting of these genes, since it is expressed in the mesophyll cells of leaf primordium and can influence the proliferation of other clonally independent leaf cells (e.g. epidermal cells (Kawade et al., 2013)). The numerous regulators of leaf development identified in the GWAS point to an interacting balance of growth regulators to increase the proportion of bundle sheath tissue within the leaf for C4 photosynthesis.
There are other genes in these regions with a diverse set of functions, although it is unclear how they could modulate IBSF, including genes associated with light stress and lignin biosynthesis. FAH1 encodes ferulic acid 5-hydroxylase (F5H) 1, a cytochrome P450 protein that, when disrupted, reduces anthocyanin accumulation under photooxidative stress (Maruta et al., 2014) and is more highly expressed in the C3 (mean RPKM = 7.38; SD = 5.29) than other photosynthetic types (mean RPKM = 1.00; SD = 2.10; Table S4). These loci could also play a role in C4 photosynthesis, although most likely they might just be in close physical linkage.
Conclusion
C4 photosynthesis is a complex trait that requires the rewiring of metabolic gene networks and alterations to the internal leaf anatomy. We investigated the genetic basis of these key innovations in Alloteropsis semialata, which has recently diverged C3, C3+C4 intermediate, and C4 phenotypes. We performed a GWAS that identified regulators of C4 decarboxylation enzymes (RIPK), nonphotochemical quenching (SOQ1), and several genes involved in tissue-level organization and leaf development (e.g. SCL9, GSL8, and GIF1). Interestingly, these tend to come from the same gene families as the previously identified C4 leaf anatomy regulators in other species. This parallel recruitment appears to mirror the pattern observed in the core metabolic enzymes, with the paralog recruited for the C4 function depending on its ancestral expression pattern and catalytic properties (Wang et al., 2009; Hibberd & Covshoff, 2010; Christin et al., 2013, 2015; Aubry et al., 2014; Emms et al., 2016; Moreno-Villena et al., 2018). Thus, the easiest path to C4 leaf anatomy would be context-dependent, which likely has implications for engineering C4 anatomy in C3 species.
Acknowledgements
The authors extend their appreciation to the Deanship of Scientific Research at Northern Border University, Arar, KSA, for funding this research work (project no.: NBU-SAFIR-2024). LP is supported by a Natural Environment Research Council (grant no.: NE/V000012/1). PAC was funded by a Royal Society University Research Fellowship (grant no.: URF\R\180022). LTD is funded by a NERC fellowship (grant no.: NE/T011025/1).
Competing interests
None declared.
Author contributions
ASA, LP, P-AC, CPO and LTD designed the study. ASA conducted the experimental work and generated the phenotype data. ASA, LP and LTD analyzed the data. All authors interpreted the results and helped write the manuscript. ASA and LP contributed equally to this work.