Genome-wide association studies and prediction of 17 traits related to phenology, biomass and cell wall composition in the energy grass Miscanthus sinensis

Increasing demands for food and energy require a step change in the effectiveness, speed and flexibility of crop breeding. Therefore, the aim of this study was to assess the potential of genome-wide association studies (GWASs) and genomic selection (i.e. phenotype prediction from a genome-wide set of markers) to guide fundamental plant science and to accelerate breeding in the energy grass Miscanthus. We generated over 100 000 single-nucleotide variants (SNVs) by sequencing restriction site-associated DNA (RAD) tags in 138 Micanthus sinensis genotypes, and related SNVs to phenotypic data for 17 traits measured in a field trial. Confounding by population structure and relatedness was severe in naïve GWAS analyses, but mixed-linear models robustly controlled for these effects and allowed us to detect multiple associations that reached genome-wide significance. Genome-wide prediction accuracies tended to be moderate to high (average of 0.57), but varied dramatically across traits. As expected, predictive abilities increased linearly with the size of the mapping population, but reached a plateau when the number of markers used for prediction exceeded 10 000–20 000, and tended to decline, but remain significant, when cross-validations were performed across subpopulations. Our results suggest that the immediate implementation of genomic selection in Miscanthus breeding programs may be feasible.

1 Figure S1 Linkage disequilibrium (LD, measured as pairwise r 2 from genotypic correlation) based on single-nucleotide variants (SNVs) detected using restriction-site associated DNA sequencing (RAD-Seq) in Miscanthus sinensis. Left: large-scale LD among SNVs that were (1) detected using alignments to the Sorghum bicolor genome; (2) filtered using 'liberal' criteria; and (3) located within 1 Mb of each other, assuming microsynteny between Miscanthus and Sorghum. Only values of r 2 ≥ 0.1 are shown. Right: fine-scale LD among SNVs that (1) were detected using alignments to a Miscanthus pseudo-reference; (2) were filtered using 'liberal' criteria; (3) were located within the same RAD tag; and (4) had minor allele frequencies of at least 0.10. To equalize sample sizes, r 2 values were calculated based on N = 26 individuals from the overall population, as well as from the 'Continent' and 'Japan' subpopulations.  (Pritchard et al., 2000;Falush et al., 2003Falush et al., , 2007. (a) Mean log-likelihoods and their standard deviations from runs assuming different numbers of subpopulations (K). (b) Values of the ad hoc statistic ΔK, which tends to peak at the value of K that corresponds to the highest hierarchical level of substructure (Evanno et al., 2005). (c) Individual proportional memberships of 138 M. sinensis genotypes assuming K = 2. Clustering results from 10 independent runs of STRUCTURE were aligned using the CLUMPP program (Jakobsson & Rosenberg, 2007) and illustrated using the DISTRUCT program (Rosenberg, 2004). Genotypes were ordered by their scores on the primary axis of variation (PC1) detected using individual-based principal component analysis (Fig. 1, Patterson et al., 2006).

Figure S3
Statistical power (left) and effect size inflation (right) based on data perturbation simulations (Yu et al., 2006) for genome-wide association study (GWAS) analyses using the EMMAX program, with mixed linear models including the identity-by-state matrix and the first two eigenvectors of population structure. Left: statistical power was estimated for  = 10 -5 and setting a target proportion of variance explained (PVE, see Materials and Methods). Right: naïve estimates of (PVE) for associations detected at  = 10 -5 vs their 'real' simulated values. In the absence of bias, points are expected to follow a trend around the x = y diagonal (dashed line).

Figure S4
Genome-wide association study (GWAS) results for 17 phenotypic traits in a population of 138 M. sinensis genotypes based on 53,174 single-nucleotide variants (SNVs) detected using alignments to the Sorghum bicolor genome and filtered using 'liberal' criteria (Table 2). Manhattan (left) and quantile-quantile (QQ) plots (right) are based on the GWAS p-values of all SNVs aligning to Sorghum chromosomes 1-10. Manhattan plots: blue lines indicate suggestive (P = 10 -5 ) and red lines Bonferroni-adjusted genome-wide significance (P = 0.05/53,174  9.4 × 10-7) based on mixed linear models including both the identity-by-state (IBS) kinship matrix and the first two eigenvectors (PC1 and PC2) from individual-based principal component analysis. QQ plots: results from analyses including (1) the IBS matrix, as well as PC1 and PC2 (black points); (2) the IBS matrix only (blue points); and (3) simple linear regression without corrections for population structure and relatedness (red points). In the absence of confounding, the majority of points are expected to fall on the x = y diagonal (dashed line).

Figure S5
Genome-wide association study (GWAS) results for 17 phenotypic traits in a population of 138 M. sinensis genotypes based on 121,771 single-nucleotide variants (SNVs) detected using alignments to a M. sinensis pseudo-reference and filtered using 'liberal' criteria (Table 2). Quantilequantile (QQ) plots are based on the GWAS p-values of all SNVs from analyses including (1) the identity-by-state (IBS) kinship matrix and the first two eigenvectors (PC1 and PC2) from individualbased principal component analysis (black points); (2) the IBS matrix only (blue points); and (3) simple linear regression without corrections for population structure and relatedness (red points). The threshold for Bonferroni-adjusted genome-wide significance was P = 0.05/121,771  4.1 × 10 -7 . In the absence of confounding, the majority of points are expected to fall on the x = y diagonal (dashed line).  (Table 2). In addition to the kinship matrix, MLMMs also included the first two eigenvectors of population structure (Fig. 1) as fixed effects. Only results from the optimal model selection step are shown (multiple Bonferroni criterion); the dotted line indicates Bonferroni-adjusted genome-wide significance (P = 0.05/53,174  9.4 × 10-7). Results for other traits were similar or more conservative than those based on MLMMs without population structure covariates (Fig. S6).

Table S1
Simple linear regression measures of performance of genome-wide prediction in a population of 138 M. sinensis genotypes based on markers filtered using liberal criteria (Table 2). All regression coefficients are based on 100 random ten-fold cross-validations (i.e., using a training population with N = 124 genotypes). Method S1 Miscanthus pseudo-reference First, we selected a reference genotype based on its relatively large sequence coverage (17,152,027 reads). After trimming reads from the 3' end to a total length of 85 bp, custom Perl scripts were used to cluster identical sequences represented between 30 and 250 times. Based on the observed coverage distribution, these sequences were assumed to represent low-or single-copy restrictionsite associated DNA (RAD) loci in the Miscanthus genome. As a result of these procedures, 149,974 RAD clusters were coalesced from 8,323,912 sequence reads (i.e., approximately 49% of the total sequence data). The assembly was then condensed to fasta format and aligned to itself using BWA (Li & Durbin, 2009), with the following command options: bwa aln -n 5 -N. This self-alignment was used to identify highly similar sequences within the assembly. Any cluster with more than two observed presumed haplotypes was discarded. This was done to eliminate duplicated sequences in the assembly. The final filtered assembly contained 48,426 RAD clusters, representing 4.1 Mb of putatively low-copy Miscanthus genomic sequence.