A chromosome‐scale genome assembly reveals a highly dynamic effector repertoire of wheat powdery mildew

Summary Blumeria graminis f. sp. tritici (B.g. tritici) is the causal agent of the wheat powdery mildew disease. The highly fragmented B.g. tritici genome available so far has prevented a systematic analysis of effector genes that are known to be involved in host adaptation. To study the diversity and evolution of effector genes we produced a chromosome‐scale assembly of the B.g. tritici genome. The genome assembly and annotation was achieved by combining long‐read sequencing with high‐density genetic mapping, bacterial artificial chromosome fingerprinting and transcriptomics. We found that the 166.6 Mb B.g. tritici genome encodes 844 candidate effector genes, over 40% more than previously reported. Candidate effector genes have characteristic local genomic organization such as gene clustering and enrichment for recombination‐active regions and certain transposable element families. A large group of 412 candidate effector genes shows high plasticity in terms of copy number variation in a global set of 36 isolates and of transcription levels. Our data suggest that copy number variation and transcriptional flexibility are the main drivers for adaptation in B.g. tritici. The high repeat content may play a role in providing a genomic environment that allows rapid evolution of effector genes with selection as the driving force.


Fig. S1
Linear regression between chromosome size and gene numbers per chromosome.

Table S1
Sizes and features of the assembled chromosomes.

Table S2
Size estimates of five collapsed repeats. Table S3 B.g. tritici1 isolates used for analyses of centromeric repeats, mapping coverage and copy number variations.

Table S4
Other formae speciales isolates used for analyses of centromeric repeats and mapping coverage.

Table S6
Numbers of homologs of candidate effector genes and non-effector genes in the two Leotiomycetes Botrytis cinerea and Phialocephala subalpina Table S7 Summary of the genetic information for the 11 chromosomes of B.g. tritici from the cross 96224 X THUN-12.

Table S8
Description of the recombination hotspots that were validated by PCR as described in Note S1.

Table S9
Primers used to verify the recombination hotspots by PCR.

Table S10
Recombination frequency in the 96224 X THUN-12 depending of the genomic origin of THUN-12.

Table S11
Conservation of recombination hotspots in three mapping populations.

Table S12
Conservation of recombination coldspots in three mapping populations.
Table S13 Description of the 16 candidate effector gene families with at least 10 members in B.g. tritici.
Table S14 Summary of duplicated genes in the reference assembly Bgt_genome.v3.16 Table S15 Copy number variation of genes in 36 isolates of B.g. tritici.

Table S16
Estimates of nucleotide polymorphism rates in synonymous sites of genes derived from sequences of 36 B.g. tritici isolates.
Table S17 Nucleotide polymorphism rates in B.g. tritici genes from 36 isolates.
Table S18 Nucleotide polymorphism rates in B.g. tritici genes from 36 isolates.

Table S20
Enriched TE superfamilies in up-and downstream regions of group 1 and 2 candidate effector genes as well as non-effector genes.

Table S21
Number of copy number variants (CNV) and templates for unequal crossing over (UECO).

Table S22
Expression levels of candidate effector and non-effector genes located near recombination breakpoints.
Method S1 Construction of the genetic map.

Method S4 Statistical analyses.
Note S1 Verification of the integrity of the genome assembly in recombination hotspots by PCR.
Note S2 Analysis of sequence diversity of genes in 36 B.g. tritici isolates.
Note S3 A chromosome-scale assembly of the B.g. tritici genome.
Note S4 Genome size estimation.

Note S5 Transposable element annotation.
Note S6 Gene annotation and candidate effector classification.

Fig. S1 Linear regression between chromosome size and gene numbers per chromosome. The
x-axis is chromosome size in Mb, while the y-axis indicates the number of genes per chromosome. Individual chromosomes are indicated with squares. a. Overall, the total number of genes and chromosome size correlate strongly with a Pearson product-moment correlation coefficient (PMCC) of r=0.978 (When r is close to 1, it indicates a strong positive relationship). b, Numbers of effectors correlate less well, but still strongly, with chromosome. c. Numbers of group 1 and 2 candidate effectors (in total 657 genes) do not correlate with chromosome size. This is possibly due to the strong physical clustering of gene families observed in these two groups. d. In contrast, numbers of candidate effectors that were not included in group 1 or 2 correlate strongly with chromosome size.

Fig. S2
Contributions to the B.g. tritici genome of the 20 most abundant TE families TE family names are given at the left. The second column indicates the size of the consensus sequence of the TE family. The copy number estimate is calculated by dividing the total number of bp contributed by the TE family by the size of the consensus sequence. The x-axis of the graph shows the contribution in % of the total genome size. Different superfamilies are shown in different colors.

Fig. S3 Distribution of GC content in genomic windows with different recombination frequencies.
GC content and recombination was estimated in 50kb windows. Windows in which no recombination occurred (cM=0) were compared to windows in with recombination (cM>0 or cM>5cM). Centromeric windows consist of windows that completely overlap with the defined centromeres (Fig. 2).

Fig. S4
Identification and characterization of tandem repeats that were collapsed in the PacBio assembly. a, To identify collapsed repeats, Illumina sequences were mapped to the B.g. tritici pseudomolecules. The inset at the top right depicts the distribution of sequence coverage in 500 bp windows, showing that Illumina sequences cover the genome approximately 23-fold. Sequence coverage is plotted along chromosomes in vertical red lines. The scale is in Mb. Most of the chromosomes have the expected even sequence coverage of approximately 23-fold. Regions containing collapsed repeats are visible as spikes. Most sequences representing collapsed repeats are deposited in Chr-Un. DRC represent a newly identified tandem repeat (see Note S4). b, Dotplot of a region containing CentA repeats. c, Example of a PacBio contig which shows extreme variation in G/C content. The x-axis represents the bp position while the y-axis shows the G/C content in sliding windows of 200 bp.

Fig. S5
Sequence organization of the centromere of chromosomes 8 and 11. a, Sequence coverage with Illumina reads in the 1 Mb region containing the centromere of chromosome 8. b, Sequence coverage around the region containing centromeric repeats type A in chromosome 8. In a and b, the dark grey boxes indicate regions with higher coverage corresponding to the centromeric repeats type A (CentA); the x-axis corresponds to positions over the chromosome in bp and the y-axis to the coverage in number of Illumina reads. c, Annotation of the region corresponding to b. d, Annotation of the centromere of chromosome 8. e, Annotation of the centromere of chromosome 11. The legend box corresponds to c, d and e. Fig. S6 BUSCO assessment of the B.g. tritici genome version v3.16. BUSCO was done with gene sets for the taxa Ascomycetes and fungi. The dataset for fungi is smaller (290 genes) and contains more highly conserved genes of which almost all (286) are present in the B.g. tritici genome version v3.16. The higher number of absent genes in Ascomycetes is due to the larger number of genes used in that dataset (1275), and the fact that the reference species of the database is Aspergillus nidulans, which is a rather distant relative of B.g. tritici.

Fig. S7
General characteristics of recombination in the cross 96224 X THUN-12. a, Distribution of the physical distance of 1,520 marker pairs that are genetically >1cM apart from each other b, Correlation between chromosome size and recombination rate. Every dot represents one chromosome. c, Correlation between chromosome size and size of the centromeric regions. Each dot represents one chromosome. d, Distribution of recombination rate estimated in 50kb windows. Supplementary Table 7. Each hotspot region was amplified in 96224 (left band) and THUN-12 (right band). a. Hotspots RH1-RH8 b. Hotspots RH9-RH18

Fig. S9
Physical clustering of effector gene families along B.g. tritici chromosomes 1-6. The physical positions of B.g. tritici genes are plotted along their chromosomal position (x-axis, in kb). Each dot represents one gene. In the lowest row in grey all genes are plotted. The second row, in black, represents all candidate effector genes and the next 21 rows represent the 21 candidate effector gene families containing more than 10 members.  (xaxis, in kb). Each dot represents one gene. In the lowest row in grey all the genes are plotted. The second row, in black, represents all candidate effector genes and the next 21 rows represent the 21 candidate effector gene families containing more than 10 members.

Fig
. S10 Phylogenetic and expression analyses of candidate effector gene families E007 and E011. a. Phylogenetic tree of effector gene family E007. b. Gene expression of the members of family E007. c. Phylogenetic tree of effector gene family E011. d. Gene expression of the members of family E011. e. Physical position of the six clusters of effector family E007 and two clusters of family E001. The different clusters are indicated on their chromosomal positions with boxes of different colors corresponding to the colors in a, b, c and d respectively. For a and c, the grey branches correspond to B.g. hordei genes and the B.g. tritici genes are indicated either with colored branches corresponding to different clusters or with black branches for genes that are not part of a cluster. For b and d, the expression values are indicated in rpkm based on three biological replicates and the error bars represent the standard error of the mean. The colors refer to the clusters in a and c respectively. Gene expression varies drastically between family members as well as between genes of the same cluster.
Fig. S11 Gene expression in the 16 candidate effector families with more than 10 genes. Gene expression levels of all members of the candidate effector gene families with more than 10 genes are plotted in reads per kilo basepairs per million reads (rpkm). Mean expression of three RNA-Seq biological replicates is plotted and the error bars represent standard deviation of the mean. The genes are ordered by increasing expression levels. Families in the green frame are from group-1, the ones in the blue frame from group-2 and the family in the yellow frame is family E016, the only family with more than 10 members that could not be attributed to one of the two groups.
Fig. S12 Expression levels of the 16 largest candidate effector gene families in 96224.The expression level was deduced from three RNA-Seq replicates and depicted in log(rpkm). Gene families of group 1 candidate effectors are depicted in green, families of group 2 in blue and family E016 in yellow.

Fig. S14 Heat map of normalized genomic coverage of candidate effector family E014.
Genomic coverage for each gene was normalized to the coverage of all genes. The genes are ordered according to the phylogenetic tree in Fig. 3. The isolates originate from Switzerland (CH), The United Kingdom (UK), China (CHN) and Israel (ISR).

Fig. S15
Amino acid compositions of 7,164 predicted Non-effector proteins, 412 group 1 and 243 group 2 candidate effector proteins. Multiple Amino-acids that are encoded by multiple codons (Alanine, Leucine, Arginine and Serine) are under-represented in group 1 and 2 candidate effector proteins, while amino acids with only two codons (Cysteine, Phenylalanine and Tyrosine) are over-represented. This amino acid usage bias contributes to the higher number of synonymous sites in non-effector genes.
Fig. S16 Two-speed-genome hypothesis tested in B.g. tritici. Density plots displaying the intergenic distance for each gene from the 3'UTR and 5'UTR of the coding sequence. a, shows all 8,470 genes. b, 844 candidate effector genes. Red dots indicate known avirulence genes (Avr) and the suppressor of avirulence (Svr) in B.g. tritici.
Fig. S17 Example of recombination hotspots in the 96224 X THUN-12 population. For each hotspot a physical distance of 150 kb is shown. Upper panels show collinearity between genetic (upper line) and physical (lower line) position of the SNP marker in the interval. Middle panels show the genes localized in the interval. Candidate effector genes of families with more than 10 members are indicated in color, effector candidate genes belonging to families with less than 10 members in black and non-effector genes by empty boxes. The third panel shows the genotypes of the 118 progeny for each SNP marker. Blue color indicates 96224 genotype and red indicates THUN-12 genotype. Grey color indicates missing data for the SNP marker. a, Bgt_chr-01 3350000-3500000, red: candidate effector family E003, black: candidate effector family E060 b, Bgt_chr-01 40500000-42000000, black: candidate effector family E046 c, Bgt_chr-04 2350000-2500000 green: candidate effector family E001 , blue: candidate effector family E016, black: candidate effector family E148 d, Bgt_chr-06 4900000-5050000 blue: candidate effector family E004           Names of the genetic marker described in Bourras et al., 2015 4 Physical distance between the markers 5 Genetic distance between the markers 6 Estimated recombination frequency between the markers 7 Chromosome that corresponds to the physical location of the marker in 3 in the 96224 assembly 8 Names of the genetic markers in 96224 X THUN-12 corresponding to the marker in 3 9 Conservation of high recombining regions (>3cM/50kb) in the 96224 X JIW2 and 96224 X THUN-12 populations is higher than e xpected by chance ( -square goodness-of-fit, p= 3.674e-07) LG14 M189RE , 2015 4 Physical distance between the markers 5 Genetic distance between the markers 6 Estimated recombination frequency between the markers 7 Chromosome that corresponds to the physical location of the marker in 3 in the 96224 assembly 8 Names of the genetic marker in 96224 X THUN-12 corresponding to the marker in 3 9 Conservation of recombination coldspots (=0cM/50kb) in the 96224 X JIW2 and 96224 X THUN-12 populations is higher than expected by chance ( -square goodness-of-fit, p=0.0022)    Group 1 expected h n.a n.a 7131 297 n.a n.a n.a Group 2 expected i n.a n.a 11501 455 n.a n.a n.a P-value exp. vs. obs. 1 n.a n.a < 0.00001 0.0858 n.a n.a n.a P-value exp. vs. obs. 2 n.a n.a < 0.00001 0.7900 n.a n.a n.a Group 1 expected h n.a n.a 5588 232 n.a n.a n.a

Method S1 Construction of the genetic map.
Two versions of the genetic map were created during the course of the work. The first genetic map served as backbone for the anchoring of the polished raw assembly contigs into chromosomes (v1). The second genetic map was used to study recombination rates in Blumeria graminis and was based on the mapping to the final genome assembly (v3.16) and corrected for possible genotyping errors (described below) (v2). Both maps were created using the filtered SNP markers (see Methods) using the program MSTmap (Wu et al., 2008). The genetic map was estimated using the Kosambi's distance function (Wu et al., 2008). The significance threshold for markers to be placed in one linkage group was set to p <10-6. Pairs or single markers that were grouped more than 15 cM away from any other linkage group were placed in a separate linkage group. These single marker groups were removed from our analysis. Estimation of missing data by the program was turned off. The genetic map used for anchoring the contigs was 4,910 cM in size. In high-density genetic maps, genotyping errors can lead to an inflation of the genetic map due to overestimation of recombination events. We therefore eliminated SNP markers with possible genotyping errors from our dataset. We removed single SNP markers when the genetic order of the marker did not correspond to the physical order of the chromosome (3% of the markers). If co-segregating markers groups with more than 10 SNP markers did not correspond to the physical orientation of the chromosome, they were retained in the genetic map. After removing of potential erroneous SNP markers, genetic distances were re-estimated using MSTmap for each linkage group separately. To test for potential gene conversion events, we detected potential double-crossovers that are less than 1 kb in size in the 96224 X THUN-12 genetic map. These intervals were tested for overlap with genes with the Bedtools intersect command (Quinlan & Hall, 2010) Method S2 Transcriptome analysis.
We used published RNA-Seq data from three different B.g. tritici isolates (each with three biological replicates) after infection of the susceptible wheat variety Chinese Spring and harvested at 48 hours after infection (Praz et al., 2018). Reads were mapped to the genome using STAR with the parameters: --outFilterMultimapNmax 10 --outFilterMismatchNoverLmax 0.02 --alignIntronMax 500 (Dobin et al., 2013). Read counts for all genes were obtained with featureCounts using standard parameters and the -M option allowing multi-mapping reads.
Expression analyses were done using the R package edgeR as previously published (Praz et al., 2018), and the same criteria were used for differential expression (log2FC > |1.5| and p-value (FDR) < 0.01).

Method S3 Phylogenetic analyses.
Protein sequence alignments were performed with Muscle 3.8.320. Phylogenetic trees were inferred with RaxML 8.2.8 using a protein GTR model and 100 bootstraps (Stamatakis, 2014).
Pairwise physical and phylogenetic distances were calculated within the 21 largest candidate effector gene families. If two genes are located on the same chromosome, the physical distance between them was calculated using the positions in the middle of the genes, if two genes are located on different chromosomes, the physical distance between them was arbitrary set to 1000000. The phylogenetic distance between two genes was calculated based on the phylogenetic tree of the family as the sum of the lengths of the branches that separate the two genes.

Method S4 Statistical analyses.
The binomial test from the R base (Team, 2008) package was used for the analysis of enrichment of candidate effectors around recombination breakpoints and among duplicated and deleted genes. For the binomial test, the probability of success was calculated by calculating the proportion of each gene class contributing to the number of duplication and deleted genes respectively. Correlation between pairwise genetic and physical distances of members of candidate effector families was tested using Mantel test (R package ade4, (Dray & Dufour, 2007).
Significance of differences in polymorphism rates between candidate effector and non-effector genes was tested with a -square test. To test if recombination hotspot and coldspots are conserved between different mapping population, 100 random intervals of the size of the intervals in Table S11 and S12 were simulated with the Bedtools random command (Quinlan & Hall, 2010)). The corresponding intervals in the 96224 X THUN-12 genetic map were identified with Bedtools interval command. -square goodness-of-fit test was used to test if observed conservation of hot and coldspots are higher than expected by chance. The proportion of recombination hotspots near centromere was also tested with a -square goodness-of-fit test.

Note S1 Verification of the integrity of the genome assembly in recombination hotspots by PCR.
In order to exclude artefacts from potential sequence mis-assemblies on the analysis of recombination rate we amplified a number of highly recombining regions by PCR and sequenced them. We selected 19 (RH1-19) marker pairs that were physically less than 2,500 bp apart from each other and had a genetic distance of more than 5cM. The selected genome regions are listed in Table S7. PCR primers for each region were designed to amplify PCR products which included both indicative SNP markers used for recombination rate analysis. Primers used for the amplification of highly recombinogenic genomic regions (RH 1-9 and 11-19) are listed in Table S8.
PCR amplification was conducted on genomic DNA of the parental isolates 96224 and THUN-12 using Phusion High-Fidelity DNA Polymerase (New England Biolabs). PCR products were purified with the GenElute PCR Clean-up Kit (Sigma-Aldrich) and Sanger-sequenced on an ABI3730 DNA Analyser (Applied Biosystems) using Bright-Dye Terminator Mix (Nimagen) with the primers designed for initial PCR amplification. In a few cases additional sequencing primers was necessary to cover the indicative SNPs defined for recombination analysis (Table S8). Due to the highly repetitive nature of the target regions, we ensured primer specificity with blast against the assembly (v3.16). Regions that could not be sequenced with desired accuracy due to unspecific by-products were subcloned using the StrataClone Blunt PCR Cloning Kit (Agilent) before sequencing with either internal or vector primers (Table S8).
Except for one region (RH10) primers within the vicinity (100 -1000bp) of the informative marker could be designed. For 17 out of the 18 remaining regions PCR amplification from both parental isolates was successful and resulted in an amplicon of the expected size (Fig. S8). The obtained PCR products were sequenced. We confirmed sequence polymorphisms at the marker positions through manual inspection of the obtained sequences. Thus, PCR amplification confirmed the sequence of our reference assembly and excluded the possibility that the recombination hotspots were caused by genome mis-assemblies.

Note S2 Analysis of sequence diversity of genes in 36 B.g. tritici isolates.
To sample diversity and polymorphism rates in candidate effector and non-effector genes, we mapped genomic sequences from the 36 previously published B.g. tritici isolates (Table S3) to the genome assembly. In total we identified 36,358 SNPs in coding sequences (CDS) of genes, and only 89 codons with more than one polymorphism. This relatively low level of sequence diversity (3.8 SNPs per kb CDS) prevented a meaningful analysis of the diversity in single genes. Instead, the 412 group 2 candidate effector genes, 245 group 2 candidate effector genes, and 7,171 noneffector genes were analysed as separate groups (i.e. lengths of genes numbers of polymorphisms were added up for each group). Weak candidate effector genes (see Note S4) were excluded from this analysis because the majority of them are derived from high-copy TEs which show high levels of polymorphisms between isolates. Erroneous mapping for TE reads to the reference genome can thus lead to frequent false positive SNP calling. An in-house perl script was used to determine the type of mutation in CDS (synonymous, non-synonymous). Using the method of (Nei & Gojobori, 1986), we determined the proportions of non-synonymous differences per non-synonymous sites and synonymous differences per synonymous sites. Here, sequences from all 35 isolates were aligned to the gene sequence of the reference isolate 96224.
Numbers of synonymous and non-synonymous sites were calculated for all sequences. Numbers of synonymous (pS) and non-synonymous differences (pN) then determined for all sequence pairs. Because polymorphism levels between isolates were very low, numbers of synonymous (pS) and non-synonymous differences (pN) were added up for all members of the gene group examined (group 1 and group 1 candidate effectors, non-effector genes or individual gene families). From this, the ratio of non-synonymous to synonymous differences per site (pN/pS) was calculated for each examined gene group. Both group 1 and 2 candidate effector genes show more than twice the pN/pS values of non-effector genes (Table S16). The high pN/pS values of group 1 and 2 candidate effector genes could indicate that many of these genes are under positive selection.
Analysis of individual families of candidate effector genes showed that high pN/pS values are relatively consistent across all families (Table S17). This excludes the possibility that high pN/pS values of an entire group are simply the result of extreme diversity in a small number of genes.
Only a few (mostly small) families are outliers with low or very high pN/pS values. These are likely statistical fluctuations due to small numbers of nucleotide differences in these gene families.
Nevertheless, some large candidate effector genes also have very high values (e.g. E008). Here, sequence data from additional isolates will be needed to determine whether these particular gene families indeed have such high rates of non-synonymous polymorphisms.
The two-speed genome hypothesis states that candidate effector genes may have higher overall polymorphism rates than non-effector genes. In some fungi, mutations can be induced if a gene is close to a TE that is silenced through the RIP pathway (Dong et al., 2015). However, B.g. tritici lacks RIP pathway genes (Wicker et al., 2013). To test whether candidate effector genes have higher mutation rates, we compared frequencies of nucleotide polymorphisms in synonymous sites in candidate effectors and non-effector genes. Here, we only considered fourfold degenerate codons where the third base can be substituted freely by any base without causing an amino acid change (e.g. Leucine is encoded by codons starting with CT while the third position can be any of the four bases), which are the codons starting with CT, GT, TC, CC, AC, GC, CG, and GG.
If candidate effector genes indeed had overall higher mutation rates, one would expect higher numbers of polymorphisms in these sites. We found that candidate effector genes (group 1 and 2) do not have significantly higher substitution rates in synonymous sites than non-effector genes (Table S15). Curiously, non-effector genes had overall significantly more synonymous sites than candidate effector genes (Table S15). This is can be explained by differences in amino-acid compositions, because predicted group 1 and 2 effector proteins are depleted in amino-acids that are encoded by multiple codons (e.g. Leucine, Arginine and Serine) compared to predicted non-effector proteins (Fig. S15).
Because the numbers of polymorphisms can be distorted if sequences from paralogs are compared (e.g. a given isolate contains a different copy of a gene), we repeated the analysis using only genes which show no evidence for copy number variations (Table S15). This had only little effect on the results.

Note S3 A chromosome-scale assembly of the B.g. tritici genome.
To obtain a high quality B.g. tritici reference genome assembly consisting of chromosome-sized scaffolds, we combined three experimental approaches. First, the genome of B.g. tritici isolate 96224 was sequenced to approximately 85-fold coverage with PacBio technology obtaining an average read length of 18'000 bp. These sequences were assembled into 653 sequence scaffolds with a cumulative length of 140.6 Mb. The N50 was 847.9 kb meaning that 50% of the assembly was in scaffolds larger than 847.9 kb, while the N90 was 190.4 kb ( Method S1). The SNP markers were used as genetic markers to construct a genetic map consisting of 11 linkage groups ranging in length from 188.6 to 732 cM and with a cumulative length of 4,910 cM.
The BAC end sequences were mapped to the PacBio sequence contigs by blastn, resulting in 17,917 BAC ends that could unambiguously be mapped. Positional information of BAC ends on PacBio contigs was used to form larger scaffolds (i.e. in cases were one end of a BAC was located on one scaffold while the other end matched to a different scaffold). In addition, positional information was used to identify 47 mis-assemblies in PacBio sequence contigs. Positions of misassemblies were also narrowed down based on SNP marker information. Manual examination of 10 cases showed that mis-assemblies were often caused by highly similar copies of transposable elements which range in size from 6-8 kb.
Information from BAC end sequencing and genetic mapping complemented each other and helped arranging PacBio contigs into large scaffolds. For example, PacBio contigs could be oriented via BAC ends if their orientation could not be determined through genetic map data (in regions of low recombination or for short contigs). Additionally, PacBio contigs connected by BAC ends could be integrated into scaffolds in the absence of genetic marker data (in regions with few polymorphisms).
In total, 313 PacBio contigs were assembled into 11 pseudomolecules which corresponded to the 11 genetic linkage groups (Table S3). The 313 contigs had an N50 of 873 kb (N90 of 223.7 kb) and represent 97.6% of the total sequence (Table 1). The un-anchored contigs were assembled into chromosome unknown (Chr-Un). The 11 pseudomolecules (hereafter equated with chromosomes) range in size from 3.6 to 19.7 Mb (Table S1) and contain only 357 sequence gaps The final genome version that is publicly available and used in this study will hereafter be referred to as Bgt_genome_v3.16.
We used the curated genetic map with a size of 4,279.5 cM to verify the quality of the assembly after scaffolding (see Method S1 for details on curation). We analyzed co-linearity between the order of the genetic and the physical position of markers (Fig. 2). A total of 99.8% of the markers in the genetic map were collinear with physical order in the genome assembly. We detected only four small discrepancies between genetic and physical positions (discrepancies of size: 27kb, 350kb, 1.7kb, 19kb) on chromosomes Bgt_chr-06, 07, 08, 09 respectively (Fig 2.). These four discrepancies were all located within individual PacBio sequence contigs. Because we could not determine whether they represent mis-assemblies or artefacts of the genetic map, they were not modified. The genetic map contains 119,023 markers, the largest chromosome Bgt_chr-09 is covered by 14,200 markers and the smallest Bgt_chr-11 by 2,646 markers, respectively.

Note S4 Genome size estimation.
The chromosome-scale assembly generated in this study has a size of 140.6 Mb which is a 71% size increase compared to the previous genome version of the same isolate (Wicker et al., 2013).
In the previous study, genome size was estimated to be 180 Mb based on BAC fingerprint assemblies. The much shorter cumulative length of the sequence assembly was assumed to be due to collapsed repeats. To obtain an estimate of the amount of collapsed repeats, we mapped Illumina reads onto the chromosome assembly. Sequence coverage with Illumina reads showed a narrow distribution for most of the genome with a mean of 22 and a standard deviation of 5.9 ( Fig. S4). Only approximately 0.5% of genomic 500 bp windows had a sequence coverage higher than 44 (i.e. twice the mean coverage). Thus, we decided to use 44-fold coverage as a cutoff for sequences to be examined further. We identified 136 regions with a total length of 2.24 Mb which have sequence coverage of 44 or more, some having nearly 20-fold the average coverage. Most of these correspond to short PacBio contigs that were integrated in Chr-Un (Fig. S4a), indicating that they were also problematic for the PacBio assembly. Nevertheless, some repeat arrays were assembled in longer PacBio scaffolds and could be integrated into the pseudomolecules (Fig.   S4a).
Some of regions with highest sequence coverage (indicative of strongly collapsed repeats) correspond to tandem repeat arrays of 5S and 45S ribosomal DNA (rDNA). One cluster of 45S rDNA genes is located on chromosome 9 approximately at position 16.4 Mb (Fig. 4). It is comprised of tandemly repeated 7.3. kb units containing the genes for 18S, 5.8S and 28S ribosomal RNA. This cluster has a size of 134.9 kb, but additional 45S rDNA arrays were also found on Chr-Un. We assume that these all actually belong to the same cluster on chromosome 9 but were not integrated in the assembly due to their highly repetitive nature.  (Table S2).
The genetic centromeres contain large arrays of tandem repeats (example in Fig. S5) for which Illumina sequence coverage indicates that they are strongly collapsed Fig. S4, Fig S5, Table S2).
We identified two types of centromeric tandem repeats: CentA has a size of 189 bp and is found in the centromeres of chromosomes 1 and 8, while CentB has a size of 197 bp and is found in the centromere of chromosomes 9 and 11 (Fig. S4). The 64 kb of assembled CentA and 14 kb CentB repeats probably represent approximately 440 kb and 76 kb of collapsed repeats (Table S2).
However, we can not exclude the possibility that other centromeres also contain CentA and/or CentB repeats, as many of them could not be integrated into the chromosome assemblies and instead were included in Chr-Un. Different chromosome-specific centromeric repeats in the same species were reported before (Plohl et al., 2014) and could witness an ancient hybridization event that was postulated previously . We identfied CentA and CentB repeats in various formae speciales of cereal mildews (Table S4 and   . In addition, we identified a novel 4.6 kb tandem repeat (DRC) which contains no genes but contributes approximately 1.16 Mb to the genome. It is likely that most DRC repeats are arranged in a single cluster on chromosome 4 because there were no other full-length copies on any other chromosome (Table S2).
Surprisingly, an additional 1.03 Mb represent sequences with very high sequence coverage but without clear repeat structure. Several of these loci share some stretches of near-identical sequences, but they contain no obvious tandem repeats or TEs. What is common to most of them is a highly fluctuating CG content (Fig. S4c). We hypothesize that either these sequences indeed represent collapsed repeats, or their sequence composition gave them a positive bias during the sequencing procedure and thus resulted in a higher coverage. Assuming that these are also 3,500 copies, representing 1.7% of the genome (Fig. S2). SINEs are evenly distributed along chromosome arms but mostly absent from centromeres (Fig. 1). LINE elements are found along the entire chromosomes with an enrichment in centromeric regions (Fig. 1). Interestingly, this enrichment is due to 11 LINE families which are highly specific for centromeres but absent from chromosome arms (for details of centromeric repeat composition see Fig. 1, Fig. S5). The other 46 identified LINE families are found almost exclusively on chromosome arms.
LTR retrotransposons of the Gypsy superfamily are also mainly found on chromosome arms, showing a similar distribution as SINEs, while LTR retrotransposons of the Copia superfamily are found all across the genome with no obvious enrichment in any chromosomal compartment (not shown). While retrotransposons show an extreme diversity in the B.g. tritici genome, we only identified three families of Class II (DNA) transposons, all of them belonging to the Mariner superfamily. They contribute only approximately 1% to the total genome sequence. A more detailed analysis of the TE fraction of the B.g. tritici genome will be published elsewhere.

Note S6 Gene annotation and candidate effector classification.
To annotate the new genome assembly, we combined different annotation approaches. We first transferred the gene annotation published in (Praz et al., 2018) and manually curated it to the B.g. tritici genome v3.16. In parallel we used the annotation pipeline Maker 2.31.8 (Cantarel et al., 2008) to annotate genes in the genome using the protein databases of B.g hordei and B.g.
tritici as templates (Spanu et al., 2010;Wicker et al., 2013). We used the B. graminis repeat database (integrated in TREP, botinst.uzh.ch/en/research/genetics/thomasWicker/trep-db.html) to mask repeats during the annotation. We then excluded all genes shorter than 50 bp and compared genes from the original annotation with genes annotated by Maker using a bidirectional blast search. All genes for which the reciprocal blast results were ambiguous were then manually curated using published RNA-Seq data from isolate 96224 (Praz et al., 2018).
We focused especially on the identification and annotation of candidate effector genes as they play important roles in host-pathogen interactions (Jones & Dangl, 2006;Panstruga & Dodds, 2009;Giraldo & Valent, 2013). To identify so far undiscovered candidate effector genes in the new genome sequence, we performed an ab-initio search by screening the entire genome for open reading frames (ORFs) that encode an N-terminal signal peptide. This search resulted in the identification of 12,863 ORFs with signal peptides. These 12,863 ORFs were then filtered in the following steps: To filter out known genes, these ORFs were then used as queries in blastn searches against the already annotated genes. Then, only those were selected that were also supported by expression data (cutoff at 20 rpkm). Finally, these were then filtered for overlap with typical TE genes (i.e. genes that clearly corresponded to canonical TE genes were removed).
Predicted ORFs that overlapped with annotated TEs were only retained if they fulfilled one or more of the following criteria: (i) the predicted ORF is in reverse orientation relative to the TE, (ii) the ORF is in the same orientation as the TE, but expression is only found in the region of the predicted ORF and not the rest of the TE, (iii) the predicted ORF only partially overlaps the respective TE, or (iv) the ORF is in the same orientation as the TE but is encoded in a different reading frame than the canonical TE genes. This filtering of the initially identfied 12,863 ORFs with signal peptides resulted in the addition of 124 gene models to the genome annotation. The combination of the de novo annotation, transfer of the existing B.g. tritici annotation to the new genome and the annotation of ORFs with signal peptide resulted in the annotation of 8'470 genes.
Using the method described in (Praz et al., 2017) clustered the proteins in families following the pipeline described in (Praz et al., 2017) and retained families with genes only from B.g. tritici and B.g. hordei as candidate effector genes.
The annotations of these candidate effector genes were manually curated, comparing the gene models with RNA-seq data, on the genome browser IGV14. The final annotation was then mapped to the genome with gmap to obtain a gff (general feature format) file.
After manual curation, we redefined candidate effector families using the B.g. tritici and B.g.
hordei candidate effector genes using the same pipeline as before (with option -I 1.4). Families with only 10% or fewer member per family having a signal peptide were flagged as "weak" effector families as well as families with more than 50% of their members showing homology to repeats (based on CDS and protein blast searches against the databases nrTREP17 and PTREP17).
To study whether the identified candidate effector genes are conserved in more closely related fungal species, we searched for homologs in Botrytis cinerea (a necrotrophic fungus) and Phialocephala subalpina (a globally distributed root endophyte, (Schlegel et al., 2016). Predicted protein sequences of Group 1 and 2, weak candidate effector genes and non-effector genes were used in blastp searches against the predicted proteins of B. cinerea and P. subalpina (accessed 09/19/2018 from ncbi.nlm.nih.gov/genome/), using an E-value cutoff of 10E-6. Despite the low