Albugo candida race diversity, ploidy and host-associated microbes revealed using DNA sequence capture on diseased plants in the field

Agathe Jouet , Diane G. O. Saunders, Mark McMullan , Ben Ward , Oliver Furzer , Florian Jupe , Volkan Cevik, Ingo Hein , Gaetan J. A. Thilliez, Eric Holub , Cock van Oosterhout and Jonathan D. G. Jones The Sainsbury Laboratory, Norwich Research Park, Norwich, NR4 7UH, UK; School of Environmental Sciences, University of East Anglia, Norwich Research Park, Norwich, NR4 7TJ, UK; John Innes Centre, Norwich Research Park, Norwich, NR4 7UH, UK; The Earlham Institute, Norwich Research Park, Norwich, NR4 7UZ, UK; University of North Carolina, Chapel Hill,


Introduction
Localized suppression of plant defence may predispose the host tissue to secondary infection by other strains of the same or of different pathogen species. In 1951, Yarwood (1951) reported localized suppression of nonhost resistance to viruses in leaves that were pre-infected by a compatible basidiomycete rust, and other examples of nonhost resistance suppression to viral, fungal and oomycete pathogens have since demonstrated that this is a common feature of infection by rusts and powdery mildew fungi (Moseman & Greely, 1964;Gill, 1965;Yarwood, 1977;Heath, 1980;Lyngkjaer & Carver, 2000;Olesen et al., 2003).
Suppression of nonhost resistance also has been demonstrated for the oomycete Albugo (Cooper et al., 2008;Belhaj et al., 2017;Prince et al., 2017) and several Brassicaceae species can lose resistance to fungal and oomycete pathogens after preinoculation with Albugo spp. For example, Arabidopsis thaliana can be colonized by Phytophthora infestans (late blight of potato and tomato) and Bremia lactuca (lettuce downy mildew) after pre-inoculation with either Albugo laibachii  or A. candida , and Brassica juncea by Hyaloperonospora arabidopsidis (Arabidopsis downy mildew) after pre-inoculation with A. candida (Cooper et al., 2008).
Suppression of nonhost resistance by Albugo species is likely to be beneficial for some microorganisms Ruhe et al., 2016), but it could also give an evolutionary advantage to the pathogen. For example, A. candida comprises at least 10 physiological races that have specialized on different Brassicaceae species (Hiura, 1930;Pound & Williams, 1963;Hill et al., 1988;Kaur et al., 2008;Meena et al., 2014) including destructive pathogens of oilseed mustard (B. juncea, Race 7), Brassica oleracea (Race 9), B. rapa (Race 2), radish (Raphanus sativus, Race 1), and wild species (Armoracia rusticana, Race 3; Capsella bursa-pastoris, Race 4; Sisymbrium officinale, Race 5 and Rorippa islandica, Race 6). McMullan et al. (2015) used this race variation to demonstrate that defense suppression by a virulent race of A. candida enables subsequent co-colonization by an avirulent race. Suppression of plant defence could therefore provide a means for cohabitation for different physiological races on the same host, and potentially the emergence of new hybrid races as a consequence of inter-race mating. This idea is reinforced by evidence for hybridization between pathogen races or species (Brasier, 2001;Olson & Stenlid, 2002;Ioos et al., 2006;Stukenbrock et al., 2012). More specifically, sexual mating has been demonstrated under laboratory conditions between A. candida Races 2 and 7 (Adhikari et al., 2003); and comparative genomics has provided evidence for historical recombination amongst Races 2, 7 and 9 (McMullan et al., 2015). Because A. candida can impact both wild and cultivated hosts, populations of susceptible, genetically uniform crop varieties growing adjacent to related wild host species may create suitable conditions for co-infection by different races, and the generation and spread of recombinant variants with novel properties.
Development of methods for the rapid identification of (new) pathogen races is therefore imperative for sustainable disease management. With the advent of current DNA sequencing technologies, researchers can now generate high-resolution genotypic data from any organism. However, the quality of the data depends on the quality and purity of the DNA sample, which means that the pathogen under study needs to be in excess compared to the host plant or cultivated axenically, which is not possible for the biotroph A. candida. Recently, the field pathogenomics of wheat yellow rust has been investigated using cDNA sequencing from infected leaves in the field, resulting in c. 37% of reads being mapped to the pathogen reference genome (Hubbard et al., 2015). In addition, complexity reduction methods have been developed for sequencing of specific subfractions of plant genomes (e.g. RenSeq (Jupe et al., 2013), MutChromSeq (S anchez-Mart ın et al., 2016) and exome capture (e.g. in barley, Mascher et al., 2013, and in wheat, (Henry et al., 2014). In this paper, we report a complexity reduction method (Pathogen Enrichment Sequencing or 'PenSeq') that enables capture and sequencing of plant-associated microbial sequences directly from field samples. We apply this to explore race diversity in A. candida and the impact of A. candida on the host leaf microbiome.

Materials and Methods
Design of Pathogen Enrichment Sequencing for the investigation of microbial DNA sequences from field sampled leaves We designed 120mer biotinylated RNA baits based on both mitochondrial genes and nuclear 'housekeeping' genes of 49 microbial species, comprising two rhizaria, five oomycetes, 12 fungi and 30 bacteria with the aim of identifying these species from environmental samples (Supporting Information  Table S1). We expect, however, to be able to identify many more species, because baits only require c. 80% nucleotide identity to hybridize with and capture DNA before sequencing (Jupe et al., 2013). In addition, we designed baits to capture putative RxLR (Rehmany et al., 2005) and CHxC class effectors (Kemen et al., 2011;Links et al., 2011), and other secreted proteins, from oomycetes. In Albugo candida, baits also were designed both to 32 neutrally evolving loci and to a contig of c. 400 kb with the aim of investigating the genetic diversity of wild pathogen populations. Neutrally evolving loci were selected using three independent neutrality tests performed on whole-genome data from seven laboratory isolates: Tajima's D (Tajima, 1989), Fu's Fs (Fu, 1997) and d N /d S (Kimura, 1977 ;  Table S2) whereas the c. 400 kb contig was the longest contiguous sequence that could be assembled from A. candida at the time of the bait library design, using reads from isolate AcNc2 (Race 4, 'contig 1' (in McMullan et al., 2015), 246 genes including 11 with a predicted secretion signal, no putative effectors). AcNc2 was one of three races analysed in a previous study that reported on the genetic diversity of five A. candida isolates. We also targeted the nuclear internal transcribed spacer (ITS) sequence in plants to facilitate host species identification. In total, 18 348 120mer baits were synthesized that anneal specifically to targeted sequences (> 2 Mb) without overlap ( Fig. S1; Table S1 for list of targeted loci). These baits were used to capture DNA from a total of 115 samples, including 91 A. candida isolates (Table S3). DNA samples were barcoded and sequenced on three Hiseq lanes with 150 bp paired-end reads.

Preparation of samples for reconstruction experiments
We set up reconstruction experiments to verify that DNA from targeted loci was enriched and sequenced in control samples, and assess the relationship between read depth and DNA abundance. To do this, two sets of controls were prepared (Table S4). In samples #18 to 22, DNA from Erysiphe cruciferarum, Hyaloperonospora arabidopsidis, Phytophthora infestans, Albugo laibachii, Pseudomonas syringae and A. candida was combined in various ways (see also Methods S1). These DNA samples were isolated either from infected leaves, or, for P. syringae, grown on medium before DNA extraction. We also included a noninfected A. thaliana sample as a negative control (#114). In samples #56 to 59, varying relative molar amounts of DNA from P. syringae and A. candida were mixed so that the amount of DNA from P. syringae increased, and from A. candida decreased. Seven additional samples collected on various host species were used to investigate whether the ITS is sufficient for host species identification (#1, 97, 14, 17, 20, 38 and 40). Reads were mapped to all targets and average read depth as well as the breadth of coverage (the percentage of targeted base pairs with at least 109 read depth) was computed for each control organism (see 'Read alignment and consensus sequences calling').

Read alignment and consensus sequences calling
Reads were aligned to targets using the BWA-MEM algorithm of the Burrow-Wheelers Aligner BWA v.0.7.4 (Li & Durbin, 2009). Read duplicates which may have arisen from amplification of adapter-ligated DNA and post-capture enrichment were discarded using SAMTOOLS v.0.1.19 (rmdup command). BAM files were further processed using ngsutils-0.5.7 to remove reads with > 25% clipped bases and with the highest edit distance to the reference and lowest alignment score in case of multiple alignments (NM and AS tags). Read depth, the number of reads per base, as well as the breadth of coverage, the percentage of bases covered by at least 10 reads, was estimated using the depth command in SAMTOOLS. These statistics were averaged across loci or organisms depending on the analysis performed. Consensus sequences of A. candida loci were generated before nucleotide divergence and phylogenetic analyses. To do this, variants (> 109) were called from the bam file produced by BWA using the MPILEUP command of SAMTOOLS and stored in BCF format. Conversion from BCF to VCF and then FASTQ format was performed using the 'BCFTOOLS VIEW' and 'VCFU-TILS.PL VCF2FQ' commands in SAMTOOLS. A short shell script was written in Linux to convert FASTQ files into FASTA files. We also independently fed aforementioned bam and vcf files to SHAPEIT2 v.2.20 (O'Connell et al., 2014) to obtain phased vcf files that were used to generate neutral and putative effector haplotypes using BCFTOOLS v.1.3.1.
Sequences were finally aligned using the multiple alignment program MAFFT v7.127 (Katoh & Standley, 2013). The reference A. candida isolate used in this study was Nc2 (Race 4; McMullan et al., 2015), except for some effectors that are absent from that particular isolate. There was a total of 6493, 24 249, 202 390 and 398 508 base positions in the final datasets of conserved loci, neutrally evolving loci, putative effectors and the c. 400 kb contig, respectively.

Phylogenetic, nucleotide diversity and recombination analyses
Phylogenies were inferred using RAXML v.7.7.3 (Stamatakis, 2014) with the Generalized Time Reversible (GTR) model of nucleotide substitution, gamma distributed rate variation among sites (GTRGAMMA) and 100 bootstrap replicates. Pairwise nucleotide divergence was evaluated using MEGA v.6.06 (Tamura et al., 2013). It was computed as the number of base differences per site from averaging over all sequence pairs between or within lineages. Ambiguous positions were ignored in each sequence pair. The split network was inferred with SPLITSTREE v.4.14.2 (Huson & Bryant, 2006) using the UNCOR-RECTEDP and the NEIGHBORNET methods and 500 bootstrap replicates. A more in-depth genetic diversity and selection analysis was also performed using DNASP v.5.10.01 (Librado & Rozas, 2009) on neutral and putative effector haplotypes to compute six genetic diversity estimates (number of segregating sites, of haplotypes, of pairwise differences, heterozygosity, pi and theta) and three population genetic statistics (Tajima, 1989;Fu & Li, 1993;Fu, 1997). Finally, recombination analysis was performed using the software HyBRIDCHECK (Ward & van Oosterhout, 2016). Using a sliding window approach, this software scans for sudden changes in nucleotide divergence between sequences and identifies potential recombination events when nucleotide identity is significantly increased. Recombination blocks were then dated using the same software and assuming a strict molecular clock with a mutation rate of 10 À9 mutations/site/generation.

Evaluation of A. candida heterozygosity and ploidy level
Throughout this paper, heterozygosity is expressed as the proportion of heterozygous sites within individuals. Before estimating heterozygosity in A. candida isolates, sites where read depth was < 509 and/or with a Phred-scaled quality (QUAL) score < 100 were first removed from vcf files prepared above using vcflib (https://github.com/vcflib/vcflib#vcflib) and VCFTOOLS v.0.1.10 (Danecek et al., 2011) and isolates where > 10% sites were removed were discarded. Heterozygosity was then estimated for the remaining isolates as the proportion of heterozygous sites at the c. 400 kb contig. The mean and standard deviation of the mean percentage of heterozygous sites per A. candida lineage are reported. In addition, the ploidy level of A. candida was evaluated using the c. 400 kb contig as in Yoshida et al. (2013). The distribution of the read proportion of each single nucleotid polymorphism (SNP) at heterozygous sites was graphed for each isolate using R v.3.1.2. This analysis was repeated using whole-genome data from laboratory isolates AcNc2, AcEm2, AcEx1, Ac2v, Ac7v, AcBoT and AcBoL. In this case, reads were aligned to AcNc2 (McMullan et al., 2015) and to Ac2v (Links et al., 2011) contigs. In all ploidy analyses, heterozygous sites where read depth was higher than 1.5 the average read depth were removed. This is to avoid biases due to mismapping of reads from paralogous sequences.

Microbiome analysis
Albugo candida can infect its host both symptomatically and asymptomatically. Therefore, to compare the microbiome of A. candida-infected and noninfected samples, we used plant samples from which A. candida locus Ev1786 could not be amplified ('healthy' in Table S3, see Methods S1) and symptomatically infected plants. In total, we used 82 infected samples with high read depth at the 400 kb contig and 12 noninfected plants (#62-68, 108-112).
Reads generated for the above samples were fed into KRAKEN (v.0.10.5). First, the proportion of reads classified as Fungi or as Bacteria was averaged and compared between both groups of samples. In addition, the number of operational taxonomic units (OTUs) detected by KRAKEN was compared between the infected and noninfected plants using a Mann-Whitney U-test. Common and rare microbial OTUs on the A. candida-infected plants were identified using a binomial test. In order to correct for an inflated type I error rate due to multiple testing, the critical value was divided by the total number of OTUs identified by KRAKEN, resulting in a Bonferroni corrected a 0 = 1.92 9 10 À5 .

Sequence capture enables detection of defined pathogen sequences from infected leaves in reconstruction experiments
Although some targets were not sequenced in some samples due to presence/absence polymorphisms between the strains used in the bait design and those in the samples, almost all targets were captured and sequenced from control organisms when present (min breadth of coverage = 94.14%, max = 100%; Fig. 1a). Read depth was variable between samples and organisms but in all cases sufficient for downstream analyses (mean read depth (AE SD) = 1003 9 (AE 696), min = 111, Table S4). Additionally, targets were not sequenced in sample #114 where control organisms were absent (mean breadth of coverage (AE SD) = 5.87% (AE 10.7), mean read depth (AE SD) = 79 (AE 10)). In all samples, however, a small proportion of reads were wrongly assigned to control organisms likely due to the presence of closely related species (Table S4). Read misalignment was also observed at the ITS sequences of the hosts which were all fully covered by reads due to high homology between species (Fig. 1b). The highest read depth at the ITS was used to identify the host species.
After verification that targeted loci were captured and sequenced to reasonable depth and coverage, the relationship between input DNA and output reads was examined using samples #56-59. Because read depth is highly correlated with the total number of reads generated per sample (Pearson R 2 = 0.880, P < 0.001), we report on the percentage of reads mapping to each of the target organisms (Fig. 2). We also normalize this against the number of targeted base pairs in A. candida and P. syringae, respectively, to account for differences between them. Linear regression analyses revealed a positive association between the proportion of reads mapping to control organisms and the amount of DNA that was used during library preparation for both A. candida and P. syringae (R 2 = 0.88 and 0.78). Due to the small sample size, P-values were not significant for individual sets of samples (P = 0.06 and 0.12, respectively). However, combining both datasets renders the correlation significant (R 2 = 0.83, P < 0.01). Therefore, although additional samples are needed to generate more robust statistics, Pathogen Enrichment Sequencing (PenSeq) may provide an indication of the relative abundance of pathogen DNA across samples, as well as within samples, providing that the number of targeted base pairs per pathogen is taken into account. Next, we investigated whether PenSeq could be used to investigate both the species diversity and genetic diversity of organisms within field samples.
PenSeq from A. candida-infected leaves reveals correspondence of physiological race structure with phylogenetic lineage based on four sets of loci PenSeq data were used to investigate the genetic diversity of A. candida natural populations. To do this, 85 isolates were analysed that had high read depth at the c. 400 kb contig (mean read depth (AE SD) 479 9 (AE 529); min = 18, max = 2336; Fig. 3; Table S3). Initially, 32 neutrally evolving loci were used to build a phylogeny and measure genetic diversity ( Fig. 4a; Table S2). Two additional phylogenetic trees were built using the c. 400 kb contig and seven conserved loci (mitochondrial and nuclear 'housekeeping' genes), respectively (Figs 4b, S2). Discrimination of A. candida isolates was not possible due to low genetic variation within conserved loci ( Fig. S2; Table S5). However, phylogenies built with the neutral loci and the 400 kb contig identified 17 A. candida lineages (I-XVII in Fig. 4 and throughout the paper, Table S5 for a detailed description; phylogenetic divergence was inferred if observed in both trees). These lineages consist of isolates collected on the same host or closely related hosts, irrespective of country of origin (mean within lineages pdistance = 0-0.058%). Conversely, isolates collected on different hosts formed distinct lineages (mean between lineages pdistance = 0.36-1.40%), with the exception of isolates collected on radish (R. sativus). In accordance with this observation, phylogenetic clustering mostly agreed with physiological race nomenclature with groups X and V corresponding to Races 5 and 9 in both trees, respectively. However, what would be considered as Races 1 (lineages VI and VII), 2 (XIII) and 4 (I) consist of genetically more diverged isolates. In particular, isolates collected on radish (R. sativus) are so diverged that isolate #5 clusters distinctly from other radish isolates in both phylogenetic trees.
A phylogenetic tree was also constructed using presence/absence polymorphisms of putative CHxC and related effectors of A. candida (Fig. S3). Isolates collected on the same or closely related hosts share similar effector candidate repertoires. Thus, as with the two previous trees, phylogenetic clustering based on these sequences mainly agrees with physiological race designation, including Races 1 (lineages VI and VII) and 4 (I). However, Races 2 and 11 were not discriminated, as in the tree built using the neutral loci (lineages XIII and XIV). Genetic diversity at putative effectors was higher compared to other loci, emphasizing the important role that these proteins may have in host adaptation (Table S5).
The relatedness of some lineages to others is similar in the phylogenies in Fig. 4 with for example, the B. juncea virulent races, including the reference genome Race 2 (isolate Ac2v), clustering distantly from most others in both datasets. However, this is not always the case and in Fig. 4(a) (neutral loci), isolates infecting Brassica oleracea (lineage V, Race 9) appear closely related to Capsella-, Camelina-and Arabidopsis-infecting races (I, II and III), whereas they appear more distantly related in the tree created using the 400 kb contig (Fig. 4b). This observation may be explained either by incomplete lineage sorting (the random sorting of ancestral alleles into the descendant host-specific races) or hybridization by secondary contact of isolates from two hostspecific races.

PenSeq enables detection of historical recombination in a 400 kb contig
In order to discriminate between incomplete lineage sorting and hybridization by secondary contact, the software HybridCheck (Ward & van Oosterhout, 2016)

Research
New Phytologist contig to identify and date regions of high nucleotide identity between A. candida races (i.e. putative recombinant regions). If hybridization has occurred recently, after the split between the races, the number of mutations accumulated in the recombinant region will be significantly lower compared to the nucleotide divergence elsewhere in the genome. In total, 159 913 putative recombinant regions of an average of 9569 bp (AE SD = 13425) were detected by HybridCheck. These were dated from 5798 (5-95% CI = 0-6104) to 474 852 (5-95% CI = 383 186-579 694) generations ago, assuming a base mutation rate of 10 À9 New Phytologist mutations/site/generation (Fig. 5). Although the detection of these ancient regions may be explained by incomplete lineage sorting, recombinant regions that were dated to coalesce recently are best explained by recent hybridization followed by recombination, given that only few or even no mutations distinguish isolates from distinct lineages (Fig. S4).
Finally, as bifurcating phylogenetic trees can only poorly describe the evolutionary history of taxa when recombination-like processes are frequent, a neighbour-net network was built using SPLITSTREE v.4.14.2 (Huson & Bryant, 2006) ; Fig. 6). Although based on the c. 400 kb contig alone, the network represents a combination of both of the trees shown above confirming some of the divisions suggested in Fig. 4(a,b). Consistent with the recombination analysis performed with HybridCheck, the network identifies events of reticulate evolution (i.e. the partial merging of ancestor lineages), early (long branches, right) or late (short branches, left) during the divergence of A. candida lineages.
Genotypes in the 400 kb contig reveal between-race variation in rates of clonal and sexual reproduction Although there is evidence for sexual reproduction between A. candida lineages, verifying the prevalence of sexual reproduction within them is rendered more difficult by low withinrace genetic diversity in A. candida (p-distance = 0-0.58%). In McMullan et al. (2015), the percentage of heterozygous sites shared between isolates of the same race was used to estimate clonality in A. candida. The rationale behind this is that the percentage of heterozygous sites shared between two genetically identical isolates of the same lineage only reduces gradually as each isolate accumulates novel mutations (which are mostly in a heterozygous state). By contrast, shared heterozygosity of isolates that reproduce sexually will be reduced by 50% after one generation, and by 75% after two generations of sexual reproduction, assuming diploidy and Mendelian segregation of the alleles.

New Phytologist
Here, we estimated the level of shared heterozygosity between individuals of the same lineage, as defined above. Throughout the text, the level of heterozygosity is expressed as the proportion of heterozygous sites within individuals at the c. 400 kb contig. Although isolates in some lineages had high levels of shared heterozygosity (84.4-97.8%; Fig. 7), others shared few of their heterozygous sites (24.8-55.4%; Fig. 7) which suggests that the importance of sexual reproduction varies between A. candida lineages, and we postulate that a 44.6% and 75.2% loss in shared heterozygosity is consistent with one and two generations of sexual reproduction, respectively. The A. candida races were previously thought to be evolving mainly clonally based on the sequencing of five isolates (McMullan et al., 2015). Although this could be confirmed for isolates collected on B. oleracea (BoT, BoL; Race 9), those from A. thaliana, C. bursa-pastoris and B. juncea (Nc2, Em2; Race 4 and 2v; Race 2) appear to be able to reproduce sexually.

Allele frequency analysis suggests widespread polyploidy in some A. candida races
The level of observed heterozygosity at the c. 400 kb contig also varied between lineages and although heterozygosity was low in most lineages (e.g. 0.05% for C. sativa-specific isolates; Fig. 8), other lineages had intermediate to high levels of heterozygosity (isolates collected on B. juncea (0.14% (AE 0.09)) and B. oleracea (0.65% (AE 0.02)) -R. sativus (1.15% (AE 0.03)), respectively). To better understand this, we investigated the distribution of SNPs at heterozygous sites along the c. 400 kb contig using the method published in Yoshida et al. (2013). The rationale behind this method is that, for diploid organisms such as A. candida, each bi-allelic SNP should account for c. 50% of the reads. Shifted proportions of reads per SNP could either be explained by polyploidy or mixed infections. However, this is unlikely to be a consequence of mixed infections if the same pattern is observed in all isolates within a race, collected at different times and locations.
It was not possible to determine ploidy in 11 samples due to the lack of heterozygous sites in the c. 400 kb contig (Fig. S5), suggesting mechanisms such as loss-of-heterozygosity (Lamour et al., 2012) or high levels of selfing in these isolates. However, heterozygosity was sufficient for analysis in 72 isolates and a diploid pattern was observed for isolates with low levels of heterozygosity (Fig. 8, yellow). Isolates with intermediate and high levels of heterozygosity showed a tetraploid and a triploid pattern, respectively (c. 33 and 67% of reads for each SNP in triploids (red, lineages V and VII; Races 9 and 1) and both 50-50 and 25-75% in putative tetraploids (blue, lineage XIII; Race 2)). This analysis, repeated using whole-genome data of several laboratory isolates (single-spored), could not confirm tetraploidy of isolates collected on B. juncea (Ac2v, Race 2; Fig. S6 using two reference genomes Ac2v (Links et al., 2011) and AcNc2 (McMullan et al., 2015). This may be because of duplicated regions in the 400 kb contig of Ac2v that are not well assembled in the reference genomes. If that is the case, homologous regions would map to the same location, altering the per-SNP read proportion at heterozygous sites in such a way that it resembles that of a tetraploid (i.e. a double diploid). By contrast, triploidy of isolates collected on B. oleracea (AcBoT/AcBoL; lineage V, Race 9) was confirmed, based on both single-spore propagated lab strains, and field isolates.  Table S3 for a detailed description and Table S5 for a description of the dataset per host).

New Phytologist
Population genetic analysis of effector and neutral genes across host-pathogen associations We assessed the impact of host lifestyle on A. candida evolution in more detail. For this, we evaluated the genetic diversity and selection pressure on both the neutral and putative effector loci of A. candida growing on wild hosts, garden escapes and crops. Fig. 9(a) and Table S6 reveal that effectors are significantly more polymorphic than neutral loci, a pattern that is consistent across all summary statistics used to describe genetic variation (Fig. S7). Significant genetic variation was also observed between plant types, and crop-infecting A. candida races are on average significantly more polymorphic than those infecting garden and wild plants (Figs 9a, S7; Table S6). The higher level of polymorphism of these cropinfecting races is consistent with their triploidy and (likely) clonal mode of reproduction, which in turn will result in 'frozen heterozygosity' (Schwarz, 2017).

New Phytologist
Next we analysed the frequency spectrum of segregating sites of effector and neutral genes of A. candida using three population genetic statistics (Tajima's D, Fu & Li's D and Fu's Fs;Tajima, 1989;Fu & Li, 1993;Fu, 1997). These statistics can be used to identify selection as well as changes in the population's demography (Tajima, 1989;Slatkin & Hudson, 1991). By including both neutral and effector genes, we designed our experiment with the aim to delineate the demographic signal (expected both in the neutral and the effector loci) from the signal of diversifying selection (effectors only). Surprisingly, neither Tajima's D nor Fu & Li's D differed significantly between neutral and effector genes (Table S7), and Fu's Fs revealed only a marginally significant difference (P = 0.036, Fig. 9b; Table S7), which suggests that demographic processes outweigh signatures of diversifying selection. This interpretation is further supported by the consistency across the host plants in how both classes of genes deviate from neutrality (Fig. 9b); in 24 of 27 pairwise comparisons between these genes the deviation from neutrality is in the same direction for neutral as it is for effector genes. Furthermore, all three statistics show a highly significant difference between host types, with A. candida of crop plants showing the highest positive values, indicating an excess of intermediate frequency polymorphisms (Fig. 9b; Table S7). Such an excess is consistent with balancing selection (Slatkin & Hudson, 1991), but given that it is observed also at neutral genes, we rule out this explanation. Rather, we propose that this observation is caused by a population bottleneck, which may be the result of resistance-breaking by a small number of genotypes to a genetically relatively uniform crop.

Detection of colonization by other microbes in Albugoinfected field samples
We used baits designed to capture sequences from 12 genes from 42 bacterial and fungal species to investigate the presence of microorganisms on both A. candida-infected and noninfected plants (Table S3). We classified reads from each sample to taxonomic units (OTUs) using the software KRAKEN (Wood & Salzberg, 2014). We could detect and phylogenetically assign microbial reads from both A. candida-infected and noninfected plants (Fig. 10). Interestingly, we found on average more reads from bacteria on plants infected by A. candida compared to noninfected plants (infected: median (1st and 3rd quartile) = 1.695% (0.885-3.030); noninfected: 0.670% (0.153-2.185)), Mann-Whitney U-test; W = 5.74, df = 1, P < 0.017). A similar result was observed with fungi-derived reads (infected: 1.973% (1.205-2.433); noninfected: 0.090% (0.045-0.165), Mann-Whitney Utest; W = 78.0, df = 1, P < 0.00001). In addition, the mean (AE SEM) number of OTUs detected is larger on A. candidainfected plants (infected: 473 (AE 14); noninfected: 358 (AE 51), Mann-Whitney U-test; W = 167.0, P < 0.00001), and some of these OTUs appear to be associated with the presence of A. candida (e.g. Rickettsiales and Chytridiomycetes, Table S8). However, these preliminary results warrant a more focused, quantitative analysis to assess the extent with which A. candida infection elevates the colonization by other microbes in the field.

Discussion
PenSeq enables cost-effective investigation of DNA sequence variation in pathogen genotypes from field samples We report here a sequence capture-based protocol (PenSeq) that enables cost-effective assessment of pathogen genetic diversity in field samples. We designed baits based on specific sequences from c. 50 microbial species, and although we may have introduced biases associated to baits, we anticipated we would recover targeted gene sequences from essentially all likely plant-associated bacteria because only 80% nucleotide identity is required for baits to hybridize with DNA (Jupe et al., 2013). We were able to recover all targets, at high read depth, from control pathogens when present in reconstruction experiments. In addition, we showed that the more abundant the microbe in the sample, the higher the proportion of reads observed. PenSeq thus reveals not only presence-absence differences, but also relative abundance of microbes from field samples. With this in mind, we compared the phyllosphere of A. candida-infected and noninfected plants and found an increase in the abundance and diversity of bacteria and fungi, which supports the hypothesis that A. candida suppresses the host's immune response, impacting microbial communities. The most striking examples are the Rickettsiales (Proteobacteria) and Chytridiomycetes that were detected in all 82 A. candida-infected plants but on none of the 12 noninfected plants. A caveat in this association analysis is that many variables, other than infection by A. candida, may play a role in the presence or absence of microbes. We therefore advise conducting microbiome-focused studies to confirm and quantitatively assess the impact of A. candida infection on the various associated microbes.
During the same experiment, we sequenced a total of c. 660 kb of the A. candida genome from 91 isolates (187 loci including putative effectors). We used these data to identify distinct lineages that had diverged from each other by 0.29-1.15%. This could not be achieved using a limited number of loci  or would have been very expensive and laborious using whole-genome sequencing of purified isolates. These lineages appeared to be mostly specialized on one host species and to be consistent with what has been defined as a pathotype or race in A. candida (Biga, 1955;Pound & Williams, 1963;Verma, 2012). Conceivably, some A. candida lineages can infect closely related host species that were not sampled in this study, as observed for lineage I (Race 4) infecting C. bursa-pastoris, A. thaliana and Eutrema japonica and lineage XII infecting Lunaria annua and Lunaria rediviva.
With PenSeq, we were able to sequence c. 115 samples on three HiSeq lanes, by multiplexing up to 47 samples in a lane. We could investigate not only the genetic diversity in A. candida, but also the presence and abundance of other microbes from samples collected in the field. This method enables inspection of many genetic marker loci and is a good alternative to expensive whole-genome sequencing of microbiota and to PCR-based methods such as internal transcriber spacing (ITS) and 16S sequencing. Potential applications would be to study the coevolution of host resistance genes and pathogen effectors in natural populations or of microorganisms within specific environments. Sequence capture has been used to sequence viral genomes from clinical samples (VirCapSeq; Briese et al., 2015) and in a companion paper, PenSeq was used to investigate putative effector sequences in P. infestans and Phytophthora capsici (Thilliez et al., 2018). Although field pathogenomics can also be very helpful for assigning races to lineages (Hubbard et al., 2015), sequence capture enables the sequences under investigation to be predefined.
A rich history of recombination breakpoints revealed in comparisons of 17 different races Hybridization between A. candida lineages may have enabled adaptation to the numerous hosts on which the pathogen has been reported (Adhikari et al., 2003;McMullan et al., 2015).  6 Neighbour-Net network built using SPLITSTREE v.4.11.3 and based on the c. 400 kb contig (398 508 bp) of 85 Albugo candida isolates (including AcBoL and AcEm2). The scale represents distances estimated using the uncorrected p-distance. Isolates are colour-coded according to the hosts they were collected on and phylogenetic lineages identified in Fig. 4

New Phytologist
Genetic exchange between lineages may occur if they share a common host (Pound & Williams, 1963;Saharan et al., 2014) or if the immune system of the host is compromised by a compatible isolate (Cooper et al., 2008;Belhaj et al., 2017;Prince et al., 2017). By sampling more isolates from different host species, we set out to better understand the role of hybridization in A. candida evolution.
In the present study, we identified many putative recombinant regions between lineages within the c. 400 kb contig. Some of these regions are quite polymorphic and are probably due to incomplete lineage sorting or represent trans-species polymorphism. However, other regions are (nearly) identical between races over long stretches of sequence, suggesting that they arose by recent hybridization. In addition, although the existence of tetraploid races of A. candida is unproven, the high levels of heterozygosity in triploid lineages (B. oleracea and R. sativus; lineages V and VII, Races 9 and 1) suggests that they originated from intercrosses between diploids and tetraploids rather than from the absence of reduction division in either male or female germ cells. This idea is reinforced by the mid-branch placement of triploids in the network shown above (lineages V and VII (Races 9 and 1) mid-branch of lineages IV and VI (Races 7 and 1)). However, no mixed infections were detected which suggests that this scenario is rare. Conceivably, genetic exchanges between lineages that are adapted to diverged hosts (each with their unique sets of resistance genes) would most likely be maladaptive, in comparison with combinations of alleles which have been selected over extended evolutionary timescales (Brasier, 2001).
Polyploidy is a significant contributor to race diversification and evolution Polyploidy in A. candida may have important impacts on both the occurrence of sexual reproduction within and between lineages and the adaptive potential of the lineages. First, polyploid isolates may have reduced fertility or be strictly asexual (especially triploids; Comai, 2005). In triploid isolates from B. oleracea (V, race 9) and R. sativus (VII, race 1), heterozygous sites are mostly shared between isolates, suggesting strict clonal reproduction. The hypothesis of asexuality in these lineages is reinforced by the lack of observable oospores from isolates collected on B. oleracea and propagated in the laboratory (AcBoT and AcBoL; V. Cevik, pers. comm.; McMullan et al., 2015). Second, polyploid isolates may not be able to hybridize with other lineages (Pannell et al., 2004;K€ ohler et al., 2010). This may, in the long term, lead to speciation of the polyploid lineages. Finally, polyploidy may allow for relaxed selection pressure and therefore increased mutation rates at the duplicated genome(s) which may lead to the acquisition of novel gene functions (Comai, 2005;Madlung, 2012). It may also increase vigour (increased biomass, growth rate) in comparison to diploid races. Interestingly, polyploid races identified during this work had previously been classified as A. candida macrospora due to large sporangia, compared to those observed in A. candida microspora (e.g. isolates collected on C. bursa-pastoris or Sisymbrium spp.; Biga, 1955;Pound & Williams, 1963). It would be interesting to test for polyploidy in other lineages classified as macrospora (other Brassica spp. or Erucastrum spp.) and for an increased sporangia/zoospore resistance or survival in the polyploids.
In A. candida, polyploidy appears to have been selected for only in cultivated host species. This is interesting because polyploidy may allow for (and indeed impose a need for) rapid clonal expansion of an adapted race on the relatively uniform genotypes of cultivated host populations. We indeed observe this with cropinfecting (mostly polyploid) A. candida races showing a relative excess of intermediate frequency variants, both at neutral as well as effector genes. Conceivably, this observation is caused by a genetic bottleneck resulting from resistance-breaking by a small number of genotypes during the widespread cultivation of Fig. 7 Mean percentage of shared heterozygous sites (y-axis) per Albugo candida phylogenetic cluster (x-axis). The number of isolates used in each group is provided at the top of each bar (races with less than two sampled isolates were discarded from the analysis). Left of the dashed line are races sharing most of their heterozygous sites (clonal, > 84%) and right are races that share few of their heterozygous sites (sexual, < 55%). Standard deviation is also shown. It is larger for races that reproduce mainly sexually suggesting stochasticity in segregation. Brassica crops in Northern Europe over the last 2000 yr (Maggioni et al., 2000). In addition, asexuality associated with polyploidy may be beneficial as sexual reproduction would break up combinations of alleles that are adapted to the host genotypes. It also can result in a phenomenon known as 'frozen heterozygosity' (Schwarz, 2017), in which the polymorphism contained within genetic loci becomes fixed in the asexual lineage, which enables it to resist the eroding effects of genetic drift. By contrast, clonal reproduction may be less advantageous for the pathogen in a coevolutionary arms race with host species that are genetically Fig. 8 Mean heterozygosity of Albugo candida phylogenetic lineages at the c. 400 kb contig (398 508 bp) along with representative ploidy graphs for races with low (yellow), intermediate (blue) and high levels of heterozygosity (red) based on that same contig. Heterozygosity is expressed as the percentage of observed heterozygous sites. Median (black solid line) and mean (red solid line) heterozygosity are provided. In the ploidy graphs, the x-axis represents the proportion of reads per single nucleotide polymorphism (SNP) at heterozygous positions and the y-axis is the count of heterozygous sites.
(b) (a) Fig. 9 Genetic diversity and selection at neutral and putative effector loci of Albugo candida races infecting wild plants, garden escapes and crops. Shown are the mean and SD of (a) The number of segregating sites S and (b) Tajima's D, both computed using DNAsp v5.10.01. Other summary statistics for genetic diversity (the number of haplotypes (H), heterozygosity (H t ), the average number of nucleotide differences, and the mutation-drift parameter theta (Θ)) and selection (Fu & Li, 1993;Fu, 1997)

Research
New Phytologist diverse, such as, for example, pathogens infecting wild populations of C. bursa-pastoris or S. officinale.
In summary, we have established PenSeq, a cost-effective sequence capture-based method to investigate genetic diversity on pathogens while they are colonizing their hosts. We used this method to substantially enhance our understanding of A. candida race diversity, and to reveal associated leaf microbiota in infected leaves. We believe that this method has broad potential applications ranging from investigating obligate biotrophic pathogens such as rusts, downy mildews and powdery mildews, to diagnostics of symptomless pathogens and even to investigating mammalian diseases, such as genotyping parasites in blood samples.

Fig. S8
Analyses of the frequency spectrum of segregating sites of effector and neutral genes of Albugo candida using two population genetic statistics; Fu & Li (1993) and Fu (1997) (mean and SD).
Table S1 Details about species and loci to which oligos were designed.

Table S3
Detailed list of the samples sequenced using PenSeq.

Table S4
Breadth of coverage and read depth at loci of control organisms used in the reconstruction experiment.

Table S5
Within-and between-lineage p-distance (in %) at the four sets of loci sequenced in Albugo candida.

Table S6
General linear model (GLM) with summary statistics for genetic diversity of Albugo candida genes as response variable, and plant species nested within plant type (i.e. wild, garden or crop) and gene type (effector or neutral gene) as fixed factors.

Table S7
General linear model (GLM) with summary statistics for selection of Albugo candida genes as response variable, and plant species nested within plant type (i.e. wild, garden or crop) and gene type (effector or neutral gene) as fixed factors. Methods S1 Detailed description of the methodology used in this paper and including sample collection, control preparation, DNA extraction, library preparation, DNA capture and enrichment.
Please note: Wiley Blackwell are not responsible for the content or functionality of any Supporting Information supplied by the authors. Any queries (other than missing material) should be directed to the New Phytologist Central Office.