Bridging the micro- and macroevolutionary levels in phylogenomics: Hyb-Seq solves relationships from populations to species and above.

Reconstructing phylogenetic relationships at the micro- and macroevoutionary levels within the same tree is problematic because of the need to use different data types and analytical frameworks. We test the power of target enrichment to provide phylogenetic resolution based on DNA sequences from above species to within populations, using a large herbarium sampling and Euphorbia balsamifera (Euphorbiaceae) as a case study. Target enrichment with custom probes was combined with genome skimming (Hyb-Seq) to sequence 431 low-copy nuclear genes and partial plastome DNA. We used supermatrix, multispecies-coalescent approaches, and Bayesian dating to estimate phylogenetic relationships and divergence times. Euphorbia balsamifera, with a disjunct Rand Flora-type distribution at opposite sides of Africa, comprises three well-supported subspecies: western Sahelian sepium is sister to eastern African-southern Arabian adenensis and Macaronesian-southwest Moroccan balsamifera. Lineage divergence times support Late Miocene to Pleistocene diversification and climate-driven vicariance to explain the Rand Flora pattern. We show that probes designed using genomic resources from taxa not directly related to the focal group are effective in providing phylogenetic resolution at deep and shallow evolutionary levels. Low capture efficiency in herbarium samples increased the proportion of missing data but did not bias estimation of phylogenetic relationships or branch lengths.


Introduction
Evolutionary biologists, in their efforts to determine which factors govern biodiversity dynamics, have used two approaches that differ primarily in the time frame in which they operate: microevolutionary processes (genetic drift, mutation, migration) act mostly on individuals within populations (recent time), while macroevolutionary processes (speciation, extinction, dispersal) focus on diversification at and above species level in relation to environments and over longer timescales (Benton, 1995;Reznick & Ricklefs, 2009). Species experience population size changes and range shifts in response to climatic oscillations, structuring their gene pools across their geographic ranges; these processes are often addressed through population-genetic or demographic studies (Hewitt, 2000(Hewitt, , 2004Davis & Shaw, 2001;Parmesan, 2006). However, over longer timescales (hundreds of thousands to millions of yr), the signature of these microevolutionary processes can become saturated, in which case phylogenies addressing the order and timing of diversification events provide more information on the evolutionary fate of lineages (Svenning et al., 2015). Although micro-and macroevolution operate over different geographic and temporal scales, stochastic processes have been shown to leave their imprint on deep phylogenetic histories (Oliver, 2013).
Bridging the micro-and macroevolutionary scales is difficult because the types of molecular data, sampling schemes, and phylogenetic models that are often employed differ when analysing population and species-level relationships. Microevolutionary studies typically use multiple individuals per population and rely on repeated DNA or polymorphic molecular markers (e.g. simple sequence repeat, amplified fragment length polymorphism), as they, in comparison to DNA sequence data, provide more variability at the population level and can be used to detect recent admixture (Guichoux et al., 2011). Yet these types of markers are not readily analysable using the standard molecular substitution models employed in phylogenetics based on DNA sequences (e.g. Tavar e, 1986); instead, approximate models (e.g. Luo et al., 2007) or still debated statistical models are employed (Hobolth et al., 2008;Wu & Drummond, 2011). Conversely, macroevolutionary studies often rely on DNA sequences from only one individual per species, but the use of a single or few genetic regions is usually not enough to obtain well-resolved and supported phylogenies (e.g. Pelser et al., 2007). Furthermore, these analyses are limited to parts of the genome that may contain conflicting signals, leading to spurious phylogenetic relationships (Shen et al., 2017).
High-throughput sequencing (HTS) provides an avenue for bridging the micro-and macroevolutionary gap in phylogenetics by scaling up the number of loci and individuals within populations and across species that can be sequenced at a reasonable cost (Barrett et al., 2016). Target sequencing has gained popularity in recent years because, unlike whole-genome sequencing (WGS), which requires fresh material (but see Dentinger et al., 2016), this technique can work with both fresh and old museum material (e.g. Lemmon & Lemmon, 2013;Barrett et al., 2016) and generally demands less complex bioinformatics (e.g. F er & Schmickl, 2018). Reduced representation techniques such as restriction site-associated DNA sequencing (RADseq, Baird et al., 2008) or genotype-by-sequencing (Elshire et al., 2011) work well for population-level analysis and do not require a reference genome. Yet, their outputs are single nucleotide polymorphisms (SNPs) or small reads from anonymous loci, which make assessment of gene orthology challenging in the case of deep divergences (e.g. Rubin et al., 2012; although see Hipp et al., 2018). A modification of target sequencing, Hyb-Seq, that is, hybrid enrichment with genome skimming (Weitemier et al., 2014), is promising for nonmodel organisms because it generates thousands of DNA sequences from low or single-copy nuclear genes (LCNGs), combining exon capture with genome skimming of intronic and intergenic regions, flanking the targeted exon regions. Hyb-Seq also generates highly repetitive DNA from organellar genomes as a by-product; the latter is important to detect reticulate evolution and introgression.
Here, we examine the efficiency of Hyb-Seq to provide phylogenetic resolution at deep and shallow evolutionary levels using fine-scale within-population sampling from both silica-dried tissue and old herbarium material to solve multilevel relationships. Fragmented DNA from museum samples can lead to a potential bias in phylogenomic analyses (Sayyari et al., 2017); however, only few studies have tested this effect within species based on a limited number of herbarium samples (e.g. Hart et al., 2016). As a case study, we use Euphorbia balsamifera Aiton (the sweet tabaiba, Euphorbiaceae; Fig. 1), a taxon exhibiting a deep intraspecific divergence (c. 3.8 million yr ago (Ma); Peirson et al., 2013;Pokorny et al., 2015) and a disjunct distribution spanning thousands of kilometres on opposite sides of Africa, the so-called Rand Flora pattern (Christ, 1892;Andrus et al., 2004;Sanmart ın et al., 2010). In this biogeographic pattern, unrelated plant lineages show similar disjunct distributions with sister taxa in distantly located regions along African continental margins and adjacent islands, such as Macaronesia-northwest Africa, western Africa mountains, Horn of Africa-South Arabia, eastern Africa and southern Africa. Euphorbia balsamifera is a diploid species that belongs to section Balsamis Webb & Berthelot within subgenus Athymalus Neck. ex Rchb. Its current taxonomy (see Supporting Information Notes S1) recognizes two subspecies: balsamifera, distributed in all major islands in the Canaries, the northwestern Atlantic coast of Africa and western Sahel; and adenensis (Deflers) P.R.O. Bally, occurring in eastern Africa, southern Arabia and the Socotra Archipelago (Govaerts et al., 2000;Peirson et al., 2013). There is a third taxon, ssp. sepium N.E.Br. (Molero et al., 2002), currently synonymized under ssp. balsamifera, which has been applied to all the populations of western Sahel and whose taxonomic status and phylogenetic position have never been addressed using molecular data (Bruyns et al., 2011;Peirson et al., 2013). Difficulties in obtaining fresh material from remote and now politically unstable countries across eastern Africa and the Middle East have restricted phylogenetic studies of Euphorbia from those areas to multicopy DNA regions like nuclear internal transcriber spacers (ITS) and plastid markers (Bruyns et al., 2011;Peirson et al., 2013;Fig. S1); these might be easily Sanger-sequenced from herbarium specimens.
In this phylogenomic study, the first within Euphorbiaceae, we designed probes to capture 431 orthologous low-copy nuclear loci (c. 709 kbp) using a significant amount of herbarium material. Our specific aims were to test the utility of probes based on genomic resources from taxa not directly related to the focal group in providing phylogenetic resolution from within populations and species (E. balsamifera) to above-species level (within section Balsamis and subgenus Athymalus); to assess the effect of extensive use of highly degraded DNA from herbarium material on capture success and potential bias in phylogenomic inference at different evolutionary scales; to test the ability of our probes to obtain off-target chloroplast sequences; and to test the monophyly of E. balsamifera ssp. sepium and its phylogenetic relationship within E. balsamifera and estimate lineage divergence times.

Sampling
We started with 165 samples, of which 121 were successfully amplified (Table S1). In the final dataset, Euphorbia balsamifera s.l. is represented by 40 populations (i.e. localities; Table S1, Fig. 1g,h). Seven populations of the Canarian ssp. balsamifera included at least eight individuals, and one (Telde, IS462) came from a single individual. Also several populations were represented by a single individual: ssp. balsamifera from W Africa (three), ssp. adenensis (19) and ssp. sepium (10). Our dataset also included at least one sample from other representatives within section Balsamis (E. larica Boiss., E. masirahensis Ghaz. and E. noxia Pax), and subgenus Athymalus (E. antso Denis, sect.  (g, h) Geographic distribution of Euphorbia balsamifera in the Canary Islands, North Africa, and Arabian Peninsula based on herbarium specimens: ssp. balsamifera (circles), ssp. sepium (triangles), and ssp. adenensis (diamonds); (h) Detailed distribution of ssp. balsamifera in the Canary Islands. Fuchsia, green and orange symbols represent sampled DNA specimens of each taxon included in the phylogeny shown in Fig. 3 (Table S1).

Probe design and sequence capture
As genomic resources for the design of probes, we used the transcriptomes of two Euphorbia species from subgenera Chamaesyce and Esula (RHAU and PXYR, E. mesembryanthemifolia Jacq. and E. pekinensis Rupr., respectively)available through the 1KP initiative (www.onekp.com/public_data.html)and Ricinus communis L. (Euphorbiaceae) as the reference genome (v0.1, available at www.phytozome.net; Chan et al., 2010) to estimate intron/exon boundaries. MARKERMINER v.1.0 (Chamala et al., 2015) was employed to identify LCNGs' orthologous loci, which were used to develop the gene target probes. Probes were designed for 431 orthologous LCNGs (Table S2) ranging from 639 to 6774 bp with at least one exonic fragment per LCNGs longer than 500 bp, representing a total length of c. 709 kbp. Arbor Biosciences (Ann Arbor, Michigan, USA) manufactured a target enrichment kit with in-solution biotinylated probes. The 120-mer probes (20 002 total baits) were tiled at 29 density for the sequences of the two Euphorbia transcriptomes and 1.59 for the sequences of R. communis.
Genomic DNA (extraction protocols in Methods S1) from silica-dried samples and recent herbarium collections, which had fragment sizes > 550 bp, were sonicated to a target fragment size of 550 bp using a Covaris E220 Focused-ultrasonicator (Wohurn, MA, USA). The remaining samples for which the average fragment size was < 550 bp were not sonicated and thus resulted in libraries with insert size < 550 bp. Sequencing libraries were prepared using the Illumina TruSeq Nano HT DNA Kit (Illumina Inc., San Diego, CA, USA). Indexed samples were pooled in approximately equal quantities (typically 16-22 samples per equimolar 1000 ng pool, when possible). Each pool was enriched using the custom baits kit following the manufacturer's protocol, at 60-65°C for 16-24 h. Pooling of samples was phylogenetically informed; we avoided including in the same pool intra-and interspecific sampling, for example, other taxa and E. balsamifera specimens, to prevent a large number of baits being sequestered by ingroup samples. Enriched products were PCRamplified for 14 cycles and purified using the QIAquick PCR purification kit (Qiagen). All sequencing took place on an Illumina MiSeq at The Field Museum of Natural History (Chicago, IL, USA). The 96 silica-dried samples (five pools) were sequenced on two 2 9 300 bp runs (600 cycle v3). The 48 herbarium samples (two pools) were sequenced on one 2 9 75 bp (150 cyle v3) run; as a result of low enrichment, these pools were re-enriched following the myBaits ® protocol using previously enriched product as the input. The double-enriched products were then sequenced on another 2 9 75 bp run. To ensure recovery of chloroplast sequences, we added 10% unenriched library to all sequencing runs. Further details can be found in Methods S1.

Data processing and phylogenetic analysis of targeted nuclear loci
Demultiplexed sequences were quality-filtered using TRIMMO-MATIC (Bolger et al., 2014) to remove adapter sequences. We also removed poorly aligned regions from alignments using TRIMAL v.1.2 (removing all columns with gaps in > 50% of the sequences or with a similarity score < 0.001, unless this removes > 40% of the columns in the original alignment; Capella-Guti errez et al., 2009). The HYBPIPER pipeline (v.1.0; Johnson et al., 2016) was used to assemble loci. Summary statistics were obtained using SAMTOOLS v.1.8 (Li et al., 2009). Orthologous sequences from 428 nuclear loci containing only exons were aligned using MAFFT v.7.222 (Katoh & Standley, 2013). We evaluated gene capture success as the percentage of summed captured length of all target loci per individual divided by the summed mean length of all reference loci. We considered an alignment to be poor quality if the aligned pairwise identity was < 65.5% and the percentage of identical sites was < 15% (Table S3). This resulted in 132 loci removed from the final dataset and subsequent analyses. Therefore, the final dataset included exons from 296 loci (486 878 bp, 121 samples).
All 296 exon matrices were concatenated into a supermatrix and a phylogenetic tree was built by maximum likelihood (ML) after automatic model selection using ModelFinder (Kalyaanamoorthy et al., 2017) in IQ-TREE v.1.4.2  (1000 ultrafast bootstraps '-bb', '-m TEST') and RAXML v.8 2.9 using the GTRCAT model with 200 fast bootstraps followed by slow ML optimization (default '-fa' search; Stamatakis, 2014). Alternatively, we used methods that implement the multispecies coalescent models (MSC), where individual genes are allowed to evolve within the species tree under independent tree topologies to account for gene tree discordance. In particular, we used ASTRAL-II v.2.4.7.7  with default parameters to estimate a species tree from the individual gene trees by maximizing the number of quartet trees (sets of four species) shared between gene trees and the species tree (Chou et al., 2015;. This method is not based on parameter estimation and thus is efficient with large-genomic datasets; besides it can generate species trees that are statistically consistent with the MSC model. New alignments and ML phylogenies were created for each of the 296 exon matrices with Practical Alignment using SAT e and Transitivity (PASTA; . We ran PASTA with default parameters and different options for alignment: MAFFT or MUSCLE (Edgar, 2004); merging: OPAL (Wheeler & Kececioglu, 2007) or MUSCLE; and ML tree estimation: FASTTREE (Price et al., 2010) or RAXML. The resulting ML trees were used to infer the coalescence-based species phylogeny in ASTRAL-II with local posterior probabilities estimated to provide support for relationships. Because of concerns with the effect of alignment trimming on phylogenetic estimation (Tan et al., 2015), we ran our analyses with and without trimmed sites.
Additionally, we used SVDQUARTETS to infer a species tree under the coalescent framework (Chifman & Kubatko, 2014. Unlike ASTRAL-II, SVDQUARTETS does not require a priori inference of individual gene trees. Instead, quartet trees are evaluated using an algebraic approximation (i.e. the expected rank of the flattened rate matrix under MSC) and combined into a species tree using a supertree approach. Originally designed for SNPs, SVDQUARTETS has been shown to perform well on multilocus datasets even though it violates the assumption that sites are independent (Chifman & Kubatko, 2015). We used the concatenated 296-exon supermatrix as input with the option evaluating 100 000 random quartets or all possible quartets if lower than 100 000. Clade support was assessed by running 100 bootstrap replicates, using the MSC model. The analysis was performed in PAUP* 4.0a146 (Swofford, 2002). Both ASTRAL-II and SVDQUAR-TETS were run with samples unassigned ('blind') or assigned to the lineages identified in the 'blind' analyses: sepium, adenensis, and within balsamifera (Western Sahara, Gran Canaria, East Tenerife, West Tenerife, and La Gomera).
To explore the effect of more than one single copy per locus, we used HYBPIPER scripts to generate a list of potential paralogues for each sample and locus (Table S4). If a gene is identified as a paralogue in several samples, that gene is a candidate for potential duplication and paralogy; if only in one sample, it is probably a different allele; if a sample contains multiple paralogues, that sample is a potential polyploid. Thus, we ran different analyses in ASTRAL-II by sequentially removing loci with paralogues in at least one sample, two or more samples, or > 10 samples, and compared the resulting topologies with the full dataset. We also evaluated gene tree-species tree discordance by computing the level of support/conflict provided by each of the 296 gene trees used in the analyses for bipartitions shown in the ASTRAL-II species tree, as well as for other alternative bipartitions. We followed the procedure described in Smith et al. (2015) and used Matt Johnson's scripts (https://github.c om/mossmatters/MJPythonNotebooks/blob/master/PhyParts_ PieCharts.ipynb) to visualize the results. This procedure allowed us to evaluate how many genes support or conflict with individual bipartitions within the species tree, that is, if there is a dominant tree topology in the gene trees or, if there is conflict, whether this stems from an alternative tree topology or from many low-frequency alternative gene topologies or lack of support for conflicting bipartitions (Smith et al., 2015).
Finally, we also analysed introns and supercontig (exon + intron) matrices generated by HYBPIPER. As before, we considered an alignment to be poor quality if the percentages of identical sites were < 40% and < 25%, respectively. This resulted in matrices of 15 out of 404 introns and 112 out of 424 supercontigs (Tables S5, S6). Phylogenetic trees using concatenated introns (97 samples, 16 817 bp) and supercontigs (117 samples, 347 878 bp) matrices were analysed with IQ-TREE, with same settings as earlier.

Data processing and phylogenetic analysis of chloroplast (skimmed) data
To compare the nuclear and plastid phylogenetic signals, we recovered plastid DNA using the annotated plastome of R. communis (NC_016736) and transferring its annotations to a draft plastome of Euphorbia esula (Horvath et al., 2018), if similarity was 95% between the two plastomes, using the transfer annotations function in GENEIOUS v.9.1.7 (http://www.geneious. com, Kearse et al., 2012). Subsequently, we extracted coding sequence (CDS) regions from each gene (86 exons in total) and used HYBPIPER with the default parameters to extract exons sequences. We only included 10 samples of the three subspecies (ssp. balsamifera, adenensis and sepium). For ssp. balsamifera, samples were merged by island (i.e. Gran Canaria, Tenerife and La Gomera); for the other subspecies, samples were merged by country (Table S7). Admittedly, this strategy was not optimal but it was motivated by the low coverage recovered by sample (Table S8). Also, our main interest with this analysis was to test if the plastid genome supported the same clades, corresponding to the three subspecies, recovered in all the analyses of nuclear loci. We obtained 66 exon matrices that were aligned with MAFFT and corrected manually following a similarity criterion (Simmons, 2004). Then, they were concatenated into a supermatrix (63 133 bp) that included only seven samples because the sequence length in three samples did not reach 5% of the total length in the concatenated matrix. The latter was analysed with RAxML applying GTRCAT and 200 fast bootstraps followed by slow ML optimization (default '-fa' search).

Divergence time estimation
Divergence times were estimated in BEAST v.1.8 (Drummond & Rambaut, 2007) using the nuclear exon supermatrix (296 loci). Analyses were run in a reduced dataset that included all the target species, and two samples for each of the three clades (subspecies) within E. balsamifera (Methods S1). The dataset was run unpartitioned under the best-fitting substitution model estimated in IQ-TREE (GTR+I). A birth-death process with incomplete taxon sampling (Stadler, 2009) was used as the tree-growth prior. We forced the monophyly of some clades to conform to accepted phylogenetic relationships among the Athymalus taxa (Peirson et al., 2013; Methods S1). We did the same for ssp. balsamifera, sepium and adenensis, as these taxa were recovered as monophyletic groups in all phylogenomic analyses of the nuclear and chloroplast data (see later). We estimated divergence times under the strict clock (SC) model and two Bayesian relaxed clocks: the uncorrelated lognormal clock (UCLD; Drummond et al., 2003) and a random local clock (RLC; Drummond & Suchard, 2010). Final analyses comprised Markov chain Monte Carlo (MCMC) searches run for 400 million generations, with samples logged every 40 000 th generation. The root node and the next basal nodethat is, the divergence from the rest of the tree of E. antso and E. hadramautica, respectivelywere constrained using two secondary calibration points from Horn et al. (2014). They were assigned normal distribution priors (Ho & Phillips, 2009) (Rambaut, 2009) were used, respectively, to generate and visualize the resulting maximum clade credibility (MCC) tree.
To date population divergence events within Canarian E. balsamifera (the best sampled in our dataset), we used a reduced exon supermatrix containing one accession of ssp. adenensis and all (80) accessions of ssp. balsamifera. All priors were set as described earlier, except for constant coalescent as tree prior. The divergence of ssp. balsamifera from ssp. adenensis was constrained with a normal distribution prior (4.5 AE 1.05 Ma) from the species-level analysis described earlier.
The raw reads were deposited in GenBank under BioProject PRJNA415769. Draft assemblies, bait sequences and full-length targets are archived in DIGITAL-CSIC (https://digital.csic.es); BEAST xml files are provided in Methods S2.

Gene-capture success
The baits designed based on genomic resources from R. communis and two species of Euphorbia belonging to subgenera other than Athymalus were effective in capturing the target genes across broad evolutionary scales: among sections of Athymalus, within section Balsamis, and among subspecies and populations of balsamifera (Table S9). The average number of reads per sample obtained was 1 036 822 and the percentage of mapped reads per sample was 48.61%, ranging from 7 to 69% (Table S10). In general, capture success was much higher for silica-dried material than for herbarium material (Table 1; Fig. 2). Average success was 73% and varied across taxa. Gene-capture success was almost complete for ssp. balsamifera (99%) with the exception of three herbarium samples (IS401, IS459, IS460) from southwest Morocco and Western Sahara (Table S9). Other taxa also had a relatively high gene-capture success (86%) in contrast to ssp. adenensis (43%) and sepium (45%) that were represented almost entirely by herbarium samples (Table S9). There was a significant correlation between age of herbarium material (which ranged between 3 and 126 yr) and percentage of success (P < 0.01; Fig. 2). However, the plot of residuals shows that this correlation does not hold for all age groups: it is not significant for specimens within the time interval 1950-2002 (P > 0.5).
Phylogenetic estimation of gene tree-species tree from nuclear data All our analyses solved evolutionary relationships within the subgenus Athymalus as well as within E. balsamifera. Maximum likelihood analysis (IQ-TREE) of the 296-exon dataset recovered a well-supported phylogeny with values of bootstrap supports (BS) of 100 for backbone and main clade relationships, except for the position of E. noxia (BS = 94), which is included in a phylogeny here for the first time. Within E. balsamifera, the three subspecies were recovered in a highly supported clade. Subspecies sepium was sister to ssp. adenensis and balsamifera (Fig. 3a). A clade comprising E. masirahensis, E. larica and E. noxia was inferred as sister to E. balsamifera, with E. davyi as sister to them. Successive sister clades are two other clades: E. socotrana-E. scheffleri and E. smithii-E. matabelensis; E. hadramautica is sister to the aforementioned clades (Fig. 3a), with E. antso at the root of the tree.
At the population level, specimens of ssp. balsamifera were grouped into several highly supported clades (> 95 BS) that seem to follow geographic location. Specimens from Gran Canaria and eastern Tenerife were grouped into two clades that were sister to each other (Fig. 3a). Sequences from western Tenerife were divided into two subclades (BS = 100 and 85) embedded within a larger La Gomera clade. The three samples from southwest Morocco (IS 401) and Western Sahara (IS459, IS460) were grouped together, nested within a clade containing samples of eastern Tenerife. The clades of ssp. adenensis and sepium had substantially longer branches than the ssp. balsamifera clade, and in contrast to it, they lacked clear geographic structure (Fig. 3a). Similar tree topologies and relative branch lengths were recovered by the RAxML analysis (Fig. S2) and IQ-TREE, with or without trimmed low-quality sites (results not shown).
To test whether the observed differences in branch length within E. balsamifera were an artefact of low capture-efficiency in herbarium samples, as a result of an excess of short DNA sequences in ssp. adenensis and sepium, we repeated the ML analysis using a subset of 18 selected loci (the ones having the shortest sequences) and excluding sequences with < 25% of the target length. Each gene was aligned separately with MAFFT and trimmed with trimAL to retain only sites with < 25% missing data for each gene. The exon sequences were combined in a supermatrix and analysed with RAxML. The resulting tree (Fig. S3) shows the same pattern of branch length and relationships as in the previous analyses (Figs 3a, S2). We also estimated the ASTRAL-II tree (Fig. S4), which shows the same topology as Fig. 3(b).
Multispecies coalescent methods recovered relationships consistent with the ML (IQ-TREE, RAXML) concatenated trees. Both ASTRAL-II and SVDQUARTETS generated species trees in which a strongly supported monophyletic ssp. sepium was sister to ssp. adenensis and balsamifera, which were in turn reciprocally monophyletic (Figs 3b, S5, S6). Backbone relationships among Athymalus species were similar to those in the IQ-TREE (Fig. 3a); the only difference within sect. Balsamis was the position of E. noxia, which appeared as sister to E. balsamifera in ASTRAL-II ( Fig. 3b) but formed a clade with E. masirahensis and E. larica in the IQ-TREE and SVDQUARTETS (Figs 3a, S5, S6). Topologies obtained in ASTRAL-II (Fig. S7) using a diverse array of aligners (MUSCLE vs MAFFT), mergers (MUSCLE vs OPAL), and tree estimation algorithms (FASTTREE vs RAXML) all supported the same relationships among subspecies and species depicted in Fig. 3(b). Within ssp. balsamifera, ASTRAL-II trees showed similar structure to the IQ-TREE (Fig. 3b). The main differences were in the western Tenerife populations, which form a clade sister to La Gomera populations in ASTRAL and SVDQUARTETS (Figs 3b, S6). Similar results were found in ASTRAL (Fig. 3b) and SVDQUARTETS with the blind analyses (results not shown) or when samples were assigned to populations (Figs S8, S5). Another difference concerned the samples from southwest Morocco and Western Sahara. These appear as sister to the remaining ssp. balsamifera from the Canary Islands in the MSC trees (Figs 3b, S6), rather than nested within a Tenerife clade as in the IQ-TREE and RAXML trees (Figs 3a, S2). Also, the position of the southwest Morocco specimen (IS401) varied across the ASTRAL-II trees (Fig. S7e-h). The long branches subtending the northwest African samples IS401 and IS459which presented a < 50% capture success (Fig. 3a) suggest the possibility of long-branch attraction, against which MSC methods are more robust (Liu et al., 2014). To evaluate this, we sequentially removed each of these samples and re-estimated the tree in IQ-TREE (following Bergsten, 2005). Results (Fig. S9) showed no changes in the backbone topology but the position of the problematic samples varied across trees, as expected when long-branch attraction is at play.
A total of 116 loci out of 428 (27%) were warned to contain potentially paralogue sequences (Table S4). Sensitivity analyses in ASTRAL-II (by sequentially removing paralogues from the final 296-loci dataset) recovered similar topologies (Fig. S10) to the full analysis (Fig. 3b), suggesting no significant bias in our results. Comparison of the individual gene tree topologies against the ASTRAL-II species tree (Fig. S11) reveals gene tree concordance for phylogenetic relationships between subspecies and above but gene tree discordance within subspecies of E. balsamifera (see also Fig. 4a). For the latter, the source of conflict in most nodes is a result of incongruence with other alternative, low-frequency bipartitions or lack of support for any bipartition, whereas conflict with an alternative dominant bipartition appears to be low (Fig. S11).
Finally, the phylogenetic tree obtained from the concatenated supercontig matrixexons plus introns (Fig. S12)was congruent with the topology of the concatenated exon matrix, supporting similar relationships among Athymalus taxa and within the E. balsamifera clade (BS = 100), although population structure within ssp. balsamifera lacked support. The phylogenetic tree obtained from the concatenated intron matrix provided little resolution (results not shown), suggesting that the level of missing data within this dataset might be responsible for the lower resolution in the supercontig vs the exon-based analyses (Figs S2, S12).

Analysis of plastid data
We recovered a low number of plastid sequences (Table S14) for E. balsamifera s.l. and none for any of the other taxa. No relationship was found between type of material and capture success: that is, the average percentage of mapped reads in silica samples was 0.09%, whereas for herbarium samples it ranged between 0.05% and 0.53%. The ML trees using the exon concatenated matrix showed strong support for the monophyly of ssp. sepium, adenensis and balsamifera (Fig. 3c).

Divergence time estimation
Divergence time estimates for major clades are shown in Table 2. MCMC runs with the SC and RLC models converged (Table S15) and gave estimates with overlapping confidence intervals ( Table 2). The UCLD model provided younger age estimates (Table S15), but inspection of traces indicated poor mixing and low EES values. The mean and standard deviation parameters of the lognormal clock showed a bimodal distribution, suggesting difficulties in accommodating molecular rate variation within our mixed population/species-level large genomic dataset.
The tree topology from the SC and RLC analyses (Fig. 4a) was congruent with the one recovered by the ASTRAL-II MSC tree (Fig. 3b), except that ssp. balsamifera was not recovered as  (Table 2). Divergence times between ssp. sepium and adenensis-balsamifera were dated in the Late Miocene (SC, 11.57 Ma, 95% HPD 7.68-15.33; RLC, 9.93 Ma; 6.63-13.30, respectively; Table 2), whereas the divergence between ssp. balsamifera and adenensis dates back to the Early Pliocene (SC, 4.93 Ma; 3.29-6.56). Divergence between E. balsamifera and closest relatives masirahensis-larica-noxia-davyi was estimated in the Mid-Miocene (15.01 Ma in SC; 13.44 RLC). Initial divergence within ssp. sepium was estimated as 7.69 Ma (SC) or 4.55 (RLC), while those of ssp. adenensis and balsamifera were estimated as 4.27 Ma (SC; 4.16 in RLC) and 3.45 Ma (SC), respectively ( Table 2). The BEAST coalescent analysis within ssp. balsamifera gave younger estimates for mean population divergences (Fig. 4b), but the 95% HPD credibility intervals overlap with those in Fig. 3(a). The samples from southwest Morocco/Western Sahara branched earlier (as in the ASTRAL-II tree; Fig. 3b), around the Early Pleistocene (2.72 (1.25-4.16); 2.08 (0.86-3.23) Ma), with a very recent split for the Canarian samples (0.26, 0.11-0.30 Ma). The population geographic structure resembled that of the IQ-TREE (Fig. 3a) except that there was no clear split into two major eastern/western clades. Instead, East Tenerife populations appeared sister to the Gran Canaria populations, while those from West Tenerife formed a grade relative to populations from La Gomera (Fig. 3b).

Discussion
Hyb-Seq as a tool to bridge the micro-and macroevolutionary gap in phylogenomics The advent of HTS techniques has opened new avenues to solve phylogenetic relationships within deep and shallow evolutionary radiations (Nicholls et al., 2015;Stephens et al., 2015;Barrett et al., 2016;Gardner et al., 2016;Hamilton et al., 2016;Mitchell et al., 2017). Our studythe first phylogenomic analysis within Euphorbiaceaedemonstrates that our probes combined with target sequencing and genome skimming (Hyb-Seq), can be applied to solve evolutionary relationships from populations to species and above within the same tree in Euphorbia. Even if a proportion of an individual LCNG contained low phylogenetic signal (Table S11), our combined dataset of 296 exons alone provided enough resolution to address evolutionary relationships among species within subgenus Athymalus and section Balsamis, among subspecies within E. balsamifera, and even among and within populations in the latter. The phylogeny inferred from the nuclear genome dataset (Fig. 3a,b) agreed with the topologies found in previous studies with Sanger sequencing (Peirson et al., 2013), but with far better support towards E. balsamifera and its closest relatives. The fact that the output of Hyb-Seq are DNA sequences of targeted known loci instead of SNPs as in RADs allowed us to use the same molecular evolutionary models across phylogenetic levels, from intrapopulation to populations and species, thus effectively bridging the micro-and macroevolutionary gap.
One advantage of Hyb-Seq is the recovery of plastid sequences as a by-product of nuclear enrichment (e.g. Schmickl et al., 2016;Crowl et al., 2017). At shallow phylogenetic levels, comparison of the nuclear and plastid signal is key to detecting any potential conflict attributed to ongoing gene flow and reticulate evolution. Here, even though < 1% of mapped reads corresponded to plastid genes, the plastid genome data recovered was enough to support the presence of three clades (subspecies) within E. balsamifera s.l. in agreement with the nuclear dataset (Fig. 3). Our efforts to recover introns and intergenic regions by genome skimming was met with partial success (Table S12; Fig. S12). Johnson et al. (2016) found a similar result: the analysis of supercontig matrices did not provide improved resolution relative to Table 2 Divergence times of the main clades in Euphorbia balsamifera and its closest relatives in Fig. 4(a) presented as the mean time to the most recent common ancestor and the 95% highest posterior density (HPD) interval obtained from the divergence time analyses performed under the strict clock and the random local clock of the 296-exon supermatrix Clades (Fig. 4a the exon-only datasets. Nonetheless, this should not be considered a failure of our approach (partly explained by our extensive use of herbarium material) as a technique to bridge across phylogenetic levels: despite the low variability of each individual loci, the combined information from all exons provided enough signal to resolve relationships at different phylogenetic depths. Kadlec et al. (2017) compared target-sequencing techniques on the basis of their effectiveness for marker selection, using criteria such as gene length and variability. They concluded that approaches employing universal markers such as ultraconserved elements (Faircloth et al., 2012) or anchored hybrid enrichment (Lemmon et al., 2012) performed worse than those using a set of probes specific to the focal group. Among the latter, they preferred 'custom-made' scripts (Mandel et al., 2015) over automated ones (MarkerMiner, Chamala et al., 2015;Hyb-Seq, Weitemier et al., 2014) because they tend to generate longer, more variable reads. The difference lies in how phylogenetically close to the group of interest are the genomic resources employed in the design of baits. Here, we used the proteome of R. communis available in MarkerMiner (Chamala et al., 2015), in subfamily Acalyphoideae, against two Euphorbia (Euphorbioideae) transcriptomes from subgenera Chamaesyce and Esula. This probably resulted in more conserved, less variable genes (< 10%, Table S11) compared with noncoding nuclear regions such as ITS in Malpighiales (e.g. Euphorbiaceae: 45%, Horn et al., 2012;Hypericaceae: 43%, Meseguer et al., 2013). Indeed, inspection of gene tree discordance (Figs 4a, S11) suggests that most conflict stems from low resolution within each gene and nonoverlapping sample sets (i.e. missing data as a result of capture failure), rather than real conflict among gene trees. While we agree that the use of a fully annotated genome of the group (i.e. Euphorbia) is the right path to generate probes with high variability (Kadlec et al., 2017), this is often out of the reach of small laboratories, both economically and time-wise. Additionally, by selecting a relatively distant proteome, we have generated a set of baits that are effective for gene capture at macroand microevolutionary levels within Euphorbia (Fig. 3a) and potentially across genera within the large angiosperm family Euphorbiaceae (c. 6300 species; Wurdack & Farfan-Rios, 2017). This set of baits is now available for future studies in this family.
Another advantage of target sequencing such as Hyb-Seq over alternative HTS approaches is the possibility of using material from natural history collections ('museomics') which, even if degraded, is crucial when working with old, rare and endangered taxa or with species coming from under-sampled and difficult-to-access regions (e.g. Staats et al., 2013;Bakker et al., 2016). In our study, this is crucial because many Rand Flora taxa occur in what are currently politically unstable countries in Africa (e.g. Somalia) and the Middle East (e.g. Yemen), limiting the possibilities of doing fieldwork to obtain fresh material. Nearly all samples of ssp. sepium and ssp. adenensis, and the nonCanarian samples of ssp. balsamifera, were obtained from herbaria (Tables 1, S9; Figs 2, 3a). Our study demonstrates that Hyb-Seq works well with old collections. Although there is a correlation between age of specimen and capture success ( Fig. 2), we show that even very old specimens (IS401, 1893) and those with very low capture success (e.g. IS435, Table S9) can be placed within their (nominal) clade in the phylogenetic trees (Figs 3a,S2,S5,S8). We also show that this correlation only holds for very old and recent samples, suggesting that for other samples, additional variables such as preservation quality might be more important in capture success. Sayyari et al. (2017) showed that highly fragmentary DNA often results in a surplus of short sequences, which might lead to spurious phylogenetic reconstructions. In particular, these incomplete genes could result in long branches for the affected lineages. In our study, we used a large number of herbarium samples, which typically yield fragmented DNA below 500 bp, resulting in shorter sequences after hybridization compared with those typically obtained from fresh material. We have addressed this issue showing that even after removing these very short sequences (< 25% target gene length) the resulting phylogenetic tree recovered the same topological relationships and similar branch length differences among the three subspecies of E. balsamifera as in the dataset including all sequences (Fig. S4). Thus, the presence of fragmentary DNA sequences might not be necessarily misleading in a phylogenomic study. Our analyses of the influence of paralogues did not reveal a significant bias in tree estimation either (Fig. S10).

Solving relationships within an ancient continental disjunct lineage
Euphorbia balsamifera represents one of the few studied intraspecific examples of the Rand Flora pattern (Pokorny et al., 2015). Our results from the phylogenomic analyses strongly support the monophyly of three subspecies within E. balsamifera: adenensis, balsamifera and sepium. Subspecies sepium, here sequenced for the first time, was recovered as sister to the clade formed by ssp. adenensis and balsamifera, the latter being congruent with the phylogeny of Peirson et al. (2013). The strongly supported reciprocal monophyly by both nuclear and chloroplast genomes, and comparatively long branch lengths subtending the three subspecies (Fig. 3), suggest the need for a taxonomic revision of their taxonomic status (R. Riina et al., unpublished).
The inferred divergence times between subspecies of E. balsamifera and with species in sect. Balsamis are older (ranging from Late Miocene to Early Pliocene) than those obtained in previous studies (Bruyns et al., 2011;Horn et al., 2014;Pokorny et al., 2015), although none of them included ssp. sepium. Aridification of North Africa linked to the formation of the Sahara Desert (Senut et al., 2009) was probably responsible for the isolation of ssp. balsamifera and adenensis during the Early to Mid-Pliocene (4.93; 3.29-6.56 Ma). The much older split of sepium in the Late Miocene (c. 11 Ma), however, predates the formation of the Sahara. This taxon has a wider distribution in North Africa and appears to be adapted to more inland xeric environments compared with ssp. adenensis and balsamifera. The Tortonian (11.6-7.2 Ma) was a period characterized by lower temperatures and wetter environments than the Pliocene, so it is possible that ecological vicariance contributed to the divergence of ssp. sepium New Phytologist (2018)

Research
New Phytologist from the ancestor of adenensis and balsamifera, as has been postulated in other Rand Flora taxa .
Our estimates of lineage divergence times support a recent colonization of the Canary Islands by ssp. balsamifera (crown age c. 0.26 Ma), probably from coastal Moroccan/Western Saharan populations (Fig. 4). This agrees well with the hypothesis of Mairal et al. (2015a) that the Macaronesian component of the Rand Flora originated from a recent dispersal event from a northwestern African population during the Pleistocene glacial cycles when geographic distances became shorter, although other taxa show exceptions to this pattern (Thiv et al., 2010). The recent Canarian radiation stands in contrast with the longer branch lengths and older time estimates for population divergence within ssp. sepium and adenensis (Figs 3a, 4a). Our phylogenetic tests showed that the branch length variation observed between African and Canarian populations of ssp. balsamifera is not an artefact of including incomplete genes with short sequences (Fig. S4). Moreover, this pattern is not unique to E. balsamifera. Rand Flora genera such as Canarina (Campanulaceae, Mairal et al., 2015a), Camptoloma (Scrophulariaceae) and Plocama (Rubiaceae; Sanmart ın et al., 2017) exhibit a similar pattern of shallower population divergences in the Macaronesian taxa compared with those in eastern Africa/southern Arabia.

Conclusions
Our study demonstrates that the baits designed for Euphorbia subgenus Athymalus using genomic and transcriptomic resources from distantly related taxa (i.e. subfamily Acalyphoideae and Euphorbia subgenera Chamaesyce and Esula) are able to solve phylogenetic relationships at shallow and deep levels. We show that inclusion of fragmentary gene sequences in a dataset, a typical effect of working with degraded DNA from natural history collections, does not necessarily bias phylogenomic analyses at inter-and intraspecific levels. This might be crucial for systematic studies involving large-scale spatial sampling or fieldwork in difficult-to-access regions. Finally, we have revealed that E. balsamifera consists of three highly supported clades (ssp. adenensis, balsamifera and sepium) and that estimates of divergence times and phylogeographic structure within ssp. balsamifera in the Canary Islands are consistent with those found in other Rand Flora taxa.

Supporting Information
Additional Supporting Information may be found online in the Supporting Information at the end of the article.                Methods S1 Extended materials and methods.
Methods S2 XML files used for the divergence time estimation analyses under a Bayesian framework in BEAST.
Notes S1 Notes about the taxonomy of Euphorbia balsamifera.