Genetic diversity and domestication of hazelnut (Corylus avellana) in Turkey

Assessing and describing genetic diversity in crop plants is a crucial first step towards their improvement. The European hazelnut, Corylus avellana, is one of the most economically important tree nut crops worldwide. It is primarily produced in Turkey where rural communities depend on it for their livelihoods. Despite this we know little about hazelnut’s domestication history and the genetic diversity it holds. We use double digest Restriction-site Associated DNA (ddRAD) sequencing to produce genome-wide dataset containing wild and domesticated hazelnut. We uncover patterns of population structure and diversity, determine levels of crop-wild gene flow and estimate the timing of key divergence events. We find that genetic clusters of cultivars do not reflect their given names and that there is limited evidence for a reduction in genetic diversity in domesticated individuals. Admixture has likely occurred multiple times between wild and domesticated hazelnut. Domesticates appear to have first diverged from their wild relatives during the Mesolithic. We provide the first genomic assessment of Turkish hazelnut diversity and suggest that it is currently in a partial stage of domestication. Our study provides a platform for further research that will protect this crop from the threats of climate change and an emerging fungal disease.

We used TreeMix to infer patterns of population splitting and mixing from allele frequency 228 data. We calculated allele frequencies for each of the clusters that were identified using 229 DAPC. We sequentially increased the number of migration events from zero to five (m0-m5) 230 and examined changes in likelihood with each event added. We also used the '-se' option to 231 calculate the significance of each migration event. We used two different block sizes (10, 232 100). We then examined levels of admixture between wild and domesticated clusters using 233 the D statistic (Patterson et al., 2012)  Distinct groups were more easily distinguishable in wild Turkish individuals. We recovered 254 three major groups corresponding to three different areas of collection, Bolu, Giresun and 255 Ordu (Fig. 2). Samples from Giresun and Ordu were each split into two different groups, 256 indicating that there may be some fine scale genetic structure in these regions. There were a 257 small number of Giresun individuals that fell close to individuals from Ordu, which may 258 point to exchange of DNA between these adjacent regions. We conducted a DAPC on wild and cultivated individuals together (Fig. 3a) and inferred that 269 six clusters was the optimal number and 13 PCs were retained. Four clusters were made up of 270 cultivated individuals, two of which were markedly different from the others; cluster six 271 contained Italian cultivars (referred to as the Italian cluster) and cluster four contained several 272 individuals of the Turkish cultivar 'Tombul' (Turkish cultivars 2, referred to as the 'Tombul' 273 cluster). The remaining three clusters were tightly grouped. One of these contained mostly 274 wild C. avellana individuals, regardless of their country of origin, Another was made up of 275 Turkish cultivars including many 'Cakildak' and 'Palaz' (Turkish cultivars 3, referred to as 276 the 'Cakildak' cluster). The last cluster of cultivated individuals was a mix of many different 277 strains (Turkish cultivars 1). Although we refer to some clusters by their most prominent 278 cultivar, each also contained a mix of different cultivars. We note that the C. maxima samples 279 included in our analysis fell into clusters with cultivated, rather than wild individuals. The 280 final cluster contained individuals previously identified as C. colurna as well as those thought 281 to belong to some C. avellana cultivars e.g. the cultivar 'Anac Orta' (referred to as the C. 282 colurna cluster) as in our phylogenetic network (Fig. 2). We treat all members of this cluster 283 as C. colurna for downstream analyses. We examined the geographic distribution of the 284 clusters ( Fig. 3b) and this revealed evidence for an East-West division between cultivated We performed a similar analysis using the same individuals and fastSTRUCTURE. This 288 revealed that eight clusters (k = 8) best explained the structure in the data. Unlike in the 289 DAPC, wild C. avellana individuals were spread across multiple clusters. Most fell into a 290 single large cluster (coloured red in Fig. 4c), while groups of individuals from Giresun (teal, 291  Fig. 4c) again contained a mix of different cultivars. We then 300 grouped our fastSTRUCTURE results using our DAPC clusters (Fig. 4d). This revealed that 301 all fastSTRUCTURE wild clusters belonged to the single DAPC wild cluster. Individuals 302 belonging to Turkish Cultivars 1 and 'Tombul' cluster were grouped in fastSTRUCTURE, 303 though most individuals with mixed ancestry were in the former cluster (Fig. 4d). The last 304 major difference between the two analyses was that the 'Cakildak' cluster was split in two in 305 the fastSTRUCTURE analysis (Fig. 4d). 306

307
The main purpose of this analysis was to uncover evidence of mixed ancestry in wild and 308 domesticated individuals. We detected little evidence for admixture between the C. colurna 309 group and other groups, except for the individual 'CK1' which was sampled at Royal Botanic 310 Gardens, Kew. This specimen was thought to be a variety of C. colurna but may instead be 311 the product of a cross between C. avellana and C. colurna. We found extensive evidence for 312 admixture among wild and cultivated C. avellana. This was particularly evident in two 313 cultivar clusters (yellow and purple, Fig. 4c). We also recovered evidence of admixture 314 between all cultivated clusters, which may be the result of past crosses between cultivars 315 belonging to different clusters. At the same time, there were many domesticated samples with 316 ancestry assigned to just a single genetic cluster, showing little evidence for past admixture. 317

318
We also ran a fineRADSTRUCTURE analysis on wild and cultivated individuals. The groups. Many of the wild individuals showed a similar level of coancestry to one another. 321 There were a number of small groups of wild individuals that were grouped by their 322 geographic region -samples from Bolu, Ordu and Georgia shared high levels of coancestry. 323 Individuals from the DAPC C. colurna cluster also stood out and were placed within the 324 large group of wild individuals, rather than outside as per expectations. There was a much 325 higher variability in coancestry among cultivated individuals indicating more pronounced 326 genetic structure. They were split into several large groups that broadly reflected the clusters 327 inferred using other approaches, but revealed additional fine-scale structure inside of each 328 group. This approach, alongside others, allowed us to accept our hypothesis that (i) there is 329 more structure in cultivated than wild populations. 330 331

Diversity among wild and cultivated individuals 332
We found that observed heterozygosity (H o ) was generally higher in cultivated than wild 333 clusters but estimates of expected heterozygosity (H e ) did not follow this pattern ( 'Tombul' samples. We then assessed those specimens where the cultivar name information 357 was available by pooling individuals based on cluster name (Fig. 4b). We examined the 358 relative proportion of each cluster that made up each cultivar. For all cases in which we had 359 more than one sample, we found that named cultivars were composed of variation from more 360 than one cluster. We therefore rejected our hypothesis (iii) that genetic clustering supports 361 given cultivar names. or from a more diverged population, which was the case for the migration event from the 391 Italian cluster. Each of the three events highly was significant (p < 2.1e-06). The amount of 392 variance explained was high (98.24%) even without any migration edges and increased until 393 three migration edges were present, up to 99.98% (Table S2). Matrices of pairwise residuals 394 are shown in Figure S4. 395 396 We then examined whether gene flow has occurred between the wild cluster and clusters of 397 Turkish cultivars. We inferred D statistics for three tests (Table S3), two of which had Z 398 scores > 2, indicating some evidence for gene flow between the 'Cakildak' and wild clusters, 399 agreeing with our treemix analysis (Fig. 6b). Results from fastSTRUCTURE, treemix and D 400 statistics indicate that gene flow between wild and domesticated hazelnut has taken place and 401 we therefore accept our hypothesis (iv). 402 403 DISCUSSION 404

Genetic clusters do not match cultivars 405
All approaches used revealed that there was more pronounced genetic structure in 406 domesticated than wild hazelnut (Fig. 3, 4, S1). Perhaps the most striking pattern we 407 recovered was the mismatch between genetic data and named cultivars. We identified five 408 genetic clusters across all of our cultivated individuals (Fig. 4a). When we grouped 409 individuals by cultivar name, mean ancestry coefficients were always made up of more than 410 one genetic cluster. This suggests that inferences from our genomic markers do not reflect the 411 naming system of Turkish cultivars. This may be because cultivar names are based on traits 412 that are not correlated with neutral genetic variation, such as kernel size, shape or taste. 413 Morphology has been used to assign Turkish cultivars to three primary groups, primarily 414 based on nut shape (Kafkas et al., 2009) and these do not correspond to the genetic clusters 415 we have recovered. Kernels of 'Yassi Badem', one of the cultivars that grouped with wild 416 individuals instead of cultivars in our DAPC, are shaped like almonds and not suitable for 417 processing. This cultivar was also found to be the most genetically distant by Kafkas et al. 418 (2009) and did group with cultivars rather than wild individuals in our fastSTRUCTURE 419 analysis (Fig. 4c). It may be that cultivars like 'Yassi Badem' have not undergone complete 420 domestication. 421 Our clustering was similar in some aspects to a previous study based on several nuclear 423 marker types (Kafkas et al., 2009). 'Tombul' was split among genetic clusters, a pattern also 424 recovered in Boccacci et al. (2006). This cultivar is the most economically important, and it 425 has been implied that it 'Tombul' nuts are from just a single clone (Ayfer et al. 1986;426 Caliskan, 1995) but this is not supported by the genetic variation within 'Tombul' we 427 recovered. Furthermore, morphological differences in their nuts and husks have been 428 observed between different 'Tombul' samples (Kafkas et al., 2009)

Variable distance between domesticated and wild hazelnut 439
Our DAPC analysis revealed that most cultivated clusters fall close to wild clusters (Fig. 3), 440 an inference that is supported by the work of Ozturk et al. (2017). These patterns could be the 441 result of local domestication, though we think this is unlikely as we would have expected 442 wild and cultivated individuals to cluster together geographically. The 'Tombul' and Italian 443 clusters were highly differentiated from other groups in our DAPC (Fig. 3a). Italian cultivars 444 are geographically isolated from Turkish samples as they occur more than 1,500km away, 445 which may explain their differentiation. Boccacci & Botta (Boccacci & Botta, 2009) found 446 little evidence of gene flow from east (Turkey/Iran) to West (Italy/Spain), which supports the 447 differentiation we uncovered. However, we do find some evidence for admixture (Fig. 4, 6b) 448 suggesting that some of the genomes of present day Turkish and Italian cultivars may been 449 the result of past introgression. 450

451
The geographic distribution of 'Tombul' overlaps with other Turkish cultivars yet it still 452 remains highly differentiated (Fig. 3a), which may be indicative of more considered breeding 453 efforts to improve the cultivar. This cluster also had the lowest level of H e among the six 454 the highest quality so any hybrids may be weeded out by farmers to protect the cultivar. 457 Alternatively, the quality of the nuts may mean that 'Tombul' is often planted in new areas 458 where it has not yet had time to interact with local wild relatives. Either way, farmers could 459 be maintaining the distinction between 'Tombul' and other cultivars. 460 461 Evidence for gene flow among wild and cultivated samples 462 We identified two potential instances of past gene flow between wild and domesticated C. 463 avellana (Fig. 6b). These were supported by extensive admixture in our clustering analysis 464 (Fig. 4c). However only gene flow between 'Cakildak' and wild C. avellana, was also 465 supported by D statistic tests. This event was recovered in our treemix analysis (Fig. 6b) and 466 we found some evidence for admixture between wild and 'Cakildak' in our fastSTRUCTURE 467 analysis (Fig 4c), which also pointed to extensive admixture between wild C. avellana and 468 individuals belong to other cultivars. We also inferred an admixture event between 'Tombul' 469 and Italian clusters (Fig. 6c), but was poorly supported by fastSTRUCTURE (Fig. 4a). 1998) yet we found similar levels of heterozygosity in cultivated compared to wild that any domestication bottleneck has not had a strong effect on genetic diversity. As C. 524 avellana is an obligate outcrosser and self-incompatible, any attempts to augment cultivars 525 could also increase levels of heterozygosity. Another possibility is that highly heterozygous 526 individuals have been preferentially retained and clonally propagated in orchards, perhaps 527 because of increased yields caused by hybrid vigour. Our observations are not entirely 528 uncommon: cultivated grapevine (Marrano et al., 2017) was more heterozygous than its wild 529 counterpart and a study using microsatellites found that genetic diversity in hazelnut cultivars 530 was similar or higher than wild populations in southern Europe (Boccacci et al., 2013). 531 532 While levels of H o were lower, levels of H e were actually higher in wild C. avellana (Fig. 5), 533 which could point to a reduction of genetic diversity during domestication. We took wild C. 534 avellana samples from a wider geographic distribution than cultivated samples and this may 535 have led to the observed patterns of H e . Our comparison of all wild and cultivated samples 536 ( Fig. S1) accounts for this somewhat, and we find that values of H o and H e are more similar 537 than when using separated clusters (Fig. 5). Furthermore, small clusters of wild individuals 538 inferred using fastSTRUCTURE had levels and patterns of heterozygosity similar to their 539 cultivated counterparts (Fig. 5), so increased H e is not always observed for wild individuals. bottleneck. However, when we calculated heterozygosity after removing admixed individuals 545 we found very similar results (Fig. 5), which suggests that introgression is likely not driving 546 the observed pattern in genetic diversity. One of the major concerns for modern day crop 547 plants is that reduced genetic diversity caused by domestication will limit the potential for 548 crop improvement in the future (Harlan, 1972). European hazelnut displays relatively high 549 levels of diversity that is promising both for improvement and for resistance to environmental 550 stressors such as pathogens or climate change. 551 552 Given the proximity of some wild and domesticated clusters (Fig. 3a), similar levels of 553 heterozygosity (Fig. 5) and existence of cultivars that group with wild individuals, we suggest 554 that hazelnut is still in the early stages of domestication. Our results indicate that cultivated 555 hazelnut may not have experienced a strong domestication bottleneck that reduced genetic 556 diversity. Our phylogenetic analyses suggest that around 10-15kya have passed since 557 domesticated hazelnut first split from its wild progenitors and about 5-10kya since the 558 common ancestor of current Turkish cultivars. This lends support to the idea that 559 domestication has been a gradual process instead of a single event in the past (Brown et al., 560 2009;Brown, 2019), and the genetic proximity of wild and cultivated samples may suggest it 561 is still ongoing today. These characteristics make C. avellana a useful model for 562 understanding the genetic effects of partial domestication. 563

CONCLUSION 565
The European hazelnut is one of the most important tree nut crops worldwide and is a large 566 part of the economy and livelihood of communities on the north coast of Turkey. We 567 conducted an assessment of the diversity of cultivars and wild populations in this area and 568 beyond, the first using a genomic approach. We found that cultivars are highly heterozygous, 569 and that admixture has likely occurred among wild and domesticated hazelnut as well as 570 among different genetic clusters of cultivated individuals. We used genomic data to cluster 571 different cultivars into major groups and, surprisingly, these did not overlap with the current 572 naming of cultivars. Our efforts could be useful as a starting point for more efficient use of 573 genetic diversity in breeding programmes. We inferred divergence times of wild and 574 showing standard error. We calculated heterozygosity using three different groupings, 625 delineated by black bars. From left to right: the first grouping was based on DAPC clustering 626 (Fig. 3a), the second grouping was based on fastSTRUCTURE clustering and only included 627 individuals with pure ancestry (no admixture) (Fig. 4c). Colours of x-axis labels correspond 628 to the colours used in figure 4c. The third grouping was based on the major split between 629 wild and cultivated individuals in our fineRADSTRUCTURE analysis (Fig. S1). 630