Inferring pathogen dynamics from temporal count data: the emergence of Xylella fastidiosa in France is probably not recent

Summary Unravelling the ecological structure of emerging plant pathogens persisting in multi‐host systems is challenging. In such systems, observations are often heterogeneous with respect to time, space and host species, and may lead to biases of perception. The biased perception of pathogen ecology may be exacerbated by hidden fractions of the whole host population, which may act as infection reservoirs. We designed a mechanistic‐statistical approach to help understand the ecology of emerging pathogens by filtering out some biases of perception. This approach, based on SIR (Susceptible–Infected–Removed) models and a Bayesian framework, disentangles epidemiological and observational processes underlying temporal counting data. We applied our approach to French surveillance data on Xylella fastidiosa, a multi‐host pathogenic bacterium recently discovered in Corsica, France. A model selection led to two diverging scenarios: one scenario without a hidden compartment and an introduction around 2001, and the other with a hidden compartment and an introduction around 1985. Thus, Xylella fastidiosa was probably introduced into Corsica much earlier than its discovery, and its control could be arduous under the hidden compartment scenario. From a methodological perspective, our approach provides insights into the dynamics of emerging plant pathogens and, in particular, the potential existence of infection reservoirs.


Supplementary Notes S2: Detailed model description Model skeleton
Time is discrete and regular, and takes values in the set of integers Z (in our application, the time unit is the month). The unobserved process Y represents the real epidemic and is split into 4 components: • S O (t) is the number of susceptible hosts at time t in the observable compartment, • S H (t) is the number of susceptible hosts at time t in the hidden compartment, • I O (t) is the number of infected hosts at time t in the observable compartment, • I H (t) is the number of infected hosts at time t in the hidden compartment. Thereafter, indices O and H are used to distinguish variables concerning the observable and hidden compartments, respectively.
The epidemic is observed via the surveillance of symptomatic and asymptomatic hosts in the observable compartment. In addition, hosts which are observed as infected are destroyed. Thus, the observation process is a so-called control process, which changes the course of the epidemic (the aim of the control may be to eradicate or slow down the epidemic). The control process consists of a deterministic component C N providing the numbers of symptomatic and asymptomatic sampled hosts at time t, and a stochastic component C I providing the numbers of symptomatic and asymptomatic infected hosts among the sampled hosts: is the number of symptomatic hosts sampled at time t in the observable compartment, • N ∅ obs (t) is the number of asymptomatic hosts sampled at time t in the observable compartment, • I † obs (t) is the number of infected hosts at time t among the N † obs (t) sampled symptomatic hosts, • I ∅ obs (t) is the number of infected hosts at time t among the N ∅ obs (t) sampled asymptomatic hosts. Thereafter, symbols † and ∅ are used to distinguish variables concerning symptomatic and asymptomatic hosts, respectively, The process C N , which represents the sampling effort, is supposed to be known. When no host is observed at time t, the two components of C N are equal to zero. The processes Y and C I are updated as follows: where the new state of Y depends on the past states of Y (up to time t − K, K ∈ N * ) and the previous observation of infected hosts, the new state of C I depends on the new state of the epidemic and the (deterministic) numbers of sampled hosts, and the functions F θ and G θ parameterized by θ govern these dependencies.
The two following sections describe how we specified F θ (i.e. the controlled epidemic process) and G θ (i.e. the observation process). Then, we will specify the prior distributions of model parameters.

Model of the controlled epidemic process
Before the time of introduction of the disease, say t 0 , the process Y is constant and equal to: where N 0 is the initial host population size, φ is the initial proportion of the host population corresponding to the observable compartment, and · is the rounding operator. The epidemic is initiated at time t 0 by distributing I 0 infected hosts in the observable and hidden compartments, with the proportions φ and (1 − φ), respectively: At time t > t 0 , new infections in the observable and hidden compartments, say I * O (t) and I * H (t) respectively, occur as a function of past prevalences: where N (t − k) is the total number of hosts at time t − k (i.e. the sum of the 4 components of is the total number of infected hosts at time t − k, and w k ≥ 0 is a weight measuring the contribution of the disease prevalence at time t − k to new infections at time t. Thus, w k I(t − k)/N (t − k) gives the potential of infection of hosts infected at time t − k over hosts susceptible at time t − 1, and min 1, K k=1 w k N (t−k) gives the probability of infection at time t of hosts that were susceptible at time t − 1. In this model, any infected host contributes to the new infections in both the observable and hidden compartments, irrespective of the compartment it belongs to. Equations (1-2), without the rounding and minimum operators, can be viewed as the auto-regressive terms appearing in the discrete-time renewal equation 1 , which is often used to build temporal models of disease incidence, like in Champredon et al. (2017).
In addition, at time t > t 0 , 1. infected hosts observed at time t − 1 are removed from the infected hosts in the observable compartment and replaced by resistant hosts 2 , in the aim of enhancing the spread of the disease; 2. the other infected hosts at time t − 1 are affected by a mortality rate ρ, irrespective of the compartments they belong to, and they are replaced by susceptible hosts 3 . These assumptions and Equations (1-2) lead to: where I obs (t − 1) = I † obs (t − 1) + I ∅ obs (t − 1) is the number of (symptomatic and asymptomatic) infected hosts detected at time t − 1, and the rounding operator allows us to obtain a vector of integers for Y (t). The number of removed hosts in our SIR model is defined in Footnote 2.

Model of the observation process
We modeled the numbers I † obs (t) and I ∅ obs (t) of symptomatic and asymptomatic observed infected hosts at times t satisfying N † obs (t) > 0 and N ∅ obs (t) > 0 with independent hypergeometric distributions taking into account the effects of false-negative in the diagnostic test: where is the number of infected symptomatic hosts at time t in the observable compartment, which is assumed to be a fixed proportion γ 1 ∈ [0, 1] of I O (t): is the number of infected asymptomatic hosts at time t in the observable compartment and satisfies: is the number of susceptible symptomatic hosts at time t in the observable compartment that are considered as at-risk (i.e. hosts whose propensity to be sampled is high, despite their healthy status); this number is assumed to be a time-varying proportion g(t)γ 2 of S O (t): where γ 2 ∈ [0, 1] is the fixed proportion of symptomatic hosts among the at-risk susceptible hosts in the observable compartment, and g(t) is the time-varying proportion of at-risk hosts among the susceptible hosts in the observable compartment; is the number of susceptible asymptomatic hosts at time t in the observable compartment that are considered as at-risk, and satisfies: • is the rate of false-negatives: the fraction of infected hosts cannot be observed as positive cases because of diagnostic test issues, and are accounted for in the negative cases in Equations (4-5). Note that our model assumes that the rate of false-positives is zero, and that all infected hosts in the observable compartment are at-risk (i.e. all infected hosts in the observable compartment have a high propensity to be sampled). Note also that the rounding operator in the hypergeometric distributions is applied to get integers for the parameters.

Competing models and prior distributions
We built eight competing models, by considering several options for the initial proportion φ of the host population that is observable, and several options for the g function providing the time-varying proportion of at-risk hosts among the susceptible hosts in the observable compartment. These models are labelled M 1 , . . . , M 8 and are classified in Table S2.
Model parameters are listed in Table S3, which also provides their prior distributions. Below we justify the choices of priors. For the introduction date t 0 , we chose a vague uniform prior over a period of 50 years (i.e. 600 months) before the first detection that occurred at time t = 0 (i.e. July 2015).
Specifying the prior for the initial population size N 0 is tricky and requires to define the host unit. We considered the host-unit to be an area that makes sense regarding the type of data under consideration. Samples are collected from individual plants, but generally represent a set of neighbor plants. The median distance between the locations of samples being 9 meters, we considered that an host unit is an area of about π(9/2) 2 ≈ 64 square meters. Using Corine Land Cover data, the total area potentially covered with plants represents about 3500km 2 in South Corsica, 10% of which was arbitrarily considered as covered by potential host plants of Xylella fastidiosa. Thus, we considered that the initial population of host plants consisted of approximately 0.1 × (3500 × 10 6 )/(π(9/2) 2 ) ≈ 5.5 × 10 6 host units. Based on this rough evaluation of N 0 , we set a relatively vague prior for this parameter, namely, a log-normal distribution with mean and standard deviation parameters equal to log(5 × 10 6 ) and 0.5, respectively, which lead to quantiles of order 0.025 and 0.975 equal to 1.9 million and 13.3 million.
The prior distribution for I 0 was supposed to be Dirac(10) (i.e. I 0 can only take the value 10), which means that 10 infected host units were introduced at t 0 . Obviously, this number has a strong influence on the introduction date: a larger I 0 would lead to a more recent introduction date t 0 . The dependence between I 0 and t 0 yields an identifiability issue in the estimation algorithm, and led us to fix the value I 0 . The value 10 was considered as a plausible number (it might correspond, for instance, to a non negligible volume of infected plants imported by a nursery).
The prior for the mortality rate ρ is a log-normal distribution with mean and standard deviation parameters equal to log(0.02) and 0.5, respectively, whose quantiles of order 0.025 and 0.975 are 0.0075 and 0.0533. If ρ = 0.0075, 50% of infected hosts die in the first 7.7 years of their infection. If ρ = 0.0533, 50% of infected hosts die in the first year of their infection. Thus, a priori, the mortality rate can take diverse values corresponding to significantly different mortality dynamics.
Because of some identifiability issues in the estimation algorithm and to allow an eventual annual periodicity to be inferred, the weights w k were all assumed to be equal to 0, except w 12 , for which we specified a vague uniform prior over [0,10]. In the main text, w 12 is denoted by w.
A priori, the false-negative rate has a uniform prior over [0, 0.2] and is therefore assumed to be relatively low, but it can take non-negligible values.
For the proportions φ, γ 1 , γ 2 , β 1 and β 2 , we chose vague uniform priors over [0, 1], except mentioned otherwise in Table S3 to satisfy the specification of each competing model. For instance, for models M 1 , M 2 , M 3 without hidden compartment, φ can only be equal to 1, and for models M 7 (with an a priori small hidden compartment) and M 8 (with an a priori large hidden compartment), φ has a beta distribution with parameter vectors equal to (4,1) and (1,4), respectively.

Hidden compartment
None Fraction of the whole population Preference in sampling

Supplementary Notes S3: Impact of the choice of the number I 0 of introduced infected hosts
As indicated in the main text, the number I 0 of introduced infected hosts at t 0 , was set at a fixed value in all models because of some identifiability issues. We set I 0 = 10, which amounts to assume that the epidemic began with the introduction of a small batch of infected plants and that subsequent introductions did not significantly impact the overall curse of the epidemic.
Here, we illustrate the identifiability issue encountered when I 0 is not fixed, and we investigate the impact of the value chosen for I 0 on the estimation of other parameters. For the sake of example, we used model M 8 that includes an a priori large hidden compartment and a varying preference in sampling at-risk hosts. Figure S2 provides an example of posterior distribution obtained for I 0 when this parameter is not fixed at 10 but a priori distributed under a log-normal distribution with mean parameter log 10 and standard deviation 1. Despite a prior whose mass is mostly concentrated over values lower than 100, the MCMC chains made excursions to values larger than 1000. We performed additional estimations of parameters of model M 8 by varying the fixed value of I 0 . Figure S3 gives, for each model parameter, its posterior mean and quantiles of order 0.025 and 0.975 when I 0 varies between 1 and 100. The value I 0 = 10 is indicated by the vertical dashed red line. Figure S3 also shows the posterior mean and quantiles of the log-likelihood that is a measure of the goodness-of-fit of the model. Large differences in posterior means and quantiles are observed for low values of I 0 , namely I 0 = 1 or I 0 = 5, but such low values lead to rather low log-likelihoods. In contrast, larger values of I 0 led to roughly similar results and log-likelihoods. It has however to be noted that the value I 0 = 10 generates an introduction date that tends to be more recent than larger values. This is somehow counterintuitive but this result is due, in particular, to the larger estimate obtained for the infection strength when I 0 = 10. The value I 0 = 10 that is used in the main text is indicated by the vertical dashed red line. Parameters are the introduction date (t 0 ), the initial population size (N 0 ), the mortality rate (ρ), the infection strength (w), the proportion of the observable compartment (φ), the proportions of symptomatic hosts in the observable compartment (γ 1 and γ 2 ), the parameters governing the preference in sampling at-risk hosts (β 1 and β 2 ), and the false-negative rate ( ).