Explanation of and Comments on McCowan's Re-analysis of Covid Spillover

trying to make it more accessible to those who wish to understand. As for the others....

Mar 03, 2025

Introduction

The origins of the SARS-CoV-2 (SC2) virus have understandably attracted much attention. Pekar et al.2022 (1) (denoted P2022 henceforth) is one of the most influential papers addressing the question, e.g. being picked up by more than 500 news outlets. It makes the qualitative claim that the phylogeny of the downstream sequences indicates two spillovers from a non-human host and that the most recent common ancestor (MRCA) existed shortly before cases were officially detected at the Huanan Seafood Market (HSM). The paper has multiple flaws in basic Bayesian methods, in coding implementation, and in modeling. The most important coding implementation errors have now been acknowledged in a correction, but it seems worthwhile to have a published discussion of the other issues. Before delving into those, however, it is important to note that the case for a natural spillover does not require that P2022 conclusions are correct nor does the case for a lab spillover require that they are incorrect.

Context

Worobey et al.(2) argued that HSM was the site where SC2 spilled over into humans. There were several facts that tended to disfavor that claim, including that all the sequenced HSM-linked cases were lineage B, 2 nucleotides more distant from related natural viruses than was the lineage A found in many other locations. That suggests prima facie that the HSM cases were downstream from a spillover that occurred elsewhere. P2022 address this question by arguing that the phylogeny indicates that B probably spilled over separately from A, which in principle would allow other evidence (e.g. the location data in Worobey) to show that B spilled over in HSM. That would still leave questions about the spillover leading to lineage A, but there is general agreement that if lineage B spilled over at HSM then the origins were zoonotic regardless of the details about any other spillovers.

The Model

P2022 use a complicated model in which the infection spreads between individuals who have a distribution of numbers of possible connections to others. No population structure is included, i.e. there are no sub-populations in which disease mostly spreads internally with reduced rates of transmission to other sub-populations. (3) There are some adjustable parameters for the disease doubling time and the probability of sequencing the virus from a case, both assumed somewhat unrealistically to be independent of time. Without getting into the detailed form of the model, one can see that these simplifying assumptions, although justified in a first look at the process, can lead to unrealistic probabilities for detailed features of the outcomes.

The doubling time is set to several values in a realistic range. The probability of detecting a sequence seems to be set to an irrelevant probability of detecting a case. This leads to simulations having about ten times as many sequences included as were present in the data set used, an important and unnecessary difference that’s relevant to the key hypotheses tested.

P2022 compares two hypotheses, I₁and I₂. I₁ is that all the human cases descend from a single introduction. I₂ is that the two major clades descend from two separate introductions. P2022 use their model to calculate a Bayes factor BF by which they say that the observed two-clade topology favors I₂ over I₁.

Regardless of whether the model of post-spillover transmission, mutation, and ascertainment is realistic and whether the data used are representative the paper applies a specific well-defined mathematical model to a well-defined set of data. Therefore it has well-defined implications. Like any mathematical implications, those can be calculated correctly or incorrectly.

Coding Errors

The original P2022 paper made at least three coding errors in calculating those implications. These were noted by Angus McCowan under the pseudonym “Neoacanthoparyphium echinatoides“ on the “pubpeer” post-publication review site . The code is available on github. Correcting the first three errors reduced the likelihood ratio favoring the two spillover hypothesis I₂ over the one-spillover hypothesis I₁ from ~60 to ~4.3, as now acknowledged in a corrected version of P2022. McCowan went on to correct some further minor errors and to run the code enough time to get good statistics, finding BF= ~3.5, although these have not yet been formally acknowledged by the original authors. The threshold used for nominal significance has been lowered from 10 in the original paper to 3.2 in the corrected version, so the factor remains “significant”.

Conventionally, one would not consider a Bayes factor as small as that to have any significance unless the particular features to be observed were pre-specified, since for any complicated set of observations it’s always possible to post-hoc select some that are more compatible with one hypothesis. In frequentist null hypothesis significance testing a familiar equivalent problem is dealt with, or at least is supposed to be, by requiring either pre-specified statistical tests or truly extreme p-values, enough to more than make up for the implicit collection of possible comparisons.

Bayesian Logic

The most fundamental issue with P2022, however, concerns basic Bayesian logic. (4) Bayesian odds update factors are of the form
P(observations|hypothesis 1)/P(observations|hypothesis 2).

It is obviously essential that the same observations be used for each hypothesis. For example, if one were to update the odds for deciding which of two suspects committed a burglary, it would not be correct to use the ratio P(drives car|suspect 2)/P(drives blue Toyota|suspect 1). The P2022 paper makes an analogous mistake. For the I₁ hypothesis the observation used is that there are two polytomies with sequences differing by at least 2 nucleotides with the smaller polytomy having at least 30% of the detected cases. (Lineage A had 35%). For the I₂ hypothesis the requirement was only that there be two polytomies with no restriction on the sequence difference or the relative size. (5) This is a fundamental error in logic.

One of the two conditions P2022 required for I₁ but not for I₂ was that the smaller clade have a size of at least 30%. The ratio of the sizes of the two clades is determined entirely by the transmission after the introductions and thus requires no modeling other than that already used by P2022. McCowan has spot-checked the size ratio for I₂ finding that 100/674 =15 % met the condition. Thus including just this one condition would reduce the Bayes factor from ~4 to ~0.6.

There is not a unique correct algorithm for correcting the logic error for the sequence difference constraint. The reason is that for I₂ one needs some picture of the pre-spillover sequence distribution in order to obtain the probability of meeting the observational constraint for the sequence difference, D, between the clades. No such model is explicitly provided. Nevertheless McCowan has managed a minimal extension of the P2022 model from which estimates of the largest plausible Bayes factor with balanced requirements can be calculated. (5) The model simply puts in two successful introductions, each followed by a descendent tree generated by exactly the same algorithm that P2022 used. By adjusting the two introduction times, i.e. the post-MRCA start times of the two trees, to maximize the likelihood of the I₂ hypothesis, McCowan obtains an upper bound for what its Bayes factor should be with the P2022 model. (5) Since it turns out that this likelihood is maximized when the two times are equal, the only way the time from MRCA to introduction shows up is by the expected mutation number. Since the time is chosen to maximize the likelihood, the likelihood is insensitive even to the mutation rate in the prior host for rates close to that in humans because the time is always picked to get the most favorable expected number of mutations.

McCowan finds a maximum Bayes factor for I₂ vs. I₁ to be ~0.3. (5) I₁ is thus fairly strongly favored.

As I discuss below, when proper Bayesian outcome descriptions are used one can more easily obtain approximately this upper bound from simple considerations.

Further thoughts

I should comment briefly on a few other obvious corrections that might be made. These comments are purely my responsibility and are not simply attempts to clarify McCowan’s new paper.

Genuine Bayesian Conditions (see Appendix for tutorial)

The conditions used by P2022 are not correct for Bayesian analyses. These one-sided conditions, that D > 1 and that the smaller size was at least 30% of the total, seem to be remnants of a frequentist analysis in which one asks how probable it would be for some result to be at least as far from the prediction of some model as was actually observed. Bayesian analysis compares the likelihoods under different hypotheses of getting the actual observations, not arbitrary one-sided extensions of them. In this case the relevant observations are that D= 2 and that the smaller size was 35% of the total. From the first round of McCowan’s simulations on pubpeer, one would expect that using the exact D will lower the Bayes factor for I₂, unsurprisingly since separate introductions can have arbitrarily large sequence differences while clades descending from a single introduction tend to have closely similar sequences. This distinction becomes particularly important in evaluating models that propose a prior quasi-species with a broad distribution of sequences.

One can obtain an approximate upper bound on the sequence difference correction once one uses real Bayesian conditions. For I₁ switching from requiring only D>1 to requiring D=2 makes little difference in likelihood because the early sequences cluster closely near the single introduced sequence and are especially unlikely to have major clades with D > 2 . For I₂ the prior evolution just sets an expected D between the two randomly sampled introductions for I₂ for some distribution of sequences. Since the small number of observed differences come from a huge collection of possible mutations the distribution for small D should have a Poisson form. The distribution then is fully described by its expectation value, E(D) The maximum probability of D = 2 is obtained when E(D) = 2, giving P(D=2) = 2/e^2= 0.27. This is almost exactly the factor by which inclusion of the D = 2 constraint reduced BF in the McCowan’s previous pubpeer analyses.

This argument provides a simple intuitive way of seeing why McCowan’s BF is largest when the introduction times are ~15 days post-MRCA, since that corresponds to about an expectation of one mutation for each introduction on its descent from the MRCA. McCowan’s argument, however, uses the specific possible MRCA sequences proposed in P2022. If I understand correctly, longer times since the MRCA are disfavored in those simulations because of reduced probability of getting the right starting sequences in each introduction rather than by the more generic reduced probability of getting the right sequence difference between the two introductions. One advantage of the more generic approach would be to avoid the appearance of reliance on specific choices of ancestors.

For the size ratio the approximate fraction of the I₂ trees meeting the inequality condition can be directly calculated from the distribution of ratios of sizes among different I₁ simulations. That calculation is independent of the details of the pre-introduction evolution. The calculation might not be exact because even when the introduced sequences differ by D=2 early mutations in either batch could accidentally give sequences that look like intermediates. I don’t know how the P2022 algorithm would interpret those. McCowan has now found that ~15% of the simulation pairs meet the inequality condition. The combination of this factor with the pre-evolution-independent 2/e^2= 0.27 factor from the sequence constraint should give a good approximation to the upper bound of the net correction factor. It need not be exact because the joint distribution doesn’t have to factorize into the product of the two distributions, especially for I₁.

Using the actual size ratio rather than the P2022 inequality (>30%) would amount to using the ratio of the relevant probability density functions: PDF(ratio=65%/35%|I₂)/PDF(ratio=65%/35%|I₁) rather than the probability ratio P((30% to 70%)|I₂)/P((30% to 70%)|I₁). The one-introduction range of sizes is large enough that the I₂ Bayes factor should remain maximized at equal introduction times. I suspect that the distribution of logarithms of ratios of sizes of the two clades in I₁is somewhat narrower than distribution of size ratios of the two clades in I₂, so that using the correct Bayesian condition might slightly shift the odds toward I₂. Unless the distribution of ratios for I₁ is surprisingly narrow that tweak will leave I₁ favored.

Adjustable Parameters

McCowan initially allows two adjustable parameters (the two times) to maximize the likelihood of I₂. Since in the range of mutation rates explored only one time is used, in this range the I₂ model has in effect only one extra parameter. In the standard Akaike Information Criterion each additional parameter requires a penalty of 1/e in likelihood. Without some such penalty models with more parameters would always be favored no matter how irrelevant those parameters are since each parameter can be adjusted to whatever happens to improve the fit. Conventionally, the posterior odds P(I₂)/P(I₁) should be reduced be another factor of 1/e = 0.37. If a time difference between introductions is also allowed then there should be another such factor.

Priors

P2022 use priors P₀(I₁) = P₀(I₂) = 0.5 and P₀(I_n) = 0 for n >2. These seem unjustifiably informative. A more conventional non-informative prior would describe the expectation value x for the number of successful spillovers, with a uniform probability density of ln(x). Divergence of its integral for small x is removed by conditioning on there being an observation, i.e. n>0. Weak divergence for large x can be removed by truncating on any large number, e.g. the population of Asia. Then assuming Poisson probabilities for I given any x, one easily obtains P₀(I₂)=P₀(I₁)/2. P(I₂)/P(I₁) should be reduced by this factor of 0.5.

Data

The P2022 paper omitted numerous sequences that appeared to be intermediates from their data set due to reliability issues. Although questions have been raised about whether consistent criteria were used for exclusion of sequences it may be that all the excluded intermediates were indeed misreads of some sort. Now in a larger data set reliable intermediates have been found.(6) Lv et al. interpret these as descendants of the actual evolutionary intermediates in humans, although one cannot yet be sure that they aren’t later mutations from one of the lineages. If that more recent interpretation of more complete data is correct, then the two-lineage empirical basis of the P2022 paper is lost.

Time-dependent Ascertainment

Less formally, but perhaps most importantly, the common framing of the question has missed the key point. The particular observed feature that leads P2022 to conclude there were two successful spillovers is that, at least in their sequence set, there were two distinct lineages without any reliably verified intermediate sequences. Since by all accounts all the sequences descend from an MRCA the issue is why intermediate sequences were missing from those data.

It is possible for the two mutations separating the lineages to have occurred in one transmission event without intermediate cases existing although that is fairly unlikely and already included in the simulations. Another possibility already incorporated in the P2022 probability calculations is that the intermediates existed but simply happened by chance to be missed because most cases were missed. Those calculations, however, assume that the probability of detecting a case is fixed as a function of time. That assumption is very far from accurate, as detailed by Tsang et al. (7) based on the way cases were ascertained at different stages of the pandemic. Early ascertainment rates were much lower than later ones. Thus the simulations miss a key reason why early intermediates are likely to be missing.

Rather than incorporate that realistic reason for an enhanced probability of missing intermediates in the model, P2022 proposed an alternative two-spillover explanation that plays the same role mathematically. Since no cases have been ascertained in any animal host prior to the human cases, intermediates in such prior hosts would be missed if the MRCA occurred in them rather than in a human. That explanation is called the two-spillover hypothesis. It moves the MRCA from the early stage of the human pandemic, when the actual ascertainment probability was especially low (although that reduction is not included in the model) to a hypothetical host in which the ascertainment probability was zero. In other words, the model replaces the approximately known time-dependence of the ascertainment rate in humans with a similar though more extreme hypothetical time dependence from the spillover itself. Using more realistic time-dependent ascertainment rates might have removed even the superficial appeal of the I₂ hypothesis.

Appendix on Bayesian Observational Descriptions

It seems that there’s widespread misunderstanding of a basic principle of Bayesian reasoning. Bayesian likelihoods are conditional probabilities of seeing the actual observed results and not the conditional probabilities of seeing the sort of extended range of more extreme unobserved results that are used in frequentist null hypothesis significance testing p-values. Here’s a quick argument showing how radically wrong it is to use the p-value-ish ranges rather than the actual results.

Let's say you have two hypotheses about a coin:

H1 is that the coin is 50/50
H2 is that the coin is 70/30 heads.

Now flip the coin 99 times. Say you get 50 heads.
Say we define the outcomes as broad one-sided ranges:
(E1=more heads or E2=more tails).

We observed E1.

P(E1|H1) = 0.5

P(E1|H2) =1-epsilon.
So the likelihood ratio favors H2 by almost a factor of 2.

Say we define the outcomes as the actual number of heads seen, not some extended interval.
Then P(E1|H1)/P(E1|H2) is (5/7)*(0.25/0.21)^49 =3666, strongly favoring H1.

Which reasoning is right and which is wrong?

This is not a judgment call.

References

1. J. E. Pekar et al., The molecular epidemiology of multiple zoonotic origins of SARS-CoV-2. Science 377, 960 (2022).https://www.science.org/doi/abs/10.1126/science.abp8337
2. M. Worobey et al., The Huanan Seafood Wholesale Market in Wuhan was the early epicenter of the COVID-19 pandemic. Science 377, 951 (2022).https://www.science.org/doi/abs/10.1126/science.abp8715
3. S. Zhao et al., Pinpointing the animal origins of SARS-CoV-2: a genomic approach. Journal of Genetics and Genomics 49, 900 (2022).https://www.sciencedirect.com/science/article/pii/S1673852722001539
4. M. He, and L. Dunn, Assessing Probabilities for the Two Leading SARS-CoV-2 Origin Narratives. 2022. https://www.researchgate.net/publication/363158480_Assessing_Probabilities_for_the_Two_Leading_SARS-CoV-2_Origin_Narratives
5. A. McCowan, Purported quantitative support for multiple introductions of SARS-CoV-2 into humans is an artefact of an imbalanced hypothesis testing framework. arXiv q-bio.qm, (2025).https://arxiv.org/abs/2502.20076
6. J.-X. Lv et al., Evolutionary trajectory of diverse SARS-CoV-2 variants at the beginning of COVID-19 outbreak. Virus Evolution 10, (2024).https://doi.org/10.1093/ve/veae020
7. T. K. Tsang, P. Wu, Y. Lin, E. H. Y. Lau, G. M. Leung, and B. J. Cowling, Effect of changing case definitions for COVID-19 on the epidemic curve and transmission parameters in mainland China: a modelling study. The Lancet Public Health 5, e289 (2020).https://doi.org/10.1016/S2468-2667(20)30089-X
Share
Leave a comment

Michael’s Substack (Free)

Discussion about this post