29 Comments

fantastic work!

I agree with almost everything, just a few comments:

- I'd put some weight on the earliest genome in presence of Vero DNA sequenced at Sangon. Both the company and Vero cell lines are mentioned in DEFUSE. If it were from a human, it would have likely been a less mixed sample, and was certainly never published, which in itself is suspicious.

IMO WIV panick-sequenced all their RaTG13-like / FCS insertion project samples after they got the first SARS2 sequence to check if it was them. This would also explain the CHO DNA (often used for spike characterisation).

- There are 2 more important pieces of evidence you do not discuss

1: human optimized SARS2 spike expression vectors found in 2019 patient samples

2: our endonuclease preprint, specifically the high concentration of syn. Mutations in restriction sites used by WIV researchers in 2017.

this talk may help:

https://youtu.be/EuuY94tsbls?si=IVu6DXPxMDxhNT98

let me know if you like to discuss this.

Expand full comment
author

p.s. I had a restriction segment factor in early drafts but took it out due to some advice from experts. That's the best call now, but it could change.

Expand full comment
author

Valentin- Many thanks for the thoughtful comment, which may point to future updates. I have watched your talk. Here's why I didn't yet include those points.

1. Spike sequences in plasmids. I don't understand the context well enough to use it yet. Perhaps more write-up of why LL would lead to them and ZW wouldn't would help.

2. Restriction fragment pattern. I wanted to avoid using models as much as possible because I'm not knowledgable enough to evaluate them. You show patterns for ~42 viruses which break down into ~24 clusters of closely similar ones. I find one that lands in the little segment-number max-segment-length region you've found for the synthetic 10, and another is close. So I'd say P(segment pattern|ZW) is something like 1/15. Why not use that update factor? Because I don't know P(segment pattern|LL). Friedemann Weber points out that the easiest synthesis methods wipe out the restriction sites. What's the probability that someone would use a harder method to preserve modularity for future changes? I don't know. Is it bigger than 1/15? I'd like to follow up on this but would need more data.

Sangon: I'd love to get more detailed info on Sangon. I had a strong suspicion that it was worth an update but didn't get far enough into the weeds to be sure. Those 3 reversionary nt's - how good were the read depths? How many other similar read-depth mutations were around? How strong was the evidence that the SC2 wasn't from humans rather than the culture cells? I just ran out of energy and didn't follow up.

Expand full comment

Michael,

first, you should absolutely only include things into your ananlysis that you are very certain to be correct. and to be frank, your result is crystal clear as is, ending up with 99.5 or 99.99% does not really move the needle, both justify a dilligent investigation.

regarding your questions:

- i basically explain the pCDNA3.1 spikes in my talk to my best understanding. The constructs are highly similar to ones published here https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7092805/ , ideal to test spike to hACE affinity or as a vaccine for an animal (both mentioned in DEFUSE), and so heavily mutated that a later lab contamination can be pretty much excluded.

- about 1% of all BetaCoVs have a BsmBI/BsaI pattern falling into the below 8kb/less than 8 fragments window. the point here is that none of the closely SARS2 related ones have a similar pattern, so SARS2 sticks out here. one could look into a few other highly efficient type IIs enzymes, but BsaI and BsmBI are the only ones previously used by the WIV. what makes SARS2 more special, is that it has one type of IIs sites flanking RBD and FCS, and another one for the rest of the backbone. This allows you to build the backbone once, and then to easily make several RBD/FCS variants. SARS2 is the only alpha or betaCoV with a pattern that allows for this (so 1 in ~3k, I can send you the slides). Weber would be correct if they only wanted to make one virus, but if you want to make several variants, you need those sites in the genome (see https://twitter.com/CD57227/status/1696475005259862479?s=20. Weber is not an experienced bioengineer. and even tweeted that Vero sequences in SARS2 would not support a lab leak hypothesis…).

more important than the pattern is the very high frequency of synonymous mutations in these restriction sites, almost 10x higher than in the rest of the genome when compared to RaTG13. I don't know any natural process that would put a selection preassure und fairly evenly spaced IIs sites. happy to elaborate this, but likely easier via zoom.

regarding Sangon, I haven't looked into read depths, mostly trusted the analysis of others (Istvan Csabai's, which was confirmed by Jesse Bloom).

at the moment, I think the most effective way forward would be to have an analysis like yours that includes all of the evidence in a peer reviewed publication. i started working on this with a colleague, let me know in case you feel like it makes sense to join forces.

best regards, Valentin

Expand full comment
author

Hi Valentin- I'd love to help as a critic etc. for a peer-reviewed publication, but I'm kind of too worn out to be a real coauthor.

Your point that "BsaI and BsmBI are the only ones previously used by the WIV" directly answers one of my questions, and does indicate that the restriction pattern should give another likelihood factor. But as you say, the public and governmental debate won't be much changed if the robust Bayes odds go from 100/1 to 1000/1. People will still say "so you're saying there's a chance".

Expand full comment
author
Sep 5, 2023·edited Sep 5, 2023Author

I don't want to censor anybody, especially one of my rare cohort of fellow Harvard people who did time in a penitentiary. Nonetheless, I'm deleting a series of very long Comments from reader "Harvard2TheBigHouse" because they wandered off topic into extremely naive remarks about quantum mechanics, etc. (Next week I'll post something about quantum mechanics!) I don't want this substack to be a woo forum.

His key relevant point was that he believes that SC2 came from live-attenuated-vaccine research, which I would consider to be a subset of LL. He gave a link to an early paper on that: https://onlinelibrary.wiley.com/doi/10.1002/bies.202100017.

He also believes that HIV came from similar research, a topic about which I know nothing.

Readers who wish to follow up on his thoughts may go to the substack under that name.

Expand full comment
author
Sep 3, 2023·edited Sep 3, 2023Author

Tentative not-ready-for-prime-time updates in response to Valentin. Here's a glance at sausage-making.

Valentin's collaborator Alex Washburne says that of the 10 relevant synthetic viruses they found, 8 used methods that left the restriction sites in. (https://alexwasburne.substack.com/p/the-synthetic-origin-theory-of-sars). I found one or two out of 24 natural sequence clusters that they present land in the pattern region of the synthetic sequences.

But there's another issue. There exists a pair of suitable restriction enzymes that give just the right synthetic segment pattern for SC2. But aren't there several other possible combinations of suitable restriction enzymes? We need to compare the ~80% chance that the right pattern could be found under LL with the probability that the right pattern could be found under ZW including all plausible sets of restriction enzymes that might be used for synthesis. Are there 3 such combinations? 6? Somebody in the business should know the answer.

I hate to recommend machine learning for anything, but it might make sense to use an ML method to distinguish the patterns of the synthetic sequences from the many others, then use the results to give odds.

***

Sangon has the most intriguing data. I should have mentioned that DEFUSE specified that lab as where some of their sequencing would be done, and that the cells included not just VERO but hamster cells, also standard for lab culture. There were some weird features indicating that these cells had been pretty messed up by some virus. Here's a tentative update. I hope more knowledgable people can review it.

3 of 13 mutations looked ancestral. It doesn't much matter whether there could have been misreads, because the probability of a random misread looking ancestral isn't much different from the probability I mentioned of early SC2 mutations looking ancestral. I get

P(3of13|ZW)=~1/70.

What about P(3of13|LL)? Here I don't really have a clue. You expect more ancestral nt's in a recent ancestral line, but how many? The simplest way I can express my cluelessness would be to assign equal probabilities to any number of ancestral nt's from 0 to 13. Then P(3of13|LL)=1/14. This would give a likelihood ratio of 70/14=5, or logit = 1.6. Maybe ±1.

***

i still am too ignorant of the roles of plasmids to have even a first look at that possible update.

Expand full comment
Sep 8, 2023Liked by Michael Weissman

If the probability of a random mutation looking ancestral is 23/654=~0.035, then the probability of 3 of 13 mutations looking ancestral should be (0.035)^3(1-0.035)^10(13!)/(10!3!)=~1/115. Am I misunderstanding this?

Expand full comment
author
Sep 8, 2023·edited Sep 8, 2023Author

You're right. I was just slopping through a first look using 0.04 rather than 0.035 to calculate in my head. If I end up confident enough to use this, more careful calculations are needed. One slight complication is that 23/654 was never the exact ratio needed since it counted "distinct" mutations rather than total mutations. But I should put that in the uncertainty, not in conservatively rounding up.

Expand full comment
author

There's a bigger question I'd like some help with before incorporating Sangon. I saw a claim that the 3 ancestral mutations actually match the ones for Kumar et al.'s independently estimated progenitor MRCA. If so, there's basically no chance that they're misreads. Anybody know more about this?

First thoughts: If it's right, the possible interpretations would narrow. This would clearly be ancestral RNA from a sample with VERO, hamster, and human DNA. It's a lab culture sample. Even if the RNA happened to come from the person(s) it would mean that the MRCA was present in a lab sample. That sounds like it strongly favors LL over ZW.

Expand full comment
Sep 10, 2023Liked by Michael Weissman

Kumar et al.'s progenitor sequence seems to have the first two synonymous mutations listed in Table 2 of Csabai et al. Kumar's third mutation is T->C at position 28,144, not G->C at position 29,449 as reported by Csabai. Disclaimer: I'm not a pro at this. I just plugged the data from Kumar's sequence file (https://igem.temple.edu/data/COVID-19/proCoV2.fasta), containing both the progenitor and the reference sequence directly into Clustal Omega.

Expand full comment
author

Thanks! You're way more of a pro than me.

Expand full comment
Sep 10, 2023·edited Sep 10, 2023Liked by Michael Weissman

Dammit! I was looking at the synonymous mutations. When I instead look at the sites of the most ancestral mutations (denoted alpha_1, alpha_2, and alpha_3 by Kumar et al.), the sequence variants found by Csabai seem to line up perfectly with the progenitor predicted by Kumar et al. Specifically, they find C at position 28,144 (alpha_3), T at position 8782 (alpha_2), and T at position 18,060 (alpha_1). This is much more interesting.

Expand full comment
author

No kidding! It's embarrassing. I wanted to be sophisticated and not use phrases like "smoking gun" but...

Expand full comment

How are you thinking about the wide time window in which the Sangon contaminant could have been collected? If it pre-dates the pandemic, it's a smoking gun. But if it was collected in November or December of 2019, wouldn't it be consistent with lab work performed on early patient samples?

Expand full comment

Since you won't answer on X, let's try again here. You quote Bob Garry "Do the alignment of the spikes at the amino acid level -- it's stunning."

He was aligning SARS2 with RaTG13, which the WIV uploaded on January 24th. So why did Shi and WIV publish the 96% match exposing the furin cleavage site, which kickstarted engineering rumors?

https://virological.org/t/tackling-rumors-of-a-suspicious-origin-of-ncov2019/384

Expand full comment
author

I wasn't aware of your asking on X. I do other things. I don't know why they chose to publish some things and not others. Engineering rumors would have started due to location and FCS regardless of whether RaTG13 were published or not. Actually, they probably would have started even if there were no evidence, but in this case there was.

Expand full comment

I don't think you do "other things" if you write million word Bayesian analysis. Now 2nd time, why did Shi and WIV publish RaTG13 bringing Bob Garry's attention to the FCS, that you allege they inserted?

Expand full comment
author
Sep 11, 2023·edited Sep 11, 2023Author

I don't know, but it's hard to see how the FCS could have escaped attention.

I planted 3 little trees today and harvested about 100 peppers, some tomatillos, mustard greens, arugula, and cilantro. Zoomed w grandkids. Took a few mile walk out to Japanese garden. ...

Expand full comment

where does "I don't know" fit into your Bayesian analysis?

Bob (and every virologist on the planet) knew of the FCS by Jan 12th, but none suspected engineering until Shi published RaTG13 on Jan 24th.

https://www.youtube.com/watch?v=HhsBE0C8Zcg&t=2160s

Expand full comment
author

"I don't know" fits in the infinite sea of things we don't know and thus cannot use to estimate probabilities.

Expand full comment

Food for thought, and most of the #mousecrew (jikky the mouse, etc..) agrees on the broader point that it's not merely a "leak":

https://twitter.com/Jikkyleaks/status/1693588342649377202?s=20

https://zenodo.org/record/8216373

https://theleadingreport.com/2023/08/31/new-study-reveals-that-all-covid-variants-have-been-created-in-biolabs/

Expand full comment
author

I find that highly implausible. Although I'm not an expert on modeling the evolution of new variants I do get a chance to talk with some. The big new variants are, it's true, not arising from ordinary mutations as the virus spreads throughout the population. Bottlenecks in each transmission event are a big issue. Long-term evolution within immune-compromised hosts looks really different. One also really has to be wary of contamination of samples with more recent strains.

Expand full comment