Probing the contested origins of SARS-CoV-2 is grounded in two hypotheses: that the pandemic virus entered the human population after escaping from a laboratory or that it jumped from its natural host to our species from infected animals. Since Charles Darwin’s publication of “On the Origins of Species by Means of Natural Selection” in 1859, there have been countless debates over two logically independent processes — common descent and natural selection. Only one figure in his publication (the Diagram of Divergence of Taxa) is frequently interpreted as illustrative of representing a broad conceptual model of Darwin’s theory as it illustrates the causal efficacy of natural selection in producing well-defined varieties and ultimately species.
Darwin’s Taxa Tree Diagram encompasses the idea that natural selection explains common descent and the origin of organic diversity. His initial Tree of Life sketch, which appears in his Notebook B, in 1837, illustrates his theoretical insight on how a genus of related species might originate by divergence from a starting point. On December 31, 2019, four cases of pneumonia of unknown etiology in Wuhan, were reported to the Chinese office of the WHO. On January 12, 2020, Chinese scientists disseminated the genetic sequence of the etiological agent; a virus that would be labeled as SARS-CoV-2 which came to be known as COVID-19.
On February 26, 2020, Brazil’s first case of COVID-19 was confirmed in a 61-year-old male who recently visited in the Lombardy region of Italy. Twenty-four hours later, the genome of the SARS-CoV-2 virus from that patient was sequenced. By that time, the agent displayed a unique sequence of bases that differed from other species and was genetically different enough from SARS-CoV-1 (genetic similarity of approximately 79%) and MERS-CoV (50%), to be considered novel.
But twelve months ago, more than 200 data entries from the genetic sequencing of early cases of SARS-CoV-2 vanished from the Sequence Read Archive (SRA) – a digital scientific database. The SRA is an international public archival resource for next-generation sequence data set up under the guidance of the International Nucleotide Sequence Database Collaboration (INSDC). This repository for raw sequencing data is maintained by the National Center for Biotechnology Information (NCBI) and is part of the US National Institutes of Health (NIH).
The mysterious disappearances neither buttress nor diminish the hypothesis that the pathogen seeped out of a Wuhan laboratory. But it raises suspicions. Dr. Jesse Bloom, a computational biologist at the Fred Hutchinson Cancer Research Center, was reviewing genetic data published by COVID-19 research teams when he stumbled upon a March 2020 study with a spreadsheet cataloguing 241 genetic sequences collected by Wuhan University scientists.
The spreadsheet indicated that the scientists had uploaded the sequences to the SRA. But his search of the SRA for the raw files returned with– “no item found.” Confounded, he returned to the spreadsheet which pointed out that the 241 sequences were collected by a scientist named Ming Wang at Renmin Hospital in Wuhan. Dr. Bloom continued to review the literature and uncovered another study where the sequences were tagged to a paper in which Ming Wang and others used nanopore-sequencing technology to detect SARS-CoV-2 genetic material in samples taken from humans. That study was published in the journal “Small” on 24th June 2020, (Vol. 16, Issue 32) having been posted on “bioRxiv” on March 6th, 2020.
In that study, scientists examined 45 samples from nasal swabs from outpatients searching for a portion of SARS-CoV-2’s genetic material. The scientists did not publish the actual sequences sifted out from the samples. They only published an elaborate table on mutations. Dr. Bloom suspected that the 45 samples were the source of the 241 deleted sequences — so he continued his search.
He found the deleted files in the unending “cloud”. He then swapped in the code for a missing sequence from Wuhan. Eureka! Using this approach, Dr. Bloom excavated 13 sequences from Google Cloud. He then combined his 13 sequences with the data during the early stages of the pandemic to map the Darwinian Tree of Life for SARS-CoV-2. He posits that it is plausible that the Wuhan market episode was one of the first super-spreading events and that by then SARS-CoV-2 may have developed extensive diversity — even in Wuhan.
Tracing all the steps by which SARS-CoV-2 could have evolved from a bat virus is hindered by the paucity of samples from the Huanan Fish Market especially since the Huanan market viruses actually had three extra mutations that are missing from SARS-CoV-2 samples collected weeks later. This suggests that those later viruses resemble coronaviruses found in bats and this supports the hypothesis that there was some early lineage of the virus that did not appear in the seafood market.
Dr. Bloom found that the redacted sequences he found in Google Cloud also lack those extra mutations. Thus, the Wuhan market virus varieties may not be representative of the full diversity of coronaviruses already loose by December 2019. These deleted sequences do not shed light on the origins of COVID-19. But the excavation of partial SARS-CoV-2 genome sequences at the beginnings of the pandemic’s apparent epicenter in Wuhan, which were once in the SRA and later removed, is unnerving.