The Chinese side of the COVID data withdrawal controversy

Much ado about nothing, it turns out.

Jul 24, 2021

Around one month ago, in late June 2021, media outlets from The New York Times, the Wall Street Journal, Financial Times, Bloomberg, to Nature reported that Wuhan University researchers had “mysteriously” withdrawn COVID-19 sequencing data from a U.S. database, quoting experts saying “the incident demonstrated further evidence of how Chinese researchers and officials have not been fully transparent in how they dealt with data related to the pandemic’s origins.” (FT)

Your Pekingnologist published a Chinese vice-minister’s account of the incident at a press conference the day before yesterday, which you can read as background for this newsletter.

Yang Liu, a friend and colleague at day-job employer Xinhua, and your Pekingnologist have managed to speak to the Wuhan University researcher at the center of the controversy.

The researcher insisted on anonymity before talking to us, for fear of a bad influence on his or her career. The researcher said that so far he or she had not been named in any press coverage, that the Chinese vice minister didn’t name him, and that Dr. Jesse Bloom, the U.S. researcher challenging the data withdrawal, had also not named him or her.

Yang Liu and your Pekingnologist decided to grant anonymity with the full realization that this might affect the reception of our respective newsletter. But we had to leave this to your better judgment.

We want to stress nobody in Beijing directed or aided our outreach to the Wuhan University researcher. In fact, our first phone call was shut down in about ten seconds - as soon as we identified ourselves. Later, Yang Liu’s repeated pledge that he would not betray a fellow Wuhan University alumni apparently helped a lot.

Now a spoiler to the story: it’s actually very boring, especially for a weekend, but for a story that has received wide international press coverage, this is one of the two sides that hasn’t been heard.

***

The Wuhan University researchers were trying to publish a paper in an academic journal, which has been named by the Chinese vice-minister as Small. The paper is about a new sequencing method to determine if a person has been infected by SARS-CoV-2 and Other Respiratory Viruses.

To prove a new sequencing method works, the researchers needed to show the data where their sequencing method was applied. The text of the paper described the raw sequencing data from their sequencing method, but to further strengthen the credibility of the raw sequencing data and consequently the new sequencing method, the authors also uploaded their raw sequencing data to a U.S. database and wrote one paragraph in their paper detailing how to access (including the Internet link, the accession code, etc. ) the raw sequencing data.

After reviewing and editing, the journal sent a draft back to the Chinese researchers, and in this draft that paragraph was gone - apparently deleted by people working for the journal.

The Chinese researchers thought since it was the journal that deleted the paragraph, then it must mean that the journal decided the link to the raw sequencing data was unimportant.

Then the Chinese researchers thought since the link would not be included in the published paper, what’s the good of leaving the raw sequencing data hanging alone with no direct citation on any paper? In the original words of the Chinese researcher that we spoke to, the raw sequencing data would be like 无头苍蝇 a headless fly. Then they wrote to the U.S. database asking for a withdrawal, which, per established protocols, obliged.

The Chinese researcher repeatedly emphasized in the interview that the raw sequencing data is of little value to SARS-CoV-2 origin tracing, for about three reasons summarized by your Pekingnologist:

1) The first of the raw sequencing data was sampled on January 30, 2020. It wasn’t “early” for the purpose of origin tracing.

On that day, China reported 9,692 COVID-19 cases, according to CNN.

The first known case, as of now, was Dec. 8, 2019. That’s about two months before the first sampling.

2) Raw sequencing data has to meet some quality standards so as to be of meaning and value to origin tracing efforts, but the raw sequencing data for that paper didn’t meet those standards, because the raw sequencing data was only for proving the new sequencing method - a much lower bar.

To explain this to a layman like Yang Liu or your Pekingnologist, the researcher said their raw sequencing data was like covering only a few digits of a dozen-digit-long ID number; to use their raw sequencing data to identify the person behind the ID number would be impossible - they don’t even have the whole ID number.

3) The text of the paper already has a description of the raw sequencing data, which provided reasonably enough information - enough that the paper was accepted and published by the journal Small - a decent journal.

With a 2016 ISI Impact Factor of 8.64, Small continues to be among the top multidisciplinary journals covering a broad spectrum of topics at the nano- and microscale at the interface of materials science, chemistry, physics, engineering, medicine, and biology.

The paper was published on June 24, 2020, without that paragraph and hence without the description on how to access the data in the U.S. database.

Sometime later, a Dr. Jesse Bloom in the U.S. apparently stumbled upon the data, became interested, did detective work, and recovered the data via some other means.

Dr. Bloom on June 7, 2021, which is already almost one year after the paper’s publication, wrote to the Chinese researcher asking “why the raw sequencing data for the study are no longer available?” - this is a direct quote, shown by the Chinese researcher to us. If Dr. Bloom reads this newsletter and checks his sent email, he would know this is word-for-word accurate.

The Chinese researcher said to us they didn’t know Dr. Bloom, and that they thought - they still think - if they were to share the “raw sequencing data,” the best way was to upload the “raw sequencing data” to a database and made it public, not exclusively with one person.

The Chinese researcher admitted they didn’t write back to Dr. Bloom at the time.

Almost two weeks later, on June 23 (for Beijing Time GMT+8, it’s June 22), 2021, which is about one month ago from now, Dr. Bloom published this pre-print and Twitter thread two weeks, which immediately went viral.

Bloom Lab @jbloom_lab

In a new study, I identify and recover a deleted set of #SARSCoV2 sequences that provide additional information about viruses from the early Wuhan outbreak: biorxiv.org/content/10.110… (1/n)

biorxiv.orgRecovery of deleted deep sequencing data sheds more light on the early Wuhan SARS-CoV-2 epidemicThe origin and early spread of SARS-CoV-2 remains shrouded in mystery. Here I identify a data set containing SARS-CoV-2 sequences from early in the Wuhan epidemic that has been deleted from the NIH’s Sequence Read Archive. I recover the deleted files from the Google Cloud, and reconstruct partial se…

Your Pekingnologist clearly recalls reading this thread while he was walking out of Exit E of Beijing’s Xuanwumen subway station to his office and thinking “shit, this is gonna blow up.”

And it did. Nearly all the mainstream international media outlets you’ve heard of published a story on the same day (June 23: NYT, WSJ) or the next day (June 24: FT, Nature).

The press reports reported that they reached out but didn’t hear IMMEDIATELY from the Chinese researchers

NYT: Three of the co-authors of the 2020 testing study that produced the 13 sequences did not immediately respond to emails inquiring about Dr. Bloom’s finding.
WSJ: The researchers didn’t immediately respond to a request for comment. China’s National Health Commission didn’t immediately respond to a request for comment.
FT: Wuhan University did not respond to a request for comment.
Nature: The corresponding authors of the Small paper did not respond to questions from Nature’s news team about why they asked for the sequences to be removed from the SRA, which happened before the paper was published.

While the overwhelming majority of the reports hinted strongly this may be, in FT’s summary of experts’ comments

further evidence of how Chinese researchers and officials have not been fully transparent in how they dealt with data related to the pandemic’s origins. another incident

The Chinese researcher said that the Chinese government authorities did NOT contact them to ask about the matter UNTIL the international media coverage.

The Chinese vice-minister mentioned in Thursday’s press conference:

这个事情报道出来以后，我们马上对这个事情进行了调查、了解。
After this incident was reported, we immediately conducted an investigation and gained an understanding of it.

***

You are welcome to buy me a coffee or pay me via Paypal.

***

The following, built largely on the Wuhan University researcher’s account, is a play-by-play recount of what happened (and largely the same with Yang Liu’s Beijing Channel newsletter, which went out several hours ago)

I. What was the research in question actually about?

The research, as shown in the published paper, was an attempt to find a sequencing method. At the time of the research in 2020, medical professionals in China, and especially inWuhan, lacked an efficient and effective method to diagnose SARS-CoV-2 and differentiate it from other respiratory viruses.

Or, as described in the published Abstract of the paper

The ongoing global novel coronavirus pneumonia COVID-19 outbreak has engendered numerous cases of infection and death. COVID-19 diagnosis relies upon nucleic acid detection; however, currently recommended methods exhibit high false-negative rates and are unable to identify other respiratory virus infections, thereby resulting in patient misdiagnosis and impeding epidemic containment.
Combining the advantages of targeted amplification and long-read, real-time nanopore sequencing, herein, nanopore targeted sequencing (NTS) is developed to detect SARS-CoV-2 and other respiratory viruses simultaneously within 6–10 h, with a limit of detection of ten standard plasmid copies per reaction...NTS is thus suitable for COVID-19 diagnosis; moreover, this platform can be further extended for diagnosing other viruses and pathogens.

II. What was the COVID-19 data for?

The raw sequencing data was originally uploaded to the U.S. National Library of Medicine database because it could be of value to reviewers in the journal’s publication process, so as they could assess both the results and the sequencing method. The researcher said

因为审稿人会要求你说明这些数据是编造的还是真实的？真实发生完以后，你测到的数据长成什么样，得给大家看一下，你得给审稿人看一下。我们就按照这个要求把数据提交到美国国立卫生研究院的数据库上面了。
Because the reviewers would need proof of the authenticity of the data used in the research - was the data (in the paper) real or fabricated? What does your data look like? So we submitted our data to the National Library of Medicine database.

According to the researcher, the primary purpose of the paper was to establish a new diagnosis method; the sequence of the SARS-CoV-2 that was in the samples was somewhat irrelevant to the main point of the research.

因为首先这个数据对这篇文章没有什么直接的科学价值。我们开发的检测方法就相当于炒盘菜，这道菜使用的原材料可以写也可以不写，因为我们已经在正文中完整地展示了病毒的序列。
The raw sequencing data had no direct scientific value to the paper. The new diagnosis method that we developed was like a Chinese stir-fried dish, and the exact ingredients for this dish didn’t matter that much, because we had already fully presented the sequences of the viruses in the paper’s text.
所有信息都在正文中以图表的形式展示出来。其实这些序列就像一堆铅笔，然后我们已经使用了非常易懂的方式描述了出来，铅笔有红色的，有蓝色的，你就算不把这些铅笔放在那里，我们也已经描述了事实。
All the information has been presented in the body of the text in the form of tables and figures. Actually, these sequences are like a bunch of pencils and then we have described it in a very understandable way, some pencils are red, some are blue. Even if these pencils are not before your eyes, we have described the facts.

The researcher further stated due to the nature of the data collected from the samples, they are not suitable for use in any origin tracing efforts.

我们检测就是去撒个网的范围，只能最大程度上能捞到病毒序列的 1/3。但是绝大多数我们这里面还覆盖不到。所以这些数据的质量从覆盖度来讲，达不到溯源数据的要求。在世卫组织的报告里面有很明确的这个描述。在他讲那段之前上来就提到几个标准，这几个标准我们没有一条是满足的，所以说是谁都不会拿我们这个数据去讲溯源的事情。第一，他的覆盖度不够，你想给做一个人的身份证鉴定，我们只包括了身份证号码的后面的那一小部分，怎么能证明说这个人身份证号是全的。第二就是我们深度测序的准确性。这种方法的准确性用于诊断是足够了，但是用于溯源精准的判断，那是不够的。所以这是很正常的，不会有人拿我们这个数据去做溯源分析的。
The net we cast could only capture one-third of the sequence of SARS-CoV-2. The majority of the SARS-CoV-2 sequences couldn’t be captured in our sequencing method. So from this perspective, the data does not meet the standard of origin tracing requirements. A WHO report described in detail the requirements, none of which our raw sequencing data meets. So no one (familiar with all the ins and outs) would use our data for origin tracing work.
The first thing is the limited coverage. This is like identifying a person by his or her ID number, if we only have a fraction of that ID number, how can we know the complete ID number?
Secondly, there is the accuracy of the data harvested from our samples. They are accurate enough for diagnosis (of SARS-CoV-2), but not for origin tracing. This is very normal - no one would take our data to do origin tracing work.

III, When was the sampling done that produced the data?

According to the researcher, a total of two batches of samples were taken.

In the first batch, a total of 45 samples were taken randomly from patients that sought treatment in Wuhan on Jan. 30th, 2020. The second batch of samples was taken from a group of patients in mid-February, 2020.

In the opinions of a Chinese vice-minister, that means the data had little value in COVID-19 origin tracing - it’s simply not “early” data.

On that day, China reported 9,692 COVID-19 cases, according to CNN. The first known case, as of now, was Dec. 8, 2019. That’s about two months before the first sampling.

IV. Why did Wuhan University researchers withdraw the data from the U.S. database?

The researcher said that in their submission to the journal, they included a paragraph that describes the Internet link and accession code to find the raw sequencing data in the U.S. database.

However, the paragraph was deleted during the (copy) editing process by people on the journal’s side.

The Chinese researchers didn’t object to that deletion.

所以基于这种情况我们一看这个杂志就是把这段删掉了，我们觉得这个是没必要的，因为这个杂志本身也是一个方法学的杂志，它不是一个国外的媒体或学者关注或发表的病毒序列文章，大家完全不在一个领域里面。
When we saw that the journal had deleted the paragraph, we believed that then the paragraph was unnecessary. The journal itself focused on methodology. And the paper was not intended to be a paper publishing virus sequences, which foreign media and scholars paid a lot of attention to. We were not in the same field (of science).

Since the language pointing to the data on the database was deleted, the researcher thought there was no reason for the data to be kept on the database, “because no one will know why it existed there”.

就是你这个数据因为在正文里面它没有这段描述了，所以你把数据传到一个地方，它就像一只无头的苍蝇在里面，没有人知道说这个数据是跟我们没有关系的，也可能时间长了，我们自己也找不着那个东西了，也没有一个链接，所以我们就把这个数据删掉了。这个是去年6月份的事情。
Because the paper no longer included this descriptive paragraph (of the link to the database), the data that was stored in the database was like a headless fly. Nobody would know the data’s association, maybe after some time, even we wouldn’t be able to find the data, since there was no link. So we asked for the data to be deleted. This took place in June 2020.

V. How did Dr. Jesse Bloom come into the picture?

Fast forward to one year later, in 2021, the researcher said an email came from Dr. Jesse Bloom on Monday, June 7, a snapshot of which was shown to Yang Liu and your Pekingnologist.

我完全不认识这个人，在我这个研究领域里面也都没有交集。我只是把他当成一个普通的研究者。普通的研究者如果给我写邮件，向我要数据的话，如果这个数据也确实是发表的，我就会把数据公开共享。但我不可能说我把这个数据共享给他一个人。
I didn't know this person (Dr. Jesse Bloom) at all, and he and I had no intersection in my research field. I just regarded him as an ordinary researcher. If an ordinary researcher writes me an email and asks me for data, if this data needs to be published, I will share it publicly, not exclusively with the ordinary researcher.
我之前压根没听说过这个人，因为，我看到他给我写信的话，我的第一反应是我们把这个数据再上传一个地方。
I have never heard of this person (Dr. Jesse Bloom) before. When I saw his email to me, my first reaction was that we should re-upload this data to another place.

Dr. Jesse Bloom published on June 22 (in the time zone of the continental United States), which was June 23 in China.

他在6月22号发了一篇文章，非常具有攻击性和污蔑性，直接说我们是有隐瞒的，不然为什么把数据撤回来。在他发布文章后，美国有很多家媒体，包括纽约时报等很多就给我们轮番轰炸，写信问我这边是怎么回事，我压根不知道是怎么回事，然后紧接着就在当天美国的主媒体就在那个人的基础上把这个事情又进行了升华，就是衍生了一些报道。
He (Dr. Jesse Bloom) posted an article on June 22nd that was very offensive and slanderous, directly saying that we were hiding something, otherwise why did we pull back the data.
After he published the article, many media in the United States, including the New York Times and many others, bombarded us with emails asking me what happened. I absolutely didn't know what was happening. And then on the same day the mainstream media in the United States “upgraded” the story on the basis of that person's article.
报道各式各样，比如武汉研究人员神秘的删除了在美国国立卫生研究院数据库上上传的数据。我数了一下，大约有四十几家媒体进行报道，包括主流杂志Science、Nature也给我写信，问我事件的真是情况，但是他们还没等我回应，或者没等我反应过来，就纷纷发表了文章，描述这个事件。然后我们就先后回复我们做了什么，第一就是我们把这个数据重新上传了，上传到国家生物信息数据库里面，原来怎么上传，我们现在就怎么上传，是完全公开的。
There were all kinds of reports, such as Wuhan researchers mysteriously deleted the data uploaded to the NIH database. I counted about four dozen media reports, including the mainstream (academic) journals Science and Nature, which also wrote to me asking me about the real situation of the incident, but they published articles describing the incident before I could respond or come to fully understand the situation. Then we responded what we had been working on, including re-uploading the data to a database in China, just like before, all open-access.

VI. Was there any hiding of Wuhan COVID-19 data?

Bloom Lab @jbloom_lab

There are also broader implications. First, fact this dataset was deleted should make us skeptical that all other relevant early Wuhan sequences have been shared. We already know many labs in China ordered to destroy early samples: scmp.com/news/china/soc… (16/n)

The researcher said that there was no basis in asserting that they were hiding anything. because first and foremost, the published paper didn’t include any description of the Internet link, access codes, etc. to the raw sequencing data in the U.S. database.

正文里面也没有这段，这个很简单，你可以到我那个杂志，文章是开源的，就是open access的，你都可以下载，在正式发表的文章里面是没有任何数据描述的，如果说我在正文里面描述了这么一段它存在哪个地方了，然后我又自己私下里面就把这段给它删除了，杂志社都不会饶了我。
The published text did not have that paragraph, this is very simple, you can go to that journal, and the article is open-access, you can download it. In the officially published version, there is no description of the link to the data’s storage in the database. If there were such a paragraph in the text describing where the data was accessible, and then I privately deleted it, the journal would not spare me.

VII, Why not reply at that time?

如果在23号他铺天盖地写的时候去回应他，我们有几个证据还没有完成。首先，我需要把这个数据公开。我不能说在我还没公开这个数据的时候，就去回应他。你回应完你不还是没有公开吗？
If we were to respond to him (publicly) on June 23, when he was writing all over the place, we still had several pieces of evidence that were not yet complete. First, I needed to make this data public. I can't respond when I still haven’t made the data public - in that case, even if I responded, the data was still not public, what’s the point?

The researcher said the whole data had been re-uploaded.

We re-uploaded the data to the GSA database constructed by China's National Center for Biotechnology Information. What we uploaded to the NCBI (National Center for Biological Information) of the United States, we re-uploaded it to the Chinese database completely. It is totally public.

Responding to suspicions that the Chinese government was behind the withdrawal of the data, the researcher said he couldn't recall the exact date when he was contacted by the Chinese authorities, but it was sometime after June 22, 2021 (when Dr. Jesse Bloom published his preprint and Twitter thread) and before early July, a full year after the data was withdrawn.

The Chinese vice-minister hinted at this during Thursday’s press conference, saying

这个事情报道出来以后，我们马上对这个事情进行了调查、了解。
After this matter was reported, we immediately investigated and understood this matter.

In hindsight, the researcher said:

我也在反思这个事、这个过程。上一次有人给我写信问这个事的时候，我怎么样回复能更好？但是请你理解，这个时候我不知道该怎么办，也没有人告诉我该怎么办。
I am also reflecting on this matter/process: when I received the email about it, could I have responded better? Would things turn out to be better if I had said something better? But I ask for your understanding - at the time, I didn’t know what to do, and there was nobody telling me what to do.

***

(Totally personal opinions below)

The Chinese researcher should definitely do that and perhaps take International Communication 101.

Dr. Jesse Bloom and the media which jumped on the nothing-burger should do some reflections as well.

Hopefully, the response to the latter suggestion wouldn’t be just something like:

“I’ve written them an email and given them two weeks to respond. They didn’t respond. It’s on them.”

“We reached out to them, gave them a chance to speak before our publication, which is the usual practice based on widely accepted journalism standards, but they didn’t respond immediately. So we went to publish.”

Two outstanding interns, Chenjie Liao, a student at Sun Yat-sen University, and Qi Cui, a graduate student at China Foreign Affairs University, have just joined Yang Liu’s Beijing Channel newsletter and contributed to this newsletter since it’s a joint production between him and me.

Pekingnology

The Chinese side of the COVID data withdrawal controversy

Much ado about nothing, it turns out.