Posted Content•DOI•

Human gene function publications that describe wrongly identified nucleotide sequence reagents are unacceptably frequent within the genetics literature

Q: How many citations have been accumulated by the problematic papers?

Given that over 17,000 citations have been accumulated by the problematic papers 479 that the authors have identified, it seems inevitable that unreliable gene function papers are already 480 wasting time and resources.

Q: What was the expected threshold for the sequences that were flagged incorrectly?

All flagged incorrect targeting sequences were double-checked through additional blastn 748 searches against the database: “Homo sapiens (taxid:9606)”, optimized for “Somewhat 749 similar sequences (blastn)”, using an expect threshold 1000, in February 2021.

Q: What was used to screen all the papers published in Gene from 2007-2018?

S&B was employed to screen all original articles published in Gene from 2007-2018, 267 and all open-access articles published in Oncology Reports from 2014-2018 (Table 4).

Q: How many wrongly identified sequences were found in the flagged C+G papers?

Approximately half (51/100, 51%) the flagged C+G papers were found to include a median 247 of 2 (range 1-8) wrongly identified sequences/ paper (Table 1).

Q: What are the types of errors that are common in the biomedical and genetics literature?

These error types represent the equivalent of spelling errors (12, 14, 15), as 111 well as identity errors, where a correct sequence is replaced by a different and possibly 112 genetically unrelated sequence (11-13, 16-21).

Q: What is the ncRNA that is difficult to recognise?

; https://doi.org/10.1101/2021.07.29.453321doi: bioRxiv preprint2626ncRNA’s possess largely numeric identifiers that may be more difficult to recognise and 549 recall than alphanumeric gene identifiers, any focus upon ncRNA’s could contribute to large 550 publication series being less visible within the literature.

Q: What can be done to encourage further fact-checking of the gene function literature?

Professional societies can reinforce the importance of reagent 591 verification through conference presentations, education programs, and journal editorials, and 592 can advocate for tangible incentives to encourage further fact-checking of the genetics 593 literature.

Q: What is the likely explanation for the number of wrongly identified sequences in hundreds of gene?

as workplace sabotage is 489 typically directed towards known individuals (39, 40), this seems an unlikely explanation for 490 wrongly identified sequences across hundreds of gene function papers published by many 491 different authors.

Yasunori Park¹, Rachael A. West¹, Rachael A. West², Pranujan Pathmendra¹, Bertrand Favier³, Thomas Stoeger⁴, Amanda Capes-Davis⁵, Amanda Capes-Davis¹, Guillaume Cabanac⁶, Cyril Labbé³, Jennifer A. Byrne⁷, Jennifer A. Byrne¹ - Show less +8 more•Institutions (7)

University of Sydney¹, Children's Hospital at Westmead², University of Grenoble³, Northwestern University⁴, Children's Medical Research Institute⁵, University of Toulouse⁶, Ministry of Health (New South Wales)⁷

31 Jul 2021-bioRxiv (Cold Spring Harbor Laboratory)-

TL;DR: In this article, a semi-automated screening tool, Seek & Blastn, was used to identify 712 papers across 78 journals that described at least one wrongly identified nucleotide sequence.

read less

Abstract: Nucleotide sequence reagents underpin a range of molecular genetics techniques that have been applied across hundreds of thousands of research publications. We have previously reported wrongly identified nucleotide sequence reagents in human gene function publications and described a semi-automated screening tool Seek & Blastn to fact-check the targeting or non-targeting status of nucleotide sequence reagents. We applied Seek & Blastn to screen 11,799 publications across 5 literature corpora, which included all original publications in Gene from 2007-2018 and all original open-access publications in Oncology Reports from 2014-2018. After manually checking the Seek & Blastn screening outputs for over 3,400 human research papers, we identified 712 papers across 78 journals that described at least one wrongly identified nucleotide sequence. Verifying the claimed identities of over 13,700 nucleotide sequences highlighted 1,535 wrongly identified sequences, most of which were claimed targeting reagents for the analysis of 365 human protein-coding genes and 120 non-coding RNAs, respectively. The 712 problematic papers have received over 17,000 citations, which include citations by human clinical trials. Given our estimate that approximately one quarter of problematic papers are likely to misinform or distract the future development of therapies against human disease, urgent measures are required to address the problem of unreliable gene function papers within the literature. Author summary This is the first study to have screened the gene function literature for nucleotide sequence errors at the scale that we describe. The unacceptably high rates of human gene function papers with incorrect nucleotide sequences that we have discovered represent a major challenge to the research fields that aim to translate genomics investments to patients, and that commonly rely upon reliable descriptions of gene function. Indeed, wrongly identified nucleotide sequence reagents represent a double concern, as both the incorrect reagents themselves and their associated results can mislead future research, both in terms of the research directions that are chosen and the experiments that are undertaken. We hope that our research will inspire researchers and journals to seek out other problematic human gene function papers, as we are unfortunately concerned that our results represent the tip of a much larger problem within the literature. We hope that our research will encourage more rigorous reporting and peer review of gene function results, and we propose a series of responses for the research and publishing communities.

...read moreread less

Summary (3 min read)

Jump to: [Introduction] – [miR-145 corpus] – [Analysis of all problematic human gene function papers] – [Bibliometric analysis of human genes analysed in problematic papers] – [Discussion] – [Future directions] and [Summary and conclusions]

Introduction

The promise of genomics to improve the health of cancer and other patients has resulted in billions of dollars of research investment which have been accompanied by expectations of similar quantum gains in health outcomes (1, 2) .
The authors initial application of S&B identified 77/203 (38%) screened papers with incorrect nucleotide sequence reagents, with their focus being the description of the S&B tool (12) , as opposed to its application.
The authors have now employed S&B to screen original research papers across 5 literature corpora, representing 3 targeted and two journal corpora.
The 75 problematic SGK papers analysed 24 human cancer types, most frequently brain cancer, where 1-9 problematic SGK papers were identified per queried gene (Table 2 ).

miR-145 corpus

PubMed similarity searches employing individual SGK papers identified numerous papers that analysed the functions of different human miR's in cancer cell lines.
Papers that focussed upon miR-145 were identified using PubMed similarity searches of index papers (12, 13) 3 ).
In contrast to SGK papers, most incorrect sequences in miR-145 papers were employed as (RT)-PCR primers (Table 3 ) and were identified only once within the corpus (Fig 3 The miR-145 corpus included papers that analysed human miR-145 function in human cell lines.
Publication dates were limited to 2019 to broadly align with the SGK corpus.

Analysis of all problematic human gene function papers

After adjusting for 9 duplicate papers across the 5 corpora, the authors identified 712 problematic papers with wrongly identified sequences (Fig 1 ) that were published by 78 journals and 31 publishers (S6 Data).
As most incorrect reagents represented (RT-)PCR primers which are employed as paired reagents, the authors considered the verified identities of primer pairs that were found to include at least one wrongly identified primer (Fig 6 ).
Many problematic papers (n=192) described primer pairs that were predicted to target the same incorrect gene (Fig 6 ).

Bibliometric analysis of human genes analysed in problematic papers

Primary protein-coding genes, which represented the first-listed genes in publication titles or abstracts, tended to be associated with more papers in PubMed than a randomly chosen human protein-coding gene (median publication numbers: 167 vs 31, P < 10 -109 , two-sided Mann-Whitney U test) (S3A Fig) .
Again, most wrongly identified target genes have appeared in more papers than a randomly chosen protein-coding gene (median publication numbers: 238 vs 31, P < 10 -94 , two-sided Mann-Whitney U test) (S3B Fig) .
The most frequent wrongly claimed gene targets were GAPDH and ACTB (Fig 7B ), reflecting their widespread use as RT-PCR control genes.
In summary, these analyses demonstrate that problematic papers can focus upon and/or employ reagents that are wrongly claimed to target highly-investigated human genes such as BCL2, EGFR, PTEN, STAT3, and CCND1 (Fig 7).

Discussion

Experimental analyses of gene function require nucleotide sequence reagent identities to precisely match their published descriptions.
As previously discussed, papers that describe incorrect nucleotide sequences could encourage the incorrect selection of genes for further experimentation, possibly at the expense of more productive candidates (8) .
Large numbers of human gene function papers with incorrect nucleotide sequences that list hospital affiliations in China could reflect hospital doctors turning to paper mills to meet publication requirements, whereas the contrasting institutional profiles of problematic papers from other countries could highlight different publication pressures elsewhere.
In summary, the authors are concerned that the sheer number of human genes that are available for analysis, combined with research drivers that favour the continued investigation of genes of known function (48) (49) (50) , are unwittingly providing an extensive source of topics around which gene function papers can be fraudulently created.

Future directions

The authors results indicate that the problem of incorrect gene function papers requires urgent action.
Within the research community, this can take place in several ways.
Similarly, recent changes to researcher assessment (65, 66) will not address problematic papers that have already been published.
While the described efforts to screen incoming manuscripts are welcome and should be extended to all journals that publish gene function research, screening incoming manuscripts must be coupled with addressing problematic papers that are already embedded in the literature (71, (74) (75) (76) .
These efforts could be supported by gene function experts who could explain the significance of incorrect nucleotide sequences and/or provide training for editorial staff, particularly as the necessary researcher skills are already widely available.

Summary and conclusions

To fully extend the benefits of genomics towards patients and broader populations, it is widely recognised that the authors must understand the functions of every human gene (1, 2) .
Whereas genuine gene research requires time, expertise, and material resources, the mass production of fraudulent gene function papers by paper mills could be quicker and cheaper by orders of magnitude (8) .
Given the number of human genes whose functions can analysed singly and/or in combination with other genes and/or drugs across different cancer types or other diseases, combined with acute demands for research productivity that may not always be matched by researcher capacity and training (78) , fraudulent gene function papers could unfortunately outstrip the publication of genuine gene function research.
Indeed, the possible extent of the problem of unreliable human gene function papers is indicated by the lack of overlap between the problematic papers that the authors have reported, and other papers of concern reported elsewhere (71, 74, 79) .
While publishers and journals decide how to address this urgent problem, laboratory scientists, text miners and clinical researchers must approach the human gene function literature with a critical mindset, and carefully evaluate the merits of individual papers before acting upon their results.

Did you find this useful? Give us your feedback

Content maybe subject to copyright Report

Human gene function publications that describe wrongly identified nucleotide sequence

reagents are unacceptably frequent within the genetics literature

Short title: Wrongly identified nucleotide sequences in gene function papers

Yasunori Park

, Rachael A West

1,2

, Pranujan Pathmendra

, Bertrand Favier

, Thomas

Stoeger

4,5,6

, Amanda Capes-Davis

1,7

, Guillaume Cabanac

, Cyril Labbé

, Jennifer A

Byrne

1,10,*

Faculty of Medicine and Health, The University of Sydney, NSW, Australia

Children’s Cancer Research Unit, Kids Research, The Children’s Hospital at Westmead,

Westmead, NSW, Australia

Univ. Grenoble Alpes, TIMC, Grenoble, France

Successful Clinical Response in Pneumonia Therapy (SCRIPT) Systems Biology Center,

Northwestern University, Evanston, United States.

Department of Chemical and Biological Engineering, Northwestern University, Evanston,

United States.

Center for Genetic Medicine, Northwestern University School of Medicine, Chicago, United

States

CellBank Australia, Children’s Medical Research Institute, Westmead, New South Wales,

Australia

Computer Science Department, IRIT UMR 5505 CNRS, University of Toulouse, 118 route

de Narbonne, 31062 Toulouse Cedex 9, France

Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG, Grenoble, France

NSW Health Statewide Biobank, NSW Health Pathology, Camperdown, NSW, Australia

*Corresponding author

.CC-BY-NC-ND 4.0 International licenseavailable under a

was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted July 31, 2021. ; https://doi.org/10.1101/2021.07.29.453321doi: bioRxiv preprint

Keywords: cancer; gene function; miRNA; non-coding RNA’s; nucleotide sequence reagent;

paper mill; protein-coding gene

.CC-BY-NC-ND 4.0 International licenseavailable under a

was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted July 31, 2021. ; https://doi.org/10.1101/2021.07.29.453321doi: bioRxiv preprint

Abstract

Nucleotide sequence reagents underpin a range of molecular genetics techniques that have

been applied across hundreds of thousands of research publications. We have previously

reported wrongly identified nucleotide sequence reagents in human gene function

publications and described a semi-automated screening tool Seek & Blastn to fact-check the

targeting or non-targeting status of nucleotide sequence reagents. We applied Seek & Blastn

to screen 11,799 publications across 5 literature corpora, which included all original

publications in Gene from 2007-2018 and all original open-access publications in Oncology

Reports from 2014-2018. After manually checking the Seek & Blastn screening outputs for

over 3,400 human research papers, we identified 712 papers across 78 journals that described

at least one wrongly identified nucleotide sequence. Verifying the claimed identities of over

13,700 nucleotide sequences highlighted 1,535 wrongly identified sequences, most of which

were claimed targeting reagents for the analysis of 365 human protein-coding genes and 120

non-coding RNAs, respectively. The 712 problematic papers have received over 17,000

citations, which include citations by human clinical trials. Given our estimate that

approximately one quarter of problematic papers are likely to misinform or distract the future

development of therapies against human disease, urgent measures are required to address the

problem of unreliable gene function papers within the literature.

.CC-BY-NC-ND 4.0 International licenseavailable under a

was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted July 31, 2021. ; https://doi.org/10.1101/2021.07.29.453321doi: bioRxiv preprint

Author summary

This is the first study to have screened the gene function literature for nucleotide sequence

errors at the scale that we describe. The unacceptably high rates of human gene function

papers with incorrect nucleotide sequences that we have discovered represent a major

challenge to the research fields that aim to translate genomics investments to patients, and

that commonly rely upon reliable descriptions of gene function. Indeed, wrongly identified

nucleotide sequence reagents represent a double concern, as both the incorrect reagents

themselves and their associated results can mislead future research, both in terms of the

research directions that are chosen and the experiments that are undertaken. We hope that our

research will inspire researchers and journals to seek out other problematic human gene

function papers, as we are unfortunately concerned that our results represent the tip of a much

larger problem within the literature. We hope that our research will encourage more rigorous

reporting and peer review of gene function results, and we propose a series of responses for

the research and publishing communities.

.CC-BY-NC-ND 4.0 International licenseavailable under a

was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted July 31, 2021. ; https://doi.org/10.1101/2021.07.29.453321doi: bioRxiv preprint

Introduction

The promise of genomics to improve the health of cancer and other patients has resulted in

billions of dollars of research investment which have been accompanied by expectations of

similar quantum gains in health outcomes (1, 2). Since the first draft of the human genome

was reported (3, 4), a series of increasingly rapid technological advances has permitted the

routine sequencing of human genomes at scale (1, 2), and the increasing application of

genomics to inform clinical care (1, 2, 5). Despite the now routine capacity to sequence the

human genome, genomics research relies upon results produced by other research fields to

translate genome sequencing results to patients (5-7). For example, while whole genome

sequencing demonstrates that thousands of human genes are mutated or deregulated in human

cancers (1), knowledge of human gene function is required to prioritise individual gene

candidates for subsequent pre-clinical and translational studies (5-7).

A first step in triaging and prioritising gene candidates for further analysis is the

consideration of available knowledge of predicted and/or demonstrated gene functions (5-8).

High quality, reliable information about gene function is important to select the most

promising gene candidates and to then progress these candidates through pre-clinical and

translational research pipelines (8), which is supported by drug candidates with genetically

supported targets being significantly more likely to progress through phased clinical trials (9,

10). However, in contrast to the sophisticated platforms that produce genomic or

transcriptomic sequence data at scale, gene function experiments typically analyse single or

small numbers of genes through the application of more ubiquitous molecular techniques (6),

some of which have been in routine experimental use for 15-30 years. For example, gene

knockdown approaches have been widely employed to assess the consequences of reduced

.CC-BY-NC-ND 4.0 International licenseavailable under a

was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted July 31, 2021. ; https://doi.org/10.1101/2021.07.29.453321doi: bioRxiv preprint

HTML Viewer

Frequently Asked Questions (8)

Q1. How many citations have been accumulated by the problematic papers?

Given that over 17,000 citations have been accumulated by the problematic papers 479 that the authors have identified, it seems inevitable that unreliable gene function papers are already 480 wasting time and resources.

Q2. What was the expected threshold for the sequences that were flagged incorrectly?

All flagged incorrect targeting sequences were double-checked through additional blastn 748 searches against the database: “Homo sapiens (taxid:9606)”, optimized for “Somewhat 749 similar sequences (blastn)”, using an expect threshold 1000, in February 2021.

Q3. What was used to screen all the papers published in Gene from 2007-2018?

S&B was employed to screen all original articles published in Gene from 2007-2018, 267 and all open-access articles published in Oncology Reports from 2014-2018 (Table 4).

Q4. How many wrongly identified sequences were found in the flagged C+G papers?

Approximately half (51/100, 51%) the flagged C+G papers were found to include a median 247 of 2 (range 1-8) wrongly identified sequences/ paper (Table 1).

Q5. What are the types of errors that are common in the biomedical and genetics literature?

These error types represent the equivalent of spelling errors (12, 14, 15), as 111 well as identity errors, where a correct sequence is replaced by a different and possibly 112 genetically unrelated sequence (11-13, 16-21).

Q6. What is the ncRNA that is difficult to recognise?

; https://doi.org/10.1101/2021.07.29.453321doi: bioRxiv preprint2626ncRNA’s possess largely numeric identifiers that may be more difficult to recognise and 549 recall than alphanumeric gene identifiers, any focus upon ncRNA’s could contribute to large 550 publication series being less visible within the literature.

Q7. What can be done to encourage further fact-checking of the gene function literature?

Professional societies can reinforce the importance of reagent 591 verification through conference presentations, education programs, and journal editorials, and 592 can advocate for tangible incentives to encourage further fact-checking of the genetics 593 literature.

Q8. What is the likely explanation for the number of wrongly identified sequences in hundreds of gene?

as workplace sabotage is 489 typically directed towards known individuals (39, 40), this seems an unlikely explanation for 490 wrongly identified sequences across hundreds of gene function papers published by many 491 different authors.

Human gene function publications that describe wrongly identified nucleotide sequence reagents are unacceptably frequent within the genetics literature

Summary (3 min read)

Introduction

miR-145 corpus

Analysis of all problematic human gene function papers

Bibliometric analysis of human genes analysed in problematic papers

Discussion

Future directions

Summary and conclusions

Citations

References

Related Papers (5)

Frequently Asked Questions (8)

Q1. How many citations have been accumulated by the problematic papers?

Q2. What was the expected threshold for the sequences that were flagged incorrectly?

Q3. What was used to screen all the papers published in Gene from 2007-2018?

Q4. How many wrongly identified sequences were found in the flagged C+G papers?

Q5. What are the types of errors that are common in the biomedical and genetics literature?

Q6. What is the ncRNA that is difficult to recognise?

Q7. What can be done to encourage further fact-checking of the gene function literature?

Q8. What is the likely explanation for the number of wrongly identified sequences in hundreds of gene?