scispace - formally typeset
Search or ask a question
Posted ContentDOI

Human gene function publications that describe wrongly identified nucleotide sequence reagents are unacceptably frequent within the genetics literature

TL;DR: In this article, a semi-automated screening tool, Seek & Blastn, was used to identify 712 papers across 78 journals that described at least one wrongly identified nucleotide sequence.
Abstract: Nucleotide sequence reagents underpin a range of molecular genetics techniques that have been applied across hundreds of thousands of research publications. We have previously reported wrongly identified nucleotide sequence reagents in human gene function publications and described a semi-automated screening tool Seek & Blastn to fact-check the targeting or non-targeting status of nucleotide sequence reagents. We applied Seek & Blastn to screen 11,799 publications across 5 literature corpora, which included all original publications in Gene from 2007-2018 and all original open-access publications in Oncology Reports from 2014-2018. After manually checking the Seek & Blastn screening outputs for over 3,400 human research papers, we identified 712 papers across 78 journals that described at least one wrongly identified nucleotide sequence. Verifying the claimed identities of over 13,700 nucleotide sequences highlighted 1,535 wrongly identified sequences, most of which were claimed targeting reagents for the analysis of 365 human protein-coding genes and 120 non-coding RNAs, respectively. The 712 problematic papers have received over 17,000 citations, which include citations by human clinical trials. Given our estimate that approximately one quarter of problematic papers are likely to misinform or distract the future development of therapies against human disease, urgent measures are required to address the problem of unreliable gene function papers within the literature. Author summary This is the first study to have screened the gene function literature for nucleotide sequence errors at the scale that we describe. The unacceptably high rates of human gene function papers with incorrect nucleotide sequences that we have discovered represent a major challenge to the research fields that aim to translate genomics investments to patients, and that commonly rely upon reliable descriptions of gene function. Indeed, wrongly identified nucleotide sequence reagents represent a double concern, as both the incorrect reagents themselves and their associated results can mislead future research, both in terms of the research directions that are chosen and the experiments that are undertaken. We hope that our research will inspire researchers and journals to seek out other problematic human gene function papers, as we are unfortunately concerned that our results represent the tip of a much larger problem within the literature. We hope that our research will encourage more rigorous reporting and peer review of gene function results, and we propose a series of responses for the research and publishing communities.

Summary (3 min read)

Introduction

  • The promise of genomics to improve the health of cancer and other patients has resulted in billions of dollars of research investment which have been accompanied by expectations of similar quantum gains in health outcomes (1, 2) .
  • The authors initial application of S&B identified 77/203 (38%) screened papers with incorrect nucleotide sequence reagents, with their focus being the description of the S&B tool (12) , as opposed to its application.
  • The authors have now employed S&B to screen original research papers across 5 literature corpora, representing 3 targeted and two journal corpora.
  • The 75 problematic SGK papers analysed 24 human cancer types, most frequently brain cancer, where 1-9 problematic SGK papers were identified per queried gene (Table 2 ).

miR-145 corpus

  • PubMed similarity searches employing individual SGK papers identified numerous papers that analysed the functions of different human miR's in cancer cell lines.
  • Papers that focussed upon miR-145 were identified using PubMed similarity searches of index papers (12, 13) 3 ).
  • In contrast to SGK papers, most incorrect sequences in miR-145 papers were employed as (RT)-PCR primers (Table 3 ) and were identified only once within the corpus (Fig 3 The miR-145 corpus included papers that analysed human miR-145 function in human cell lines.
  • Publication dates were limited to 2019 to broadly align with the SGK corpus.

Analysis of all problematic human gene function papers

  • After adjusting for 9 duplicate papers across the 5 corpora, the authors identified 712 problematic papers with wrongly identified sequences (Fig 1 ) that were published by 78 journals and 31 publishers (S6 Data).
  • As most incorrect reagents represented (RT-)PCR primers which are employed as paired reagents, the authors considered the verified identities of primer pairs that were found to include at least one wrongly identified primer (Fig 6 ).
  • Many problematic papers (n=192) described primer pairs that were predicted to target the same incorrect gene (Fig 6 ).

Bibliometric analysis of human genes analysed in problematic papers

  • Primary protein-coding genes, which represented the first-listed genes in publication titles or abstracts, tended to be associated with more papers in PubMed than a randomly chosen human protein-coding gene (median publication numbers: 167 vs 31, P < 10 -109 , two-sided Mann-Whitney U test) (S3A Fig) .
  • Again, most wrongly identified target genes have appeared in more papers than a randomly chosen protein-coding gene (median publication numbers: 238 vs 31, P < 10 -94 , two-sided Mann-Whitney U test) (S3B Fig) .
  • The most frequent wrongly claimed gene targets were GAPDH and ACTB (Fig 7B ), reflecting their widespread use as RT-PCR control genes.
  • In summary, these analyses demonstrate that problematic papers can focus upon and/or employ reagents that are wrongly claimed to target highly-investigated human genes such as BCL2, EGFR, PTEN, STAT3, and CCND1 (Fig 7).

Discussion

  • Experimental analyses of gene function require nucleotide sequence reagent identities to precisely match their published descriptions.
  • As previously discussed, papers that describe incorrect nucleotide sequences could encourage the incorrect selection of genes for further experimentation, possibly at the expense of more productive candidates (8) .
  • Large numbers of human gene function papers with incorrect nucleotide sequences that list hospital affiliations in China could reflect hospital doctors turning to paper mills to meet publication requirements, whereas the contrasting institutional profiles of problematic papers from other countries could highlight different publication pressures elsewhere.
  • In summary, the authors are concerned that the sheer number of human genes that are available for analysis, combined with research drivers that favour the continued investigation of genes of known function (48) (49) (50) , are unwittingly providing an extensive source of topics around which gene function papers can be fraudulently created.

Future directions

  • The authors results indicate that the problem of incorrect gene function papers requires urgent action.
  • Within the research community, this can take place in several ways.
  • Similarly, recent changes to researcher assessment (65, 66) will not address problematic papers that have already been published.
  • While the described efforts to screen incoming manuscripts are welcome and should be extended to all journals that publish gene function research, screening incoming manuscripts must be coupled with addressing problematic papers that are already embedded in the literature (71, (74) (75) (76) .
  • These efforts could be supported by gene function experts who could explain the significance of incorrect nucleotide sequences and/or provide training for editorial staff, particularly as the necessary researcher skills are already widely available.

Summary and conclusions

  • To fully extend the benefits of genomics towards patients and broader populations, it is widely recognised that the authors must understand the functions of every human gene (1, 2) .
  • Whereas genuine gene research requires time, expertise, and material resources, the mass production of fraudulent gene function papers by paper mills could be quicker and cheaper by orders of magnitude (8) .
  • Given the number of human genes whose functions can analysed singly and/or in combination with other genes and/or drugs across different cancer types or other diseases, combined with acute demands for research productivity that may not always be matched by researcher capacity and training (78) , fraudulent gene function papers could unfortunately outstrip the publication of genuine gene function research.
  • Indeed, the possible extent of the problem of unreliable human gene function papers is indicated by the lack of overlap between the problematic papers that the authors have reported, and other papers of concern reported elsewhere (71, 74, 79) .
  • While publishers and journals decide how to address this urgent problem, laboratory scientists, text miners and clinical researchers must approach the human gene function literature with a critical mindset, and carefully evaluate the merits of individual papers before acting upon their results.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

1
1
Human gene function publications that describe wrongly identified nucleotide sequence
1
reagents are unacceptably frequent within the genetics literature
2
Short title: Wrongly identified nucleotide sequences in gene function papers
3
Yasunori Park
1
, Rachael A West
1,2
, Pranujan Pathmendra
1
, Bertrand Favier
3
, Thomas
4
Stoeger
4,5,6
, Amanda Capes-Davis
1,7
, Guillaume Cabanac
8
, Cyril Lab
9
, Jennifer A
5
Byrne
1,10,*
6
1
Faculty of Medicine and Health, The University of Sydney, NSW, Australia
7
2
Children’s Cancer Research Unit, Kids Research, The Children’s Hospital at Westmead,
8
Westmead, NSW, Australia
9
3
Univ. Grenoble Alpes, TIMC, Grenoble, France
10
4
Successful Clinical Response in Pneumonia Therapy (SCRIPT) Systems Biology Center,
11
Northwestern University, Evanston, United States.
12
5
Department of Chemical and Biological Engineering, Northwestern University, Evanston,
13
United States.
14
6
Center for Genetic Medicine, Northwestern University School of Medicine, Chicago, United
15
States
16
7
CellBank Australia, Children’s Medical Research Institute, Westmead, New South Wales,
17
Australia
18
8
Computer Science Department, IRIT UMR 5505 CNRS, University of Toulouse, 118 route
19
de Narbonne, 31062 Toulouse Cedex 9, France
20
9
Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG, Grenoble, France
21
10
NSW Health Statewide Biobank, NSW Health Pathology, Camperdown, NSW, Australia
22
*Corresponding author
23
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted July 31, 2021. ; https://doi.org/10.1101/2021.07.29.453321doi: bioRxiv preprint

2
2
24
Keywords: cancer; gene function; miRNA; non-coding RNA’s; nucleotide sequence reagent;
25
paper mill; protein-coding gene
26
27
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted July 31, 2021. ; https://doi.org/10.1101/2021.07.29.453321doi: bioRxiv preprint

3
3
Abstract
28
Nucleotide sequence reagents underpin a range of molecular genetics techniques that have
29
been applied across hundreds of thousands of research publications. We have previously
30
reported wrongly identified nucleotide sequence reagents in human gene function
31
publications and described a semi-automated screening tool Seek & Blastn to fact-check the
32
targeting or non-targeting status of nucleotide sequence reagents. We applied Seek & Blastn
33
to screen 11,799 publications across 5 literature corpora, which included all original
34
publications in Gene from 2007-2018 and all original open-access publications in Oncology
35
Reports from 2014-2018. After manually checking the Seek & Blastn screening outputs for
36
over 3,400 human research papers, we identified 712 papers across 78 journals that described
37
at least one wrongly identified nucleotide sequence. Verifying the claimed identities of over
38
13,700 nucleotide sequences highlighted 1,535 wrongly identified sequences, most of which
39
were claimed targeting reagents for the analysis of 365 human protein-coding genes and 120
40
non-coding RNAs, respectively. The 712 problematic papers have received over 17,000
41
citations, which include citations by human clinical trials. Given our estimate that
42
approximately one quarter of problematic papers are likely to misinform or distract the future
43
development of therapies against human disease, urgent measures are required to address the
44
problem of unreliable gene function papers within the literature.
45
46
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted July 31, 2021. ; https://doi.org/10.1101/2021.07.29.453321doi: bioRxiv preprint

4
4
Author summary
47
This is the first study to have screened the gene function literature for nucleotide sequence
48
errors at the scale that we describe. The unacceptably high rates of human gene function
49
papers with incorrect nucleotide sequences that we have discovered represent a major
50
challenge to the research fields that aim to translate genomics investments to patients, and
51
that commonly rely upon reliable descriptions of gene function. Indeed, wrongly identified
52
nucleotide sequence reagents represent a double concern, as both the incorrect reagents
53
themselves and their associated results can mislead future research, both in terms of the
54
research directions that are chosen and the experiments that are undertaken. We hope that our
55
research will inspire researchers and journals to seek out other problematic human gene
56
function papers, as we are unfortunately concerned that our results represent the tip of a much
57
larger problem within the literature. We hope that our research will encourage more rigorous
58
reporting and peer review of gene function results, and we propose a series of responses for
59
the research and publishing communities.
60
61
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted July 31, 2021. ; https://doi.org/10.1101/2021.07.29.453321doi: bioRxiv preprint

5
5
Introduction
62
The promise of genomics to improve the health of cancer and other patients has resulted in
63
billions of dollars of research investment which have been accompanied by expectations of
64
similar quantum gains in health outcomes (1, 2). Since the first draft of the human genome
65
was reported (3, 4), a series of increasingly rapid technological advances has permitted the
66
routine sequencing of human genomes at scale (1, 2), and the increasing application of
67
genomics to inform clinical care (1, 2, 5). Despite the now routine capacity to sequence the
68
human genome, genomics research relies upon results produced by other research fields to
69
translate genome sequencing results to patients (5-7). For example, while whole genome
70
sequencing demonstrates that thousands of human genes are mutated or deregulated in human
71
cancers (1), knowledge of human gene function is required to prioritise individual gene
72
candidates for subsequent pre-clinical and translational studies (5-7).
73
74
A first step in triaging and prioritising gene candidates for further analysis is the
75
consideration of available knowledge of predicted and/or demonstrated gene functions (5-8).
76
High quality, reliable information about gene function is important to select the most
77
promising gene candidates and to then progress these candidates through pre-clinical and
78
translational research pipelines (8), which is supported by drug candidates with genetically
79
supported targets being significantly more likely to progress through phased clinical trials (9,
80
10). However, in contrast to the sophisticated platforms that produce genomic or
81
transcriptomic sequence data at scale, gene function experiments typically analyse single or
82
small numbers of genes through the application of more ubiquitous molecular techniques (6),
83
some of which have been in routine experimental use for 15-30 years. For example, gene
84
knockdown approaches have been widely employed to assess the consequences of reduced
85
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted July 31, 2021. ; https://doi.org/10.1101/2021.07.29.453321doi: bioRxiv preprint

Citations
More filters
Journal ArticleDOI
TL;DR: A compilation of all piroplasmid species, isolates, and species complexes that infect domestic mammals and which have been well defined by molecular phylogenetic markers is presented, showing diversification of parasite species appears to be dominated by host-parasite cospeciation (Fahrenholz's rule).

28 citations

Journal ArticleDOI
04 Aug 2021-Nature
TL;DR: This paper found that misreported nucleotide reagents may be rife in papers on human gene function, which may be due to the fact that these reagents are highly correlated with gene function.
Abstract: Misreported nucleotide reagents may be rife in papers on human gene function. Misreported nucleotide reagents may be rife in papers on human gene function.

1 citations

Journal ArticleDOI
TL;DR: The authors decontaminate the scientific literature using curative and preventive actions using AI-powered literature-based discovery, which aims at detecting research misconduct and frauds.
Abstract: Research misconduct and frauds pollute the scientific literature. Honest errors and malevolent data fabrication, image manipulation, journal hijacking, and plagiarism passed peer review unnoticed. Problematic papers deceive readers, authors citing them, and AI-powered literature-based discovery. Flagship publishers accepted hundreds flawed papers despite claiming to enforce peer review. This application ambitions to decontaminate the scientific literature using curative and preventive actions.
References
More filters
Journal ArticleDOI
Eric S. Lander1, Lauren Linton1, Bruce W. Birren1, Chad Nusbaum1  +245 moreInstitutions (29)
15 Feb 2001-Nature
TL;DR: The results of an international collaboration to produce and make freely available a draft sequence of the human genome are reported and an initial analysis is presented, describing some of the insights that can be gleaned from the sequence.
Abstract: The human genome holds an extraordinary trove of information about human development, physiology, medicine and evolution. Here we report the results of an international collaboration to produce and make freely available a draft sequence of the human genome. We also present an initial analysis of the data, describing some of the insights that can be gleaned from the sequence.

22,269 citations

Journal ArticleDOI
J. Craig Venter1, Mark Raymond Adams1, Eugene W. Myers1, Peter W. Li1  +269 moreInstitutions (12)
16 Feb 2001-Science
TL;DR: Comparative genomic analysis indicates vertebrate expansions of genes associated with neuronal function, with tissue-specific developmental regulation, and with the hemostasis and immune systems are indicated.
Abstract: A 2.91-billion base pair (bp) consensus sequence of the euchromatic portion of the human genome was generated by the whole-genome shotgun sequencing method. The 14.8-billion bp DNA sequence was generated over 9 months from 27,271,853 high-quality sequence reads (5.11-fold coverage of the genome) from both ends of plasmid clones made from the DNA of five individuals. Two assembly strategies-a whole-genome assembly and a regional chromosome assembly-were used, each combining sequence data from Celera and the publicly funded genome effort. The public data were shredded into 550-bp segments to create a 2.9-fold coverage of those genome regions that had been sequenced, without including biases inherent in the cloning and assembly procedure used by the publicly funded group. This brought the effective coverage in the assemblies to eightfold, reducing the number and size of gaps in the final assembly over what would be obtained with 5.11-fold coverage. The two assembly strategies yielded very similar results that largely agree with independent mapping data. The assemblies effectively cover the euchromatic regions of the human chromosomes. More than 90% of the genome is in scaffold assemblies of 100,000 bp or more, and 25% of the genome is in scaffolds of 10 million bp or larger. Analysis of the genome sequence revealed 26,588 protein-encoding transcripts for which there was strong corroborating evidence and an additional approximately 12,000 computationally derived genes with mouse matches or other weak supporting evidence. Although gene-dense clusters are obvious, almost half the genes are dispersed in low G+C sequence separated by large tracts of apparently noncoding sequence. Only 1.1% of the genome is spanned by exons, whereas 24% is in introns, with 75% of the genome being intergenic DNA. Duplications of segmental blocks, ranging in size up to chromosomal lengths, are abundant throughout the genome and reveal a complex evolutionary history. Comparative genomic analysis indicates vertebrate expansions of genes associated with neuronal function, with tissue-specific developmental regulation, and with the hemostasis and immune systems. DNA sequence comparisons between the consensus sequence and publicly funded genome data provided locations of 2.1 million single-nucleotide polymorphisms (SNPs). A random pair of human haploid genomes differed at a rate of 1 bp per 1250 on average, but there was marked heterogeneity in the level of polymorphism across the genome. Less than 1% of all SNPs resulted in variation in proteins, but the task of determining which SNPs have functional consequences remains an open challenge.

12,098 citations

Journal ArticleDOI
TL;DR: SciPy as discussed by the authors is an open-source scientific computing library for the Python programming language, which has become a de facto standard for leveraging scientific algorithms in Python, with over 600 unique code contributors, thousands of dependent packages, over 100,000 dependent repositories and millions of downloads per year.
Abstract: SciPy is an open-source scientific computing library for the Python programming language. Since its initial release in 2001, SciPy has become a de facto standard for leveraging scientific algorithms in Python, with over 600 unique code contributors, thousands of dependent packages, over 100,000 dependent repositories and millions of downloads per year. In this work, we provide an overview of the capabilities and development practices of SciPy 1.0 and highlight some recent technical developments.

6,244 citations

Journal ArticleDOI
TL;DR: This comprehensive review highlights the physicochemical properties of cisplatin and related platinum-based drugs, and discusses its uses (either alone or in combination with other drugs) for the treatment of various human cancers.

3,467 citations

Journal ArticleDOI
27 Jul 2017-Cell
TL;DR: DEMETER, an analytical framework that segregates on- from off-target effects of RNAi, demonstrates the basis behind one such predictive model linking hypermethylation of the UBB ubiquitin gene to a dependency on UBC and provides a foundation for a cancer dependency map that facilitates the prioritization of therapeutic targets.

1,533 citations

Related Papers (5)
Frequently Asked Questions (8)
Q1. How many citations have been accumulated by the problematic papers?

Given that over 17,000 citations have been accumulated by the problematic papers 479 that the authors have identified, it seems inevitable that unreliable gene function papers are already 480 wasting time and resources. 

All flagged incorrect targeting sequences were double-checked through additional blastn 748 searches against the database: “Homo sapiens (taxid:9606)”, optimized for “Somewhat 749 similar sequences (blastn)”, using an expect threshold 1000, in February 2021. 

S&B was employed to screen all original articles published in Gene from 2007-2018, 267 and all open-access articles published in Oncology Reports from 2014-2018 (Table 4). 

Approximately half (51/100, 51%) the flagged C+G papers were found to include a median 247 of 2 (range 1-8) wrongly identified sequences/ paper (Table 1). 

These error types represent the equivalent of spelling errors (12, 14, 15), as 111 well as identity errors, where a correct sequence is replaced by a different and possibly 112 genetically unrelated sequence (11-13, 16-21). 

; https://doi.org/10.1101/2021.07.29.453321doi: bioRxiv preprint2626ncRNA’s possess largely numeric identifiers that may be more difficult to recognise and 549 recall than alphanumeric gene identifiers, any focus upon ncRNA’s could contribute to large 550 publication series being less visible within the literature. 

Professional societies can reinforce the importance of reagent 591 verification through conference presentations, education programs, and journal editorials, and 592 can advocate for tangible incentives to encourage further fact-checking of the genetics 593 literature. 

as workplace sabotage is 489 typically directed towards known individuals (39, 40), this seems an unlikely explanation for 490 wrongly identified sequences across hundreds of gene function papers published by many 491 different authors.