Human gene function publications that describe wrongly identified nucleotide sequence reagents are unacceptably frequent within the genetics literature
Summary (3 min read)
Introduction
- The promise of genomics to improve the health of cancer and other patients has resulted in billions of dollars of research investment which have been accompanied by expectations of similar quantum gains in health outcomes (1, 2) .
- The authors initial application of S&B identified 77/203 (38%) screened papers with incorrect nucleotide sequence reagents, with their focus being the description of the S&B tool (12) , as opposed to its application.
- The authors have now employed S&B to screen original research papers across 5 literature corpora, representing 3 targeted and two journal corpora.
- The 75 problematic SGK papers analysed 24 human cancer types, most frequently brain cancer, where 1-9 problematic SGK papers were identified per queried gene (Table 2 ).
miR-145 corpus
- PubMed similarity searches employing individual SGK papers identified numerous papers that analysed the functions of different human miR's in cancer cell lines.
- Papers that focussed upon miR-145 were identified using PubMed similarity searches of index papers (12, 13) 3 ).
- In contrast to SGK papers, most incorrect sequences in miR-145 papers were employed as (RT)-PCR primers (Table 3 ) and were identified only once within the corpus (Fig 3 The miR-145 corpus included papers that analysed human miR-145 function in human cell lines.
- Publication dates were limited to 2019 to broadly align with the SGK corpus.
Analysis of all problematic human gene function papers
- After adjusting for 9 duplicate papers across the 5 corpora, the authors identified 712 problematic papers with wrongly identified sequences (Fig 1 ) that were published by 78 journals and 31 publishers (S6 Data).
- As most incorrect reagents represented (RT-)PCR primers which are employed as paired reagents, the authors considered the verified identities of primer pairs that were found to include at least one wrongly identified primer (Fig 6 ).
- Many problematic papers (n=192) described primer pairs that were predicted to target the same incorrect gene (Fig 6 ).
Bibliometric analysis of human genes analysed in problematic papers
- Primary protein-coding genes, which represented the first-listed genes in publication titles or abstracts, tended to be associated with more papers in PubMed than a randomly chosen human protein-coding gene (median publication numbers: 167 vs 31, P < 10 -109 , two-sided Mann-Whitney U test) (S3A Fig) .
- Again, most wrongly identified target genes have appeared in more papers than a randomly chosen protein-coding gene (median publication numbers: 238 vs 31, P < 10 -94 , two-sided Mann-Whitney U test) (S3B Fig) .
- The most frequent wrongly claimed gene targets were GAPDH and ACTB (Fig 7B ), reflecting their widespread use as RT-PCR control genes.
- In summary, these analyses demonstrate that problematic papers can focus upon and/or employ reagents that are wrongly claimed to target highly-investigated human genes such as BCL2, EGFR, PTEN, STAT3, and CCND1 (Fig 7).
Discussion
- Experimental analyses of gene function require nucleotide sequence reagent identities to precisely match their published descriptions.
- As previously discussed, papers that describe incorrect nucleotide sequences could encourage the incorrect selection of genes for further experimentation, possibly at the expense of more productive candidates (8) .
- Large numbers of human gene function papers with incorrect nucleotide sequences that list hospital affiliations in China could reflect hospital doctors turning to paper mills to meet publication requirements, whereas the contrasting institutional profiles of problematic papers from other countries could highlight different publication pressures elsewhere.
- In summary, the authors are concerned that the sheer number of human genes that are available for analysis, combined with research drivers that favour the continued investigation of genes of known function (48) (49) (50) , are unwittingly providing an extensive source of topics around which gene function papers can be fraudulently created.
Future directions
- The authors results indicate that the problem of incorrect gene function papers requires urgent action.
- Within the research community, this can take place in several ways.
- Similarly, recent changes to researcher assessment (65, 66) will not address problematic papers that have already been published.
- While the described efforts to screen incoming manuscripts are welcome and should be extended to all journals that publish gene function research, screening incoming manuscripts must be coupled with addressing problematic papers that are already embedded in the literature (71, (74) (75) (76) .
- These efforts could be supported by gene function experts who could explain the significance of incorrect nucleotide sequences and/or provide training for editorial staff, particularly as the necessary researcher skills are already widely available.
Summary and conclusions
- To fully extend the benefits of genomics towards patients and broader populations, it is widely recognised that the authors must understand the functions of every human gene (1, 2) .
- Whereas genuine gene research requires time, expertise, and material resources, the mass production of fraudulent gene function papers by paper mills could be quicker and cheaper by orders of magnitude (8) .
- Given the number of human genes whose functions can analysed singly and/or in combination with other genes and/or drugs across different cancer types or other diseases, combined with acute demands for research productivity that may not always be matched by researcher capacity and training (78) , fraudulent gene function papers could unfortunately outstrip the publication of genuine gene function research.
- Indeed, the possible extent of the problem of unreliable human gene function papers is indicated by the lack of overlap between the problematic papers that the authors have reported, and other papers of concern reported elsewhere (71, 74, 79) .
- While publishers and journals decide how to address this urgent problem, laboratory scientists, text miners and clinical researchers must approach the human gene function literature with a critical mindset, and carefully evaluate the merits of individual papers before acting upon their results.
Did you find this useful? Give us your feedback
Citations
28 citations
1 citations
References
22,269 citations
12,098 citations
6,244 citations
3,467 citations
1,533 citations
Related Papers (5)
Frequently Asked Questions (8)
Q2. What was the expected threshold for the sequences that were flagged incorrectly?
All flagged incorrect targeting sequences were double-checked through additional blastn 748 searches against the database: “Homo sapiens (taxid:9606)”, optimized for “Somewhat 749 similar sequences (blastn)”, using an expect threshold 1000, in February 2021.
Q3. What was used to screen all the papers published in Gene from 2007-2018?
S&B was employed to screen all original articles published in Gene from 2007-2018, 267 and all open-access articles published in Oncology Reports from 2014-2018 (Table 4).
Q4. How many wrongly identified sequences were found in the flagged C+G papers?
Approximately half (51/100, 51%) the flagged C+G papers were found to include a median 247 of 2 (range 1-8) wrongly identified sequences/ paper (Table 1).
Q5. What are the types of errors that are common in the biomedical and genetics literature?
These error types represent the equivalent of spelling errors (12, 14, 15), as 111 well as identity errors, where a correct sequence is replaced by a different and possibly 112 genetically unrelated sequence (11-13, 16-21).
Q6. What is the ncRNA that is difficult to recognise?
; https://doi.org/10.1101/2021.07.29.453321doi: bioRxiv preprint2626ncRNA’s possess largely numeric identifiers that may be more difficult to recognise and 549 recall than alphanumeric gene identifiers, any focus upon ncRNA’s could contribute to large 550 publication series being less visible within the literature.
Q7. What can be done to encourage further fact-checking of the gene function literature?
Professional societies can reinforce the importance of reagent 591 verification through conference presentations, education programs, and journal editorials, and 592 can advocate for tangible incentives to encourage further fact-checking of the genetics 593 literature.
Q8. What is the likely explanation for the number of wrongly identified sequences in hundreds of gene?
as workplace sabotage is 489 typically directed towards known individuals (39, 40), this seems an unlikely explanation for 490 wrongly identified sequences across hundreds of gene function papers published by many 491 different authors.