PDF to JATS XML Conversion — Why it’s important for an Academic Publisher
JATS XML XML Converter Typeset

PDF to JATS XML Conversion — Why it’s important for an Academic Publisher

Shanu K
Shanu K
Is PDF the deprecated standard for Academic Publishing and Archiving?


The first version of Adobe Acrobat PDF was launched on June 15, 1993. The initial vision was to build a “Paperless office”, as the engineers at Adobe had dreamt it.

It’s a revolutionary product in its own right. A file format that provides a uniform display across any kind of operating system. Easy to use, reliable across any device.

Fast forward to the decade 2010–2020.

PDF has now become the de-facto medium for sharing and reading any genre of the document. Including “Research papers”.

The transformation of digital consumption via PDFs has upended the business model for Academic Publishers. In the 1990s and early 2000s, big publishers use to sell “hard copies” of their journal to university libraries.

In the last decade, however, all major publishers have shifted towards digital subscription offerings. They allow full-text PDF downloads of research papers. Digital Subscriptions now account for more than 80% of major publishers.

What’s the Problem with PDF then?

To build ubiquity in usage and render a file across any OS, PDF forgoes semantic information present in a document. It perceives any piece of information as a string of text.

To give you an example. In a research papers, there are variety of sections present which conveys different meanings:
- Title
- Author Names and Affiliations
- Abstract
- Introduction
- Results and Methods
- Conclusion
- Acknowledgment

For a PDF, every section is a piece of string of text. It cannot infer the meaning of that section.

This semantic data loss seemed a good enough compromise during 1990s. Then, a major shift happened.

Search Engines led by Google stormed the internet. In a period of 10 years, they morphed into the “Discovery behavior” of the internet. If you are reading a paper online, it’s a high degree of probability that you arrived there via a Google Search.

“Search and Discovery” is Key

Every Academic Publisher now relies heavily on the digital subscription business.

The philosophy is simple:

Discovery through Search → Online Readers → Demand for Downloads of Publications → Demand for Subscription → $$

All publishers are strongly incentivized to build a reader base on an organic basis. It’s good for business.

Search engines such as Google and Google Scholar need contextual information to crawl and build a knowledge graph about your website. This is where PDF fares badly. Since the semantic information is lost in a PDF, Google and Google Scholar lose out on building the knowledge graph of your Research paper.

This breaks the funnel.

Less Information passed to Google → Less discovey via Google → Less online Readers → Less demand → Less $$

Enter JATS XML…..

The Journal Article Tag Suite (JATS) is an XML format for scientific literature published online. Based on community-based push, it has become a de-facto technical standard to archive research papers. Read this blog on JATS XML to know more about it.

JATS XML logo | source: wikipedia.org

The tag suite within JATS is built to be semantic from the ground up. The tag suite natively comprises of semantic elements such as <article-title>, <journal-title>,<author>, <aff>, <kwd>, <abstract>, <ref> and the like.

This XML stores meaning-at-source. The data structure within the XML enforces a semantic continuity within the tag elements. The XML structure is efficient for a Google bot to crawl and build a knowledge graph, which is richer than a PDF document.

This will enable Google to accurately answer queries:
- when a user searches for a specific author-name
- when a user searches for an affiliation
- when a user searches for a specific text within the abstract.
<the search applications are endless>

End Result

Rich information passed to Google → More discovey via Google → More online Readers → More demand → More $$

If you are an academic publisher, you should have a historical set of Volumes & Issues published as a PDF. It’s time to pull yourself up by your bootstraps. Convert all your existing PDF documents into JATS XML and host it on your website.

The benefits over the long run are substantial.


— — — — — - — — — — — — — — — — — — — — — —

P.S.: If you are looking for a quick and simple way to convert your PDF files to JATS XML,try out SciSpace’s (Formerly Typeset) PDF to JATS XML converter.

Typeset’s PDF to JATS XML converter (Source: https://typeset.io/for-publishers/convert/pdf-to-jats-xml/)

Before you go


If you found the above article interesting, the following blogs might also interest you.

  1. Tools for STM Publishers: Running an Open Access journal on a shoestring budget
  2. Thoughts on the future of academic publishing
  3. Top 4 MS-Word (Docx) to JATS XML Converters
  4. How to Submit Metadata to Crossref: A Step by Step Guide