scispace - formally typeset
Search or ask a question

Showing papers in "Journal of Biomedical Semantics in 2020"


Journal ArticleDOI
TL;DR: A list of sixteen recommendations regarding the usage of NLP systems and algorithms, usage of data, evaluation and validation, presentation of results, and generalizability of results was developed and believe will increase the reproducibility and reusability of future studies and NLP algorithms in medicine.
Abstract: Free-text descriptions in electronic health records (EHRs) can be of interest for clinical research and care optimization. However, free text cannot be readily interpreted by a computer and, therefore, has limited value. Natural Language Processing (NLP) algorithms can make free text machine-interpretable by attaching ontology concepts to it. However, implementations of NLP algorithms are not evaluated consistently. Therefore, the objective of this study was to review the current methods used for developing and evaluating NLP algorithms that map clinical text fragments onto ontology concepts. To standardize the evaluation of algorithms and reduce heterogeneity between studies, we propose a list of recommendations. Two reviewers examined publications indexed by Scopus, IEEE, MEDLINE, EMBASE, the ACM Digital Library, and the ACL Anthology. Publications reporting on NLP for mapping clinical text from EHRs to ontology concepts were included. Year, country, setting, objective, evaluation and validation methods, NLP algorithms, terminology systems, dataset size and language, performance measures, reference standard, generalizability, operational use, and source code availability were extracted. The studies’ objectives were categorized by way of induction. These results were used to define recommendations. Two thousand three hundred fifty five unique studies were identified. Two hundred fifty six studies reported on the development of NLP algorithms for mapping free text to ontology concepts. Seventy-seven described development and evaluation. Twenty-two studies did not perform a validation on unseen data and 68 studies did not perform external validation. Of 23 studies that claimed that their algorithm was generalizable, 5 tested this by external validation. A list of sixteen recommendations regarding the usage of NLP systems and algorithms, usage of data, evaluation and validation, presentation of results, and generalizability of results was developed. We found many heterogeneous approaches to the reporting on the development and evaluation of NLP algorithms that map clinical text to ontology concepts. Over one-fourth of the identified publications did not perform an evaluation. In addition, over one-fourth of the included studies did not perform a validation, and 88% did not perform external validation. We believe that our recommendations, alongside an existing reporting standard, will increase the reproducibility and reusability of future studies and NLP algorithms in medicine.

42 citations


Journal ArticleDOI
TL;DR: An integrative knowledge graph containing a large volume of biomedical and research data is developed that highlights an opportunity to take a translational science view on the field of rare diseases by enabling researchers to identify disease characteristics, which may play a role in the translation of discover across different research domains.
Abstract: The Genetic and Rare Diseases (GARD) Information Center was established by the National Institutes of Health (NIH) to provide freely accessible consumer health information on over 6500 genetic and rare diseases. As the cumulative scientific understanding and underlying evidence for these diseases have expanded over time, existing practices to generate knowledge from these publications and resources have not been able to keep pace. Through determining the applicability of computational approaches to enhance or replace manual curation tasks, we aim to both improve the sustainability and relevance of consumer health information, but also to develop a foundational database, from which translational science researchers may start to unravel disease characteristics that are vital to the research process. We developed a meta-ontology based integrative knowledge graph for rare diseases in Neo4j. This integrative knowledge graph includes a total of 3,819,623 nodes and 84,223,681 relations from 34 different biomedical data resources, including curated drug and rare disease associations. Semi-automatic mappings were generated for 2154 unique FDA orphan designations to 776 unique GARD diseases, and 3322 unique FDA designated drugs to UNII, as well as 180,363 associations between drug and indication from Inxight Drugs, which were integrated into the knowledge graph. We conducted four case studies to demonstrate the capabilities of this integrative knowledge graph in accelerating the curation of scientific understanding on rare diseases through the generation of disease mappings/profiles and pathogenesis associations. By integrating well-established database resources, we developed an integrative knowledge graph containing a large volume of biomedical and research data. Demonstration of several immediate use cases and limitations of this process reveal both the potential feasibility and barriers of utilizing graph-based resources and approaches to support their use by providers of consumer health information, such as GARD, that may struggle with the needs of maintaining knowledge reliant on an evolving and growing evidence-base. Finally, the successful integration of these datasets into a freely accessible knowledge graph highlights an opportunity to take a translational science view on the field of rare diseases by enabling researchers to identify disease characteristics, which may play a role in the translation of discover across different research domains.

23 citations


Journal ArticleDOI
TL;DR: Developing domain-specific methods is crucial to address complex NLP tasks such as symptom onset extraction and retrospective calculation of duration of a preclinical syndrome, as well as solving three different tasks: time expression extraction, symptom extraction, and temporal “linking”.
Abstract: Duration of untreated psychosis (DUP) is an important clinical construct in the field of mental health, as longer DUP can be associated with worse intervention outcomes. DUP estimation requires knowledge about when psychosis symptoms first started (symptom onset), and when psychosis treatment was initiated. Electronic health records (EHRs) represent a useful resource for retrospective clinical studies on DUP, but the core information underlying this construct is most likely to lie in free text, meaning it is not readily available for clinical research. Natural Language Processing (NLP) is a means to addressing this problem by automatically extracting relevant information in a structured form. As a first step, it is important to identify appropriate documents, i.e., those that are likely to include the information of interest. Next, temporal information extraction methods are needed to identify time references for early psychosis symptoms. This NLP challenge requires solving three different tasks: time expression extraction, symptom extraction, and temporal “linking”. In this study, we focus on the first step, using two relevant EHR datasets. We applied a rule-based NLP system for time expression extraction that we had previously adapted to a corpus of mental health EHRs from patients with a diagnosis of schizophrenia (first referrals). We extended this work by applying this NLP system to a larger set of documents and patients, to identify additional texts that would be relevant for our long-term goal, and developed a new corpus from a subset of these new texts (early intervention services). Furthermore, we added normalized value annotations (“2011–05”) to the annotated time expressions (“May 2011”) in both corpora. The finalized corpora were used for further NLP development and evaluation, with promising results (normalization accuracy 71–86%). To highlight the specificities of our annotation task, we also applied the final adapted NLP system to a different temporally annotated clinical corpus. Developing domain-specific methods is crucial to address complex NLP tasks such as symptom onset extraction and retrospective calculation of duration of a preclinical syndrome. To the best of our knowledge, this is the first clinical text resource annotated for temporal entities in the mental health domain.

19 citations


Journal ArticleDOI
TL;DR: This work developed a method that uses machine learning and word embeddings to identify words and phrases that are used to refer to an ontology class in biomedical Europe PMC full-text articles and can identify whether a class should be a subclass of some high-level ontology classes.
Abstract: Ontologies are widely used across biology and biomedicine for the annotation of databases. Ontology development is often a manual, time-consuming, and expensive process. Automatic or semi-automatic identification of classes that can be added to an ontology can make ontology development more efficient. We developed a method that uses machine learning and word embeddings to identify words and phrases that are used to refer to an ontology class in biomedical Europe PMC full-text articles. Once labels and synonyms of a class are known, we use machine learning to identify the super-classes of a class. For this purpose, we identify lexical term variants, use word embeddings to capture context information, and rely on automated reasoning over ontologies to generate features, and we use an artificial neural network as classifier. We demonstrate the utility of our approach in identifying terms that refer to diseases in the Human Disease Ontology and to distinguish between different types of diseases. Our method is capable of discovering labels that refer to a class in an ontology but are not present in an ontology, and it can identify whether a class should be a subclass of some high-level ontology classes. Our approach can therefore be used for the semi-automatic extension and quality control of ontologies. The algorithm, corpora and evaluation datasets are available at https://github.com/bio-ontology-research-group/ontology-extension.

17 citations


Journal ArticleDOI
TL;DR: It is assumed that the CAS corpus built with clinical cases, such as they are reported in the published scientific literature in French, can be effectively exploited for the development and testing of NLP tools and methods.
Abstract: Textual corpora are extremely important for various NLP applications as they provide information necessary for creating, setting and testing those applications and the corresponding tools. They are also crucial for designing reliable methods and reproducible results. Yet, in some areas, such as the medical area, due to confidentiality or to ethical reasons, it is complicated or even impossible to access representative textual data. We propose the CAS corpus built with clinical cases, such as they are reported in the published scientific literature in French. Currently, the corpus contains 4,900 clinical cases in French, totaling nearly 1.7M word occurrences. Some clinical cases are associated with discussions. A subset of the whole set of cases is enriched with morpho-syntactic (PoS-tagging, lemmatization) and semantic (the UMLS concepts, negation, uncertainty) annotations. The corpus is being continuously enriched with new clinical cases and annotations. The CAS corpus has been compared with similar clinical narratives. When computed on tokenized and lowercase words, the Jaccard index indicates that the similarity between clinical cases and narratives reaches up to 0.9727. We assume that the CAS corpus can be effectively exploited for the development and testing of NLP tools and methods. Besides, the corpus will be used in NLP challenges and distributed to the research community.

9 citations


Journal ArticleDOI
TL;DR: The study shows that the presented system produces a coherent and logical structure for freely written nursing narratives and has the potential to reduce the time and effort nurses are currently spending on documenting care in hospitals.
Abstract: Up to 35% of nurses’ working time is spent on care documentation. We describe the evaluation of a system aimed at assisting nurses in documenting patient care and potentially reducing the documentation workload. Our goal is to enable nurses to write or dictate nursing notes in a narrative manner without having to manually structure their text under subject headings. In the current care classification standard used in the targeted hospital, there are more than 500 subject headings to choose from, making it challenging and time consuming for nurses to use. The task of the presented system is to automatically group sentences into paragraphs and assign subject headings. For classification the system relies on a neural network-based text classification model. The nursing notes are initially classified on sentence level. Subsequently coherent paragraphs are constructed from related sentences. Based on a manual evaluation conducted by a group of three domain experts, we find that in about 69% of the paragraphs formed by the system the topics of the sentences are coherent and the assigned paragraph headings correctly describe the topics. We also show that the use of a paragraph merging step reduces the number of paragraphs produced by 23% without affecting the performance of the system. The study shows that the presented system produces a coherent and logical structure for freely written nursing narratives and has the potential to reduce the time and effort nurses are currently spending on documenting care in hospitals.

7 citations


Journal ArticleDOI
TL;DR: The open-source Oral Health and Disease Ontology (OHD) and the instance-based representation as a framework for dental and medical health care information are developed and demonstrated to demonstrate a prospective design for a principled data backend that can be used consistently and encompass both dental andmedical information in a single framework.
Abstract: A key challenge for improving the quality of health care is to be able to use a common framework to work with patient information acquired in any of the health and life science disciplines. Patient information collected during dental care exposes many of the challenges that confront a wider scale approach. For example, to improve the quality of dental care, we must be able to collect and analyze data about dental procedures from multiple practices. However, a number of challenges make doing so difficult. First, dental electronic health record (EHR) information is often stored in complex relational databases that are poorly documented. Second, there is not a commonly accepted and implemented database schema for dental EHR systems. Third, integrative work that attempts to bridge dentistry and other settings in healthcare is made difficult by the disconnect between representations of medical information within dental and other disciplines’ EHR systems. As dentistry increasingly concerns itself with the general health of a patient, for example in increased efforts to monitor heart health and systemic disease, the impact of this disconnect becomes more and more severe. To demonstrate how to address these problems, we have developed the open-source Oral Health and Disease Ontology (OHD) and our instance-based representation as a framework for dental and medical health care information. We envision a time when medical record systems use a common data back end that would make interoperating trivial and obviate the need for a dedicated messaging framework to move data between systems. The OHD is not yet complete. It includes enough to be useful and to demonstrate how it is constructed. We demonstrate its utility in an analysis of longevity of dental restorations. Our first narrow use case provides a prototype, and is intended demonstrate a prospective design for a principled data backend that can be used consistently and encompass both dental and medical information in a single framework. The OHD contains over 1900 classes and 59 relationships. Most of the classes and relationships were imported from existing OBO Foundry ontologies. Using the LSW2 (LISP Semantic Web) software library, we translated data from a dental practice’s EHR system into a corresponding Web Ontology Language (OWL) representation based on the OHD framework. The OWL representation was then loaded into a triple store, and as a proof of concept, we addressed a question of clinical relevance – a survival analysis of the longevity of resin filling restorations. We provide queries using SPARQL and statistical analysis code in R to demonstrate how to perform clinical research using a framework such as the OHD, and we compare our results with previous studies. This proof-of-concept project translated data from a single practice. By using dental practice data, we demonstrate that the OHD and the instance-based approach are sufficient to represent data generated in real-world, routine clinical settings. While the OHD is applicable to integration of data from multiple practices with different dental EHR systems, we intend our work to be understood as a prospective design for EHR data storage that would simplify medical informatics. The system has well-understood semantics because of our use of BFO-based realist ontology and its representation in OWL. The data model is a well-defined web standard.

7 citations


Journal ArticleDOI
TL;DR: A scientometric analysis on bioprinting based on a competitive technology intelligence (CTI) cycle assesses scientific documents to establish the publication rate of science and technology in terms of institutions, patents or journals offers valuable information for the identification of potential associations for biop printing researches and stakeholders.
Abstract: Scientific activity for 3D bioprinting has increased over the past years focusing mainly on fully functional biological constructs to overcome issues related to organ transplants. This research performs a scientometric analysis on bioprinting based on a competitive technology intelligence (CTI) cycle, which assesses scientific documents to establish the publication rate of science and technology in terms of institutions, patents or journals. Although analyses of publications can be observed in the literature, the identification of the most influential authors and affiliations has not been addressed. This study involves the analysis of authors and affiliations, and their interactions in a global framework. We use network collaboration maps and Betweenness Centrality (BC) to identify of the most prominent actors in bioprinting, enhancing the CTI analysis. 2088 documents were retrieved from Scopus database from 2007 to 2017, disclosing an exponential growth with an average publication increase of 17.5% per year. A threshold of five articles with ten or more cites was established for authors, while the same number of articles but cited five or more times was set for affiliations. The author with more publications was Atala A. (36 papers and a BC = 370.9), followed by Khademhosseini A. (30 documents and a BC = 2104.7), and Mironov (30 documents and BC = 2754.9). In addition, a small correlation was observed between the number of collaborations and the number of publications. Furthermore, 1760 institutions with a median of 10 publications were found, but only 20 within the established threshold. 30% of the 20 institutions had an external collaboration, and institutions located in and close to the life science cluster in Massachusetts showed a strong cooperation. The institution with more publications was the Harvard Medical School, 61 publications, followed by the Brigham and Women’s hospital, 46 papers, and the Massachusetts Institute of Technology with 37 documents. Network map analysis and BC allowed the identification of the most influential authors working on bioprinting and the collaboration between institutions was found limited. This analysis of authors and affiliations and their collaborations offer valuable information for the identification of potential associations for bioprinting researches and stakeholders.

6 citations


Journal ArticleDOI
TL;DR: The African Wildlife Ontology provides a wide range of options concerning examples and exercises for ontology engineering well beyond illustrating just language features and automated reasoning, and assists in demonstrating tasks concerning ontology quality.
Abstract: Most tutorial ontologies focus on illustrating one aspect of ontology development, notably language features and automated reasoners, but ignore ontology development factors, such as emergent modelling guidelines and ontological principles. Yet, novices replicate examples from the exercise they carry out. Not providing good examples holistically causes the propagation of sub-optimal ontology development, which may negatively affect the quality of a real domain ontology. We identified 22 requirements that a good tutorial ontology should satisfy regarding subject domain, logics and reasoning, and engineering aspects. We developed a set of ontologies about African Wildlife to serve as tutorial ontologies. A majority of the requirements have been met with the set of African Wildlife Ontology tutorial ontologies, which are introduced in this paper. The African Wildlife Ontology is mature and has been used yearly in an ontology engineering course or tutorial since 2010 and is included in a recent ontology engineering textbook with relevant examples and exercises. The African Wildlife Ontology provides a wide range of options concerning examples and exercises for ontology engineering well beyond illustrating just language features and automated reasoning. It assists in demonstrating tasks concerning ontology quality, such as alignment to a foundational ontology and satisfying competency questions, versioning, and multilingual ontologies.

6 citations


Journal ArticleDOI
TL;DR: This work demonstrates that predicates between proteins can be used to identify disease trajectories by leveraging the predicate information from paths between (disease) proteins in a knowledge graph.
Abstract: Knowledge graphs can represent the contents of biomedical literature and databases as subject-predicate-object triples, thereby enabling comprehensive analyses that identify e.g. relationships between diseases. Some diseases are often diagnosed in patients in specific temporal sequences, which are referred to as disease trajectories. Here, we determine whether a sequence of two diseases forms a trajectory by leveraging the predicate information from paths between (disease) proteins in a knowledge graph. Furthermore, we determine the added value of directional information of predicates for this task. To do so, we create four feature sets, based on two methods for representing indirect paths, and both with and without directional information of predicates (i.e., which protein is considered subject and which object). The added value of the directional information of predicates is quantified by comparing the classification performance of the feature sets that include or exclude it. Our method achieved a maximum area under the ROC curve of 89.8% and 74.5% when evaluated with two different reference sets. Use of directional information of predicates significantly improved performance by 6.5 and 2.0 percentage points respectively. Our work demonstrates that predicates between proteins can be used to identify disease trajectories. Using the directional information of predicates significantly improved performance over not using this information.

6 citations


Journal ArticleDOI
TL;DR: In this article, the authors present a Core Ontology of Phenotypes (COP) and the software PhenoMan, which implements a novel ontology-based method to model, classify and compute phenotypes from already available data.
Abstract: The successful determination and analysis of phenotypes plays a key role in the diagnostic process, the evaluation of risk factors and the recruitment of participants for clinical and epidemiological studies. The development of computable phenotype algorithms to solve these tasks is a challenging problem, caused by various reasons. Firstly, the term ‘phenotype’ has no generally agreed definition and its meaning depends on context. Secondly, the phenotypes are most commonly specified as non-computable descriptive documents. Recent attempts have shown that ontologies are a suitable way to handle phenotypes and that they can support clinical research and decision making. The SMITH Consortium is dedicated to rapidly establish an integrative medical informatics framework to provide physicians with the best available data and knowledge and enable innovative use of healthcare data for research and treatment optimisation. In the context of a methodological use case ‘phenotype pipeline’ (PheP), a technology to automatically generate phenotype classifications and annotations based on electronic health records (EHR) is developed. A large series of phenotype algorithms will be implemented. This implies that for each algorithm a classification scheme and its input variables have to be defined. Furthermore, a phenotype engine is required to evaluate and execute developed algorithms. In this article, we present a Core Ontology of Phenotypes (COP) and the software Phenotype Manager (PhenoMan), which implements a novel ontology-based method to model, classify and compute phenotypes from already available data. Our solution includes an enhanced iterative reasoning process combining classification tasks with mathematical calculations at runtime. The ontology as well as the reasoning method were successfully evaluated with selected phenotypes including SOFA score, socio-economic status, body surface area and WHO BMI classification based on available medical data. We developed a novel ontology-based method to model phenotypes of living beings with the aim of automated phenotype reasoning based on available data. This new approach can be used in clinical context, e.g., for supporting the diagnostic process, evaluating risk factors, and recruiting appropriate participants for clinical and epidemiological studies.

Journal ArticleDOI
TL;DR: The LSTM-based machine learning method was able to extract named entities to be de-identified with better performance, in general, than that of the authors' rule-based methods, however, machine learning methods are inadequate for processing expressions with low occurrence.
Abstract: Recently, more electronic data sources are becoming available in the healthcare domain. Electronic health records (EHRs), with their vast amounts of potentially available data, can greatly improve healthcare. Although EHR de-identification is necessary to protect personal information, automatic de-identification of Japanese language EHRs has not been studied sufficiently. This study was conducted to raise de-identification performance for Japanese EHRs through classic machine learning, deep learning, and rule-based methods, depending on the dataset. Using three datasets, we implemented de-identification systems for Japanese EHRs and compared the de-identification performances found for rule-based, Conditional Random Fields (CRF), and Long-Short Term Memory (LSTM)-based methods. Gold standard tags for de-identification are annotated manually for age, hospital, person, sex, and time. We used different combinations of our datasets to train and evaluate our three methods. Our best F1-scores were 84.23, 68.19, and 81.67 points, respectively, for evaluations of the MedNLP dataset, a dummy EHR dataset that was virtually written by a medical doctor, and a Pathology Report dataset. Our LSTM-based method was the best performing, except for the MedNLP dataset. The rule-based method was best for the MedNLP dataset. The LSTM-based method achieved a good score of 83.07 points for this MedNLP dataset, which differs by 1.16 points from the best score obtained using the rule-based method. Results suggest that LSTM adapted well to different characteristics of our datasets. Our LSTM-based method performed better than our CRF-based method, yielding a 7.41 point F1-score, when applied to our Pathology Report dataset. This report is the first of study applying this LSTM-based method to any de-identification task of a Japanese EHR. Our LSTM-based machine learning method was able to extract named entities to be de-identified with better performance, in general, than that of our rule-based methods. However, machine learning methods are inadequate for processing expressions with low occurrence. Our future work will specifically examine the combination of LSTM and rule-based methods to achieve better performance. Our currently achieved level of performance is sufficiently higher than that of publicly available Japanese de-identification tools. Therefore, our system will be applied to actual de-identification tasks in hospitals.

Journal ArticleDOI
TL;DR: The Latent Dirichlet Allocation approach is used to analyze the correlation between funding efforts and actually published research results in order to provide the policy makers with a systematic and rigorous tool to assess the efficiency of funding programs in the medical area.
Abstract: Medical knowledge is accumulated in scientific research papers along time. In order to exploit this knowledge by automated systems, there is a growing interest in developing text mining methodologies to extract, structure, and analyze in the shortest time possible the knowledge encoded in the large volume of medical literature. In this paper, we use the Latent Dirichlet Allocation approach to analyze the correlation between funding efforts and actually published research results in order to provide the policy makers with a systematic and rigorous tool to assess the efficiency of funding programs in the medical area. We have tested our methodology in the Revista Medica de Chile, years 2012-2015. 50 relevant semantic topics were identified within 643 medical scientific research papers. Relationships between the identified semantic topics were uncovered using visualization methods. We have also been able to analyze the funding patterns of scientific research underlying these publications. We found that only 29% of the publications declare funding sources, and we identified five topic clusters that concentrate 86% of the declared funds. Our methodology allows analyzing and interpreting the current state of medical research at a national level. The funding source analysis may be useful at the policy making level in order to assess the impact of actual funding policies, and to design new policies.

Journal ArticleDOI
TL;DR: A NEural ArchiTecture for Drug side effect prediction (NEAT) is proposed, which is optimized on the task of drug side effect discovery based on a complete discussion while being attentive to user credibility and experience, thus, addressing the mentioned shortcomings.
Abstract: Health 2.0 allows patients and caregivers to conveniently seek medical information and advice via e-portals and online discussion forums, especially regarding potential drug side effects. Although online health communities are helpful platforms for obtaining non-professional opinions, they pose risks in communicating unreliable and insufficient information in terms of quality and quantity. Existing methods in extracting user-reported adverse drug reactions (ADRs) in online health forums are not only insufficiently accurate as they disregard user credibility and drug experience, but are also expensive as they rely on supervised ground truth annotation of individual statement. We propose a NEural ArchiTecture for Drug side effect prediction (NEAT), which is optimized on the task of drug side effect discovery based on a complete discussion while being attentive to user credibility and experience, thus, addressing the mentioned shortcomings. We train our neural model in a self-supervised fashion using ground truth drug side effects from mayoclinic.org. NEAT learns to assign each user a score that is descriptive of their credibility and highlights the critical textual segments of their post. Experiments show that NEAT improves drug side effect discovery from online health discussion by 3.04% from user-credibility agnostic baselines, and by 9.94% from non-neural baselines in term of F1. Additionally, the latent credibility scores learned by the model correlate well with trustworthiness signals, such as the number of “thanks” received by other forum members, and improve credibility heuristics such as number of posts by 0.113 in term of Spearman’s rank correlation coefficient. Experience-based self-supervised attention highlights critical phrases such as mentioned side effects, and enhances fully supervised ADR extraction models based on sequence labelling by 5.502% in terms of precision. NEAT considers both user credibility and experience in online health forums, making feasible a self-supervised approach to side effect prediction for mentioned drugs. The derived user credibility and attention mechanism are transferable and improve downstream ADR extraction models. Our approach enhances automatic drug side effect discovery and fosters research in several domains including pharmacovigilance and clinical studies.

Journal ArticleDOI
TL;DR: A configurable and automated schema extraction and publishing approach, which enables ad-hoc SPARQL query formulation against RDF triple stores without requiring direct access to the private data, can enable reuse of data from private repositories and in settings where agreeing upon a shared schema and encoding a priori is infeasible.
Abstract: Sharing sensitive data across organizational boundaries is often significantly limited by legal and ethical restrictions. Regulations such as the EU General Data Protection Rules (GDPR) impose strict requirements concerning the protection of personal and privacy sensitive data. Therefore new approaches, such as the Personal Health Train initiative, are emerging to utilize data right in their original repositories, circumventing the need to transfer data. Circumventing limitations of previous systems, this paper proposes a configurable and automated schema extraction and publishing approach, which enables ad-hoc SPARQL query formulation against RDF triple stores without requiring direct access to the private data. The approach is compatible with existing Semantic Web-based technologies and allows for the subsequent execution of such queries in a safe setting under the data provider’s control. Evaluation with four distinct datasets shows that a configurable amount of concise and task-relevant schema, closely describing the structure of the underlying data, was derived, enabling the schema introspection-assisted authoring of SPARQL queries. Automatically extracting and publishing data schema can enable the introspection-assisted creation of data selection and integration queries. In conjunction with the presented system architecture, this approach can enable reuse of data from private repositories and in settings where agreeing upon a shared schema and encoding a priori is infeasible. As such, it could provide an important step towards reuse of data from previously inaccessible sources and thus towards the proliferation of data-driven methods in the biomedical domain.