scispace - formally typeset
Search or ask a question
Author

Chuan Lei

Other affiliations: Worcester Polytechnic Institute
Bio: Chuan Lei is an academic researcher from IBM. The author has contributed to research in topics: Complex event processing & Graph (abstract data type). The author has an hindex of 11, co-authored 33 publications receiving 259 citations. Previous affiliations of Chuan Lei include Worcester Polytechnic Institute.

Papers
More filters
Journal ArticleDOI
01 Jul 2020
TL;DR: This paper presents ATHENA++, an end-to-end system that can answer complex queries in natural language by translating them into nested SQL queries, and combines linguistic patterns from NL queries with deep domain reasoning using ontologies to enable nested query detection and generation.
Abstract: Natural Language Interfaces to Databases (NLIDB) systems eliminate the requirement for an end user to use complex query languages like SQL, by translating the input natural language (NL) queries to SQL automatically. Although a significant volume of research has focused on this space, most state-of-the-art systems can at best handle simple select-project-join queries. There has been little to no research on extending the capabilities of NLIDB systems to handle complex business intelligence (BI) queries that often involve nesting as well as aggregation. In this paper, we present Athena++, an end-to-end system that can answer such complex queries in natural language by translating them into nested SQL queries. In particular, Athena++ combines linguistic patterns from NL queries with deep domain reasoning using ontologies to enable nested query detection and generation. We also introduce a new benchmark data set (FIBEN), which consists of 300 NL queries, corresponding to 237 distinct complex SQL queries on a database with 152 tables, conforming to an ontology derived from standard financial ontologies (FIBO and FRO). We conducted extensive experiments comparing Athena++ with two state-of-the-art NLIDB systems, using both FIBEN and the prominent Spider benchmark. Athena++ consistently outperforms both systems across all benchmark data sets with a wide variety of complex queries, achieving 88.33% accuracy on FIBEN benchmark, and 78.89% accuracy on Spider benchmark, beating the best reported accuracy results on the dev set by 8%.

63 citations

Proceedings ArticleDOI
14 Jun 2016
TL;DR: The SPASS optimizer identifies opportunities for effective shared processing among CEP queries by leveraging time-based event correlations among queries and finds a shared pattern plan in polynomial-time covering all sequence patterns while still guaranteeing an optimality bound.
Abstract: Complex Event Processing (CEP) has emerged as a technology of choice for high performance event analytics in time-critical decision-making applications. Yet it is becoming increasingly difficult to support high-performance event processing due to the rising number and complexity of event pattern queries and the increasingly high velocity of event streams. In this work we design the SPASS framework that successfully tackles these demanding CEP workloads. Our SPASS optimizer identifies opportunities for effective shared processing among CEP queries by leveraging time-based event correlations among queries. The problem of pattern sharing is shown to be NP-hard by reducing the Minimum Substring Cover problem to our CEP pattern sharing problem. The SPASS optimizer is designed that finds a shared pattern plan in polynomial-time covering all sequence patterns while still guaranteeing an optimality bound. To execute this shared pattern plan, the SPASS runtime employs stream transactions that assure concurrent shared maintenance and re-use of sub-patterns across queries. Our experimental study confirms that the SPASS framework achieves over 16 fold performance improvement for a wide range of experiments compared to the state-of-the-art solution.

52 citations

Proceedings ArticleDOI
Fatma Őzcan1, Abdul Quamar1, Jaydeep Sen1, Chuan Lei1, Vasilis Efthymiou1 
11 Jun 2020
TL;DR: This tutorial will review natural language interface solutions in terms of their interpretation approach, as well as the complexity of the queries they can generate, and discuss open research challenges.
Abstract: Recent advances in natural language understanding and processing resulted in renewed interest in natural language based interfaces to data, which provide an easy mechanism for non-technical users to access and query the data. While early systems only allowed simple selection queries over a single table, some recent work supports complex BI queries, with many joins and aggregation, and even nested queries. There are various approaches in the literature for interpreting user's natural language query. Rule-based systems try to identify the entities in the query, and understand the intended relationships between those entities. Recent years have seen the emergence and popularity of neural network based approaches which try to interpret the query holistically, by learning the patterns. In this tutorial, we will review these natural language interface solutions in terms of their interpretation approach, as well as the complexity of the queries they can generate. We will also discuss open research challenges.

32 citations

Journal ArticleDOI
TL;DR: This paper presents ''Query Mesh'' (or QM), a practical alternative to state-of-the-art data stream processing approaches, and proposes several cost-based query optimization heuristics designed to effectively find nearly optimal QMs.

27 citations

Proceedings ArticleDOI
Alina Vretinaris1, Chuan Lei1, Vasilis Efthymiou1, Xiao Qin1, Fatma Ozcan2 
09 Jun 2021
TL;DR: Zhang et al. as mentioned in this paper introduced ED-GNN based on three representative graph neural networks (GraphSAGE, R-GCN, and MAGNN) for medical entity disambiguation.
Abstract: Medical knowledge bases (KBs), distilled from biomedical literature and regulatory actions, are expected to provide high-quality information to facilitate clinical decision making. Entity disambiguation (also referred to as entity linking) is considered as an essential task in unlocking the wealth of such medical KBs. However, existing medical entity disambiguation methods are not adequate due to word discrepancies between the entities in the KB and the text snippets in the source documents. Recently, graph neural networks (GNNs) have proven to be very effective and provide state-of-the-art results for many real-world applications with graph-structured data. In this paper, we introduce ED-GNN based on three representative GNNs (GraphSAGE, R-GCN, and MAGNN) for medical entity disambiguation. We develop two optimization techniques to fine-tune and improve ED-GNN. First, we introduce a novel strategy to represent entities that are mentioned in text snippets as a query graph. Second, we design an effective negative sampling strategy that identifies hard negative samples to improve the model's disambiguation capability. Compared to the best performing state-of-the-art solutions, our ED-GNN offers an average improvement of 7.3% in terms of F1 score on five real-world datasets.

25 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.
Abstract: Machine Learning is the study of methods for programming computers to learn. Computers are applied to a wide range of tasks, and for most of these it is relatively easy for programmers to design and implement the necessary software. However, there are many tasks for which this is difficult or impossible. These can be divided into four general categories. First, there are problems for which there exist no human experts. For example, in modern automated manufacturing facilities, there is a need to predict machine failures before they occur by analyzing sensor readings. Because the machines are new, there are no human experts who can be interviewed by a programmer to provide the knowledge necessary to build a computer system. A machine learning system can study recorded data and subsequent machine failures and learn prediction rules. Second, there are problems where human experts exist, but where they are unable to explain their expertise. This is the case in many perceptual tasks, such as speech recognition, hand-writing recognition, and natural language understanding. Virtually all humans exhibit expert-level abilities on these tasks, but none of them can describe the detailed steps that they follow as they perform them. Fortunately, humans can provide machines with examples of the inputs and correct outputs for these tasks, so machine learning algorithms can learn to map the inputs to the outputs. Third, there are problems where phenomena are changing rapidly. In finance, for example, people would like to predict the future behavior of the stock market, of consumer purchases, or of exchange rates. These behaviors change frequently, so that even if a programmer could construct a good predictive computer program, it would need to be rewritten frequently. A learning program can relieve the programmer of this burden by constantly modifying and tuning a set of learned prediction rules. Fourth, there are applications that need to be customized for each computer user separately. Consider, for example, a program to filter unwanted electronic mail messages. Different users will need different filters. It is unreasonable to expect each user to program his or her own rules, and it is infeasible to provide every user with a software engineer to keep the rules up-to-date. A machine learning system can learn which mail messages the user rejects and maintain the filtering rules automatically. Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis. Statistics focuses on understanding the phenomena that have generated the data, often with the goal of testing different hypotheses about those phenomena. Data mining seeks to find patterns in the data that are understandable by people. Psychological studies of human learning aspire to understand the mechanisms underlying the various learning behaviors exhibited by people (concept learning, skill acquisition, strategy change, etc.).

13,246 citations

Journal Article
TL;DR: BLOCKIN BLOCKINÒ BLOCKin× ½¸ÔÔº ¾ßß¿º ¿ ¾ ¾ à ¼ à à 0
Abstract: BLOCKIN BLOCKINÒ BLOCKIN× ½¸ÔÔº ¿ßß¿º ¿

373 citations

Posted Content
TL;DR: This paper proposes to decompose complex questions into a sequence of simple questions, and compute the final answer from the sequence of answers, and empirically demonstrates that question decomposition improves performance from 20.8 precision@1 to 27.5 precision @1 on this new dataset.
Abstract: Answering complex questions is a time-consuming activity for humans that requires reasoning and integration of information. Recent work on reading comprehension made headway in answering simple questions, but tackling complex questions is still an ongoing research challenge. Conversely, semantic parsers have been successful at handling compositionality, but only when the information resides in a target knowledge-base. In this paper, we present a novel framework for answering broad and complex questions, assuming answering simple questions is possible using a search engine and a reading comprehension model. We propose to decompose complex questions into a sequence of simple questions, and compute the final answer from the sequence of answers. To illustrate the viability of our approach, we create a new dataset of complex questions, ComplexWebQuestions, and present a model that decomposes questions and interacts with the web to compute an answer. We empirically demonstrate that question decomposition improves performance from 20.8 precision@1 to 27.5 precision@1 on this new dataset.

256 citations

Journal ArticleDOI
Jungeun Kim1, Jae-Gil Lee1
03 Dec 2015
TL;DR: This survey provides readers with a comprehensive understanding of community detection in multi-layer graphs and compares the state-of-the-art algorithms with respect to their underlying properties.
Abstract: Community detection, also known as graph clustering, has been extensively studied in the literature. The goal of community detection is to partition vertices in a complex graph into densely-connected components socalled communities. In recent applications, however, an entity is associated with multiple aspects of relationships, which brings new challenges in community detection. The multiple aspects of interactions can be modeled as a multi-layer graph comprised of multiple interdependent graphs, where each graph represents an aspect of the interactions. Great efforts have therefore been made to tackle the problem of community detection in multi-layer graphs. In this survey, we provide readers with a comprehensive understanding of community detection in multi-layer graphs and compare the state-of-the-art algorithms with respect to their underlying properties.

162 citations

Journal ArticleDOI
TL;DR: A tensor-based multiple clustering on bicycle renting and returning data is illustrated, which can provide several suggestions for rebalancing of the bicycle-sharing system and some challenges about the proposed framework are discussed.
Abstract: Due to the rapid advances of information technologies, Big Data, recognized with 4Vs characteristics (volume, variety, veracity, and velocity), bring significant benefits as well as many challenges A major benefit of Big Data is to provide timely information and proactive services for humans The primary purpose of this paper is to review the current state-of-the-art of Big Data from the aspects of organization and representation, cleaning and reduction, integration and processing, security and privacy, analytics and applications, then present a novel framework to provide high-quality so called Big Data-as-a-Service The framework consists of three planes, namely sensing plane, cloud plane and application plane, to systemically address all challenges of the above aspects Also, to clearly demonstrate the working process of the proposed framework, a tensor-based multiple clustering on bicycle renting and returning data is illustrated, which can provide several suggestions for rebalancing of the bicycle-sharing system Finally, some challenges about the proposed framework are discussed

121 citations