scispace - formally typeset
Search or ask a question

Showing papers by "Srikanta Bedathur published in 2021"


Proceedings ArticleDOI
14 Aug 2021
TL;DR: Fast Random projection-based one-class classification (FROCC) as mentioned in this paper is an efficient, scalable and easily parallelizable method for one classifier with provable theoretical guarantees, which transforms the training data by projecting it onto a set of random unit vectors that are chosen uniformly and independently from the unit sphere.
Abstract: Several applications, like malicious URL detection and web spam detection, require classification on very high-dimensional data. In such cases anomalous data is hard to find but normal data is easily available. As such it is increasingly common to use a one-class classifier (OCC). Unfortunately, most OCC algorithms cannot scale to datasets with extremely high dimensions. In this paper, we present Fast Random projection-based One-Class Classification (FROCC), an extremely efficient, scalable and easily parallelizable method for one-class classification with provable theoretical guarantees. Our method is based on the simple idea of transforming the training data by projecting it onto a set of random unit vectors that are chosen uniformly and independently from the unit sphere, and bounding the regions based on separation of the data. FROCC can be naturally extended with kernels. We provide a new theoretical framework to prove that that FROCC generalizes well in the sense that it is stable and has low bias for some parameter settings. We then develop a fast scalable approximation of FROCC using vectorization, exploiting data sparsity and parallelism to develop a new implementation called ParDFROCC. ParDFROCC achieves up to 2 percent points better ROC than the next best baseline, with up to 12× speedup in training and test times over a range of state-of-the-art benchmarks for the OCC task.

8 citations


Proceedings ArticleDOI
26 Oct 2021
TL;DR: In this article, a transfer learning framework called Reformd is proposed for continuous-time location prediction for regions with sparse checkin data, where the authors model user-specific checkin-sequences in a region using a marked temporal point process (MTPP) with normalizing flows to learn the inter-checkin time and geo-distributions.
Abstract: There exists a high variability in mobility data volumes across different regions, which deteriorates the performance of spatial recommender systems that rely on region-specific data. In this paper, we propose a novel transfer learning framework called Reformd, for continuous-time location prediction for regions with sparse checkin data. Specifically, we model user-specific checkin-sequences in a region using a marked temporal point process (MTPP) with normalizing flows to learn the inter-checkin time and geo-distributions. Later, we transfer the model parameters of spatial and temporal flows trained on a data-rich origin region for the next check-in and time prediction in a target region with scarce checkin data. We capture the evolving region-specific checkin dynamics for MTPP and spatial-temporal flows by maximizing the joint likelihood of next checkin with three channels (1) checkin-category prediction, (2) checkin-time prediction, and (3) travel distance prediction. Extensive experiments on different user mobility datasets across the U.S. and Japan show that our model significantly outperforms state-of-the-art methods for modeling continuous-time sequences. Moreover, we also show that Reformd can be easily adapted for product recommendations i.e., sequences without any spatial component.

8 citations


Proceedings ArticleDOI
14 Aug 2021
TL;DR: The 2nd International Workshop on Data Quality Assessment for Machine Learning (DQAML'21) is organized in conjunction with the Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD) as discussed by the authors.
Abstract: The 2nd International Workshop on Data Quality Assessment for Machine Learning (DQAML'21) is organized in conjunction with the Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD). This workshop aims to serve as a forum for the presentation of research related to data quality assessment and remediation in AI/ML pipeline. Data quality is a critical issue in the data preparation phase and involves numerous challenging problems related to detection, remediation, visualization and evaluation of data issues. The workshop aims to provide a platform to researchers and practitioners to discuss such challenges across different modalities of data like structured, time series, text and graphical. The aim is to attract perspectives from both industrial and academic circles.

3 citations



Book ChapterDOI
11 May 2021
TL;DR: In the last several years, AI/ML technologies have become pervasive in academia and industry, finding its utility in newer and challenging applications as discussed by the authors, and they have become the basis for many new and interesting applications.
Abstract: In the last several years, AI/ML technologies have become pervasive in academia and industry, finding its utility in newer and challenging applications.

2 citations


Proceedings ArticleDOI
26 Oct 2021
TL;DR: In this paper, the authors propose a system, HAPPI (How Provenance of Probabilistic Inference), to handle query processing and inference over probabilistic knowledge graphs.
Abstract: Knowledge graphs (KG) model relationships between entities as labeled edges (or facts). They are mostly constructed using a suite of automated extractors, thereby inherently leading to uncertainty in the extracted facts. Modeling the uncertainty as probabilistic confidence scores results in a probabilistic knowledge graph. Graph queries over such probabilistic KGs require answer computation along with the computation of result probabilities, i.e., probabilistic inference. We propose a system, HAPPI (How Provenance of Probabilistic Inference), to handle such query processing and inference. Complying with the standard provenance semiring model, we propose a novel commutative semiring to symbolically compute the probability of the result of a query. These provenance-polynomial-like symbolic expressions encode fine-grained information about the probability computation process. We leverage this encoding to efficiently compute as well as maintain probabilities of results even as the underlying KG changes. Focusing on conjunctive basic graph pattern queries, we observe that HAPPI is more efficient than knowledge compilation for answering commonly occurring queries with lower range of probability derivation complexity. We propose an adaptive system that leverages the strengths of both HAPPI and compilation based techniques, for not only to perform efficient probabilistic inference and compute their provenance, but also to incrementally maintain them.

1 citations


Posted Content
TL;DR: In this paper, the authors address the problem of learning low-dimensional representation of entities on relational databases consisting of multiple tables and propose an attention-based model to learn embeddings for entities in the relational database.
Abstract: In this paper, we address the problem of learning low dimension representation of entities on relational databases consisting of multiple tables. Embeddings help to capture semantics encoded in the database and can be used in a variety of settings like auto-completion of tables, fully-neural query processing of relational joins queries, seamlessly handling missing values, and more. Current work is restricted to working with just single table, or using pretrained embeddings over an external corpus making them unsuitable for use in real-world databases. In this work, we look into ways of using these attention-based model to learn embeddings for entities in the relational database. We are inspired by BERT style pretraining methods and are interested in observing how they can be extended for representation learning on structured databases. We evaluate our approach of the autocompletion of relational databases and achieve improvement over standard baselines.

1 citations


Posted Content
TL;DR: The authorsMD The authors proposes a transfer learning framework for continuous-time location prediction for regions with sparse checkin data, which learns the inter-checkin time and geo-distributions by maximizing the joint likelihood of next checkin with three channels (1) checkin category prediction, 2) check-in time prediction, 3) travel distance prediction).
Abstract: There exists a high variability in mobility data volumes across different regions, which deteriorates the performance of spatial recommender systems that rely on region-specific data. In this paper, we propose a novel transfer learning framework called REFORMD, for continuous-time location prediction for regions with sparse checkin data. Specifically, we model user-specific checkin-sequences in a region using a marked temporal point process (MTPP) with normalizing flows to learn the inter-checkin time and geo-distributions. Later, we transfer the model parameters of spatial and temporal flows trained on a data-rich origin region for the next check-in and time prediction in a target region with scarce checkin data. We capture the evolving region-specific checkin dynamics for MTPP and spatial-temporal flows by maximizing the joint likelihood of next checkin with three channels (1) checkin-category prediction, (2) checkin-time prediction, and (3) travel distance prediction. Extensive experiments on different user mobility datasets across the U.S. and Japan show that our model significantly outperforms state-of-the-art methods for modeling continuous-time sequences. Moreover, we also show that REFORMD can be easily adapted for product recommendations i.e., sequences without any spatial component.

1 citations


Posted Content
TL;DR: In this paper, the authors propose a system called HAPPI (How Provenance of Probabilistic Inference) to handle query processing over probabilistic knowledge graphs.
Abstract: Knowledge graphs (KG) that model the relationships between entities as labeled edges (or facts) in a graph are mostly constructed using a suite of automated extractors, thereby inherently leading to uncertainty in the extracted facts. Modeling the uncertainty as probabilistic confidence scores results in a probabilistic knowledge graph. Graph queries over such probabilistic KGs require answer computation along with the computation of those result probabilities, aka, probabilistic inference. We propose a system, HAPPI (How Provenance of Probabilistic Inference), to handle such query processing. Complying with the standard provenance semiring model, we propose a novel commutative semiring to symbolically compute the probability of the result of a query. These provenance-polynomiallike symbolic expressions encode fine-grained information about the probability computation process. We leverage this encoding to efficiently compute as well as maintain the probability of results as the underlying KG changes. Focusing on a popular class of conjunctive basic graph pattern queries on the KG, we compare the performance of HAPPI against a possible-world model of computation and a knowledge compilation tool over two large datasets. We also propose an adaptive system that leverages the strengths of both HAPPI and compilation based techniques. Since existing systems for probabilistic databases mostly focus on query computation, they default to re-computation when facts in the KG are updated. HAPPI, on the other hand, does not just perform probabilistic inference and maintain their provenance, but also provides a mechanism to incrementally maintain them as the KG changes. We extend this maintainability as part of our proposed adaptive system.

1 citations


Book ChapterDOI
11 May 2021
TL;DR: In this article, the authors propose a dual-network Hawkes process (DNHP) to model bursty diffusion of text-based events over a social network of user nodes, where closeness of nodes is captured using topic-topic, a user-user, and user-topic interactions.
Abstract: We address the problem of modeling bursty diffusion of text-based events over a social network of user nodes. The purpose is to recover, disentangle and analyze overlapping social conversations from the perspective of user-topic preferences, user-user connection strengths and, importantly, topic transitions. For this, we propose a Dual-Network Hawkes Process (DNHP), which executes over a graph whose nodes are user-topic pairs, and closeness of nodes is captured using topic-topic, a user-user, and user-topic interactions. No existing Hawkes Process model captures such multiple interactions simultaneously. Additionally, unlike existing Hawkes Process based models, where event times are generated first, and event topics are conditioned on the event times, the DNHP is more faithful to the underlying social process by making the event times depend on interacting (user, topic) pairs. We develop a Gibbs sampling algorithm for estimating the three network parameters that allows evidence to flow between the parameter spaces. Using experiments over large real collection of tweets by US politicians, we show that the DNHP generalizes better than state of the art models, and also provides interesting insights about user and topic transitions.

1 citations


Posted Content
TL;DR: The TechTrack dataset as mentioned in this paper ) is a dataset for tracking entities in technical procedures, which consists of 1351 procedures annotated with open domain articles from WikiHow and contains more than 1200 unique entities with an average of 4.7 entities per procedure.
Abstract: We introduce TechTrack, a new dataset for tracking entities in technical procedures. The dataset, prepared by annotating open domain articles from WikiHow, consists of 1351 procedures, e.g., "How to connect a printer", identifies more than 1200 unique entities with an average of 4.7 entities per procedure. We evaluate the performance of state-of-the-art models on the entity-tracking task and find that they are well below the human annotation performance. We describe how TechTrack can be used to take forward the research on understanding procedures from temporal texts.