scispace - formally typeset
Search or ask a question
Author

L. Venkata Subramaniam

Bio: L. Venkata Subramaniam is an academic researcher from IBM. The author has contributed to research in topics: Noisy text & Data cleansing. The author has an hindex of 20, co-authored 93 publications receiving 1470 citations. Previous affiliations of L. Venkata Subramaniam include Indian Institute of Technology Delhi.


Papers
More filters
Proceedings ArticleDOI
16 Apr 2012
TL;DR: This paper proposes generative models that can discover communities based on the discussed topics, interaction types and the social connections among people and shows that it performs better than existing community discovery models.
Abstract: In recent years, social networking sites have not only enabled people to connect with each other using social links but have also allowed them to share, communicate and interact over diverse geographical regions. Social network provide a rich source of heterogeneous data which can be exploited to discover previously unknown relationships and interests among groups of people. In this paper, we address the problem of discovering topically meaningful communities from a social network. We assume that a persons' membership in a community is conditioned on its social relationship, the type of interaction and the information communicated with other members of that community. We propose generative models that can discover communities based on the discussed topics, interaction types and the social connections among people. In our models a person can belong to multiple communities and a community can participate in multiple topics. This allows us to discover both community interests and user interests based on the information and linked associations. We demonstrate the effectiveness of our model on two real word data sets and show that it performs better than existing community discovery models.

183 citations

Proceedings ArticleDOI
23 Jul 2009
TL;DR: A survey of the existing measures for noise in text is presented and application areas that ingest this noisy text for various tasks like Information Retrieval and Information Extraction are covered.
Abstract: Often, in the real world noise is ubiquitous in text communications. Text produced by processing signals intended for human use are often noisy for automated computer processing. Automatic speech recognition, optical character recognition and machine translation all introduce processing noise. Also digital text produced in informal settings such as online chat, SMS, emails, message boards, newsgroups, blogs, wikis and web pages contain considerable noise. In this paper, we present a survey of the existing measures for noise in text. We also cover application areas that ingest this noisy text for various tasks like Information Retrieval and Information Extraction.

88 citations

Proceedings Article
23 Aug 2010
TL;DR: An unsupervised method for the translation of noisy text to clean text and a weighted list of possible clean tokens for each noisy token are obtained.
Abstract: In this paper we look at the problem of cleansing noisy text using a statistical machine translation model. Noisy text is produced in informal communications such as Short Message Service (SMS), Twitter and chat. A typical Statistical Machine Translation system is trained on parallel text comprising noisy and clean sentences. In this paper we propose an unsupervised method for the translation of noisy text to clean text. Our method has two steps. For a given noisy sentence, a weighted list of possible clean tokens for each noisy token are obtained. The clean sentence is then obtained by maximizing the product of the weighted lists and the language model scores.

76 citations

Proceedings ArticleDOI
02 Aug 2009
TL;DR: This work presents an efficient search algorithm that does not require any training data or SMS normalization and can handle semantic variations in question formulation and demonstrates the effectiveness of the approach on two reallife datasets.
Abstract: Short Messaging Service (SMS) is popularly used to provide information access to people on the move. This has resulted in the growth of SMS based Question Answering (QA) services. However automatically handling SMS questions poses significant challenges due to the inherent noise in SMS questions. In this work we present an automatic FAQ-based question answering system for SMS users. We handle the noise in a SMS query by formulating the query similarity over FAQ questions as a combinatorial search problem. The search space consists of combinations of all possible dictionary variations of tokens in the noisy query. We present an efficient search algorithm that does not require any training data or SMS normalization and can handle semantic variations in question formulation. We demonstrate the effectiveness of our approach on two reallife datasets.

73 citations

Proceedings ArticleDOI
03 Nov 2003
TL;DR: A system called BioAnnotator for identifying and annotating biological terms in documents and a system called MedSummarizer that uses the extracted terms to identify the common concepts in a given group of genes.
Abstract: Journals and conference proceedings represent the dominant mechanisms of reporting new biomedical results. The unstructured nature of such publications makes it difficult to utilize data mining or automated knowledge discovery techniques. Annotation (or markup) of these unstructured documents represents the first step in making these documents machine analyzable. In this paper we first present a system called BioAnnotator for identifying and annotating biological terms in documents. BioAnnotator uses domain based dictionary look-up for recognizing known terms and a rule engine for discovering new terms. The combination and dictionary look-up and rules result in good performance (87% precision and 94% recall on the GENIA 1.1 corpus for extracting general biological terms based on an approximate matching criterion). To demonstrate the subsequent mining and knowledge discovery activities that are made feasible by BioAnnotator, we also present a system called MedSummarizer that uses the extracted terms to identify the common concepts in a given group of genes.

54 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.
Abstract: Machine Learning is the study of methods for programming computers to learn. Computers are applied to a wide range of tasks, and for most of these it is relatively easy for programmers to design and implement the necessary software. However, there are many tasks for which this is difficult or impossible. These can be divided into four general categories. First, there are problems for which there exist no human experts. For example, in modern automated manufacturing facilities, there is a need to predict machine failures before they occur by analyzing sensor readings. Because the machines are new, there are no human experts who can be interviewed by a programmer to provide the knowledge necessary to build a computer system. A machine learning system can study recorded data and subsequent machine failures and learn prediction rules. Second, there are problems where human experts exist, but where they are unable to explain their expertise. This is the case in many perceptual tasks, such as speech recognition, hand-writing recognition, and natural language understanding. Virtually all humans exhibit expert-level abilities on these tasks, but none of them can describe the detailed steps that they follow as they perform them. Fortunately, humans can provide machines with examples of the inputs and correct outputs for these tasks, so machine learning algorithms can learn to map the inputs to the outputs. Third, there are problems where phenomena are changing rapidly. In finance, for example, people would like to predict the future behavior of the stock market, of consumer purchases, or of exchange rates. These behaviors change frequently, so that even if a programmer could construct a good predictive computer program, it would need to be rewritten frequently. A learning program can relieve the programmer of this burden by constantly modifying and tuning a set of learned prediction rules. Fourth, there are applications that need to be customized for each computer user separately. Consider, for example, a program to filter unwanted electronic mail messages. Different users will need different filters. It is unreasonable to expect each user to program his or her own rules, and it is infeasible to provide every user with a software engineer to keep the rules up-to-date. A machine learning system can learn which mail messages the user rejects and maintain the filtering rules automatically. Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis. Statistics focuses on understanding the phenomena that have generated the data, often with the goal of testing different hypotheses about those phenomena. Data mining seeks to find patterns in the data that are understandable by people. Psychological studies of human learning aspire to understand the mechanisms underlying the various learning behaviors exhibited by people (concept learning, skill acquisition, strategy change, etc.).

13,246 citations

01 Jan 2013

1,098 citations

Book
05 Jul 2012
TL;DR: Data matching (also known as record or data linkage, entity resolution, object identification, or field matching) is the task of identifying, matching and merging records that correspond to the same entities from several databases or even within one database as mentioned in this paper.
Abstract: Data matching (also known as record or data linkage, entity resolution, object identification, or field matching) is the task of identifying, matching and merging records that correspond to the same entities from several databases or even within one database. Based on research in various domains including applied statistics, health informatics, data mining, machine learning, artificial intelligence, database management, and digital libraries, significant advances have been achieved over the last decade in all aspects of the data matching process, especially on how to improve the accuracy of data matching, and its scalability to large databases. Peter Christens book is divided into three parts: Part I, Overview, introduces the subject by presenting several sample applications and their special challenges, as well as a general overview of a generic data matching process. Part II, Steps of the Data Matching Process, then details its main steps like pre-processing, indexing, field and record comparison, classification, and quality evaluation. Lastly, part III, Further Topics, deals with specific aspects like privacy, real-time matching, or matching unstructured data. Finally, it briefly describes the main features of many research and open source systems available today. By providing the reader with a broad range of data matching concepts and techniques and touching on all aspects of the data matching process, this book helps researchers as well as students specializing in data quality or data matching aspects to familiarize themselves with recent research advances and to identify open research challenges in the area of data matching. To this end, each chapter of the book includes a final section that provides pointers to further background and research material. Practitioners will better understand the current state of the art in data matching as well as the internal workings and limitations of current systems. Especially, they will learn that it is often not feasible to simply implement an existing off-the-shelf data matching system without substantial adaption and customization. Such practical considerations are discussed for each of the major steps in the data matching process.

713 citations

Journal Article
TL;DR: The Health Insurance Portability and Accountability Act, also known as HIPAA, was designed to protect health insurance coverage for workers and their families while between jobs and establishes standards for electronic health care transactions.
Abstract: The Health Insurance Portability and Accountability Act, also known as HIPAA, was first delivered to congress in 1996 and consisted of just two Titles. It was designed to protect health insurance coverage for workers and their families while between jobs. It establishes standards for electronic health care transactions and addresses the issues of privacy and security when dealing with Protected Health Information (PHI). HIPAA is applicable only in the United States of America.

561 citations