Showing papers by "Sameep Mehta published in 2018"

PDF

Open Access

Posted Content•

AI Fairness 360: An Extensible Toolkit for Detecting, Understanding, and Mitigating Unwanted Algorithmic Bias

[...]

Rachel K. E. Bellamy, Kuntal Dey, Michael Hind, Samuel C. Hoffman, Stephanie Houde, Kalapriya Kannan, Pranay Lohia, Jacquelyn A. Martino, Sameep Mehta, Aleksandra Mojsilovic, Seema Nagar, Karthikeyan Natesan Ramamurthy, John T. Richards, Diptikalyan Saha, Prasanna Sattigeri, Moninder Singh, Kush R. Varshney, Yunfeng Zhang - Show less +14 more

03 Oct 2018-arXiv: Artificial Intelligence

TL;DR: A new open source Python toolkit for algorithmic fairness, AI Fairness 360 (AIF360), released under an Apache v2.0 license to help facilitate the transition of fairness research algorithms to use in an industrial setting and to provide a common framework for fairness researchers to share and evaluate algorithms.

...read moreread less

Abstract: Fairness is an increasingly important concern as machine learning models are used to support decision making in high-stakes applications such as mortgage lending, hiring, and prison sentencing. This paper introduces a new open source Python toolkit for algorithmic fairness, AI Fairness 360 (AIF360), released under an Apache v2.0 license {this https URL). The main objectives of this toolkit are to help facilitate the transition of fairness research algorithms to use in an industrial setting and to provide a common framework for fairness researchers to share and evaluate algorithms. The package includes a comprehensive set of fairness metrics for datasets and models, explanations for these metrics, and algorithms to mitigate bias in datasets and models. It also includes an interactive Web experience (this https URL) that provides a gentle introduction to the concepts and capabilities for line-of-business users, as well as extensive documentation, usage guidance, and industry-specific tutorials to enable data scientists and practitioners to incorporate the most appropriate tool for their problem into their work products. The architecture of the package has been engineered to conform to a standard paradigm used in data science, thereby further improving usability for practitioners. Such architectural design and abstractions enable researchers and developers to extend the toolkit with their new algorithms and improvements, and to use it for performance benchmarking. A built-in testing infrastructure maintains code quality.

...read moreread less

501 citations

Proceedings Article•DOI•

Model Extraction Warning in MLaaS Paradigm

[...]

Manish Kesarwani¹, Bhaskar Mukhoty², Vijay Arya¹, Sameep Mehta¹•Institutions (2)

IBM¹, Indian Institute of Technology Kanpur²

03 Dec 2018

TL;DR: A model extraction monitor that quantifies the extraction status of models by continually observing the API query and response streams of users is introduced and two novel strategies that measure either the information gain or the coverage of the feature space spanned by user queries to estimate the learning rate of individual and colluding adversaries are presented.

...read moreread less

Abstract: Machine learning models deployed on the cloud are susceptible to several security threats including extraction attacks. Adversaries may abuse a model's prediction API to steal the model thus compromising model confidentiality, privacy of training data, and revenue from future query payments. This work introduces a model extraction monitor that quantifies the extraction status of models by continually observing the API query and response streams of users. We present two novel strategies that measure either the information gain or the coverage of the feature space spanned by user queries to estimate the learning rate of individual and colluding adversaries. Both approaches have low computational overhead and can easily be offered as services to model owners to warn them against state of the art extraction attacks. We demonstrate empirical performance results of these approaches for decision tree and neural network models using open source datasets and BigML MLaaS platform.

...read moreread less

98 citations

Posted Content•

Increasing Trust in AI Services through Supplier's Declarations of Conformity

[...]

Michael Hind, Sameep Mehta, Aleksandra Mojsilovic, Ravi Nair, Karthikeyan Natesan Ramamurthy, Alexandra Olteanu, Kush R. Varshney - Show less +3 more

22 Aug 2018

TL;DR: In this article, a supplier's declaration of conformity (SDoC) for artificial intelligence (AI) services is proposed to help increase trust in AI services, which is a transparent, standardized, but often not legally required, document used in many industries and sectors to describe the lineage of a product along with safety and performance testing it has undergone.

...read moreread less

Abstract: The accuracy and reliability of machine learning algorithms are an important concern for suppliers of artificial intelligence (AI) services, but considerations beyond accuracy, such as safety, security, and provenance, are also critical elements to engender consumers' trust in a service. In this paper, we propose a supplier's declaration of conformity (SDoC) for AI services to help increase trust in AI services. An SDoC is a transparent, standardized, but often not legally required, document used in many industries and sectors to describe the lineage of a product along with the safety and performance testing it has undergone. We envision an SDoC for AI services to contain purpose, performance, safety, security, and provenance information to be completed and voluntarily released by AI service providers for examination by consumers. Importantly, it conveys product-level rather than component-level functional testing. We suggest a set of declaration items tailored to AI and provide examples for two fictitious AI services.

...read moreread less

94 citations

Book Chapter•DOI•

Generating Adversarial Text Samples

[...]

Suranjana Samanta¹, Sameep Mehta¹•Institutions (1)

IBM¹

26 Mar 2018

TL;DR: Experimental results on IMDB movie review dataset for sentiment analysis and Twitter dataset for gender detection show the efficacy of the proposed method of crafting adversarial text samples by modification of the original samples.

...read moreread less

Abstract: Adversarial samples are strategically modified samples, which are crafted with the purpose of fooling a trained classifier. In this paper, we propose a new method of crafting adversarial text samples by modification of the original samples. Modifications of the original text samples are done by deleting or replacing the important or salient words in the text or by introducing new words in the text sample. While crafting adversarial samples, one of the key constraint is to generate meaningful sentences which can at pass off as legitimate from the language (English) viewpoint. Experimental results on IMDB movie review dataset for sentiment analysis and Twitter dataset for gender detection show the efficacy of our proposed method.

...read moreread less

33 citations

Posted Content•

FactSheets: Increasing Trust in AI Services through Supplier's Declarations of Conformity

[...]

Matthew Arnold, Rachel K. E. Bellamy, Michael Hind, Stephanie Houde, Sameep Mehta, Aleksandra Mojsilovic, Ravi Nair, Karthikeyan Natesan Ramamurthy, Darrell C. Reimer, Alexandra Olteanu, David Piorkowski, Jason Tsay, Kush R. Varshney - Show less +9 more

22 Aug 2018-arXiv: Computers and Society

TL;DR: In this article, the authors propose FactSheets, a set of declaration items tailored to Artificial Intelligence (AI) services, and provide examples for two fictitious AI services in the appendix of the paper.

...read moreread less

Abstract: Accuracy is an important concern for suppliers of artificial intelligence (AI) services, but considerations beyond accuracy, such as safety (which includes fairness and explainability), security, and provenance, are also critical elements to engender consumers' trust in a service. Many industries use transparent, standardized, but often not legally required documents called supplier's declarations of conformity (SDoCs) to describe the lineage of a product along with the safety and performance testing it has undergone. SDoCs may be considered multi-dimensional fact sheets that capture and quantify various aspects of the product and its development to make it worthy of consumers' trust. Inspired by this practice, we propose FactSheets to help increase trust in AI services. We envision such documents to contain purpose, performance, safety, security, and provenance information to be completed by AI service providers for examination by consumers. We suggest a comprehensive set of declaration items tailored to AI and provide examples for two fictitious AI services in the appendix of the paper.

...read moreread less

31 citations

Analyze, Detect and Remove Gender Stereotyping from Bollywood Movies

[...]

Nishtha Madaan, Sameep Mehta, Taneea S Agrawaal, Vrinda Malhotra, Aditi Aggarwal, Yatin Gupta, Mayank Saxena - Show less +3 more

21 Jan 2018

TL;DR: This paper analyzes movie plots and posters for all movies released since 1970 and proposes an algorithm to remove gender biased graphs from unstructured piece of text in stories from movies and de-bias these graphs to generate plausible unbiased stories.

...read moreread less

Abstract: The presence of gender stereotypes in many aspects of society is a well-known phenomenon. In this paper, we focus on studying such stereotypes and bias in Hindi movie industry (Bollywood) and propose an algorithm to remove these stereotypes from text. We analyze movie plots and posters for all movies released since 1970. The gender bias is detected by semantic modeling of plots at sentence and intra-sentence level. Different features like occupation, introductions, associated actions and descriptions are captured to show the pervasiveness of gender bias and stereotype in movies. Using the derived semantic graph, we compute centrality of each character and observe similar bias there. We also show that such bias is not applicable for movie posters where females get equal importance even though their character has little or no impact on the movie plot. The silver lining is that our system was able to identify 30 movies over last 3 years where such stereotypes were broken. The next step, is to generate debiased stories. The proposed debiasing algorithm extracts gender biased graphs from unstructured piece of text in stories from movies and de-bias these graphs to generate plausible unbiased stories.

...read moreread less

27 citations

Proceedings Article•DOI•

Efficiently Processing Temporal Queries on Hyperledger Fabric

[...]

Himanshu Gupta¹, Sandeep Hans¹, Kushagra Aggarwal², Sameep Mehta¹, Bapi Chatterjee¹, Praveen Jayachandran¹ - Show less +2 more•Institutions (2)

IBM¹, Indian Institute of Technology Kharagpur²

01 Apr 2018

TL;DR: This paper presents two models for overcoming limitations and improving the performance of temporal queries on Fabric, a popular implementation of Blockchain technology and shows that these two models significantly outperform the naive ways of handling temporal query on Fabric.

...read moreread less

Abstract: In this paper, we discuss the problem of efficiently handling temporal queries on Hyperledger Fabric, a popular implementation of Blockchain technology. The temporal nature of the data inserted by the Hyperledger Fabric transactions can be leveraged to support various use-cases. This requires that the temporal queries be processed efficiently on this data. Currently this presents significant challenges as this data is organized on file-system, is exposed to users via a limited API and does not support any temporal indexes. We present two models for overcoming these limitations and improving the performance of temporal queries on Fabric. The first model creates a copy of each event inserted by a Fabric transaction and stores temporally close events together on Fabric. The second model keeps the event count intact but tags some metadata to each event being inserted on Fabric s.t. temporally close events share the same metadata. We discuss these two models in detail and show that these two models significantly outperform the naive ways of handling temporal queries on Fabric. We also discuss the performance trade-offs for these two models across various dimensions - data storage, query performance, data ingestion time etc.

...read moreread less

23 citations

Journal Article•DOI•

Leveraging semantic resources in diversified query expansion

[...]

Adit Krishnan¹, Deepak P², Sayan Ranu³, Sameep Mehta⁴•Institutions (4)

University of Illinois at Urbana–Champaign¹, Queen's University Belfast², Indian Institute of Technology Delhi³, IBM⁴

01 Jul 2018-World Wide Web

TL;DR: This paper develops two methods, those that leverage Wikipedia and pre-learnt distributional word embeddings respectively, and shows that SLR performs state-of-the-art diversified query expansion methods, thus establishing that Wikipedia is an effective resource to aid diversification query expansion.

...read moreread less

Abstract: A search query, being a very concise grounding of user intent, could potentially have many possible interpretations. Search engines hedge their bets by diversifying top results to cover multiple such possibilities so that the user is likely to be satisfied, whatever be her intended interpretation. Diversified Query Expansion is the problem of diversifying query expansion suggestions, so that the user can specialize the query to better suit her intent, even before perusing search results. In this paper, we consider the usage of semantic resources and tools to arrive at improved methods for diversified query expansion. In particular, we develop two methods, those that leverage Wikipedia and pre-learnt distributional word embeddings respectively. Both the approaches operate on a common three-phase framework; that of first taking a set of informative terms from the search results of the initial query, then building a graph, following by using a diversity-conscious node ranking to prioritize candidate terms for diversified query expansion. Our methods differ in the second phase, with the first method Select-Link-Rank (SLR) linking terms with Wikipedia entities to accomplish graph construction; on the other hand, our second method, Select-Embed-Rank (SER), constructs the graph using similarities between distributional word embeddings. Through an empirical analysis and user study, we show that SLR ourperforms state-of-the-art diversified query expansion methods, thus establishing that Wikipedia is an effective resource to aid diversified query expansion. Our empirical analysis also illustrates that SER outperforms the baselines convincingly, asserting that it is the best available method for those cases where SLR is not applicable; these include narrow-focus search systems where a relevant knowledge base is unavailable. Our SLR method is also seen to outperform a state-of-the-art method in the task of diversified entity ranking.

...read moreread less

22 citations

Proceedings Article•

Efficient Secure k-Nearest Neighbours over Encrypted Data.

[...]

Manish Kesarwani¹, Akshar Kaul¹, Naldurg Prasad G, Sikhar Patranabis², Gagandeep Singh¹, Sameep Mehta¹, Debdeep Mukhopadhyay² - Show less +3 more•Institutions (2)

IBM¹, Indian Institute of Technology Kharagpur²

01 Jan 2018

TL;DR: This paper describes a novel protocol in the two-party cloud setting, using an underlying somewhat homomorphic encryption scheme, and provides asymptotically faster performance, without sacrificing any security guarantees.

...read moreread less

Abstract: Enterprise customers of cloud services are wary of outsourcing sensitive user and business data due to inherent security and privacy concerns. In this context, storing and computing directly on encrypted data is an attractive solution, especially against insider attacks. Homomorphic encryption, the keystone enabling technology is unfortunately prohibitively expensive. In this paper, we focus on finding k-Nearest Neighbours (k-NN) directly on encrypted data, a basic data-mining and machine learning algorithm. The goal is to compute the nearest neighbours to a given query, and present exact results to the clients, without the cloud learning anything about the data, query, results, or the access and search patterns. We describe a novel protocol in the two-party cloud setting, using an underlying somewhat homomorphic encryption scheme. In comparison to the state-of-the-art protocol in this setting, we provide asymptotically faster performance, without sacrificing any security guarantees. We implemented our protocol to demonstrate that it is efficient and practical on large and relevant real-world datasets and study how it scales well across different parameters on simulated data.

...read moreread less

19 citations

Proceedings Article•DOI•

On Building Efficient Temporal Indexes on Hyperledger Fabric

[...]

Himanshu Gupta¹, Sandeep Hans¹, Sameep Mehta¹, Praveen Jayachandran¹•Institutions (1)

IBM¹

02 Jul 2018

TL;DR: In this paper, the problem of constructing efficient temporal indexes on Hyperledger Fabric, a popular blockchain platform, is discussed and two models for creating temporal indexes are presented. But the performance tradeoffs among these variants across various dimensions - data storage, query performance, event insertion time etc.

...read moreread less

Abstract: We discuss the problem of constructing efficient temporal indexes on Hyperledger Fabric, a popular Blockchain platform. The temporal nature of the data inserted by Fabric transactions can be leveraged to support various use-cases. This requires that temporal queries be processed efficiently on this data. Currently this presents significant challenges as this data is organized on file-system, is exposed via limited API and does not support temporal indexes. In a prior work [1], we presented two models for creating temporal indexes on Fabric which overcome these limitations and improve the performance of temporal queries on Fabric. The first model creates a copy of each event inserted and stores temporally close events together on Fabric. The second model keeps the event count intact but tags metadata to each event s.t. temporally close events share the same metadata. In this paper, we present variants on these two models which are better able to handle the skew present in Fabric data. We discuss the details and show that these variants significantly outperform the approaches presented in [1] when Fabric data contains skew. We also discuss the performance tradeoffs among these variants across various dimensions - data storage, query performance, event insertion time etc.

...read moreread less

12 citations

Proceedings Article•

Semantic Understanding for Contextual In-Video Advertising.

[...]

Rishi Madhok¹, Shashank Mujumdar², Nitin Gupta², Sameep Mehta²•Institutions (2)

Delhi Technological University¹, IBM²

01 Jan 2018

TL;DR: A framework to semantically understand the video content for better ad recommendation that ensures ad relevance to video content, where and how video ads are placed, and non-intrusive user experience is proposed.

...read moreread less

Abstract: With the increasing consumer base of online video content, it is important for advertisers to understand the video context when targeting video ads to consumers. To improve the consumer experience and quality of ads, key factors need to be considered such as (i) ad relevance to video content (ii) where and how video ads are placed, and (iii) non-intrusive user experience. We propose a framework to semantically understand the video content for better ad recommendation that ensure these criteria.

...read moreread less

Posted Content•

Extracting Fairness Policies from Legal Documents.

[...]

Rashmi Nagpal, Chetna Wadhwa, Mallika Gupta, Samiulla Shaikh, Sameep Mehta, Vikram Goyal - Show less +2 more

12 Sep 2018-arXiv: Learning

TL;DR: It is shown that similarity based on word vectors beats the classical approach with a large margin, whereas other vector representations of senses and sentences fail to even match the classical baseline.

...read moreread less

Abstract: Machine Learning community is recently exploring the implications of bias and fairness with respect to the AI applications. The definition of fairness for such applications varies based on their domain of application. The policies governing the use of such machine learning system in a given context are defined by the constitutional laws of nations and regulatory policies enforced by the organizations that are involved in the usage. Fairness related laws and policies are often spread across the large documents like constitution, agreements, and organizational regulations. These legal documents have long complex sentences in order to achieve rigorousness and robustness. Automatic extraction of fairness policies, or in general, any specific kind of policies from large legal corpus can be very useful for the study of bias and fairness in the context of AI applications. We attempted to automatically extract fairness policies from publicly available law documents using two approaches based on semantic relatedness. The experiments reveal how classical Wordnet-based similarity and vector-based similarity differ in addressing this task. We have shown that similarity based on word vectors beats the classical approach with a large margin, whereas other vector representations of senses and sentences fail to even match the classical baseline. Further, we have presented thorough error analysis and reasoning to explain the results with appropriate examples from the dataset for deeper insights.

...read moreread less

Proceedings Article•DOI•

Learning an Order Preserving Image Similarity through Deep Ranking

[...]

Nitin Gupta¹, Shashank Mujumdar¹, Suranjana Samanta¹, Sameep Mehta¹•Institutions (1)

IBM¹

01 Aug 2018

TL;DR: This paper proposes a deep quadlet network to learn the feature embedding using the quadlet loss function, and presents an extensive evaluation of the proposed ranking model against state-of-the-art baselines on three datasets with fine-grained categorization.

...read moreread less

Abstract: Recently, deep learning frameworks have been shown to learn a feature embedding that captures fine-grained image similarity using image triplets or quadruplets that consider pairwise relationships between image pairs. In real-world datasets, a class contains fine-grained categorization that exhibits within-class variability. In such a scenario, these frameworks fail to learn the relative ordering between - (i) samples belonging to the same category, (ii) samples from a different category within a class and (iii) samples belonging to a different class. In this paper, we propose the quadlet loss function, that learns an order-preserving fine-grained image similarity by learning through quadlets (query:q, positive:p, intermediate:i, negative:n) where $p$ is sampled from the same category as q, i belongs to a fine-grained category within the class of $q$ and $n$ is sampled from a different class than that of q. We propose a deep quadlet network to learn the feature embedding using the quadlet loss function. We present an extensive evaluation of our proposed ranking model against state-of-the-art baselines on three datasets with fine-grained categorization. The results show significant improvement over the baselines for both order-preserving fine-grained ranking task and general image ranking task.

...read moreread less

Posted Content•

What is my data worth? From data properties to data value.

[...]

Kalapriya Kannan, Rema Ananthanarayanan, Sameep Mehta

12 Nov 2018-arXiv: Computers and Society

TL;DR: The approach is to view the data as composed of various attributes or characteristics, which are referred to as facets, and which in turn comprise many sub-facets, which provides a basis for the comparison of the relative merits of two or more data sets in a structured manner, independent of context.

...read moreread less

Abstract: Data today fuels both the economy and advances in machine learning and AI. All aspects of decision making, at the personal and enterprise level and in governments are increasingly data-driven. In this context, however, there are still some fundamental questions that remain unanswered with respect to data. \textit{What is meant by data value? How can it be quantified, in a general sense?}. The "value" of data is not understood quantitatively until it is used in an application and output is evaluated, and hence currently it is not possible to assess the value of large amounts of data that companies hold, categorically. Further, there is overall consensus that good data is important for any analysis but there is no independent definition of what constitutes good data. In our paper we try to address these gaps in the valuation of data and present a framework for users who wish to assess the value of data in a categorical manner. Our approach is to view the data as composed of various attributes or characteristics, which we refer to as facets, and which in turn comprise many sub-facets. We define the notion of values that each sub-facet may take, and provide a seed scoring mechanism for the different values. The person assessing the data is required to fill in the values of the various sub-facets that are relevant for the data set under consideration, through a questionnaire that attempts to list them exhaustively. Based on the scores assigned for each set of values, the data set can now be quantified in terms of its properties. This provides a basis for the comparison of the relative merits of two or more data sets in a structured manner, independent of context. The presence of context adds additional information that improves the quantification of the data value.

...read moreread less

Posted Content•

Generating Clues for Gender based Occupation De-biasing in Text.

[...]

Nishtha Madaan, Gautam Singh, Sameep Mehta, Aditya Chetan, Brihi Joshi - Show less +1 more

11 Apr 2018-arXiv: Computation and Language

TL;DR: This paper presents the first system that discovers the possibility that a given text portrays a gender stereotype associated with an occupation, and offers counter-evidences of opposite gender also being associated with the same occupation in the context of user-provided geography and timespan.

...read moreread less

Abstract: Vast availability of text data has enabled widespread training and use of AI systems that not only learn and predict attributes from the text but also generate text automatically. However, these AI models also learn gender, racial and ethnic biases present in the training data. In this paper, we present the first system that discovers the possibility that a given text portrays a gender stereotype associated with an occupation. If the possibility exists, the system offers counter-evidences of opposite gender also being associated with the same occupation in the context of user-provided geography and timespan. The system thus enables text de-biasing by assisting a human-in-the-loop. The system can not only act as a text pre-processor before training any AI model but also help human story writers write stories free of occupation-level gender bias in the geographical and temporal context of their choice.

...read moreread less

Proceedings Article•DOI•

REXplore: A Sketch Based Interactive Explorer for Real Estates Using Building Floor Plan Images

[...]

Divya Sharma¹, Nitin Gupta², Chiranjoy Chattopadhyay¹, Sameep Mehta²•Institutions (2)

Indian Institute of Technology, Jodhpur¹, IBM²

01 Dec 2018

TL;DR: REXplore is proposed, a novel framework that uses sketch based query mode to retrieve corresponding similar floor plan images from a repository using Cyclic Generative Adversarial Networks (Cyclic GAN) for mapping between sketch and image domain.

...read moreread less

Abstract: The increasing trend of using online platforms for real estate rent/sale makes automatic retrieval of similar floor plans a key requirement to help architects and buyers alike. Although sketch based image retrieval has been explored in the multimedia community, the problem of hand-drawn floor plan retrieval has been less researched in the past. In this paper, we propose REXplore (Real Estate eXplore), a novel framework that uses sketch based query mode to retrieve corresponding similar floor plan images from a repository using Cyclic Generative Adversarial Networks (Cyclic GAN) for mapping between sketch and image domain. The key contributions of our proposed approach are : (1) a novel sketch based floor plan retrieval framework using an intuitive and convenient sketch query mode; (2) A conjunction of Cyclic GANs and Convolution Neural Networks (CNNs) for the task of hand-drawn floor plan image retrieval. Extensive experimentation and comparison with baseline results authenticates our claim.

...read moreread less

Posted Content•

Judging a Book by its Description : Analyzing Gender Stereotypes in the Man Bookers Prize Winning Fiction.

[...]

Nishtha Madaan, Sameep Mehta, Shravika Mittal, Ashima Suvarna

25 Jul 2018-arXiv: Computation and Language

TL;DR: This paper considers 275 books shortlisted for Man Bookers Prize between 1969 and 2017 and reveals the pervasiveness of gender bias and stereotype in the books on different features like occupation, introductions and actions associated to the characters in the book.

...read moreread less

Abstract: The presence of gender stereotypes in many aspects of society is a well-known phenomenon. In this paper, we focus on studying and quantifying such stereotypes and bias in the Man Bookers Prize winning fiction. We consider 275 books shortlisted for Man Bookers Prize between 1969 and 2017. The gender bias is analyzed by semantic modeling of book descriptions on Goodreads. This reveals the pervasiveness of gender bias and stereotype in the books on different features like occupation, introductions and actions associated to the characters in the book.

...read moreread less

Proceedings Article•DOI•

Content Driven Enrichment of Formal Text using Concept Definitions and Applications

[...]

Abhinav Jain¹, Nitin Gupta¹, Shashank Mujumdar¹, Sameep Mehta¹, Rishi Madhok² - Show less +1 more•Institutions (2)

IBM¹, Delhi Technological University²

03 Jul 2018

TL;DR: A text enrichment framework that identifies the key concepts from input text, highlights definitions and fetches the definition from external data sources in case the concept is undefined, and enriches the input text with concept applications and a pre-requisite concept graph that showcases the inter-dependency within the extracted concepts.

...read moreread less

Abstract: Formal text is objective, unambiguous and tends to have complex sentence construction intended to be understood by the target demographic. However, in the absence of domain knowledge it is imperative to define key concepts and their relationship in the text for correct interpretation for general readers. To address this, we propose a text enrichment framework that identifies the key concepts from input text, highlights definitions and fetches the definition from external data sources in case the concept is undefined. Beyond concept definitions, the system enriches the input text with concept applications and a pre-requisite concept graph that showcases the inter-dependency within the extracted concepts. While the problem of learning definition statements is attempted in literature, the task of learning application statements is novel. We manually annotated a dataset for training a deep learning network for identifying application statements in text. We quantitatively compared the results of both application and definition identification models with standard baselines. To validate the utility of the proposed framework for general readers, we report enrichment accuracy and show promising results.

...read moreread less

Proceedings Article•

Content and Context: Two-Pronged Bootstrapped Learning for Regex-Formatted Entity Extraction

[...]

Stanley Simoes¹, Deepak S. Padmanabhan, Munu Sairamesh¹, Deepak Khemani¹, Sameep Mehta² - Show less +1 more•Institutions (2)

Indian Institute of Technology Madras¹, IBM²

02 Feb 2018

TL;DR: This paper proposes a bootstrapped approach to improve the recall for extraction of regex-formatted entities, with the only source of supervision being the seed regex.

...read moreread less

Abstract: Regular expressions are an important building block of rule-based information extraction systems. Regexes can encode rules to recognize instances of simple entities which can then feed into the identification of more complex cross-entity relationships. Manually crafting a regex that recognizes all possible instances of an entity is difficult since an entity can manifest in a variety of different forms. Thus, the problem of automatically generalizing manually crafted seed regexes to improve the recall of IE systems has attracted research attention. In this paper, we propose a bootstrapped approach to improve the recall for extraction of regex-formatted entities, with the only source of supervision being the seed regex. Our approach starts from a manually authored high precision seed regex for the entity of interest, and uses the matches of the seed regex and the context around these matches to identify more instances of the entity. These are then used to identify a set of diverse, high recall regexes that are representative of this entity. Through an empirical evaluation over multiple real world document corpora, we illustrate the effectiveness of our approach.

...read moreread less

Proceedings Article•DOI•

Joint Distributed Representation of Text and Structure of Semi-Structured Documents

[...]

Abhishek Laddha¹, Salil Joshi², Samiulla Shaikh¹, Sameep Mehta¹•Institutions (2)

IBM¹, American Express²

03 Jul 2018

TL;DR: A joint modeling of text and the associated structure to effectively capture the semantics of the semi-structure documents is proposed and performs at par with state-of-the-art rule based and other unsupervised approaches.

...read moreread less

Abstract: Majority of textual data over web is in the form of semi-structured documents. Thus, structural skeleton of such documents plays important role in determining the semantics of the data content. Presence of structure sometimes allows us to write simple rules to extract such information, but it may not be always possible due to flexibility in the structure and the frequency with which such structures are altered. In this paper, we propose a joint modeling of text and the associated structure to effectively capture the semantics of the semi-structure documents. The model simultaneously learns the dense continuous representation for word tokens and the structure associated with them. We utilize the context of structures for projection such that similar structures containing semantically similar topics are close to each other in vector space. We explore two semantic text mining tasks over web data to test the effectiveness of our representation viz., document similarity, and table semantic component identification. In context of traditional rule-based approaches, both these tasks demand rich, domain-specific knowledge sources, homogeneous schema for the documents, and rules that capture the semantics. On the other hand, our approach is unsupervised and resource conscious in nature. Despite of working without knowledge resources and large training data, it performs at par with state-of-the-art rule based and other unsupervised approaches.

...read moreread less

Proceedings Article•DOI•

Secure k-NN as a Service over Encrypted Data in Multi-User Setting

[...]

Gagandeep Singh¹, Akshar Kaul¹, Sameep Mehta¹•Institutions (1)

IBM¹

02 Jul 2018

TL;DR: A new SkNN solution is proposed which satisfies all the four existing properties along with an additional essential property of Query Check Verification and is analyzed and presented to showcase the efficiency in real world scenario.

...read moreread less

Abstract: To securely leverage the advantages of Cloud Computing, a lot of research has happened in the area of "Secure Query Processing over Encrypted Data". As a concrete use case, many encryption schemes have been proposed for securely processing k Nearest Neighbors (SkNN) over encrypted data in the outsourced setting. Recently Zhu et al.[1] proposed a SkNN solution which claimed to satisfy following four properties: (1)Data Privacy, (2)Key Confidentiality, (3)Query Privacy, and (4)Query Controllability. However, in this paper, we present an attack which breaks the Query Controllability claim of their scheme. Further, we propose a new SkNN solution which satisfies all the four existing properties along with an additional essential property of Query Check Verification. We analyze the security of our proposed scheme and present the detailed experimental results to showcase the efficiency in real world scenario.

...read moreread less

Patent•

Detecting and delaying effect of machine learning model attacks

[...]

Manish Kesarwani¹, Amit Kumar¹, Vijay Arya, Rakesh Pimplikar, Sameep Mehta - Show less +1 more•Institutions (1)

IBM¹

22 May 2018

TL;DR: In this paper, the authors propose a method for delaying malicious attacks on machine learning models that are trained using input captured from a plurality of users, including: deploying a model, said model designed to be used with an application, for responding to requests received from users, wherein the model comprises a machine learning model that has been previously trained using a data set; receiving input from one or more users; determining, using a malicious input detection technique, if the received input comprises malicious input; if received input, removing the malicious input from the input to being used to retrain the model;

...read moreread less

Abstract: One embodiment provides a method for delaying malicious attacks on machine learning models that a trained using input captured from a plurality of users, including: deploying a model, said model designed to be used with an application, for responding to requests received from users, wherein the model comprises a machine learning model that has been previously trained using a data set; receiving input from one or more users; determining, using a malicious input detection technique, if the received input comprises malicious input; if the received input comprises malicious input, removing the malicious input from the input to be used to retrain the model; retraining the model using received input that is determined to not be malicious input; and providing, using the retrained model, a response to a received user query, the retrained model delaying the effect of malicious input on provided responses by removing malicious input from retraining input.

...read moreread less

Posted Content•

Hardening Deep Neural Networks via Adversarial Model Cascades

[...]

Deepak Vijaykeerthy¹, Anshuman Suri², Sameep Mehta¹, Ponnurangam Kumaraguru²•Institutions (2)

IBM¹, Indraprastha Institute of Information Technology²

02 Feb 2018-arXiv: Learning

TL;DR: The proposed Adversarial Model Cascades (AMC) trains a cascade of models sequentially where each model is optimized to be robust towards a mixture of multiple attacks, which yields a single model which is secure against a wide range of attacks.

...read moreread less

Abstract: Deep neural networks (DNNs) are vulnerable to malicious inputs crafted by an adversary to produce erroneous outputs Works on securing neural networks against adversarial examples achieve high empirical robustness on simple datasets such as MNIST However, these techniques are inadequate when empirically tested on complex data sets such as CIFAR-10 and SVHN Further, existing techniques are designed to target specific attacks and fail to generalize across attacks We propose the Adversarial Model Cascades (AMC) as a way to tackle the above inadequacies Our approach trains a cascade of models sequentially where each model is optimized to be robust towards a mixture of multiple attacks Ultimately, it yields a single model which is secure against a wide range of attacks; namely FGSM, Elastic, Virtual Adversarial Perturbations and Madry On an average, AMC increases the model's empirical robustness against various attacks simultaneously, by a significant margin (of 6225% for MNIST, 5075% for SVHN and 265% for CIFAR10) At the same time, the model's performance on non-adversarial inputs is comparable to the state-of-the-art models

...read moreread less

Patent•

Ensuring resilience of a business function by managing resource availability of a mission-critical project

[...]

Sreyash Kenkre¹, Sameep Mehta, Krishnasuri Narayanam, Vinayaka Pandit•Institutions (1)

IBM¹

03 Jan 2018

TL;DR: In this article, a method and associated systems for ensuring resilience of a business function manages resource availability for projects that perform mission-critical tasks for the business function, such that the model describes how the unavailability of one instance of a resource propagates disruptions to other instances of the same type of resource.

...read moreread less

Abstract: A method and associated systems for ensuring resilience of a business function manages resource availability for projects that perform mission-critical tasks for the business function The method and systems create a model that reveals dependencies among types of resources needed by a project, such that the model describes how the unavailability of one instance of a resource propagates disruptions to other instances of the same type of resource This model automatically identifies a resource type as being critical if a disruption of an instance of the resource type would render a project task infeasible, and if restoring that task would incur unacceptable cost The model may also automatically identify a first resource type as being critical for a second resource type when disruption of the first resource type reduces the available capacity of the second resource type to an unacceptable level

...read moreread less

Proceedings Article•DOI•

Pentuplet Loss for Simultaneous Shots and Critical Points Detection in a Video

[...]

Nitin Gupta¹, Abhinav Jain¹, Prerna Agarwal¹, Shashank Mujumdar¹, Sameep Mehta¹ - Show less +1 more•Institutions (1)

IBM¹

01 Aug 2018

TL;DR: This work proposes a novel pentuplet loss to learn the frame image similarity metric through a pent uplet-based deep learning framework and shows promising results for the task of critical frame detection against human annotations on soccer highlight videos.

...read moreread less

Abstract: Critical events in videos amount to the set of frames where the user attention is heightened. Such events are usually fine-grained activities and do not necessarily have defined shot boundaries. Traditional approaches to the task of Shot Boundary Detection (SBD) in videos perform frame-level classification to obtain shot boundaries and fail to identify the critical shots in the video. We model the problem of identifying critical frames and shot boundaries in a video as learning an image frame similarity metric where the distance relationships between different types of video frames are modeled. We propose a novel pentuplet loss to learn the frame image similarity metric through a pentuplet-based deep learning framework. We showcase the results of our proposed framework on soccer highlight videos against state-of-the-art baselines and significantly outperform them for the task of shot boundary detection. The proposed framework shows promising results for the task of critical frame detection against human annotations on soccer highlight videos.

...read moreread less

Posted Content•

Efficiently Processing Workflow Provenance Queries on SPARK

[...]

Rajmohan C, Pranay Lohia, Himanshu Gupta, Siddhartha Brahma, Mauricio A. Hernández, Sameep Mehta - Show less +2 more

25 Aug 2018-arXiv: Distributed, Parallel, and Cluster Computing

TL;DR: In this article, a weakly connected component based framework is proposed to quickly determine a minimal volume of data containing the entire lineage of the queried attribute-value, which is then processed to figure out the provenance of the query attribute value.

...read moreread less

Abstract: In this paper, we investigate how we can leverage Spark platform for efficiently processing provenance queries on large volumes of workflow provenance data. We focus on processing provenance queries at attribute-value level which is the finest granularity available. We propose a novel weakly connected component based framework which is carefully engineered to quickly determine a minimal volume of data containing the entire lineage of the queried attribute-value. This minimal volume of data is then processed to figure out the provenance of the queried attribute-value. The proposed framework computes weakly connected components on the workflow provenance graph and further partitions the large components as a collection of weakly connected sets. The framework exploits the workflow dependency graph to effectively partition the large components into a collection of weakly connected sets. We study the effectiveness of the proposed framework through experiments on a provenance trace obtained from a real-life unstructured text curation workflow. On provenance graphs containing upto 500M nodes and edges, we show that the proposed framework answers provenance queries in real-time and easily outperforms the naive approaches.

...read moreread less

Posted Content•

Secure k-NN as a Service Over Encrypted Data in Multi-User Setting

[...]

Gagandeep Singh¹, Akshar Kaul¹, Sameep Mehta¹•Institutions (1)

IBM¹

15 Jan 2018-arXiv: Cryptography and Security

TL;DR: In this article, the authors proposed a new SkNN solution which satisfies all the four existing properties along with an additional essential property of Query Check Verification, and analyzed the security of their proposed scheme and presented the detailed experimental results to showcase the efficiency in real world scenario.

...read moreread less

Abstract: To securely leverage the advantages of Cloud Computing, recently a lot of research has happened in the area of "Secure Query Processing over Encrypted Data". As a concrete use case, many encryption schemes have been proposed for securely processing k Nearest Neighbors (SkNN) over encrypted data in the outsourced setting. Recently Zhu et al[25]. proposed a SkNN solution which claimed to satisfy following four properties: (1)Data Privacy, (2)Key Confidentiality, (3)Query Privacy, and (4)Query Controllability. However, in this paper, we present an attack which breaks the Query Controllability claim of their scheme. Further, we propose a new SkNN solution which satisfies all the four existing properties along with an additional essential property of Query Check Verification. We analyze the security of our proposed scheme and present the detailed experimental results to showcase the efficiency in real world scenario.

...read moreread less