scispace - formally typeset
Search or ask a question

Showing papers by "Sameep Mehta published in 2017"


Posted Content
TL;DR: This paper proposes a new method of crafting adversarial text samples by modification of the original samples, which works best for the datasets which have sub-categories within each of the classes of examples.
Abstract: Adversarial samples are strategically modified samples, which are crafted with the purpose of fooling a classifier at hand. An attacker introduces specially crafted adversarial samples to a deployed classifier, which are being mis-classified by the classifier. However, the samples are perceived to be drawn from entirely different classes and thus it becomes hard to detect the adversarial samples. Most of the prior works have been focused on synthesizing adversarial samples in the image domain. In this paper, we propose a new method of crafting adversarial text samples by modification of the original samples. Modifications of the original text samples are done by deleting or replacing the important or salient words in the text or by introducing new words in the text sample. Our algorithm works best for the datasets which have sub-categories within each of the classes of examples. While crafting adversarial samples, one of the key constraint is to generate meaningful sentences which can at pass off as legitimate from language (English) viewpoint. Experimental results on IMDB movie review dataset for sentiment analysis and Twitter dataset for gender detection show the efficiency of our proposed method.

189 citations


Proceedings ArticleDOI
01 Nov 2017
TL;DR: This paper proposes Deep Architecture for fiNdIng alikE Layouts (DANIEL), a novel deep learning framework to retrieve similar floor plan layouts from repository and creation of a new complex dataset ROBIN, having three broad dataset categories with 510 real world floor plans.
Abstract: Automatically finding out existing building layouts from a repository is always helpful for an architect to ensure reuse of design and timely completion of projects. In this paper, we propose Deep Architecture for fiNdIng alikE Layouts (DANIEL). Using DANIEL, an architect can search from the existing projects repository of layouts (floor plan), and give accurate recommendation to the buyers. DANIEL is also capable of recommending the property buyers, having a floor plan image, the corresponding rank ordered list of alike layouts. DANIEL is based on the deep learning paradigm to extract both low and high level semantic features from a layout image. The key contributions in the proposed approach are: (i) novel deep learning framework to retrieve similar floor plan layouts from repository; (ii) analysing the effect of individual deep convolutional neural network layers for floor plan retrieval task; and (iii) creation of a new complex dataset ROBIN (Repository Of BuildIng plaNs), having three broad dataset categories with 510 real world floor plans.We have evaluated DANIEL by performing extensive experiments on ROBIN and compared our results with eight different state-of-the-art methods to demonstrate DANIEL’s effectiveness on challenging scenarios.

41 citations


Posted Content
TL;DR: An end-to-end system that can understand policies written in natural language, alert users to policy violations during data usage, and log each activity performed using the data in an immutable storage so that policy compliance or violation can be proven later is proposed.
Abstract: In consequential real-world applications, machine learning (ML) based systems are expected to provide fair and non-discriminatory decisions on candidates from groups defined by protected attributes such as gender and race. These expectations are set via policies or regulations governing data usage and decision criteria (sometimes explicitly calling out decisions by automated systems). Often, the data creator, the feature engineer, the author of the algorithm and the user of the results are different entities, making the task of ensuring fairness in an end-to-end ML pipeline challenging. Manually understanding the policies and ensuring fairness in opaque ML systems is time-consuming and error-prone, thus necessitating an end-to-end system that can: 1) understand policies written in natural language, 2) alert users to policy violations during data usage, and 3) log each activity performed using the data in an immutable storage so that policy compliance or violation can be proven later. We propose such a system to ensure that data owners and users are always in compliance with fairness policies.

14 citations


Posted Content
TL;DR: Understanding security of machine learning algorithms and systems is emerging as an important research area among computer security and machine learning researchers and practitioners is presented.
Abstract: Machine learning based system are increasingly being used for sensitive tasks such as security surveillance, guiding autonomous vehicle, taking investment decisions, detecting and blocking network intrusion and malware etc. However, recent research has shown that machine learning models are venerable to attacks by adversaries at all phases of machine learning (eg, training data collection, training, operation). All model classes of machine learning systems can be misled by providing carefully crafted inputs making them wrongly classify inputs. Maliciously created input samples can affect the learning process of a ML system by either slowing down the learning process, or affecting the performance of the learned mode, or causing the system make error(s) only in attacker's planned scenario. Because of these developments, understanding security of machine learning algorithms and systems is emerging as an important research area among computer security and machine learning researchers and practitioners. We present a survey of this emerging area in machine learning.

14 citations


Book ChapterDOI
12 Dec 2017
TL;DR: A survey of this emerging area of Adversarial machine learning is presented, which shows that machine learning models are venerable to attacks by adversaries at all phases of machine learning.
Abstract: Machine learning based system are increasingly being used for sensitive tasks such as security surveillance, guiding autonomous vehicle, taking investment decisions, detecting and blocking network intrusion and malware etc. However, recent research has shown that machine learning models are venerable to attacks by adversaries at all phases of machine learning (e.g., training data collection, training, operation). All model classes of machine learning systems can be misled by providing carefully crafted inputs making them wrongly classify inputs. Maliciously created input samples can affect the learning process of a ML system by either slowing the learning process, or affecting the performance of the learned model or causing the system make error only in attacker’s planned scenario. Because of these developments, understanding security of machine learning algorithms and systems is emerging as an important research area among computer security and machine learning researchers and practitioners. We present a survey of this emerging area named Adversarial machine learning.

11 citations


Posted Content
TL;DR: This paper analyzes movie plots and posters for all movies released since 1970 to show the pervasiveness of gender bias and stereo- type in movies and shows that such bias is not applicable for movie posters where females get equal importance even though their character has little or no impact on the movie plot.
Abstract: The presence of gender stereotypes in many aspects of society is a well-known phenomenon. In this paper, we focus on studying such stereotypes and bias in Hindi movie industry (Bollywood). We analyze movie plots and posters for all movies released since 1970. The gender bias is detected by semantic modeling of plots at inter-sentence and intra-sentence level. Different features like occupation, introduction of cast in text, associated actions and descriptions are captured to show the pervasiveness of gender bias and stereo- type in movies. We derive a semantic graph and compute centrality of each character and observe similar bias there. We also show that such bias is not applicable for movie posters where females get equal importance even though their character has little or no impact on the movie plot. Furthermore, we explore the movie trailers to estimate on-screen time for males and females and also study the portrayal of emotions by gender in them. The silver lining is that our system was able to identify 30 movies over last 3 years where such stereotypes were broken.

6 citations


Posted Content
TL;DR: In this article, the authors present a cloud-based extraction monitor that can quantify the extraction status of models by observing the query and response streams of both individual and colluding adversarial users.
Abstract: Cloud vendors are increasingly offering machine learning services as part of their platform and services portfolios. These services enable the deployment of machine learning models on the cloud that are offered on a pay-per-query basis to application developers and end users. However recent work has shown that the hosted models are susceptible to extraction attacks. Adversaries may launch queries to steal the model and compromise future query payments or privacy of the training data. In this work, we present a cloud-based extraction monitor that can quantify the extraction status of models by observing the query and response streams of both individual and colluding adversarial users. We present a novel technique that uses information gain to measure the model learning rate by users with increasing number of queries. Additionally, we present an alternate technique that maintains intelligent query summaries to measure the learning rate relative to the coverage of the input feature space in the presence of collusion. Both these approaches have low computational overhead and can easily be offered as services to model owners to warn them of possible extraction attacks from adversaries. We present performance results for these approaches for decision tree models deployed on BigML MLaaS platform, using open source datasets and different adversarial attack strategies.

4 citations


Journal ArticleDOI
TL;DR: This work uses Linear Support Vector Machine along with a modified version of Label Propagation Algorithm which exploits the notion of neighborhood (in Euclidean space) for classification of financial text.
Abstract: In this work, we study the problem of annotating a large volume of Financial text by learning from a small set of human-annotated training data. The training data is prepared by randomly selecting some text sentences from the large corpus of financial text. Conventionally, bootstrapping algorithm is used to annotate large volume of unlabeled data by learning from a small set of annotated data. However, the small set of annotated data have to be carefully chosen as seed data. Thus, our approach is a digress from the conventional approach of bootstrapping as we let the users randomly select the seed data. We show that our proposed algorithm has an accuracy of 73.56% in classifying the financial texts into the different categories (“Accounting”, “Cost”, “Employee”, “Financing”, “Sales”, “Investments”, “Operations”, “Profit”, “Regulations” and “Irrelevant”) even when the training data is just 30% of the total data set. Additionally, the accuracy improves by an approximate average of 2% for an increase of the training data by 10% and the accuracy of our system is 77.91% when the training data is about 50% of the total data set. As a dictionary of hand chosen keywords prepared by domain experts are often used for financial text extraction, we assumed the existence of almost linearly separable hyperplanes between the different classes and therefore, we have used Linear Support Vector Machine along with a modified version of Label Propagation Algorithm which exploits the notion of neighborhood (in Euclidean space) for classification. We believe that our proposed techniques will be of help to Early Warning Systems used in banks where large volumes of unstructured texts need to be processed for better insights about a company.

4 citations


Proceedings ArticleDOI
Himanshu Gupta1, Sameep Mehta1, Sandeep Hans1, Bapi Chatterjee1, Pranay Lohia1, Rajmohan C1 
01 Jun 2017
TL;DR: It is argued that developing these solutions is important so that a comprehensive provenance aware Hadoop as a Service (HaaS) can be provided on cloud.
Abstract: Hadoop as a service (HaaS), also known as Hadoop in the cloud, is a big data analytics framework that stores and analyzes data in the cloud using Hadoop/Spark. In this paper, we discuss the importance of providing provenance capabilities in context of Hadoop as a service (HaaS) framework. We first review the state of the art in provenance tracking in context of databases and work-flow processing, in context of cloud and in context of big data analytics frameworks like Hadoop and Spark. We next identify a number of provenance capabilities which have been developed in context of databases and workflow processing but the corresponding solutions have not been developed in context of Hadoop or Spark. We argue that developing these solutions is important so that a comprehensive provenance aware Hadoop as a Service (HaaS) can be provided on cloud. The paper ends by identifying some research challenges in developing these provenance capabilities.

4 citations


Patent
24 Aug 2017
TL;DR: In this paper, a computer-implemented method includes classifying each of multiple temporally evolving data entities into one of multiple categories based on one or more parameters; partitioning the multiple temporal evolving data entity into multiple partitions based at least on classifying and the update frequency of each of the multiple time-evolving data entities; implementing multiple checkpoints at a distinct temporal interval for each of these partitions; and creating a snapshot of the many temporal evolving entities at a selected past point of time (i) based on said implementing and (ii) in response to a query pertaining
Abstract: Methods, systems, and computer program products for historical state snapshot construction over temporally evolving data are provided herein. A computer-implemented method includes classifying each of multiple temporally evolving data entities into one of multiple categories based on one or more parameters; partitioning the multiple temporally evolving data entities into multiple partitions based at least on (i) said classifying and (ii) the update frequency of each of the multiple temporally evolving data entities; implementing multiple checkpoints at a distinct temporal interval for each of the multiple partitions; and creating a snapshot of the multiple temporally evolving data entities at a selected past point of time (i) based on said implementing and (ii) in response to a query pertaining to a historical state of one or more of the multiple temporally evolving data entities.

2 citations


Patent
21 Nov 2017
TL;DR: In this paper, the authors describe methods, systems and computer program products for association rule mining of an encrypted database using an encryption scheme which provides additive homomorphism, where the transaction data comprise a plurality of combinations of two or more elements of a set of elements.
Abstract: Methods, systems and computer program products for association rule mining of an encrypted database are provided herein. A computer-implemented method includes receiving, at a first cloud computing environment, encrypted transaction data that are encrypted using an encryption scheme which provides additive homomorphism, wherein the transaction data comprise a plurality of combinations of two or more elements of a set of elements, receiving, at the first cloud computing environment, encrypted query data that are encrypted using the encryption scheme, wherein the query data comprise at least one of an element and a combination of two or more elements of the set of elements which are the subject of a query seeking a determination of whether at least one of the element and the combination of two or more elements is frequent, and computing addition of the encrypted query data with the encrypted transaction data.

Patent
Hima P. Karanan1, Manish Kesarwani1, Salil Joshi, Mohit Jain, Sameep Mehta 
12 Dec 2017
TL;DR: In this article, a computer-implemented root cause analysis using provenance data is presented, which comprises computing a plurality of provenance paths for at least one of the plurality of data elements in a curation flow.
Abstract: Methods, systems and computer program products for root cause analysis using provenance data are provided herein. A computer-implemented method comprises computing a plurality of provenance paths for at least one of a plurality of data elements in a curation flow and a plurality of groups of data elements in the curation flow, analyzing the computed provenance paths to determine one or more errors in the curation flow, and outputting the one or more errors in the curation flow to at least one user. The analyzing comprises at least one of identifying which of the computed provenance paths are partial provenance paths, and identifying one or more output records associated with the curation flow, wherein the one or more output records comprise incorrectly curated data, and identifying the computed provenance paths that respectively correspond to the one or more output records comprising the incorrectly curated data.

Proceedings ArticleDOI
01 Dec 2017
TL;DR: A novel multistage framework to convert textual instructions into coherent visual descriptions (text instructions annotated with images) using a combination of deep learning and graph based approach is presented.
Abstract: Text is the easiest means to record information but need not always be the best means for understanding a concept. In psychological theories, it is argued that when information is presented visually, it provides a better means to understand a concept. While techniques exist for generating text from a given image, the inverse problem that is to automatically fetch coherent images to represent a given set of instructions (sequence of text), is a hard one. In this paper, we present a novel multistage framework to convert textual instructions into coherent visual descriptions (text instructions annotated with images). The key components in the proposed approach are: (i) novel framework, which combines the text as well as image analysis to generate visual descriptions; (ii) ensure coherency across visual descriptions, using a combination of deep learning and graph based approach. Effectiveness of our proposed approach is shown through a user study on a dataset of instructions and corresponding images collected from WikiHow website.

18 Apr 2017
TL;DR: The work presented in this report aims at developing the own tamper proof temporal provenance storage platform and query based model that can track, store and analyze data transformations.
Abstract: In the era of big data where every individual is a target of intensive data collection, there is a need to create technological tools that empower individuals to track what happens to their data. Provenance has been studied extensively in both database and workflow management systems, so far with little focus on text-retrieval based workflows with user defined operators. Such kind of workflow provenance aims to capture a complete description of evaluation (or enactment) of a workflow, and this is crucial to this problem of personal data use. As an initial step to solving this problem, the work presented in this report aims at developing our own tamper proof temporal provenance storage platform and query based model that can track, store and analyze data transformations.

Posted Content
TL;DR: The Bollywood Movie corpus contains 4000 movies extracted from Wikipedia and 880 trailers extracted from YouTube which were released from 1970-2017 and suggests that the data-set is quite useful for performing such tasks.
Abstract: In past few years, several data-sets have been released for text and images. We present an approach to create the data-set for use in detecting and removing gender bias from text. We also include a set of challenges we have faced while creating this corpora. In this work, we have worked with movie data from Wikipedia plots and movie trailers from YouTube. Our Bollywood Movie corpus contains 4000 movies extracted from Wikipedia and 880 trailers extracted from YouTube which were released from 1970-2017. The corpus contains csv files with the following data about each movie - Wikipedia title of movie, cast, plot text, co-referenced plot text, soundtrack information, link to movie poster, caption of movie poster, number of males in poster, number of females in poster. In addition to that, corresponding to each cast member the following data is available - cast name, cast gender, cast verbs, cast adjectives, cast relations, cast centrality, cast mentions. We present some preliminary results on the task of bias removal which suggest that the data-set is quite useful for performing such tasks.

Proceedings ArticleDOI
Nitin Gupta1, Ankush Gupta1, Vikas Joshi1, L. Venkata Subramaniam1, Sameep Mehta1 
01 Dec 2017
TL;DR: This work derived a similarity score for attributes like gender and complexion using an existing face recognition model and learnt attribute driven models performs at par with the existing baseline models on attribute driven ranking task.
Abstract: In this work, we propose to derive the attribute specific similarity score for a pair of images using an existing parent deep model. As an example, given two facial images, we derive a similarity score for attributes like gender and complexion using an existing face recognition model. It is not always feasible to train a new model for each attribute, as training of deep neural network based model requires a large number of labelled samples to reliably learn the parameters. Hence, in the proposed framework a similarity score for each attribute is obtained as a weighted combination of all the hidden layer features of the parent model. The weights are attribute specific, and are estimated by minimizing the proposed triplet based hinge loss criteria over small number of labelled samples. Although generic, the proposed approach is developed in the context of a specific application to search for social media profiles of suspects of law enforcement agencies. To measure the effectiveness of our proposed approach, we have also created a social media dataset "LFW Social (LFW-S)", corresponding to the Labeled Faces in the Wild (LFW) dataset. The key motivation behind our approach is not to improve upon the existing baseline methods but to reduce the overhead of generating a labeled dataset for learning new attribute. However, it is worth noting that the learnt attribute driven models performs at par with the existing baseline models on attribute driven ranking task.