Showing papers by "Sameep Mehta published in 2020"

PDF

Open Access

Proceedings Article•DOI•

Overview and Importance of Data Quality for Machine Learning Tasks

[...]

Abhinav Jain¹, Hima Patel¹, Lokesh Nagalapatti¹, Nitin Gupta¹, Sameep Mehta¹, Shanmukha C. Guttula¹, Shashank Mujumdar¹, Shazia Afzal¹, Ruhi Sharma Mittal¹, Vitobha Munigala¹ - Show less +6 more•Institutions (1)

IBM¹

23 Aug 2020

TL;DR: This tutorial highlights the importance of analysing data quality in terms of its value for machine learning applications and surveys all the important data quality related approaches discussed in literature, focusing on the intuition behind them, highlighting their strengths and similarities, and illustrates their applicability to real-world problems.

...read moreread less

Abstract: It is well understood from literature that the performance of a machine learning (ML) model is upper bounded by the quality of the data. While researchers and practitioners have focused on improving the quality of models (such as neural architecture search and automated feature selection), there are limited efforts towards improving the data quality. One of the crucial requirements before consuming datasets for any application is to understand the dataset at hand and failure to do so can result in inaccurate analytics and unreliable decisions. Assessing the quality of the data across intelligently designed metrics and developing corresponding transformation operations to address the quality gaps helps to reduce the effort of a data scientist for iterative debugging of the ML pipeline to improve model performance. This tutorial highlights the importance of analysing data quality in terms of its value for machine learning applications. This tutorial surveys all the important data quality related approaches discussed in literature, focusing on the intuition behind them, highlighting their strengths and similarities, and illustrates their applicability to real-world problems. Finally we will discuss the interesting work IBM Research is doing in this space.

...read moreread less

71 citations

Proceedings Article•DOI•

Ownership preserving AI Market Places using Blockchain

[...]

Nishant Baranwal Somy¹, Kalapriya Kannan², Vijay Arya², Sandeep Hans², Abhishek Singh², Pranay Lohia², Sameep Mehta² - Show less +3 more•Institutions (2)

Indian Institute of Technology Kharagpur¹, IBM²

18 Jan 2020-arXiv: Distributed, Parallel, and Cluster Computing

TL;DR: The system protocols are set up to incentivize all three entities - data owners, cloud vendors, and AI developers to truthfully record their actions on the distributed ledger, so that the blockchain system provides verifiable evidence of wrongdoing and dispute resolution.

...read moreread less

Abstract: We present a blockchain based system that allows data owners, cloud vendors, and AI developers to collaboratively train machine learning models in a trustless AI marketplace. Data is a highly valued digital asset and central to deriving business insights. Our system enables data owners to retain ownership and privacy of their data, while still allowing AI developers to leverage the data for training. Similarly, AI developers can utilize compute resources from cloud vendors without loosing ownership or privacy of their trained models. Our system protocols are set up to incentivize all three entities - data owners, cloud vendors, and AI developers to truthfully record their actions on the distributed ledger, so that the blockchain system provides verifiable evidence of wrongdoing and dispute resolution. Our system is implemented on the Hyperledger Fabric and can provide a viable alternative to centralized AI systems that do not guarantee data or model privacy. We present experimental performance results that demonstrate the latency and throughput of its transactions under different network configurations where peers on the blockchain may be spread across different datacenters and geographies. Our results indicate that the proposed solution scales well to large number of data and model owners and can train up to 70 models per second on a 12-peer non optimized blockchain network and roughly 30 models per second in a 24 peer network.

...read moreread less

12 citations

Posted Content•

Data Readiness Report.

[...]

Shazia Afzal¹, Rajmohan C¹, Manish Kesarwani¹, Sameep Mehta¹, Hima Patel¹ - Show less +1 more•Institutions (1)

IBM¹

14 Oct 2020-arXiv: Databases

TL;DR: The concept of a Data Readiness Report is introduced as accompanying documentation to a dataset that allows data consumers to get detailed insights into the quality of data and increases trust and re-useability of data.

...read moreread less

Abstract: Data exploration and quality analysis is an important yet tedious process in the AI pipeline. Current practices of data cleaning and data readiness assessment for machine learning tasks are mostly conducted in an arbitrary manner which limits their reuse and results in loss of productivity. We introduce the concept of a Data Readiness Report as an accompanying documentation to a dataset that allows data consumers to get detailed insights into the quality of input data. Data characteristics and challenges on various quality dimensions are identified and documented keeping in mind the principles of transparency and explainability. The Data Readiness Report also serves as a record of all data assessment operations including applied transformations. This provides a detailed lineage for the purpose of data governance and management. In effect, the report captures and documents the actions taken by various personas in a data readiness and assessment workflow. Overtime this becomes a repository of best practices and can potentially drive a recommendation system for building automated data readiness workflows on the lines of AutoML [8]. We anticipate that together with the Datasheets [9], Dataset Nutrition Label [11], FactSheets [1] and Model Cards [15], the Data Readiness Report makes significant progress towards Data and AI lifecycle documentation.

...read moreread less

11 citations

Posted Content•

Explainable Link Prediction for Privacy-Preserving Contact Tracing.

[...]

Balaji Ganesan¹, Hima Patel¹, Sameep Mehta¹•Institutions (1)

IBM¹

10 Dec 2020-arXiv: Cryptography and Security

TL;DR: In this concept paper, ideas from Graph Neural Networks and explainability are presented that could improve trust in contract tracing applications, and encourage adoption by people.

...read moreread less

Abstract: Contact Tracing has been used to identify people who were in close proximity to those infected with SARS-Cov2 coronavirus. A number of digital contract tracing applications have been introduced to facilitate or complement physical contact tracing. However, there are a number of privacy issues in the implementation of contract tracing applications, which make people reluctant to install or update their infection status on these applications. In this concept paper, we present ideas from Graph Neural Networks and explainability, that could improve trust in these applications, and encourage adoption by people.

...read moreread less

5 citations

Posted Content•

Fair Transfer of Multiple Style Attributes in Text

[...]

Karan Dabas, Nishtha Madan, Vijay Arya, Sameep Mehta, Gautam Singh, Tanmoy Chakraborty - Show less +2 more

18 Jan 2020-arXiv: Computation and Language

TL;DR: The authors proposed a neural network architecture for fairly transferring multiple style attributes in a given text, which can obtain obfuscated or written text incorporated with a desired degree of multiple soft styles such as female-quality, politeness, or formalness.

...read moreread less

Abstract: To preserve anonymity and obfuscate their identity on online platforms users may morph their text and portray themselves as a different gender or demographic. Similarly, a chatbot may need to customize its communication style to improve engagement with its audience. This manner of changing the style of written text has gained significant attention in recent years. Yet these past research works largely cater to the transfer of single style attributes. The disadvantage of focusing on a single style alone is that this often results in target text where other existing style attributes behave unpredictably or are unfairly dominated by the new style. To counteract this behavior, it would be nice to have a style transfer mechanism that can transfer or control multiple styles simultaneously and fairly. Through such an approach, one could obtain obfuscated or written text incorporated with a desired degree of multiple soft styles such as female-quality, politeness, or formalness. In this work, we demonstrate that the transfer of multiple styles cannot be achieved by sequentially performing multiple single-style transfers. This is because each single style-transfer step often reverses or dominates over the style incorporated by a previous transfer step. We then propose a neural network architecture for fairly transferring multiple style attributes in a given text. We test our architecture on the Yelp data set to demonstrate our superior performance as compared to existing one-style transfer steps performed in a sequence.

...read moreread less

2 citations

Patent•

Suggestions on removing cognitive terminology in news articles

[...]

Manish A. Bhide¹, Sameep Mehta¹, Nishtha Madaan¹, Kuntal Dey¹•Institutions (1)

IBM¹

06 Aug 2020

TL;DR: In this article, a cognitive terminology filter list is used to add entities and one or more cognitive terminology types associated with each entity in the filter list to remove a cognitive term from a news article.

...read moreread less

Abstract: Embodiments provide a computer implemented method in a data processing system comprising a processor and a memory comprising instructions, which are executed by the processor to cause the processor to implement the method of removing a cognitive terminology from a news article at a news portal, the method including: receiving, by the processor, a first news article from a user; configuring, by the processor, a cognitive terminology filter list to add one or more entities and one or more cognitive terminology types associated with each entity in the cognitive terminology filter list; dividing, by the processor, the first news article into a plurality of text segments; identifying, by the processor, one or more key entities and one or more inter-entity relationships of each text segment; detecting, by the processor, one or more cognitive terminologies in the first news article; and providing, by the processor, one or more suggestions to remove the one or more cognitive terminologies.

...read moreread less

1 citations

Patent•

Recommending machine learning models and source codes for input datasets

[...]

Samiulla Shaikh, Sameep Mehta, Manish A. Bhide, Lobig William B

26 Mar 2020

TL;DR: In this article, candidate data analysis assets having a corresponding relatedness score associated with the particular input dataset greater than a defined relatedness measure threshold value are selected and ranked by score.

...read moreread less

Abstract: Asset recommendation for a particular input dataset is provided. Candidate data analysis assets having a corresponding relatedness score associated with the particular input dataset greater than a defined relatedness score threshold value are selected. Those candidate data analysis assets having a corresponding relatedness score greater than the defined relatedness score threshold value are ranked by score. Those candidate data analysis assets having a corresponding relatedness score greater than the defined relatedness score threshold value are listed by rank from highest to lowest. A justification for each candidate data analysis asset is inserted in the ranked list of candidate data analysis assets. The ranked list of candidate data analysis assets along with each respective justification is outputted on a display device.

...read moreread less

1 citations

Proceedings Article•DOI•

Feedback-Based Keyphrase Extraction from Unstructured Text Documents

[...]

Nishtha Madaan¹, Mudit Saxena², Hima Patel¹, Sameep Mehta¹•Institutions (2)

IBM¹, Shiv Nadar University²

01 Jan 2020

TL;DR: This paper brings a human in the loop, and enable a human teacher to give feedback to a key-tags extraction framework in the form of natural language, in which the quality of the output can easily be judged by non-experts.

...read moreread less

Abstract: Machine Learning experts use classification and tagging algorithms considering the black box nature of these algorithms. These algorithms, primarily key-tags extraction from unstructured text documents are meant to capture key concepts in a document. With increasing amount of data, size and complexity of the data, this problem is key in industrial setup. Different possible use cases being in IT support, conversational systems/ chatbots and financial domains, this problem is important as shown in [1], [2]. In this paper, we bring a human in the loop, and enable a human teacher to give feedback to a key-tags extraction framework in the form of natural language. We focus on the problem of key-tags extraction in which the quality of the output can easily be judged by non-experts. Our system automatically reads natural language documents, extracts key concepts and presents an interactive information exploration user interface for analysing these documents.

...read moreread less

1 citations

Proceedings Article•DOI•

Blockchain-Based Platform for Trusted Collaborations on Data and AI Models

[...]

Kalapriya Kannan¹, Abhishek Singh², Mudit Verma², Praveen Jayachandran², Sameep Mehta² - Show less +1 more•Institutions (2)

Hewlett-Packard¹, IBM²

01 Nov 2020

TL;DR: This work presents a decentralized trusted data and model platform for collaborative AI, that leverages blockchain as an immutable metadata store of data andmodel resources and operations performed on them, to support and enforce ownership, authenticity, integrity, lineage and auditability properties.

...read moreread less

Abstract: Data analytics and artificial intelligence are extensively used by enterprises today and they increasingly span organization boundaries. Such collaboration between organizations today happens in an ad hoc manner, with very little visibility and systemic control on who is accessing the data, how, and for what purpose. When sharing data and AI models with other organizations, the owners desire the ability to control access, have visibility into the entire data pipeline and lineage, and ensure integrity. In this work, we present a decentralized trusted data and model platform for collaborative AI, that leverages blockchain as an immutable metadata store of data and model resources and operations performed on them, to support and enforce ownership, authenticity, integrity, lineage and auditability properties. Smart contracts enforce policies specified on data, including hierarchical and composite policies that are uniquely enabled by the use of blockchain. We demonstrate that our system is light-weight and can support over 1000 transactions per second with sub-second latency, significantly lower than the time taken to execute data pipelines.

...read moreread less

Proceedings Article•DOI•

On Efficiently Processing Business Lineage Queries

[...]

Himanshu Gupta¹, Rajmohan C¹, Sameep Mehta¹, Kiran Pulapa²•Institutions (2)

IBM¹, Indian Institute of Technology Kharagpur²

10 Dec 2020

TL;DR: In this paper, the authors propose a framework for efficiently executing business lineage queries and experimentally illustrate the effectiveness of the same. But, they do not address the problem of retrieving business need specific l ineage i nformation of provenance graphs.

...read moreread less

Abstract: In this paper, we look at the problem of retrieving business need specific l ineage i nformation f rom t he provenance graphs. A provenance graph models the events happening on various assets on a data platform. The output of a lineage query may contain large number of nodes. However, a user depending on her business role may only want a small subset of these nodes and the lineage relationships among these nodes. We formally define the notion of a class of business lineage queries wherein a business user specifies the lineage events relevant to her business need, in terms of event labels. The lineage output then consists of these events of interest, assets associated with these events and the lineage relationships among these events and assets of interest. We propose a novel framework for efficiently executing such business lineage queries and experimentally illustrate the effectiveness of the same.

...read moreread less