scispace - formally typeset
Search or ask a question

Showing papers by "Sameep Mehta published in 2020"


Proceedings ArticleDOI
23 Aug 2020
TL;DR: This tutorial highlights the importance of analysing data quality in terms of its value for machine learning applications and surveys all the important data quality related approaches discussed in literature, focusing on the intuition behind them, highlighting their strengths and similarities, and illustrates their applicability to real-world problems.
Abstract: It is well understood from literature that the performance of a machine learning (ML) model is upper bounded by the quality of the data. While researchers and practitioners have focused on improving the quality of models (such as neural architecture search and automated feature selection), there are limited efforts towards improving the data quality. One of the crucial requirements before consuming datasets for any application is to understand the dataset at hand and failure to do so can result in inaccurate analytics and unreliable decisions. Assessing the quality of the data across intelligently designed metrics and developing corresponding transformation operations to address the quality gaps helps to reduce the effort of a data scientist for iterative debugging of the ML pipeline to improve model performance. This tutorial highlights the importance of analysing data quality in terms of its value for machine learning applications. This tutorial surveys all the important data quality related approaches discussed in literature, focusing on the intuition behind them, highlighting their strengths and similarities, and illustrates their applicability to real-world problems. Finally we will discuss the interesting work IBM Research is doing in this space.

71 citations


Proceedings ArticleDOI
TL;DR: The system protocols are set up to incentivize all three entities - data owners, cloud vendors, and AI developers to truthfully record their actions on the distributed ledger, so that the blockchain system provides verifiable evidence of wrongdoing and dispute resolution.
Abstract: We present a blockchain based system that allows data owners, cloud vendors, and AI developers to collaboratively train machine learning models in a trustless AI marketplace. Data is a highly valued digital asset and central to deriving business insights. Our system enables data owners to retain ownership and privacy of their data, while still allowing AI developers to leverage the data for training. Similarly, AI developers can utilize compute resources from cloud vendors without loosing ownership or privacy of their trained models. Our system protocols are set up to incentivize all three entities - data owners, cloud vendors, and AI developers to truthfully record their actions on the distributed ledger, so that the blockchain system provides verifiable evidence of wrongdoing and dispute resolution. Our system is implemented on the Hyperledger Fabric and can provide a viable alternative to centralized AI systems that do not guarantee data or model privacy. We present experimental performance results that demonstrate the latency and throughput of its transactions under different network configurations where peers on the blockchain may be spread across different datacenters and geographies. Our results indicate that the proposed solution scales well to large number of data and model owners and can train up to 70 models per second on a 12-peer non optimized blockchain network and roughly 30 models per second in a 24 peer network.

12 citations


Posted Content
Shazia Afzal1, Rajmohan C1, Manish Kesarwani1, Sameep Mehta1, Hima Patel1 
TL;DR: The concept of a Data Readiness Report is introduced as accompanying documentation to a dataset that allows data consumers to get detailed insights into the quality of data and increases trust and re-useability of data.
Abstract: Data exploration and quality analysis is an important yet tedious process in the AI pipeline. Current practices of data cleaning and data readiness assessment for machine learning tasks are mostly conducted in an arbitrary manner which limits their reuse and results in loss of productivity. We introduce the concept of a Data Readiness Report as an accompanying documentation to a dataset that allows data consumers to get detailed insights into the quality of input data. Data characteristics and challenges on various quality dimensions are identified and documented keeping in mind the principles of transparency and explainability. The Data Readiness Report also serves as a record of all data assessment operations including applied transformations. This provides a detailed lineage for the purpose of data governance and management. In effect, the report captures and documents the actions taken by various personas in a data readiness and assessment workflow. Overtime this becomes a repository of best practices and can potentially drive a recommendation system for building automated data readiness workflows on the lines of AutoML [8]. We anticipate that together with the Datasheets [9], Dataset Nutrition Label [11], FactSheets [1] and Model Cards [15], the Data Readiness Report makes significant progress towards Data and AI lifecycle documentation.

11 citations


Posted Content
Balaji Ganesan1, Hima Patel1, Sameep Mehta1
TL;DR: In this concept paper, ideas from Graph Neural Networks and explainability are presented that could improve trust in contract tracing applications, and encourage adoption by people.
Abstract: Contact Tracing has been used to identify people who were in close proximity to those infected with SARS-Cov2 coronavirus. A number of digital contract tracing applications have been introduced to facilitate or complement physical contact tracing. However, there are a number of privacy issues in the implementation of contract tracing applications, which make people reluctant to install or update their infection status on these applications. In this concept paper, we present ideas from Graph Neural Networks and explainability, that could improve trust in these applications, and encourage adoption by people.

5 citations


Posted Content
TL;DR: The authors proposed a neural network architecture for fairly transferring multiple style attributes in a given text, which can obtain obfuscated or written text incorporated with a desired degree of multiple soft styles such as female-quality, politeness, or formalness.
Abstract: To preserve anonymity and obfuscate their identity on online platforms users may morph their text and portray themselves as a different gender or demographic. Similarly, a chatbot may need to customize its communication style to improve engagement with its audience. This manner of changing the style of written text has gained significant attention in recent years. Yet these past research works largely cater to the transfer of single style attributes. The disadvantage of focusing on a single style alone is that this often results in target text where other existing style attributes behave unpredictably or are unfairly dominated by the new style. To counteract this behavior, it would be nice to have a style transfer mechanism that can transfer or control multiple styles simultaneously and fairly. Through such an approach, one could obtain obfuscated or written text incorporated with a desired degree of multiple soft styles such as female-quality, politeness, or formalness. In this work, we demonstrate that the transfer of multiple styles cannot be achieved by sequentially performing multiple single-style transfers. This is because each single style-transfer step often reverses or dominates over the style incorporated by a previous transfer step. We then propose a neural network architecture for fairly transferring multiple style attributes in a given text. We test our architecture on the Yelp data set to demonstrate our superior performance as compared to existing one-style transfer steps performed in a sequence.

2 citations


Patent
06 Aug 2020
TL;DR: In this article, a cognitive terminology filter list is used to add entities and one or more cognitive terminology types associated with each entity in the filter list to remove a cognitive term from a news article.
Abstract: Embodiments provide a computer implemented method in a data processing system comprising a processor and a memory comprising instructions, which are executed by the processor to cause the processor to implement the method of removing a cognitive terminology from a news article at a news portal, the method including: receiving, by the processor, a first news article from a user; configuring, by the processor, a cognitive terminology filter list to add one or more entities and one or more cognitive terminology types associated with each entity in the cognitive terminology filter list; dividing, by the processor, the first news article into a plurality of text segments; identifying, by the processor, one or more key entities and one or more inter-entity relationships of each text segment; detecting, by the processor, one or more cognitive terminologies in the first news article; and providing, by the processor, one or more suggestions to remove the one or more cognitive terminologies.

1 citations


Patent
26 Mar 2020
TL;DR: In this article, candidate data analysis assets having a corresponding relatedness score associated with the particular input dataset greater than a defined relatedness measure threshold value are selected and ranked by score.
Abstract: Asset recommendation for a particular input dataset is provided. Candidate data analysis assets having a corresponding relatedness score associated with the particular input dataset greater than a defined relatedness score threshold value are selected. Those candidate data analysis assets having a corresponding relatedness score greater than the defined relatedness score threshold value are ranked by score. Those candidate data analysis assets having a corresponding relatedness score greater than the defined relatedness score threshold value are listed by rank from highest to lowest. A justification for each candidate data analysis asset is inserted in the ranked list of candidate data analysis assets. The ranked list of candidate data analysis assets along with each respective justification is outputted on a display device.

1 citations


Proceedings ArticleDOI
01 Jan 2020
TL;DR: This paper brings a human in the loop, and enable a human teacher to give feedback to a key-tags extraction framework in the form of natural language, in which the quality of the output can easily be judged by non-experts.
Abstract: Machine Learning experts use classification and tagging algorithms considering the black box nature of these algorithms. These algorithms, primarily key-tags extraction from unstructured text documents are meant to capture key concepts in a document. With increasing amount of data, size and complexity of the data, this problem is key in industrial setup. Different possible use cases being in IT support, conversational systems/ chatbots and financial domains, this problem is important as shown in [1], [2]. In this paper, we bring a human in the loop, and enable a human teacher to give feedback to a key-tags extraction framework in the form of natural language. We focus on the problem of key-tags extraction in which the quality of the output can easily be judged by non-experts. Our system automatically reads natural language documents, extracts key concepts and presents an interactive information exploration user interface for analysing these documents.

1 citations


Proceedings ArticleDOI
01 Nov 2020
TL;DR: This work presents a decentralized trusted data and model platform for collaborative AI, that leverages blockchain as an immutable metadata store of data andmodel resources and operations performed on them, to support and enforce ownership, authenticity, integrity, lineage and auditability properties.
Abstract: Data analytics and artificial intelligence are extensively used by enterprises today and they increasingly span organization boundaries. Such collaboration between organizations today happens in an ad hoc manner, with very little visibility and systemic control on who is accessing the data, how, and for what purpose. When sharing data and AI models with other organizations, the owners desire the ability to control access, have visibility into the entire data pipeline and lineage, and ensure integrity. In this work, we present a decentralized trusted data and model platform for collaborative AI, that leverages blockchain as an immutable metadata store of data and model resources and operations performed on them, to support and enforce ownership, authenticity, integrity, lineage and auditability properties. Smart contracts enforce policies specified on data, including hierarchical and composite policies that are uniquely enabled by the use of blockchain. We demonstrate that our system is light-weight and can support over 1000 transactions per second with sub-second latency, significantly lower than the time taken to execute data pipelines.

Proceedings ArticleDOI
10 Dec 2020
TL;DR: In this paper, the authors propose a framework for efficiently executing business lineage queries and experimentally illustrate the effectiveness of the same. But, they do not address the problem of retrieving business need specific l ineage i nformation of provenance graphs.
Abstract: In this paper, we look at the problem of retrieving business need specific l ineage i nformation f rom t he provenance graphs. A provenance graph models the events happening on various assets on a data platform. The output of a lineage query may contain large number of nodes. However, a user depending on her business role may only want a small subset of these nodes and the lineage relationships among these nodes. We formally define the notion of a class of business lineage queries wherein a business user specifies the lineage events relevant to her business need, in terms of event labels. The lineage output then consists of these events of interest, assets associated with these events and the lineage relationships among these events and assets of interest. We propose a novel framework for efficiently executing such business lineage queries and experimentally illustrate the effectiveness of the same.