scispace - formally typeset
Search or ask a question
Author

T. S. Jayram

Other affiliations: University of Michigan, Microsoft
Bio: T. S. Jayram is an academic researcher from IBM. The author has contributed to research in topics: Communication complexity & Upper and lower bounds. The author has an hindex of 35, co-authored 62 publications receiving 5031 citations. Previous affiliations of T. S. Jayram include University of Michigan & Microsoft.


Papers
More filters
Journal ArticleDOI
16 Nov 2002
TL;DR: This work presents a new method for proving strong lower bounds in communication complexity based on the notion of the conditional information complexity of a function, and shows that it also admits a direct sum theorem.
Abstract: We present a new method for proving strong lower bounds in communication complexity. This method is based on the notion of the conditional information complexity of a function which is the minimum amount of information about the inputs that has to be revealed by a communication protocol for the function. While conditional information complexity is a lower bound on the communication complexity, we show that it also admits a direct sum theorem. Direct sum decomposition reduces our task to that of proving (conditional) information complexity lower bounds for simple problems (such as the AND of two bits). For the latter, we develop novel techniques based on Hellinger distance and its generalizations.

724 citations

Journal ArticleDOI
TL;DR: A measure on graphs, the minrank, is identified, which exactly characterizes the minimum length of linear and certain types of nonlinear INDEX codes and for natural classes of side information graphs, including directed acyclic graphs, perfect graphs, odd holes, and odd anti-holes, minrank is the optimal length of arbitrary INDex codes.
Abstract: Motivated by a problem of transmitting supplemental data over broadcast channels (Birk and Kol, INFOCOM 1998), we study the following coding problem: a sender communicates with n receivers R1,..., Rn. He holds an input x ∈ {0,01l}n and wishes to broadcast a single message so that each receiver Ri can recover the bit xi. Each Ri has prior side information about x, induced by a directed graph Grain nodes; Ri knows the bits of a; in the positions {j | (i,j) is an edge of G}.G is known to the sender and to the receivers. We call encoding schemes that achieve this goal INDEXcodes for {0,1}n with side information graph G. In this paper we identify a measure on graphs, the minrank, which exactly characterizes the minimum length of linear and certain types of nonlinear INDEX codes. We show that for natural classes of side information graphs, including directed acyclic graphs, perfect graphs, odd holes, and odd anti-holes, minrank is the optimal length of arbitrary INDEX codes. For arbitrary INDEX codes and arbitrary graphs, we obtain a lower bound in terms of the size of the maximum acyclic induced subgraph. This bound holds even for randomized codes, but has been shown not to be tight.

632 citations

Book ChapterDOI
13 Sep 2002
TL;DR: Three algorithms to count the number of distinct elements in a data stream to within a factor of 1 ± ?
Abstract: We present three algorithms to count the number of distinct elements in a data stream to within a factor of 1 ± ?. Our algorithms improve upon known algorithms for this problem, and offer a spectrum of time/space tradeoffs.

561 citations

Proceedings ArticleDOI
21 Oct 2006
TL;DR: A measure on graphs, the minrank, is identified, which exactly characterizes the minimum length of linear and certain types of nonlinear INDEX codes and for natural classes of side information graphs, including directed acyclic graphs, perfect graphs, odd holes, and odd anti-holes, minrank is the optimal length of arbitrary INDex codes.
Abstract: Motivated by a problem of transmitting data over broadcast channels (Birk and Kol, INFOCOM 1998), we study the following coding problem: a sender communicates with n receivers R_1, . . . , R_n. He holds an input x \in {0, 1}^n and wishes to broadcast a single message so that each receiver R_i can recover the bit x_i. Each R_i has prior side information about x, induced by a directed graph G on n nodes; R_i knows the bits of x in the positions {j | (i, j) is an edge of G}. We call encoding schemes that achieve this goal INDEX codes for {0, 1}^n with side information graph G. In this paper we identify a measure on graphs, the minrank, which we conjecture to exactly characterize the minimum length of INDEX codes. We resolve the conjecture for certain natural classes of graphs. For arbitrary graphs, we show that the minrank bound is tight for both linear codes and certain classes of non-linear codes. For the general problem, we obtain a (weaker) lower bound that the length of an INDEX code for any graph G is at least the size of the maximum acyclic induced subgraph of G.

383 citations

Journal ArticleDOI
TL;DR: An analysis of a closed-loop system using an integral control law with Lotus Notes as the target, using root-locus analysis from control theory, is able to predict the occurrence (or absence) of controller-induced oscillations in the system's response.
Abstract: A widely used approach to achieving service level objectives for a software system (eg, an email server) is to add a controller that manipulates the target system’s tuning parameters We describe a methodology for designing such controllers for software systems that builds on classical control theory The classical approach proceeds in two steps: system identification and controller design In system identification, we construct mathematical models of the target system Traditionally, this has been based on a first-principles approach, using detailed knowledge of the target system Such models can be complex and difficult to build, validate, use, and maintain In our methodology, a statistical (ARMA) model is fit to historical measurements of the target being controlled These models are easier to obtain and use and allow us to apply control-theoretic design techniques to a larger class of systems When applied to a Lotus Notes groupware server, we obtain model-fits with R^{2} no lower than 75% and as high as 98% In controller design, an analysis of the models leads to a controller that will achieve the service level objectives We report on an analysis of a closed-loop system using an integral control law with Lotus Notes as the target The objective is to maintain a reference queue length Using root-locus analysis from control theory, we are able to predict the occurrence (or absence) of controller-induced oscillations in the system’s response Such oscillations are undesirable since they increase variability, thereby resulting in a failure to meet the service level objective We implement this controller for a real Lotus Notes system, and observe a remarkable correspondence between the behavior of the real system and the predictions of the analysis This indicates that the control theoretic analysis is sufficient to select controller parameters that meet the desired goals, and the need for simulations is reduced

270 citations


Cited by
More filters
Irina Rish1
01 Jan 2001
TL;DR: This work analyzes the impact of the distribution entropy on the classification error, showing that low-entropy feature distributions yield good performance of naive Bayes and demonstrates that naive Baye works well for certain nearlyfunctional feature dependencies.
Abstract: The naive Bayes classifier greatly simplify learning by assuming that features are independent given class. Although independence is generally a poor assumption, in practice naive Bayes often competes well with more sophisticated classifiers. Our broad goal is to understand the data characteristics which affect the performance of naive Bayes. Our approach uses Monte Carlo simulations that allow a systematic study of classification accuracy for several classes of randomly generated problems. We analyze the impact of the distribution entropy on the classification error, showing that low-entropy feature distributions yield good performance of naive Bayes. We also demonstrate that naive Bayes works well for certain nearlyfunctional feature dependencies, thus reaching its best performance in two opposite cases: completely independent features (as expected) and functionally dependent features (which is surprising). Another surprising result is that the accuracy of naive Bayes is not directly correlated with the degree of feature dependencies measured as the classconditional mutual information between the features. Instead, a better predictor of naive Bayes accuracy is the amount of information about the class that is lost because of the independence assumption.

2,046 citations

Journal ArticleDOI
TL;DR: In this paper, the authors introduce a sublinear space data structure called the countmin sketch for summarizing data streams, which allows fundamental queries in data stream summarization such as point, range, and inner product queries to be approximately answered very quickly; in addition it can be applied to solve several important problems in data streams such as finding quantiles, frequent items, etc.

1,939 citations

Journal ArticleDOI
TL;DR: Mash extends the MinHash dimensionality-reduction technique to include a pairwise mutation distance and P value significance test, enabling the efficient clustering and search of massive sequence collections.
Abstract: Mash extends the MinHash dimensionality-reduction technique to include a pairwise mutation distance and P value significance test, enabling the efficient clustering and search of massive sequence collections. Mash reduces large sequences and sequence sets to small, representative sketches, from which global mutation distances can be rapidly estimated. We demonstrate several use cases, including the clustering of all 54,118 NCBI RefSeq genomes in 33 CPU h; real-time database search using assembled or unassembled Illumina, Pacific Biosciences, and Oxford Nanopore data; and the scalable clustering of hundreds of metagenomic samples by composition. Mash is freely released under a BSD license ( https://github.com/marbl/mash ).

1,886 citations

Journal ArticleDOI
TL;DR: This paper proposes a novel coded caching scheme that exploits both local and global caching gains, leading to a multiplicative improvement in the peak rate compared with previously known schemes, and argues that the performance of the proposed scheme is within a constant factor of the information-theoretic optimum for all values of the problem parameters.
Abstract: Caching is a technique to reduce peak traffic rates by prefetching popular content into memories at the end users. Conventionally, these memories are used to deliver requested content in part from a locally cached copy rather than through the network. The gain offered by this approach, which we term local caching gain, depends on the local cache size (i.e., the memory available at each individual user). In this paper, we introduce and exploit a second, global, caching gain not utilized by conventional caching schemes. This gain depends on the aggregate global cache size (i.e., the cumulative memory available at all users), even though there is no cooperation among the users. To evaluate and isolate these two gains, we introduce an information-theoretic formulation of the caching problem focusing on its basic structure. For this setting, we propose a novel coded caching scheme that exploits both local and global caching gains, leading to a multiplicative improvement in the peak rate compared with previously known schemes. In particular, the improvement can be on the order of the number of users in the network. In addition, we argue that the performance of the proposed scheme is within a constant factor of the information-theoretic optimum for all values of the problem parameters.

1,857 citations

Journal ArticleDOI
TL;DR: This article places data fusion into the greater context of data integration, precisely defines the goals of data fusion, namely, complete, concise, and consistent data, and highlights the challenges of data Fusion.
Abstract: The development of the Internet in recent years has made it possible and useful to access many different information systems anywhere in the world to obtain information. While there is much research on the integration of heterogeneous information systems, most commercial systems stop short of the actual integration of available data. Data fusion is the process of fusing multiple records representing the same real-world object into a single, consistent, and clean representation.This article places data fusion into the greater context of data integration, precisely defines the goals of data fusion, namely, complete, concise, and consistent data, and highlights the challenges of data fusion, namely, uncertain and conflicting data values. We give an overview and classification of different ways of fusing data and present several techniques based on standard and advanced operators of the relational algebra and SQL. Finally, the article features a comprehensive survey of data integration systems from academia and industry, showing if and how data fusion is performed in each.

1,797 citations