scispace - formally typeset
Search or ask a question

Showing papers by "AT&T Labs published in 2014"


Journal ArticleDOI
TL;DR: Evaluating the use of full duplex (FD) as a potential mode in practical IEEE 802.11 networks concludes that there are potentially significant benefits gained from including an FD mode in future WiFi standards.
Abstract: In this paper, we present an experiment- and simulation-based study to evaluate the use of full duplex (FD) as a potential mode in practical IEEE 802.11 networks. To enable the study, we designed a 20-MHz multiantenna orthogonal frequency-division-multiplexing (OFDM) FD physical layer and an FD media access control (MAC) protocol, which is backward compatible with current 802.11. Our extensive over-the-air experiments, simulations, and analysis demonstrate the following two results. First, the use of multiple antennas at the physical layer leads to a higher ergodic throughput than its hardware-equivalent multiantenna half-duplex (HD) counterparts for SNRs above the median SNR encountered in practical WiFi deployments. Second, the proposed MAC translates the physical layer rate gain into near doubling of throughput for multinode single-AP networks. The two results allow us to conclude that there are potentially significant benefits gained from including an FD mode in future WiFi standards.

552 citations


Proceedings ArticleDOI
18 Jun 2014
TL;DR: PrivBayes, a differentially private method for releasing high-dimensional data that circumvents the curse of dimensionality, and introduces a novel approach that uses a surrogate function for mutual information to build the model more accurately.
Abstract: Privacy-preserving data publishing is an important problem that has been the focus of extensive study. The state-of-the-art solution for this problem is differential privacy, which offers a strong degree of privacy protection without making restrictive assumptions about the adversary. Existing techniques using differential privacy, however, cannot effectively handle the publication of high-dimensional data. In particular, when the input dataset contains a large number of attributes, existing methods require injecting a prohibitive amount of noise compared to the signal in the data, which renders the published data next to useless.To address the deficiency of the existing methods, this paper presents PrivBayes, a differentially private method for releasing high-dimensional data. Given a dataset D, PrivBayes first constructs a Bayesian network N, which (i) provides a succinct model of the correlations among the attributes in D and (ii) allows us to approximate the distribution of data in D using a set P of low-dimensional marginals of D. After that, PrivBayes injects noise into each marginal in P to ensure differential privacy and then uses the noisy marginals and the Bayesian network to construct an approximation of the data distribution in D. Finally, PrivBayes samples tuples from the approximate distribution to construct a synthetic dataset, and then releases the synthetic data. Intuitively, PrivBayes circumvents the curse of dimensionality, as it injects noise into the low-dimensional marginals in P instead of the high-dimensional dataset D. Private construction of Bayesian networks turns out to be significantly challenging, and we introduce a novel approach that uses a surrogate function for mutual information to build the model more accurately. We experimentally evaluate PrivBayes on real data and demonstrate that it significantly outperforms existing solutions in terms of accuracy.

433 citations


Proceedings Article
20 Aug 2014
TL;DR: This work presents a technique based on Principal Component Analysis (PCA) that models the behavior of normal users accurately and identifies significant deviations from it as anomalous, and applies it to detect click-spam in Facebook ads and finds that a surprisingly large fraction of clicks are from anomalous users.
Abstract: Users increasingly rely on crowdsourced information, such as reviews on Yelp and Amazon, and liked posts and ads on Facebook. This has led to a market for blackhat promotion techniques via fake (e.g., Sybil) and compromised accounts, and collusion networks. Existing approaches to detect such behavior relies mostly on supervised (or semisupervised) learning over known (or hypothesized) attacks. They are unable to detect attacks missed by the operator while labeling, or when the attacker changes strategy. We propose using unsupervised anomaly detection techniques over user behavior to distinguish potentially bad behavior from normal behavior. We present a technique based on Principal Component Analysis (PCA) that models the behavior of normal users accurately and identifies significant deviations from it as anomalous. We experimentally validate that normal user behavior (e.g., categories of Facebook pages liked by a user, rate of like activity, etc.) is contained within a low-dimensional subspace amenable to the PCA technique. We demonstrate the practicality and effectiveness of our approach using extensive ground-truth data from Facebook: we successfully detect diverse attacker strategies--fake, compromised, and colluding Facebook identities--with no a priori labeling while maintaining low false-positive rates. Finally, we apply our approach to detect click-spam in Facebook ads and find that a surprisingly large fraction of clicks are from anomalous users.

237 citations


Journal ArticleDOI
TL;DR: This paper primarily focuses on the unsupervised scenario where the labeled source domain training data is accompanied by unlabeled target domain test data, and presents a two-stage data-driven approach by generating intermediate data representations that could provide relevant information on the domain shift.
Abstract: With unconstrained data acquisition scenarios widely prevalent, the ability to handle changes in data distribution across training and testing data sets becomes important. One way to approach this problem is through domain adaptation, and in this paper we primarily focus on the unsupervised scenario where the labeled source domain training data is accompanied by unlabeled target domain test data. We present a two-stage data-driven approach by generating intermediate data representations that could provide relevant information on the domain shift. Starting with a linear representation of domains in the form of generative subspaces of same dimensions for the source and target domains, we first utilize the underlying geometry of the space of these subspaces, the Grassmann manifold, to obtain a `shortest' geodesic path between the two domains. We then sample points along the geodesic to obtain intermediate cross-domain data representations, using which a discriminative classifier is learnt to estimate the labels of the target data. We subsequently incorporate non-linear representation of domains by considering a Reproducing Kernel Hilbert Space representation, and a low-dimensional manifold representation using Laplacian Eigenmaps, and also examine other domain adaptation settings such as (i) semi-supervised adaptation where the target domain is partially labeled, and (ii) multi-domain adaptation where there could be more than one domain in source and/or target data sets. Finally, we supplement our adaptation technique with (i) fine-grained reference domains that are created by blending samples from source and target data sets to provide some evidence on the actual domain shift, and (ii) a multi-class boosting analysis to obtain robustness to the choice of algorithm parameters. We evaluate our approach for object recognition problems and report competitive results on two widely used Office and Bing adaptation data sets.

205 citations


Proceedings ArticleDOI
Barna Saha1, Divesh Srivastava1
19 May 2014
TL;DR: This tutorial presents recent results that are relevant to big data quality management, focusing on the two major dimensions of discovering quality issues from the data itself, and (ii) trading-off accuracy vs efficiency, and identifies a range of open problems for the community.
Abstract: In our Big Data era, data is being generated, collected and analyzed at an unprecedented scale, and data-driven decision making is sweeping through all aspects of society. Recent studies have shown that poor quality data is prevalent in large databases and on the Web. Since poor quality data can have serious consequences on the results of data analyses, the importance of veracity, the fourth ‘V’ of big data is increasingly being recognized. In this tutorial, we highlight the substantial challenges that the first three ‘V’s, volume, velocity and variety, bring to dealing with veracity in big data. Due to the sheer volume and velocity of data, one needs to understand and (possibly) repair erroneous data in a scalable and timely manner. With the variety of data, often from a diversity of sources, data quality rules cannot be specified a priori; one needs to let the “data to speak for itself” in order to discover the semantics of the data. This tutorial presents recent results that are relevant to big data quality management, focusing on the two major dimensions of (i) discovering quality issues from the data itself, and (ii) trading-off accuracy vs efficiency, and identifies a range of open problems for the community.

203 citations


Proceedings ArticleDOI
16 Jun 2014
TL;DR: This paper presents the first large-scale study characterizing the impact of cellular network performance on mobile video user engagement from the perspective of a network operator, and quantifies the effect that 31 different network factors have on user behavior in mobile video.
Abstract: Mobile network operators have a significant interest in the performance of streaming video on their networks because network dynamics directly influence the Quality of Experience (QoE). However, unlike video service providers, network operators are not privy to the client- or server-side logs typically used to measure key video performance metrics, such as user engagement. To address this limitation, this paper presents the first large-scale study characterizing the impact of cellular network performance on mobile video user engagement from the perspective of a network operator. Our study on a month-long anonymized data set from a major cellular network makes two main contributions. First, we quantify the effect that 31 different network factors have on user behavior in mobile video. Our results provide network operators direct guidance on how to improve user engagement --- for example, improving mean signal-to-interference ratio by 1 dB reduces the likelihood of video abandonment by 2%. Second, we model the complex relationships between these factors and video abandonment, enabling operators to monitor mobile video user engagement in real-time. Our model can predict whether a user completely downloads a video with more than 87% accuracy by observing only the initial 10 seconds of video streaming sessions. Moreover, our model achieves significantly better accuracy than prior models that require client- or server-side logs, yet we only use standard radio network statistics and/or TCP/IP headers available to network operators.

139 citations


Proceedings ArticleDOI
18 Jun 2014
TL;DR: In this paper, the authors propose a voting strategy that trusts data provided by the majority or at least a certain number of sources may not work well in the presence of copying between the sources.
Abstract: Many applications rely on Web data and extraction systems to accomplish knowledge-driven tasks. Web information is not curated, so many sources provide inaccurate, or conflicting information. Moreover, extraction systems introduce additional noise to the data. We wish to automatically distinguish correct data and erroneous data for creating a cleaner set of integrated data. Previous work has shown that a naive voting strategy that trusts data provided by the majority or at least a certain number of sources may not work well in the presence of copying between the sources. However, correlation between sources can be much broader than copying: sources may provide data from complementary domains (negative correlation), extractors may focus on different types of information (negative correlation), and extractors may apply common rules in extraction (positive correlation, without copying). In this paper we present novel techniques modeling correlations between sources and applying it in truth finding. We provide a comprehensive evaluation of our approach on three real-world datasets with different characteristics, as well as on synthetic data, showing that our algorithms outperform the existing state-of-the-art techniques.

131 citations


Journal ArticleDOI
Chao Tian1
TL;DR: A computer-aided proof approach based on primal and dual relation is developed, which extends Yeung's linear programming method, which was previously only used on information theoretic problems with a few random variables due to the exponential growth of the number of variables in the corresponding LP problem.
Abstract: Exact-repair regenerating codes are considered for the case (n,k,d)=(4,3,3), for which a complete characterization of the rate region is provided. This characterization answers in the affirmative the open question whether there exists a non-vanishing gap between the optimal bandwidth-storage tradeoff of the functional-repair regenerating codes (i.e., the cut-set bound) and that of the exact-repair regenerating codes. To obtain an explicit information theoretic converse, a computer-aided proof (CAP) approach based on primal and dual relation is developed. This CAP approach extends Yeung's linear programming (LP) method, which was previously only used on information theoretic problems with a few random variables due to the exponential growth of the number of variables in the corresponding LP problem. The symmetry in the exact-repair regenerating code problem allows an effective reduction of the number of variables, and together with several other problem-specific reductions, the LP problem is reduced to a manageable scale. For the achievability, only one non-trivial corner point of the rate region needs to be addressed in this case, for which an explicit binary code construction is given.

130 citations


Proceedings ArticleDOI
Vaneet Aggarwal1, Emir Halepovic1, Jeffrey Pang1, Shobha Venkataraman1, He Yan1 
26 Feb 2014
TL;DR: This paper uses machine learning to obtain a function that relates passive measurements to an app's QoE using passive network measurements, and shows with anonymous data that Prometheus can measure the QOE of real video-on-demand and VoIP apps with over 80% accuracy, which is close to or exceeds the accuracy of approaches suggested by domain experts.
Abstract: Cellular network operators are now expected to maintain a good Quality of Experience (QoE) for many services beyond circuit-switched voice and messaging. However, new smart-phone "app" services, such as Over The Top (OTT) video delivery, are not under an operator's control. Furthermore, complex interactions between network protocol layers make it challenging for operators to understand how network-level parameters (e.g., inactivity timers, handover thresholds, middle boxes) will influence a specific app's QoE. This paper takes a first step to address these challenges by presenting a novel approach to estimate app QoE using passive network measurements. Our approach uses machine learning to obtain a function that relates passive measurements to an app's QoE. In contrast to previous approaches, our approach does not require any control over app services or domain knowledge about how an app's network traffic relates to QoE. We implemented our approach in Prometheus, a prototype system in a large U.S. cellular operator. We show with anonymous data that Prometheus can measure the QoE of real video-on-demand and VoIP apps with over 80% accuracy, which is close to or exceeds the accuracy of approaches suggested by domain experts.

126 citations


Journal ArticleDOI
01 May 2014
TL;DR: This paper presents an end-to-end framework that can incrementally and efficiently update linkage results when data updates arrive, and allows merging records in the updates with existing clusters, but also allows leveraging new evidence from the updates to fix previous linkage errors.
Abstract: Record linkage clusters records such that each cluster corresponds to a single distinct real-world entity. It is a crucial step in data cleaning and data integration. In the big data era, the velocity of data updates is often high, quickly making previous linkage results obsolete. This paper presents an end-to-end framework that can incrementally and efficiently update linkage results when data updates arrive. Our algorithms not only allow merging records in the updates with existing clusters, but also allow leveraging new evidence from the updates to fix previous linkage errors. Experimental results on three real and synthetic data sets show that our algorithms can significantly reduce linkage time without sacrificing linkage quality.

119 citations


Journal ArticleDOI
TL;DR: In this paper, a two-span, 67-km space division multiplexed (SDM) wavelength division multiple access (WDM) system incorporating the first reconfigurable optical add-drop multiplexer (ROADM) supporting spatial superchannels and the first cladding-pumped multicore erbium-doped fiber amplifier directly spliced to multicore transmission fiber is presented.
Abstract: We report a two-span, 67-km space-division-multiplexed (SDM) wavelength-division-multiplexed (WDM) system incorporating the first reconfigurable optical add-drop multiplexer (ROADM) supporting spatial superchannels and the first cladding-pumped multicore erbium-doped fiber amplifier directly spliced to multicore transmission fiber. The ROADM subsystem utilizes two conventional 1 × 20 wavelength selective switches (WSS) each configured to implement a 7 × (1 × 2) WSS. ROADM performance tests indicate that the subchannel insertion losses, attenuation accuracies, and passband widths are well matched to each other and show no significant penalty, compared to the conventional operating mode for the WSS. For 6 × 40 × 128-Gb/s SDM-WDM polarization-multiplexed quadrature phase-shift-keyed (PM-QPSK) transmission on 50 GHz spacing, optical signal-to-noise ratio penalties are less than 1.6 dB in Add, Drop, and Express paths. In addition, we demonstrate the feasibility of utilizing joint signal processing of subchannels in this two-span, ROADM system.

Proceedings ArticleDOI
07 Sep 2014
TL;DR: A machine-learning-based mechanism to infer web QoE metrics from network traces accurately is devised and a large-scale study characterizing the impact of network characteristics on web QOE is presented using a month-long anonymized dataset collected from a major cellular network provider.
Abstract: Recent studies have shown that web browsing is one of the most prominent cellular applications. It is therefore important for cellular network operators to understand how radio network characteristics (such as signal strength, handovers, load, etc.) influence users' web browsing Quality-of-Experience (web QoE). Understanding the relationship between web QoE and network characteristics is a pre-requisite for cellular network operators to detect when and where degraded network conditions actually impact web QoE. Unfortunately, cellular network operators do not have access to detailed server-side or client-side logs to directly measure web QoE metrics, such as abandonment rate and session length. In this paper, we first devise a machine-learning-based mechanism to infer web QoE metrics from network traces accurately. We then present a large-scale study characterizing the impact of network characteristics on web QoE using a month-long anonymized dataset collected from a major cellular network provider. Our results show that improving signal-to-noise ratio, decreasing load and reducing handovers can improve user experience. We find that web QoE is very sensitive to inter-radio-access-technology (IRAT) handovers. We further find that higher radio data link rate does not necessarily lead to better web QoE. Since many network characteristics are interrelated, we also use machine learning to accurately model the influence of radio network characteristics on user experience metrics. This model can be used by cellular network operators to prioritize the improvement of network factors that most influence web QoE.

Journal ArticleDOI
TL;DR: This paper develops heuristics that identify 23,914 new AS links not visible in the publicly-available BGP data-12, and analyzes properties of the Internet graph that includes these new links and characterize why they are missing.
Abstract: An accurate Internet topology graph is important in many areas of networking, from understanding ISP business relationships to diagnosing network anomalies. Most Internet mapping efforts have derived the network structure, at the level of interconnected autonomous systems (ASes), from a rather limited set of vantage points. In this paper, we argue that a promising approach to revealing the hidden areas of the Internet topology is through active measurement from an observation platform that scales with the growing Internet. By leveraging measurements performed by an extension to a popular P2P system, we show that this approach indeed exposes significant new topological information. Our study is based on traceroute measurements from more than 992,000 IPs in over 3,700 ASes distributed across the Internet hierarchy, many in regions of the Internet not covered by publicly available path information. To address this issue we develop heuristics that identify 23,914 new AS links not visible in the publicly-available BGP data-12.86 percent more customer-provider links and 40.99 percent more peering links, than previously reported. We validate our heuristics using data from a tier-1 ISP, and show that they successfully filter out all false links introduced by public IP-to-AS mapping. We analyze properties of the Internet graph that includes these new links and characterize why they are missing. Finally, we have made the identified set of links and their inferred relationships publicly available.

Proceedings ArticleDOI
02 Jun 2014
TL;DR: This paper takes a first comprehensive examination of the resource usage of mobile web browsing by focusing on two important types of resources: bandwidth and energy, using a novel traffic collection and analysis tool.
Abstract: Multiple entities in the smartphone ecosystem employ various methods to provide better web browsing experience. In this paper, we take a first comprehensive examination of the resource usage of mobile web browsing by focusing on two important types of resources: bandwidth and energy. Using a novel traffic collection and analysis tool, we examine a wide spectrum of important factors including protocol overhead, TCP connection management, web page content, traffic timing dynamics, caching efficiency, and compression usage, for the most popular 500 websites. Our findings suggest that that all above factors at different layers can affect resource utilization for web browsing, as they often poorly interact with the underlying cellular networks. Based on our findings, we developed novel recommendations and detailed best practice suggestions for mobile web content, browser, network protocol, and smartphone OS design, to make mobile web browsing more resource efficient.

Journal ArticleDOI
01 Sep 2014
TL;DR: This paper solves the following data summarization problem: given a multi-dimensional data set augmented with a binary attribute, how can the authors construct an interpretable and informative summary of the factors affecting the binary attribute in terms of the combinations of values of the dimension attributes?
Abstract: In this paper, we solve the following data summarization problem: given a multi-dimensional data set augmented with a binary attribute, how can we construct an interpretable and informative summary of the factors affecting the binary attribute in terms of the combinations of values of the dimension attributes? We refer to such summaries as explanation tables. We show the hardness of constructing optimally-informative explanation tables from data, and we propose effective and efficient heuristics. The proposed heuristics are based on sampling and include optimizations related to computing the information content of a summary from a sample of the data. Using real data sets, we demonstrate the advantages of explanation tables compared to related approaches that can be adapted to solve our problem, and we show significant performance benefits of our optimizations.

Proceedings ArticleDOI
18 Jun 2014
TL;DR: A set of time-dependent metrics, including coverage, freshness and accuracy, are defined to characterize the quality of integrated data and it is shown how statistical models for the evolution of sources can be used to estimate these metrics.
Abstract: Data integration is a challenging task due to the large numbers of autonomous data sources. This necessitates the development of techniques to reason about the benefits and costs of acquiring and integrating data. Recently the problem of source selection (i.e., identifying the subset of sources that maximizes the profit from integration) was introduced as a preprocessing step before the actual integration. The problem was studied for static sources and used the accuracy of data fusion to quantify the integration profit. In this paper, we study the problem of source selection considering dynamic data sources whose content changes over time. We define a set of time-dependent metrics, including coverage, freshness and accuracy, to characterize the quality of integrated data. We show how statistical models for the evolution of sources can be used to estimate these metrics. While source selection is NP-complete, we show that for a large class of practical cases, near-optimal solutions can be found, propose an algorithmic framework with theoretical guarantees for our problem and show its effectiveness with an extensive experimental evaluation on both real-world and synthetic data.

Proceedings ArticleDOI
22 Aug 2014
TL;DR: It is found that (1) load increase in a cell causes dramatic bandwidth reduction on UEs and significantly degrades TCP performance, (2) seamless handover causes significant TCP losses while lossless handover increases TCP segments' delay.
Abstract: The popularity of smartphones and smartphone applications means that data is the dominant traffic type in current mobile networks. In this paper we present our work on a systematic investigation into facets of the LTE/EPC architecture that impact the performance of TCP as the predominant transport layer protocol used by applications on mobile networks. We found that (1) load increase in a cell causes dramatic bandwidth reduction on UEs and significantly degrades TCP performance, (2) seamless handover causes significant TCP losses while lossless handover increases TCP segments' delay.

Journal ArticleDOI
TL;DR: It is shown that BubbleSets is the best alternative for tasks involving group membership assessment; that visually encoding group information over basic node-link diagrams incurs an accuracy penalty of about 25 percent in solving network tasks; and that GMap's use of prominent group labels improves memorability.
Abstract: We present the results of evaluating four techniques for displaying group or cluster information overlaid on node-link diagrams: node coloring, GMap, BubbleSets, and LineSets The contributions of the paper are three fold First, we present quantitative results and statistical analyses of data from an online study in which approximately 800 subjects performed 10 types of group and network tasks in the four evaluated visualizations Specifically, we show that BubbleSets is the best alternative for tasks involving group membership assessment; that visually encoding group information over basic node-link diagrams incurs an accuracy penalty of about 25 percent in solving network tasks; and that GMap's use of prominent group labels improves memorability We also show that GMap's visual metaphor can be slightly altered to outperform BubbleSets in group membership assessment Second, we discuss visual characteristics that can explain the observed quantitative differences in the four visualizations and suggest design recommendations This discussion is supported by a small scale eye-tracking study and previous results from the visualization literature Third, we present an easily extensible user study methodology

Proceedings ArticleDOI
10 Jun 2014
TL;DR: This paper formally defines an optimization problem based on a practical link data rate model, whose objective is to minimize power consumption while meeting user data rate requirements, and presents an effective algorithm to solve it in polynomial time.
Abstract: Device-to-Device (D2D) communication has emerged as a promising technique for improving capacity and reducing power consumption in wireless networks. Most existing works on D2D communications either targeted CDMA-based single-channel networks or aimed to maximize network throughput. In this paper, we, however, aim at enabling green D2D communications in OFDMA-based wireless networks. We formally define an optimization problem based on a practical link data rate model, whose objective is to minimize power consumption while meeting user data rate requirements. We then present an effective algorithm to solve it in polynomial time, which jointly determines mode selection, channel allocation and power assignment. It has been shown by extensive simulation results that the proposed algorithm can achieve over 57% power savings, compared to several baseline methods.

Proceedings ArticleDOI
02 Dec 2014
TL;DR: PARCEL splits functionality between the mobile device and the proxy based on their strengths, and in a manner distinct from both traditional browsers and existing cloud-heavy approaches, and results show PARCEL continues to perform well under client interactions, owing to its judicious functionality split.
Abstract: Today's web page download process is ill suited to cellular networks resulting in high page load times and radio energy usage While there have been notable prior attempts at tackling the challenge with assistance from proxies (cloud), achieving a responsive and energy efficient browsing experience remains an elusive goal In this paper, we make a fresh attempt at addressing the challenge by proposing PARCEL PARCEL splits functionality between the mobile device and the proxy based on their strengths, and in a manner distinct from both traditional browsers and existing cloud-heavy approaches We conduct extensive evaluations over an operational LTE network using a prototype implementation of PARCEL Our results show that PARCEL reduces page load times by 496%, and radio energy consumption by 65% compared to traditional mobile web browsers Further, our results show PARCEL continues to perform well under client interactions, owing to its judicious functionality split

Proceedings ArticleDOI
16 May 2014
TL;DR: Extensions to this paradigm that open up the event to internal employees and preserve the open-ended nature of the hackathon itself are described.
Abstract: Hackathons have become an increasingly popular approach for organizations to both test their new products and services as well as to generate new ideas. Most events either focus on attracting external developers or requesting employees of the organization to focus on a specific problem. In this paper we describe extensions to this paradigm that open up the event to internal employees and preserve the open-ended nature of the hackathon itself. In this paper we describe our initial motivation and objectives for conducting an internal hackathon, our experience in pioneering an internal hackathon at AT&T including specific things we did to make the internal hackathon successful. We conclude with the benefits (both expected and unexpected) we achieved from the internal hackathon approach, and recommendations for continuing the use of this valuable tool within AT&T.

Proceedings ArticleDOI
23 Jun 2014
TL;DR: A novel algorithm which uses compact hash bits to greatly improve the efficiency of non-linear kernel SVM in very large scale visual classification problems and proposes a novel hashing scheme for arbitrary non- linear kernels via random subspace projection in reproducing kernel Hilbert space.
Abstract: This paper presents a novel algorithm which uses compact hash bits to greatly improve the efficiency of non-linear kernel SVM in very large scale visual classification problems. Our key idea is to represent each sample with compact hash bits, over which an inner product is defined to serve as the surrogate of the original nonlinear kernels. Then the problem of solving the nonlinear SVM can be transformed into solving a linear SVM over the hash bits. The proposed Hash-SVM enjoys dramatic storage cost reduction owing to the compact binary representation, as well as a (sub-)linear training complexity via linear SVM. As a critical component of Hash-SVM, we propose a novel hashing scheme for arbitrary non-linear kernels via random subspace projection in reproducing kernel Hilbert space. Our comprehensive analysis reveals a well behaved theoretic bound of the deviation between the proposed hashing-based kernel approximation and the original kernel function. We also derive requirements on the hash bits for achieving a satisfactory accuracy level. Several experiments on large-scale visual classification benchmarks are conducted, including one with over 1 million images. The results show that Hash-SVM greatly reduces the computational complexity (more than ten times faster in many cases) while keeping comparable accuracies.

Journal ArticleDOI
04 Sep 2014
TL;DR: An insightful upper bound is provided on the average service delay of erasure-coded storage with arbitrary service time distribution and consisting of multiple heterogeneous files, which supersede known delay bounds that only work for homogeneous files.
Abstract: Modern distributed storage systems offer large capacity to satisfy the exponentially increasing need of storage space. They often use erasure codes to protect against disk and node failures to increase reliability, while trying to meet the latency requirements of the applications and clients. This paper provides an insightful upper bound on the average service delay of such erasure-coded storage with arbitrary service time distribution and consisting of multiple heterogeneous files. Not only does the result supersede known delay bounds that only work for homogeneous files, it also enables a novel problem of joint latency and storage cost minimization over three dimensions: selecting the erasure code, placement of encoded chunks, and optimizing scheduling policy. The problem is efficiently solved via the computation of a sequence of convex approximations with provable convergence. We further prototype our solution in an open-source, cloud storage deployment over three geographically distributed data centers. Experimental results validate our theoretical delay analysis and show significant latency reduction, providing valuable insights into the proposed latency-cost tradeoff in erasure-coded storage.

Journal ArticleDOI
TL;DR: It is shown that the separation approach is optimal in two general scenarios and is approximately optimal in a third scenario, which generalizes the second scenario by allowing each source to be reconstructed at multiple destinations with different distortions.
Abstract: We consider the source-channel separation architecture for lossy source coding in communication networks. It is shown that the separation approach is optimal in two general scenarios and is approximately optimal in a third scenario. The two scenarios for which separation is optimal complement each other: the first is when the memoryless sources at source nodes are arbitrarily correlated, each of which is to be reconstructed at possibly multiple destinations within certain distortions, but the channels in this network are synchronized, orthogonal, and memoryless point-to-point channels; the second is when the memoryless sources are mutually independent, each of which is to be reconstructed only at one destination within a certain distortion, but the channels are general, including multi-user channels, such as multiple access, broadcast, interference, and relay channels, possibly with feedback. The third scenario, for which we demonstrate approximate optimality of source-channel separation, generalizes the second scenario by allowing each source to be reconstructed at multiple destinations with different distortions. For this case, the loss from optimality using the separation approach can be upper-bounded when a difference distortion measure is taken, and in the special case of quadratic distortion measure, this leads to universal constant bounds.

Proceedings ArticleDOI
30 Jun 2014
TL;DR: This work reduces the data placement problem to the well-studied problem of Graph Partitioning, which is NP-Hard but for which efficient approximation algorithms exist, and produces nearly-optimal solutions in seconds.
Abstract: With the widespread use of shared-nothing clusters of servers, there has been a proliferation of distributed object stores that offer high availability, reliability and enhanced performance for MapReduce-style workloads. However, data-intensive scientific workflows and join-intensive queries cannot always be evaluated efficiently using MapReduce-style processing without extensive data migrations, which cause network congestion and reduced query throughput. In this paper, we study the problem of computing data placement strategies that minimize the data communication costs incurred by such workloads in a distributed setting.Our main contribution is a reduction of the data placement problem to the well-studied problem of Graph Partitioning, which is NP-Hard but for which efficient approximation algorithms exist. The novelty and significance of this result lie in representing the communication cost exactly and using standard graphs instead of hypergraphs, which were used in prior work on data placement that optimized for different objectives.We study several practical extensions of the problem: with load balancing, with replication, and with complex workflows consisting of multiple steps that may be computed on different servers. We provide integer linear programs (IPs) that may be used with any IP solver to find an optimal data placement. For the no-replication case, we use publicly available graph partitioning libraries (e.g., METIS) to efficiently compute nearly-optimal solutions. For the versions with replication, we introduce two heuristics that utilize the Graph Partitioning solution of the no-replication case. Using a workload based on TPC-DS, it may take an IP solver weeks to compute an optimal data placement, whereas our reduction produces nearly-optimal solutions in seconds.

Journal ArticleDOI
Xiang Zhou1, Lynn E. Nelson1
TL;DR: In this article, a systematic review of DSP-enabled technologies for high spectral efficiency (SE) 400 Gb/s-class and beyond optical networks is presented, and a training-assisted two-stage carrier phase recovery algorithm is proposed to address the detrimental cyclic phase slipping problem with minimal training overhead.
Abstract: This paper presents a systematic review of several digital signal processing (DSP)-enabled technologies recently proposed and demonstrated for high spectral efficiency (SE) 400 Gb/s–class and beyond optical networks. These include 1) a newly proposed SE-adaptable optical modulation technology—time-domain hybrid quadrature amplitude modulation (QAM), 2) two advanced transmitter side digital spectral shaping technologies—Nyquist signaling (for spectrally-efficient multiplexing) and digital pre-equalization (for improving tolerance toward channel narrowing effects), and 3) a newly proposed training-assisted two-stage carrier phase recovery algorithm that is designed to address the detrimental cyclic phase slipping problem with minimal training overhead. Additionally, this paper presents a novel DSP-based method for mitigation of equalizer-enhanced phase noise impairments. It is shown that performance degradation caused by the interaction between the long-memory chromatic dispersion compensating filter/equalizer and local oscillator laser phase noise can be effectively mitigated by replacing the commonly used fast single-tap phase-rotation-based equalizer (for typical carrier phase recovery) with a fast multi-tap linear equalizer. Finally, brief reviews of two high-SE 400 Gb/s-class WDM transmission experiments employing these advanced DSP algorithms are presented.

Patent
Elysia C. Tan1, Han Nguyen2
16 Oct 2014
TL;DR: In this article, an initial meta-file is transmitted in response to a request for the content, which identifies a division of the content stream into blocks, and available sources for delivery of the blocks.
Abstract: A method, apparatus and computer-readable storage medium distribute a non-live content stream in a network. An initial meta-file is transmitted in response to a request for the content, which identifies a division of the content stream into blocks, and available sources for delivery of the blocks. The initial meta-file can identify a first multicast and a second multicast server, assigning a first and second portion of the blocks for delivery using the first and second multicast source server, respectively. The first and second portions are transmitted using the first and second multicast source servers, respectively. The first and second portions correspond to distinct non-overlapping portions of the non-live content stream. The initial meta-file can also identify a unicast source server, assigning a third portion of the blocks for delivery using the unicast source server, the third potion being transmitted by the unicast source server.

Journal ArticleDOI
TL;DR: This work considers power allocation for an access-controlled transmitter with energy harvesting capability based on causal observations of the channel fading state and proposes power allocation algorithms for both the finite- and infinite-horizon cases whose computational complexity is significantly lower than that of the standard discrete MDP method but with improved performance.
Abstract: We consider power allocation for an access-controlled transmitter with energy harvesting capability based on causal observations of the channel fading state. We assume that the system operates in a time-slotted fashion and the channel gain in each slot is a random variable which is independent across slots. Further, we assume that the transmitter is solely powered by a renewable energy source and the energy harvesting process can practically be predicted. With the additional access control for the transmitter and the maximum power constraint, we formulate the stochastic optimization problem of maximizing the achievable rate as a Markov decision process (MDP) with continuous state. To effi- ciently solve the problem, we define an approximate value function based on a piecewise linear fit in terms of the battery state. We show that with the approximate value function, the update in each iteration consists of a group of convex problems with a continuous parameter. Moreover, we derive the optimal solution to these con- vex problems in closed-form. Further, we propose power allocation algorithms for both the finite- and infinite-horizon cases, whose computational complexity is significantly lower than that of the standard discrete MDP method but with improved performance. Extension to the case of a general payoff function and imperfect energy prediction is also considered. Finally, simulation results demonstrate that the proposed algorithms closely approach the optimal performance.

Proceedings ArticleDOI
08 Jul 2014
TL;DR: In this article, the authors develop the Predictive Finite-horizon PF Scheduling ((PF)2S) Framework that exploits mobility and show that a user's channel state is highly reproducible and leverage that to develop a data rate prediction mechanism.
Abstract: Proportional Fair (PF) scheduling algorithms are the de facto standard in cellular networks. They exploit the users' channel state diversity (induced by fast-fading) and are optimal for stationary channel state distributions and an infinite time-horizon. However, mobile users experience a nonstationary channel, due to slow-fading (on the order of seconds), and are associated with base stations for short periods. Hence, we develop the Predictive Finite-horizon PF Scheduling ((PF)2S) Framework that exploits mobility. We present extensive channel measurement results from a 3G network and characterize mobility-induced channel state trends. We show that a user's channel state is highly reproducible and leverage that to develop a data rate prediction mechanism. We then present a few channel allocation estimation algorithms that exploit the prediction mechanism. Our trace-based simulations consider instances of the ((PF)2S) Framework composed of combinations of prediction and channel allocation estimation algorithms. They indicate that the framework can increase the throughput by 15%-55% compared to traditional PF schedulers, while improving fairness.

Posted Content
TL;DR: In this paper, the authors present a data structure representing a dynamic set S of w-bit integers on a word RAM with O(log n/ log w) time complexity.
Abstract: We present a data structure representing a dynamic set S of w-bit integers on a w-bit word RAM. With |S|=n and w > log n and space O(n), we support the following standard operations in O(log n / log w) time: - insert(x) sets S = S + {x}. - delete(x) sets S = S - {x}. - predecessor(x) returns max{y in S | y = x}. - rank(x) returns #{y in S | y< x}. - select(i) returns y in S with rank(y)=i, if any. Our O(log n/log w) bound is optimal for dynamic rank and select, matching a lower bound of Fredman and Saks [STOC'89]. When the word length is large, our time bound is also optimal for dynamic predecessor, matching a static lower bound of Beame and Fich [STOC'99] whenever log n/log w=O(log w/loglog w). Technically, the most interesting aspect of our data structure is that it supports all the above operations in constant time for sets of size n=w^{O(1)}. This resolves a main open problem of Ajtai, Komlos, and Fredman [FOCS'83]. Ajtai et al. presented such a data structure in Yao's abstract cell-probe model with w-bit cells/words, but pointed out that the functions used could not be implemented. As a partial solution to the problem, Fredman and Willard [STOC'90] introduced a fusion node that could handle queries in constant time, but used polynomial time on the updates. We call our small set data structure a dynamic fusion node as it does both queries and updates in constant time.