Showing papers by "Yahoo! published in 2015"

PDF

Open Access

Book Chapter•DOI•

The Bitcoin Backbone Protocol: Analysis and Applications

[...]

Juan A. Garay¹, Aggelos Kiayias², Nikos Leonardos³•Institutions (3)

Yahoo!¹, National and Kapodistrian University of Athens², Paris Diderot University³

26 Apr 2015

TL;DR: In this paper, the authors extract and analyze the core of the Bitcoin protocol and prove two fundamental properties which they call common prefix and chain quality in the static setting where the number of players remains fixed.

...read moreread less

Abstract: Bitcoin is the first and most popular decentralized cryptocurrency to date. In this work, we extract and analyze the core of the Bitcoin protocol, which we term the Bitcoin backbone, and prove two of its fundamental properties which we call common prefix and chain quality in the static setting where the number of players remains fixed. Our proofs hinge on appropriate and novel assumptions on the “hashing power” of the adversary relative to network synchronicity; we show our results to be tight under high synchronization.

...read moreread less

1,128 citations

Proceedings Article•DOI•

Image retrieval using scene graphs

[...]

Justin Johnson¹, Ranjay Krishna¹, Michael Stark², Li-Jia Li³, David A. Shamma³, Michael S. Bernstein¹, Li Fei-Fei¹ - Show less +3 more•Institutions (3)

Stanford University¹, Max Planck Society², Yahoo!³

07 Jun 2015

TL;DR: A conditional random field model that reasons about possible groundings of scene graphs to test images and shows that the full model can be used to improve object localization compared to baseline methods and outperforms retrieval methods that use only objects or low-level image features.

...read moreread less

Abstract: This paper develops a novel framework for semantic image retrieval based on the notion of a scene graph. Our scene graphs represent objects (“man”, “boat”), attributes of objects (“boat is white”) and relationships between objects (“man standing on boat”). We use these scene graphs as queries to retrieve semantically related images. To this end, we design a conditional random field model that reasons about possible groundings of scene graphs to test images. The likelihoods of these groundings are used as ranking scores for retrieval. We introduce a novel dataset of 5,000 human-generated scene graphs grounded to images and use this dataset to evaluate our method for image retrieval. In particular, we evaluate retrieval using full scene graphs and small scene subgraphs, and show that our method outperforms retrieval methods that use only objects or low-level image features. In addition, we show that our full model can be used to improve object localization compared to baseline methods.

...read moreread less

1,006 citations

Proceedings Article•DOI•

Hate Speech Detection with Comment Embeddings

[...]

Nemanja Djuric¹, Jing Zhou¹, Robin D. Morris¹, Mihajlo Grbovic¹, Vladan Radosavljevic¹, Narayan Bhamidipati¹ - Show less +2 more•Institutions (1)

Yahoo!¹

18 May 2015

TL;DR: This work proposes to learn distributed low-dimensional representations of comments using recently proposed neural language models, that can then be fed as inputs to a classification algorithm, resulting in highly efficient and effective hate speech detectors.

...read moreread less

Abstract: We address the problem of hate speech detection in online user comments. Hate speech, defined as an "abusive speech targeting specific group characteristics, such as ethnicity, religion, or gender", is an important problem plaguing websites that allow users to leave feedback, having a negative impact on their online business and overall user experience. We propose to learn distributed low-dimensional representations of comments using recently proposed neural language models, that can then be fed as inputs to a classification algorithm. Our approach addresses issues of high-dimensionality and sparsity that impact the current state-of-the-art, resulting in highly efficient and effective hate speech detectors.

...read moreread less

630 citations

Proceedings Article•DOI•

Deep learning of binary hash codes for fast image retrieval

[...]

Kevin Lin¹, Huei-Fang Yang¹, Jen-Hao Hsiao², Chu-Song Chen¹•Institutions (2)

Academia Sinica¹, Yahoo!²

07 Jun 2015

TL;DR: This work proposes an effective deep learning framework to generate binary hash codes for fast image retrieval by employing a hidden layer for representing the latent concepts that dominate the class labels in convolutional neural networks.

...read moreread less

Abstract: Approximate nearest neighbor search is an efficient strategy for large-scale image retrieval. Encouraged by the recent advances in convolutional neural networks (CNNs), we propose an effective deep learning framework to generate binary hash codes for fast image retrieval. Our idea is that when the data labels are available, binary codes can be learned by employing a hidden layer for representing the latent concepts that dominate the class labels. The utilization of the CNN also allows for learning image representations. Unlike other supervised methods that require pair-wised inputs for binary code learning, our method learns hash codes and image representations in a point-wised manner, making it suitable for large-scale datasets. Experimental results show that our method outperforms several state-of-the-art hashing algorithms on the CIFAR-10 and MNIST datasets. We further demonstrate its scalability and efficacy on a large-scale dataset of 1 million clothing images.

...read moreread less

605 citations

Proceedings Article•DOI•

Multi-view Face Detection Using Deep Convolutional Neural Networks

[...]

Sachin Farfade¹, Mohammad Saberian¹, Li-Jia Li¹•Institutions (1)

Yahoo!¹

22 Jun 2015

TL;DR: This paper proposes Deep Dense Face Detector (DDFD), a method that does not require pose/landmark annotation and is able to detect faces in a wide range of orientations using a single model based on deep convolutional neural networks.

...read moreread less

Abstract: In this paper we consider the problem of multi-view face detection. While there has been significant research on this problem, current state-of-the-art approaches for this task require annotation of facial landmarks, e.g. TSM [25], or annotation of face poses [28, 22]. They also require training dozens of models to fully capture faces in all orientations, e.g. 22 models in HeadHunter method [22]. In this paper we propose Deep Dense Face Detector (DDFD), a method that does not require pose/landmark annotation and is able to detect faces in a wide range of orientations using a single model based on deep convolutional neural networks. The proposed method has minimal complexity; unlike other recent deep learning object detection methods [9], it does not require additional components such as segmentation, bounding-box regression, or SVM classifiers. Furthermore, we analyzed scores of the proposed face detector for faces in different orientations and found that 1) the proposed method is able to detect faces from different angles and can handle occlusion to some extent, 2) there seems to be a correlation between distribution of positive examples in the training set and scores of the proposed face detector. The latter suggests that the proposed method's performance can be further improved by using better sampling strategies and more sophisticated data augmentation techniques. Evaluations on popular face detection benchmark datasets show that our single-model face detector algorithm has similar or better performance compared to the previous methods, which are more complex and require annotations of either different poses or facial landmarks.

...read moreread less

552 citations

Proceedings Article•DOI•

TVSum: Summarizing web videos using titles

[...]

Yale Song¹, Jordi Vallmitjana¹, Amanda Stent¹, Alejandro Jaimes¹•Institutions (1)

Yahoo!¹

07 Jun 2015

TL;DR: A novel co-archetypal analysis technique is developed that learns canonical visual concepts shared between video and images, but not in either alone, by finding a joint-factorial representation of two data sets.

...read moreread less

Abstract: Video summarization is a challenging problem in part because knowing which part of a video is important requires prior knowledge about its main topic. We present TVSum, an unsupervised video summarization framework that uses title-based image search results to find visually important shots. We observe that a video title is often carefully chosen to be maximally descriptive of its main topic, and hence images related to the title can serve as a proxy for important visual concepts of the main topic. However, because titles are free-formed, unconstrained, and often written ambiguously, images searched using the title can contain noise (images irrelevant to video content) and variance (images of different topics). To deal with this challenge, we developed a novel co-archetypal analysis technique that learns canonical visual concepts shared between video and images, but not in either alone, by finding a joint-factorial representation of two data sets. We introduce a new benchmark dataset, TVSum50, that contains 50 videos and their shot-level importance scores annotated via crowdsourcing. Experimental results on two datasets, SumMe and TVSum50, suggest our approach produces superior quality summaries compared to several recently proposed approaches.

...read moreread less

528 citations

Proceedings Article•DOI•

Near optimal placement of virtual network functions

[...]

Rami Cohen, Liane Lewin-Eytan¹, Joseph (Seffi) Naor², Danny Raz³•Institutions (3)

Yahoo!¹, Technion – Israel Institute of Technology², Bell Labs³

24 Aug 2015

TL;DR: A thorough study of the NFV location problem is performed, it is shown that it introduces a new type of optimization problems, and near optimal approximation algorithms guaranteeing a placement with theoretically proven performance are provided.

...read moreread less

Abstract: Network Function Virtualization (NFV) is a new networking paradigm where network functions are executed on commodity servers located in small cloud nodes distributed across the network, and where software defined mechanisms are used to control the network flows. This paradigm is a major turning point in the evolution of networking, as it introduces high expectations for enhanced economical network services, as well as major technical challenges. In this paper, we address one of the main technical challenges in this domain: the actual placement of the virtual functions within the physical network. This placement has a critical impact on the performance of the network, as well as on its reliability and operation cost. We perform a thorough study of the NFV location problem, show that it introduces a new type of optimization problems, and provide near optimal approximation algorithms guaranteeing a placement with theoretically proven performance. The performance of the solution is evaluated with respect to two measures: the distance cost between the clients and the virtual functions by which they are served, as well as the setup costs of these functions. We provide bi-criteria solutions reaching constant approximation factors with respect to the overall performance, and adhering to the capacity constraints of the networking infrastructure by a constant factor as well. Finally, using extensive simulations, we show that the proposed algorithms perform well in many realistic scenarios.

...read moreread less

509 citations

Proceedings Article•DOI•

A Hierarchical Recurrent Encoder-Decoder for Generative Context-Aware Query Suggestion

[...]

Alessandro Sordoni¹, Yoshua Bengio¹, Hossein Vahabi², Christina Lioma³, Jakob Grue Simonsen³, Jian-Yun Nie¹ - Show less +2 more•Institutions (3)

Université de Montréal¹, Yahoo!², University of Copenhagen³

17 Oct 2015

TL;DR: This work presents a novel hierarchical recurrent encoder-decoder architecture that makes possible to account for sequences of previous queries of arbitrary lengths and is sensitive to the order of queries in the context while avoiding data sparsity.

...read moreread less

Abstract: Users may strive to formulate an adequate textual query for their information need. Search engines assist the users by presenting query suggestions. To preserve the original search intent, suggestions should be context-aware and account for the previous queries issued by the user. Achieving context awareness is challenging due to data sparsity. We present a novel hierarchical recurrent encoder-decoder architecture that makes possible to account for sequences of previous queries of arbitrary lengths. As a result, our suggestions are sensitive to the order of queries in the context while avoiding data sparsity. Additionally, our model can suggest for rare, or long-tail, queries. The produced suggestions are synthetic and are sampled one word at a time, using computationally cheap decoding techniques. This is in contrast to current synthetic suggestion models relying upon machine learning pipelines and hand-engineered feature sets. Results show that our model outperforms existing context-aware approaches in a next query prediction setting. In addition to query suggestion, our architecture is general enough to be used in a variety of other applications.

...read moreread less

437 citations

Journal Article•DOI•

YFCC100M: The New Data in Multimedia Research

[...]

Bart Thomee¹, David A. Shamma¹, Gerald Friedland², Benjamin Elizalde², Karl Ni³, Douglas N. Poland³, Damian Borth², Li-Jia Li¹ - Show less +4 more•Institutions (3)

Yahoo!¹, International Computer Science Institute², Lawrence Livermore National Laboratory³

05 Mar 2015-arXiv: Multimedia

TL;DR: The Yahoo Flickr Creative Commons 100 Million Dataset (YFCC100M) as mentioned in this paper is a collection of 100 million media objects, of which approximately 99.2 million are photos and 0.8 million are videos, all of which carry a Creative Commons license.

...read moreread less

Abstract: We present the Yahoo Flickr Creative Commons 100 Million Dataset (YFCC100M), the largest public multimedia collection that has ever been released. The dataset contains a total of 100 million media objects, of which approximately 99.2 million are photos and 0.8 million are videos, all of which carry a Creative Commons license. Each media object in the dataset is represented by several pieces of metadata, e.g. Flickr identifier, owner name, camera, title, tags, geo, media source. The collection provides a comprehensive snapshot of how photos and videos were taken, described, and shared over the years, from the inception of Flickr in 2004 until early 2014. In this article we explain the rationale behind its creation, as well as the implications the dataset has for science, research, engineering, and development. We further present several new challenges in multimedia research that can now be expanded upon with our dataset.

...read moreread less

401 citations

Proceedings Article•DOI•

Generic and Scalable Framework for Automated Time-series Anomaly Detection

[...]

Nikolay Laptev¹, Saeed Amizadeh¹, Ian Flint¹•Institutions (1)

Yahoo!¹

10 Aug 2015

TL;DR: A generic and scalable framework for automated anomaly detection on large scale time-series data and the open-sourcing of the data represents the first of its kind effort to establish the standard benchmark for anomaly detection.

...read moreread less

Abstract: This paper introduces a generic and scalable framework for automated anomaly detection on large scale time-series data. Early detection of anomalies plays a key role in maintaining consistency of person's data and protects corporations against malicious attackers. Current state of the art anomaly detection approaches suffer from scalability, use-case restrictions, difficulty of use and a large number of false positives. Our system at Yahoo, EGADS, uses a collection of anomaly detection and forecasting models with an anomaly filtering layer for accurate and scalable anomaly detection on time-series. We compare our approach against other anomaly detection systems on real and synthetic data with varying time-series characteristics. We found that our framework allows for 50-60% improvement in precision and recall for a variety of use-cases. Both the data and the framework are being open-sourced. The open-sourcing of the data, in particular, represents the first of its kind effort to establish the standard benchmark for anomaly detection.

...read moreread less

390 citations

Proceedings Article•DOI•

Profiling a warehouse-scale computer

[...]

Svilen Kanev¹, Juan Pablo Darago², Kim Hazelwood³, Parthasarathy Ranganathan⁴, Tipp Moseley⁴, Gu-Yeon Wei¹, David Brooks¹ - Show less +3 more•Institutions (4)

Harvard University¹, University of Buenos Aires², Yahoo!³, Google⁴

13 Jun 2015

TL;DR: A detailed microarchitectural analysis of live datacenter jobs, measured on more than 20,000 Google machines over a three year period, and comprising thousands of different applications finds that WSC workloads are extremely diverse, breeding the need for architectures that can tolerate application variability without performance loss.

...read moreread less

Abstract: With the increasing prevalence of warehouse-scale (WSC) and cloud computing, understanding the interactions of server applications with the underlying microarchitecture becomes ever more important in order to extract maximum performance out of server hardware. To aid such understanding, this paper presents a detailed microarchitectural analysis of live datacenter jobs, measured on more than 20,000 Google machines over a three year period, and comprising thousands of different applications. We first find that WSC workloads are extremely diverse, breeding the need for architectures that can tolerate application variability without performance loss. However, some patterns emerge, offering opportunities for co-optimization of hardware and software. For example, we identify common building blocks in the lower levels of the software stack. This "datacenter tax" can comprise nearly 30% of cycles across jobs running in the fleet, which makes its constituents prime candidates for hardware specialization in future server systems-on-chips. We also uncover opportunities for classic microarchitectural optimizations for server processors, especially in the cache hierarchy. Typical workloads place significant stress on instruction caches and prefer memory latency over bandwidth. They also stall cores often, but compute heavily in bursts. These observations motivate several interesting directions for future warehouse-scale computers.

...read moreread less

Posted Content•

Learning Complexity-Aware Cascades for Deep Pedestrian Detection

[...]

Zhaowei Cai¹, Mohammad Saberian², Nuno Vasconcelos¹•Institutions (2)

University of California, San Diego¹, Yahoo!²

19 Jul 2015-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, a new cascade design procedure is introduced, by formulating cascade learning as the Lagrangian optimization of a risk that accounts for both accuracy and complexity, and a boosting algorithm, denoted as complexity aware cascade training, is derived to solve this optimization.

...read moreread less

Abstract: The design of complexity-aware cascaded detectors, combining features of very different complexities, is considered. A new cascade design procedure is introduced, by formulating cascade learning as the Lagrangian optimization of a risk that accounts for both accuracy and complexity. A boosting algorithm, denoted as complexity aware cascade training (CompACT), is then derived to solve this optimization. CompACT cascades are shown to seek an optimal trade-off between accuracy and complexity by pushing features of higher complexity to the later cascade stages, where only a few difficult candidate patches remain to be classified. This enables the use of features of vastly different complexities in a single detector. In result, the feature pool can be expanded to features previously impractical for cascade design, such as the responses of a deep convolutional neural network (CNN). This is demonstrated through the design of a pedestrian detector with a pool of features whose complexities span orders of magnitude. The resulting cascade generalizes the combination of a CNN with an object proposal mechanism: rather than a pre-processing stage, CompACT cascades seamlessly integrate CNNs in their stages. This enables state of the art performance on the Caltech and KITTI datasets, at fairly fast speeds.

...read moreread less

Journal Article•DOI•

Memristor-Based Multilayer Neural Networks With Online Gradient Descent Training

[...]

Daniel Soudry¹, Dotan Di Castro², Asaf Gal³, Avinoam Kolodny³, Shahar Kvatinsky⁴ - Show less +1 more•Institutions (4)

Columbia University¹, Yahoo!², Technion – Israel Institute of Technology³, Stanford University⁴

14 Jan 2015-IEEE Transactions on Neural Networks

TL;DR: The utility and robustness of the proposed memristor-based circuit can compactly implement hardware MNNs trainable by scalable algorithms based on online gradient descent (e.g., backpropagation).

...read moreread less

Abstract: Learning in multilayer neural networks (MNNs) relies on continuous updating of large matrices of synaptic weights by local rules. Such locality can be exploited for massive parallelism when implementing MNNs in hardware. However, these update rules require a multiply and accumulate operation for each synaptic weight, which is challenging to implement compactly using CMOS. In this paper, a method for performing these update operations simultaneously (incremental outer products) using memristor-based arrays is proposed. The method is based on the fact that, approximately, given a voltage pulse, the conductivity of a memristor will increment proportionally to the pulse duration multiplied by the pulse magnitude if the increment is sufficiently small. The proposed method uses a synaptic circuit composed of a small number of components per synapse: one memristor and two CMOS transistors. This circuit is expected to consume between 2% and 8% of the area and static power of previous CMOS-only hardware alternatives. Such a circuit can compactly implement hardware MNNs trainable by scalable algorithms based on online gradient descent (e.g., backpropagation). The utility and robustness of the proposed memristor-based circuit are demonstrated on standard supervised learning tasks.

...read moreread less

Proceedings Article•DOI•

Video co-summarization: Video summarization by visual co-occurrence

[...]

Wen-Sheng Chu¹, Yale Song², Alejandro Jaimes²•Institutions (2)

Carnegie Mellon University¹, Yahoo!²

07 Jun 2015

TL;DR: The results suggest that summaries generated by visual co-occurrence tend to match more closely with human generated summaries, when compared to several popular unsupervised techniques.

...read moreread less

Abstract: We present video co-summarization, a novel perspective to video summarization that exploits visual co-occurrence across multiple videos. Motivated by the observation that important visual concepts tend to appear repeatedly across videos of the same topic, we propose to summarize a video by finding shots that co-occur most frequently across videos collected using a topic keyword. The main technical challenge is dealing with the sparsity of co-occurring patterns, out of hundreds to possibly thousands of irrelevant shots in videos being considered. To deal with this challenge, we developed a Maximal Biclique Finding (MBF) algorithm that is optimized to find sparsely co-occurring patterns, discarding less co-occurring patterns even if they are dominant in one video. Our algorithm is parallelizable with closed-form updates, thus can easily scale up to handle a large number of videos simultaneously. We demonstrate the effectiveness of our approach on motion capture and self-compiled YouTube datasets. Our results suggest that summaries generated by visual co-occurrence tend to match more closely with human generated summaries, when compared to several popular unsupervised techniques.

...read moreread less

Proceedings Article•DOI•

Learning Complexity-Aware Cascades for Deep Pedestrian Detection

[...]

Zhaowei Cai¹, Mohammad Saberian², Nuno Vasconcelos¹•Institutions (2)

University of California, San Diego¹, Yahoo!²

07 Dec 2015

TL;DR: CompACT cascades are shown to seek an optimal trade-off between accuracy and complexity by pushing features of higher complexity to the later cascade stages, where only a few difficult candidate patches remain to be classified.

...read moreread less

Posted Content•

Cascading Bandits: Learning to Rank in the Cascade Model

[...]

Branislav Kveton¹, Csaba Szepesvári², Zheng Wen³, Azin Ashkan•Institutions (3)

Adobe Systems¹, University of Alberta², Yahoo!³

10 Feb 2015-arXiv: Learning

TL;DR: Cascade bandits as mentioned in this paper is a learning variant of the cascade model where the objective is to identify $K$ most attractive items and formulate the problem as a stochastic combinatorial partial monitoring problem.

...read moreread less

Abstract: A search engine usually outputs a list of $K$ web pages. The user examines this list, from the first web page to the last, and chooses the first attractive page. This model of user behavior is known as the cascade model. In this paper, we propose cascading bandits, a learning variant of the cascade model where the objective is to identify $K$ most attractive items. We formulate our problem as a stochastic combinatorial partial monitoring problem. We propose two algorithms for solving it, CascadeUCB1 and CascadeKL-UCB. We also prove gap-dependent upper bounds on the regret of these algorithms and derive a lower bound on the regret in cascading bandits. The lower bound matches the upper bound of CascadeKL-UCB up to a logarithmic factor. We experiment with our algorithms on several problems. The algorithms perform surprisingly well even when our modeling assumptions are violated.

...read moreread less

Journal Article•DOI•

Optimal Demand Response Using Device-Based Reinforcement Learning

[...]

Zheng Wen¹, Daniel O'Neill², Hamid Reza Maei³•Institutions (3)

Yahoo!¹, Stanford University², Samsung³

16 Feb 2015-IEEE Transactions on Smart Grid

TL;DR: This paper formulate a fully automated EMS's rescheduling problem as a reinforcement learning (RL) problem, and argues that this RL problem can be approximately solved by decomposing it over device clusters.

...read moreread less

Abstract: Demand response (DR) for residential and small commercial buildings is estimated to account for as much as 65% of the total energy savings potential of DR, and previous work shows that a fully automated energy management system (EMS) is a necessary prerequisite to DR in these areas. In this paper, we propose a novel EMS formulation for DR problems in these sectors. Specifically, we formulate a fully automated EMS’s rescheduling problem as a reinforcement learning (RL) problem, and argue that this RL problem can be approximately solved by decomposing it over device clusters. Compared with existing formulations, our new formulation does not require explicitly modeling the user’s dissatisfaction on job rescheduling, enables the EMS to self-initiate jobs, allows the user to initiate more flexible requests, and has a computational complexity linear in the number of device clusters. We also demonstrate the simulation results of applying Q-learning, one of the most popular and classical RL algorithms, to a representative example.

...read moreread less

Proceedings Article•DOI•

E-commerce in Your Inbox: Product Recommendations at Scale

[...]

Mihajlo Grbovic¹, Vladan Radosavljevic¹, Nemanja Djuric¹, Narayan Bhamidipati¹, Jaikit Savla¹, Varun Bhagwan¹, Doug Sharp¹ - Show less +3 more•Institutions (1)

Yahoo!¹

10 Aug 2015

TL;DR: In this article, a system that leverages user purchase history determined from e-mail receipts to deliver highly personalized product ads to Yahoo Mail users is described, which was evaluated against baselines that included showing popular products and products predicted based on co-occurrence.

...read moreread less

Abstract: In recent years online advertising has become increasingly ubiquitous and effective. Advertisements shown to visitors fund sites and apps that publish digital content, manage social networks, and operate e-mail services. Given such large variety of internet resources, determining an appropriate type of advertising for a given platform has become critical to financial success. Native advertisements, namely ads that are similar in look and feel to content, have had great success in news and social feeds. However, to date there has not been a winning formula for ads in e-mail clients. In this paper we describe a system that leverages user purchase history determined from e-mail receipts to deliver highly personalized product ads to Yahoo Mail users. We propose to use a novel neural language-based algorithm specifically tailored for delivering effective product recommendations, which was evaluated against baselines that included showing popular products and products predicted based on co-occurrence. We conducted rigorous offline testing using a large-scale product purchase data set, covering purchases of more than 29 million users from 172 e-commerce websites. Ads in the form of product recommendations were successfully tested on online traffic, where we observed a steady 9% lift in click-through rates over other ad formats in mail, as well as comparable lift in conversion rates. Following successful tests, the system was launched into production during the holiday season of 2014.

...read moreread less

Proceedings Article•DOI•

Multi-label Cross-Modal Retrieval

[...]

Viresh Ranjan¹, Nikhil Rasiwasia², C. V. Jawahar¹•Institutions (2)

International Institute of Information Technology, Hyderabad¹, Yahoo!²

07 Dec 2015

TL;DR: Multi-label Canonical Correlation Analysis (ml-CCA), an extension of CCA, is introduced for learning shared subspaces taking into account high level semantic information in the form of multi-label annotations, which results in a discriminative subspace which is better suited for cross-modal retrieval tasks.

...read moreread less

Abstract: In this work, we address the problem of cross-modal retrieval in presence of multi-label annotations. In particular, we introduce multi-label Canonical Correlation Analysis (ml-CCA), an extension of CCA, for learning shared subspaces taking into account high level semantic information in the form of multi-label annotations. Unlike CCA, ml-CCA does not rely on explicit pairing between modalities, instead it uses the multi-label information to establish correspondences. This results in a discriminative subspace which is better suited for cross-modal retrieval tasks. We also present Fast ml-CCA, a computationally efficient version of ml-CCA, which is able to handle large scale datasets. We show the efficacy of our approach by conducting extensive cross-modal retrieval experiments on three standard benchmark datasets. The results show that the proposed approach achieves state of the art retrieval performance on the three datasets.

...read moreread less

Proceedings Article•DOI•

Fast and Space-Efficient Entity Linking for Queries

[...]

Roi Blanco¹, Giuseppe Ottaviano², Edgar Meij¹•Institutions (2)

Yahoo!¹, Istituto di Scienza e Tecnologie dell'Informazione²

02 Feb 2015

TL;DR: This paper proposes a probabilistic model that leverages user-generated information on the web to link queries to entities in a knowledge base and significantly outperforms several state-of-the-art baselines while being able to process queries in sub-millisecond times---at least two orders of magnitude faster than existing systems.

...read moreread less

Abstract: Entity linking deals with identifying entities from a knowledge base in a given piece of text and has become a fundamental building block for web search engines, enabling numerous downstream improvements from better document ranking to enhanced search results pages. A key problem in the context of web search queries is that this process needs to run under severe time constraints as it has to be performed before any actual retrieval takes place, typically within milliseconds.In this paper we propose a probabilistic model that leverages user-generated information on the web to link queries to entities in a knowledge base. There are three key ingredients that make the algorithm fast and space-efficient. First, the linking process ignores any dependencies between the different entity candidates, which allows for a O(k2) implementation in the number of query terms. Second, we leverage hashing and compression techniques to reduce the memory footprint. Finally, to equip the algorithm with contextual knowledge without sacrificing speed, we factor the distance between distributional semantics of the query words and entities into the model.We show that our solution significantly outperforms several state-of-the-art baselines by more than 14% while being able to process queries in sub-millisecond times---at least two orders of magnitude faster than existing systems.

...read moreread less

Proceedings Article•DOI•

Ground Truth for Grammaticality Correction Metrics

[...]

Courtney Napoles¹, Keisuke Sakaguchi¹, Matt Post¹, Joel Tetreault²•Institutions (2)

Johns Hopkins University¹, Yahoo!²

01 Jul 2015

TL;DR: The first human evaluation of GEC system outputs is conducted, and it is shown that the rankings produced by metrics such as MaxMatch and I-measure do not correlate well with this ground truth.

...read moreread less

Abstract: How do we know which grammatical error correction (GEC) system is best? A number of metrics have been proposed over the years, each motivated by weaknesses of previous metrics; however, the metrics themselves have not been compared to an empirical gold standard grounded in human judgments. We conducted the first human evaluation of GEC system outputs, and show that the rankings produced by metrics such as MaxMatch and I-measure do not correlate well with this ground truth. As a step towards better metrics, we also propose GLEU, a simple variant of BLEU, modified to account for both the source and the reference, and show that it hews much more closely to human judgments.

...read moreread less

Journal Article•DOI•

Randomized Dimensionality Reduction for $k$ -Means Clustering

[...]

Christos Boutsidis¹, Anastasios Zouzias², Michael W. Mahoney³, Petros Drineas⁴•Institutions (4)

Yahoo!¹, University of Toronto², University of California, Berkeley³, Rensselaer Polytechnic Institute⁴

01 Feb 2015-IEEE Transactions on Information Theory

TL;DR: In this article, the authors presented the first provably accurate feature selection method for $k$ -means clustering and in addition, they presented two feature extraction methods for clustering.

...read moreread less

Abstract: We study the topic of dimensionality reduction for $k$ -means clustering. Dimensionality reduction encompasses the union of two approaches: 1) feature selection and 2) feature extraction. A feature selection-based algorithm for $k$ -means clustering selects a small subset of the input features and then applies $k$ -means clustering on the selected features. A feature extraction-based algorithm for $k$ -means clustering constructs a small set of new artificial features and then applies $k$ -means clustering on the constructed features. Despite the significance of $k$ -means clustering as well as the wealth of heuristic methods addressing it, provably accurate feature selection methods for $k$ -means clustering are not known. On the other hand, two provably accurate feature extraction methods for $k$ -means clustering are known in the literature; one is based on random projections and the other is based on the singular value decomposition (SVD). This paper makes further progress toward a better understanding of dimensionality reduction for $k$ -means clustering. Namely, we present the first provably accurate feature selection method for $k$ -means clustering and, in addition, we present two feature extraction methods. The first feature extraction method is based on random projections and it improves upon the existing results in terms of time complexity and number of features needed to be extracted. The second feature extraction method is based on fast approximate SVD factorizations and it also improves upon the existing results in terms of time complexity. The proposed algorithms are randomized and provide constant-factor approximation guarantees with respect to the optimal $k$ -means objective value.

...read moreread less

Proceedings Article•DOI•

R-Storm: Resource-Aware Scheduling in Storm

[...]

Boyang Peng¹, Mohammad Hosseini¹, Zhihao Hong¹, Reza Farivar², Roy H. Campbell¹ - Show less +1 more•Institutions (2)

University of Illinois at Urbana–Champaign¹, Yahoo!²

24 Nov 2015

TL;DR: R-Storm as mentioned in this paper implements resource-aware scheduling within Storm, which can satisfy both soft and hard resource constraints as well as minimize network distance between components that communicate with each other, achieving 30-47% higher throughput and 69-350% better CPU utilization than default Storm.

...read moreread less

Abstract: The era of big data has led to the emergence of new systems for real-time distributed stream processing, e.g., Apache Storm is one of the most popular stream processing systems in industry today. However, Storm, like many other stream processing systems lacks an intelligent scheduling mechanism. The default round-robin scheduling currently deployed in Storm disregards resource demands and availability, and can therefore be inefficient at times. We present R-Storm (Resource-Aware Storm), a system that implements resource-aware scheduling within Storm. R-Storm is designed to increase overall throughput by maximizing resource utilization while minimizing network latency. When scheduling tasks, R-Storm can satisfy both soft and hard resource constraints as well as minimizing network distance between components that communicate with each other. We evaluate R-Storm on set of micro-benchmark Storm applications as well as Storm applications used in production at Yahoo! Inc. From our experimental results we conclude that R-Storm achieves 30-47% higher throughput and 69-350% better CPU utilization than default Storm for the micro-benchmarks. For the Yahoo! Storm applications, R-Storm outperforms default Storm by around 50% based on overall throughput. We also demonstrate that R-Storm performs much better when scheduling multiple Storm applications than default Storm.

...read moreread less

Proceedings Article•DOI•

Large-Scale Unusual Time Series Detection

[...]

Rob J. Hyndman¹, Earo Wang¹, Nikolay Laptev²•Institutions (2)

Monash University¹, Yahoo!²

14 Nov 2015

TL;DR: This work computes a vector of features on each time series, measuring characteristics of the series, and uses various bivariate outlier detection methods applied to the first two principal components to identify servers that are behaving unusually.

...read moreread less

Abstract: It is becoming increasingly common for organizations to collect very large amounts of data over time, and to need to detect unusual or anomalous time series. For example, Yahoo has banks of mail servers that are monitored over time. Many measurements on server performance are collected every hour for each of thousands of servers. We wish to identify servers that are behaving unusually. We compute a vector of features on each time series, measuring characteristics of the series. The features may include lag correlation, strength of seasonality, spectral entropy, etc. Then we use a principal component decomposition on the features, and use various bivariate outlier detection methods applied to the first two principal components. This enables the most unusual series, based on their feature vectors, to be identified. The bivariate outlier detection methods used are based on highest density regions and α-hulls.

...read moreread less

Journal Article•DOI•

Anti-Inflammatory Activities of Licorice Extract and Its Active Compounds, Glycyrrhizic Acid, Liquiritin and Liquiritigenin, in BV2 Cells and Mice Liver

[...]

Ji Yeon Yu¹, Jae Yeo Ha¹, Kyungmi Kim, Young Suk Jung², Jae Chul Jung³, Seikwan Oh¹ - Show less +2 more•Institutions (3)

Ewha Womans University¹, Pusan National University², Yahoo!³

20 Jul 2015-Molecules

TL;DR: Licorice extract inhibited the expression levels of pro-inflammatory cytokines in the livers of t-BHP-treated mice models, suggesting that mechanistic-based evidence substantiating the traditional claims of licorice Extract and its three bioactive components can be applied for the treatment of inflammation-related disorders, such as oxidative liver damage and inflammation diseases.

...read moreread less

Abstract: This study provides the scientific basis for the anti-inflammatory effects of licorice extract in a t-BHP (tert-butyl hydrogen peroxide)-induced liver damage model and the effects of its ingredients, glycyrrhizic acid (GA), liquiritin (LQ) and liquiritigenin (LG), in a lipopolysaccharide (LPS)-stimulated microglial cell model. The GA, LQ and LG inhibited the LPS-stimulated elevation of pro-inflammatory mediators, such as inducible nitric oxide synthase (iNOS), cyclooxygenase-2 (COX-2), tumor necrosis factor (TNF)-alpha, interleukin (IL)-1beta and interleukin (IL)-6 in BV2 (mouse brain microglia) cells. Furthermore, licorice extract inhibited the expression levels of pro-inflammatory cytokines (TNF-α, IL-1β and IL-6) in the livers of t-BHP-treated mice models. This result suggested that mechanistic-based evidence substantiating the traditional claims of licorice extract and its three bioactive components can be applied for the treatment of inflammation-related disorders, such as oxidative liver damage and inflammation diseases.

...read moreread less

Proceedings Article•DOI•

Predicting The Next App That You Are Going To Use

[...]

Ricardo Baeza-Yates¹, Di Jiang², Fabrizio Silvestri¹, Beverly L. Harrison¹•Institutions (2)

Yahoo!¹, Hong Kong University of Science and Technology²

02 Feb 2015

TL;DR: This paper model the prediction of the next app as a classification problem and proposes an effective personalized method to solve it that takes full advantage of human-engineered features and automatically derived features.

...read moreread less

Abstract: Given the large number of installed apps and the limited screen size of mobile devices, it is often tedious for users to search for the app they want to use. Although some mobile OSs provide categorization schemes that enhance the visibility of useful apps among those installed, the emerging category of homescreen apps aims to take one step further by automatically organizing the installed apps in a more intelligent and personalized way. In this paper, we study how to improve homescreen apps' usage experience through a prediction mechanism that allows to show to users which app she is going to use in the immediate future. The prediction technique is based on a set of features representing the real-time spatiotemporal contexts sensed by the homescreen app. We model the prediction of the next app as a classification problem and propose an effective personalized method to solve it that takes full advantage of human-engineered features and automatically derived features. Furthermore, we study how to solve the two naturally associated cold-start problems: app cold-start and user cold-start. We conduct large-scale experiments on log data obtained from Yahoo Aviate, showing that our approach can accurately predict the next app that a person is going to use.

...read moreread less

Journal Article•DOI•

Metabolic Analysis of Various Date Palm Fruit (Phoenix dactylifera L.) Cultivars from Saudi Arabia to Assess Their Nutritional Quality

[...]

Ismail Hamad¹, Hamada AbdElgawad², Soad K. Al Jaouni³, Gaurav Zinta², Han Asard², Sherif T. S. Hassan, Momtaz M. Hegab, Nashwa Hagagy⁴, Samy Selim⁴ - Show less +5 more•Institutions (4)

Yahoo!¹, University of Antwerp², King Abdulaziz University³, Suez Canal University⁴

27 Jul 2015-Molecules

TL;DR: The results showed that the date extracts from different cultivars have different free radical scavenging and anti-lipid peroxidation activities, and different cultivARS have different chemical composition.

...read moreread less

Abstract: Date palm is an important crop, especially in the hot-arid regions of the world. Date palm fruits have high nutritional and therapeutic value and possess significant antibacterial and antifungal properties. In this study, we performed bioactivity analyses and metabolic profiling of date fruits of 12 cultivars from Saudi Arabia to assess their nutritional value. Our results showed that the date extracts from different cultivars have different free radical scavenging and anti-lipid peroxidation activities. Moreover, the cultivars showed significant differences in their chemical composition, e.g., the phenolic content (10.4–22.1 mg/100 g DW), amino acids (37–108 μmol·g−1 FW) and minerals (237–969 mg/100 g DW). Principal component analysis (PCA) showed a clear separation of the cultivars into four different groups. The first group consisted of the Sokary, Nabtit Ali cultivars, the second group of Khlas Al Kharj, Khla Al Qassim, Mabroom, Khlas Al Ahsa, the third group of Khals Elshiokh, Nabot Saif, Khodry, and the fourth group consisted of Ajwa Al Madinah, Saffawy, Rashodia, cultivars. Hierarchical cluster analysis (HCA) revealed clustering of date cultivars into two groups. The first cluster consisted of the Sokary, Rashodia and Nabtit Ali cultivars, and the second cluster contained all the other tested cultivars. These results indicate that date fruits have high nutritive value, and different cultivars have different chemical composition.

...read moreread less

Journal Article•

SAMOA: scalable advanced massive online analysis

[...]

Gianmarco De Francisci Morales¹, Albert Bifet¹•Institutions (1)

Yahoo!¹

01 Jan 2015-Journal of Machine Learning Research

TL;DR: SAMOA (SCALABLE ADVANCED MASSIVE ONLINE ANALYSIS) is a platform for mining big data streams that provides a collection of distributed streaming algorithms for the most common data mining and machine learning tasks such as classification, clustering, and regression.

...read moreread less

Abstract: SAMOA (SCALABLE ADVANCED MASSIVE ONLINE ANALYSIS) is a platform for mining big data streams. It provides a collection of distributed streaming algorithms for the most common data mining and machine learning tasks such as classification, clustering, and regression, as well as programming abstractions to develop new algorithms. It features a pluggable architecture that allows it to run on several distributed stream processing engines such as Storm, S4, and Samza. samoa is written in Java, is open source, and is available at http://samoa-project.net under the Apache Software License version 2.0.

...read moreread less

Journal Article•DOI•

Mentha spicata Essential Oil: Chemical Composition, Antioxidant and Antibacterial Activities against Planktonic and Biofilm Cultures of Vibrio spp. Strains

[...]

Mejdi Snoussi¹, Emira Noumi², Najla Trabelsi¹, Guido Flamini³, Adele Papetti⁴, Vincenzo De Feo⁵ - Show less +2 more•Institutions (5)

Yahoo!¹, University of Monastir², University of Pisa³, University of Pavia⁴, University of Salerno⁵

07 Aug 2015-Molecules

TL;DR: The ability of the oil, belonging to the carvone chemotype, to inhibit or reduce Vibrio spp.

...read moreread less

Abstract: Chemical composition, antioxidant and anti-Vibrio spp. activities of the essential oil isolated from the aerial parts of Mentha spicata L. (spearmint) are investigated in the present study. The effect of the essential oil on Vibrio spp. biofilm inhibition and eradication was tested using the XTT assay. A total of 63 chemical constituents were identified in spearmint oil using GC/MS, constituting 99.9% of the total identified compounds. The main components were carvone (40.8% ± 1.23%) and limonene (20.8% ± 1.12%). The antimicrobial activity against 30 Vibrio spp. strains (16 species) was evaluated by disc diffusion and microdilution assays. All microorganisms were strongly affected, indicating an appreciable antimicrobial potential of the oil. Moreover, the investigated oil exhibited high antioxidant potency, as assessed by four different tests in comparison with BHT. The ability of the oil, belonging to the carvone chemotype, to inhibit or reduce Vibrio spp. biofilm warrants further investigation to explore the use of natural products in antibiofilm adhesion and reinforce the possibility of its use in the pharmaceutical or food industry as a natural antibiotic and seafood preservative against Vibrio contamination.

...read moreread less

Proceedings Article•DOI•

Transitive Transfer Learning

[...]

Ben Tan¹, Yangqiu Song², Erheng Zhong³, Qiang Yang¹•Institutions (3)

Hong Kong University of Science and Technology¹, University of Illinois at Urbana–Champaign², Yahoo!³

10 Aug 2015

TL;DR: TTL is aimed at breaking the large domain distances and transfer knowledge even when the source and target domains share few factors directly, and a learning framework to mimic the human learning process is proposed.

...read moreread less

Abstract: Transfer learning, which leverages knowledge from source domains to enhance learning ability in a target domain, has been proven effective in various applications One major limitation of transfer learning is that the source and target domains should be directly related If there is little overlap between the two domains, performing knowledge transfer between these domains will not be effective Inspired by human transitive inference and learning ability, whereby two seemingly unrelated concepts can be connected by a string of intermediate bridges using auxiliary concepts, in this paper we study a novel learning problem: Transitive Transfer Learning (abbreviated to TTL) TTL is aimed at breaking the large domain distances and transfer knowledge even when the source and target domains share few factors directly For example, when the source and target domains are documents and images respectively, TTL could use some annotated images as the intermediate domain to bridge them To solve the TTL problem, we propose a learning framework to mimic the human learning process The framework is composed of an intermediate domain selection component and a knowledge transfer component Extensive empirical evidence shows that the framework yields state-of-the-art classification accuracies on several classification data sets

...read moreread less

Collapse