scispace - formally typeset
Search or ask a question

Showing papers by "Yahoo! published in 2012"


Journal ArticleDOI
Winter Mason1, Siddharth Suri1
TL;DR: It is shown that when taken as a whole Mechanical Turk can be a useful tool for many researchers, and how the behavior of workers compares with that of experts and laboratory subjects is discussed.
Abstract: Amazon’s Mechanical Turk is an online labor market where requesters post jobs and workers choose which jobs to do for pay. The central purpose of this article is to demonstrate how to use this Web site for conducting behavioral research and to lower the barrier to entry for researchers who could benefit from this platform. We describe general techniques that apply to a variety of types of research and experiments across disciplines. We begin by discussing some of the advantages of doing experiments on Mechanical Turk, such as easy access to a large, stable, and diverse subject pool, the low cost of doing experiments, and faster iteration between developing theory and executing experiments. While other methods of conducting behavioral research may be comparable to or even better than Mechanical Turk on one or more of the axes outlined above, we will show that when taken as a whole Mechanical Turk can be a useful tool for many researchers. We will discuss how the behavior of workers compares with that of experts and laboratory subjects. Then we will illustrate the mechanics of putting a task on Mechanical Turk, including recruiting subjects, executing the task, and reviewing the work that was submitted. We also provide solutions to common problems that a researcher might face when executing their research on this platform, including techniques for conducting synchronous experiments, methods for ensuring high-quality work, how to keep data private, and how to maintain code security.

2,521 citations


Proceedings Article
25 Apr 2012
TL;DR: The goal is to automatically find an important class of failures, regardless of the protocols running, for both operational and experimental networks, with a general and protocol-agnostic framework, called Header Space Analysis (HSA).
Abstract: Today's networks typically carry or deploy dozens of protocols and mechanisms simultaneously such as MPLS, NAT, ACLs and route redistribution. Even when individual protocols function correctly, failures can arise from the complex interactions of their aggregate, requiring network administrators to be masters of detail. Our goal is to automatically find an important class of failures, regardless of the protocols running, for both operational and experimental networks. To this end we developed a general and protocol-agnostic framework, called Header Space Analysis (HSA). Our formalism allows us to statically check network specifications and configurations to identify an important class of failures such as Reachability Failures, Forwarding Loops and Traffic Isolation and Leakage problems. In HSA, protocol header fields are not first class entities; instead we look at the entire packet header as a concatenation of bits without any associated meaning. Each packet is a point in the {0,1}L space where L is the maximum length of a packet header, and networking boxes transform packets from one point in the space to another point or set of points (multicast). We created a library of tools, called Hassel, to implement our framework, and used it to analyze a variety of networks and protocols. Hassel was used to analyze the Stanford University backbone network, and found all the forwarding loops in less than 10 minutes, and verified reachability constraints between two subnets in 13 seconds. It also found a large and complex loop in an experimental loose source routing protocol in 4 minutes.

756 citations


Journal ArticleDOI
01 Mar 2012
TL;DR: In this article, the authors show how to reduce the number of passes needed to obtain, in parallel, a good initialization of k-means++ in both sequential and parallel settings.
Abstract: Over half a century old and showing no signs of aging, k-means remains one of the most popular data processing algorithms. As is well-known, a proper initialization of k-means is crucial for obtaining a good final solution. The recently proposed k-means++ initialization algorithm achieves this, obtaining an initial set of centers that is provably close to the optimum solution. A major downside of the k-means++ is its inherent sequential nature, which limits its applicability to massive data: one must make k passes over the data to find a good initial set of centers. In this work we show how to drastically reduce the number of passes needed to obtain, in parallel, a good initialization. This is unlike prevailing efforts on parallelizing k-means that have mostly focused on the post-initialization phases of k-means. We prove that our proposed initialization algorithm k-means|| obtains a nearly optimal solution after a logarithmic number of passes, and then show that in practice a constant number of passes suffices. Experimental evaluation on real-world large-scale data demonstrates that k-means|| outperforms k-means++ in both sequential and parallel settings.

537 citations


Journal ArticleDOI
TL;DR: This review focuses comprehensively on the nutrients and high-value bioactives profile as well as medicinal and functional aspects of different parts of olives and its byproducts.
Abstract: The Olive tree (Olea europaea L), a native of the Mediterranean basin and parts of Asia, is now widely cultivated in many other parts of the world for production of olive oil and table olives Olive is a rich source of valuable nutrients and bioactives of medicinal and therapeutic interest Olive fruit contains appreciable concentration, 1–3% of fresh pulp weight, of hydrophilic (phenolic acids, phenolic alchohols, flavonoids and secoiridoids) and lipophilic (cresols) phenolic compounds that are known to possess multiple biological activities such as antioxidant, anticarcinogenic, antiinflammatory, antimicrobial, antihypertensive, antidyslipidemic, cardiotonic, laxative, and antiplatelet Other important compounds present in olive fruit are pectin, organic acids, and pigments Virgin olive oil (VOO), extracted mechanically from the fruit, is also very popular for its nutritive and health-promoting potential, especially against cardiovascular disorders due to the presence of high levels of monounsaturates and other valuable minor components such as phenolics, phytosterols, tocopherols, carotenoids, chlorophyll and squalene The cultivar, area of production, harvest time, and the processing techniques employed are some of the factors shown to influence the composition of olive fruit and olive oil This review focuses comprehensively on the nutrients and high-value bioactives profile as well as medicinal and functional aspects of different parts of olives and its byproducts Various factors affecting the composition of this food commodity of medicinal value are also discussed

463 citations


Posted Content
TL;DR: It is proved that the proposed initialization algorithm k-means|| obtains a nearly optimal solution after a logarithmic number of passes, and Experimental evaluation on real-world large-scale data demonstrates that k-Means|| outperforms k- means++ in both sequential and parallel settings.
Abstract: Over half a century old and showing no signs of aging, k-means remains one of the most popular data processing algorithms. As is well-known, a proper initialization of k-means is crucial for obtaining a good final solution. The recently proposed k-means++ initialization algorithm achieves this, obtaining an initial set of centers that is provably close to the optimum solution. A major downside of the k-means++ is its inherent sequential nature, which limits its applicability to massive data: one must make k passes over the data to find a good initial set of centers. In this work we show how to drastically reduce the number of passes needed to obtain, in parallel, a good initialization. This is unlike prevailing efforts on parallelizing k-means that have mostly focused on the post-initialization phases of k-means. We prove that our proposed initialization algorithm k-means|| obtains a nearly optimal solution after a logarithmic number of passes, and then show that in practice a constant number of passes suffices. Experimental evaluation on real-world large-scale data demonstrates that k-means|| outperforms k-means++ in both sequential and parallel settings.

438 citations


Proceedings ArticleDOI
16 Apr 2012
TL;DR: An algorithm is presented by modeling diversity in tweets based on topical diversity, geographical diversity, and an interest distribution of the user by exploiting sparse factorial coding of the attributes, thus allowing it to deal with a large and diverse set of covariates efficiently.
Abstract: Micro-blogging services have become indispensable communication tools for online users for disseminating breaking news, eyewitness accounts, individual expression, and protest groups. Recently, Twitter, along with other online social networking services such as Foursquare, Gowalla, Facebook and Yelp, have started supporting location services in their messages, either explicitly, by letting users choose their places, or implicitly, by enabling geo-tagging, which is to associate messages with latitudes and longitudes. This functionality allows researchers to address an exciting set of questions: 1) How is information created and shared across geographical locations, 2) How do spatial and linguistic characteristics of people vary across regions, and 3) How to model human mobility. Although many attempts have been made for tackling these problems, previous methods are either complicated to be implemented or oversimplified that cannot yield reasonable performance. It is a challenge task to discover topics and identify users' interests from these geo-tagged messages due to the sheer amount of data and diversity of language variations used on these location sharing services. In this paper we focus on Twitter and present an algorithm by modeling diversity in tweets based on topical diversity, geographical diversity, and an interest distribution of the user. Furthermore, we take the Markovian nature of a user's location into account. Our model exploits sparse factorial coding of the attributes, thus allowing us to deal with a large and diverse set of covariates efficiently. Our approach is vital for applications such as user profiling, content recommendation and topic tracking. We show high accuracy in location estimation based on our model. Moreover, the algorithm identifies interesting topics based on location and language.

407 citations


Proceedings ArticleDOI
04 Jun 2012
TL;DR: This work describes the diffusion patterns arising from seven online domains, ranging from communications platforms to networked games to microblogging services, each involving distinct types of content and modes of sharing, and finds strikingly similar patterns across all domains.
Abstract: Models of networked diffusion that are motivated by analogy with the spread of infectious disease have been applied to a wide range of social and economic adoption processes, including those related to new products, ideas, norms and behaviors. However, it is unknown how accurately these models account for the empirical structure of diffusion over networks. Here we describe the diffusion patterns arising from seven online domains, ranging from communications platforms to networked games to microblogging services, each involving distinct types of content and modes of sharing. We find strikingly similar patterns across all domains.In particular, the vast majority of cascades are small, and are described by a handful of simple tree structures that terminate within one degree of an initial adopting "seed." In addition we find that structures other than these account for only a tiny fraction of total adoptions; that is, adoptions resulting from chains of referrals are extremely rare. Finally, even for the largest cascades that we observe, we find that the bulk of adoptions often takes place within one degree of a few dominant individuals. Together, these observations suggest new directions for modeling of online adoption processes.

400 citations


Journal ArticleDOI
TL;DR: A meta-analysis of individual patient data suggests no differences in efficacy between cisplatin and carboplatin in the first-line treatment of SCLC, but there are differences in the toxicity profile.
Abstract: Purpose Since treatment efficacy of cisplatin- or carboplatin-based chemotherapy in the first-line treatment of small-cell lung cancer (SCLC) remains contentious, a meta-analysis of individual patient data was performed to compare the two treatments. Patients and Methods A systematic review identified randomized trials comparing cisplatin with carboplatin in the first-line treatment of SCLC. Individual patient data were obtained from coordinating centers of all eligible trials. The primary end point was overall survival (OS). All statistical analyses were stratified by trial. Secondary end points were progression-free survival (PFS), objective response rate (ORR), and treatment toxicity. OS and PFS curves were compared by using the log-rank test. ORR was compared by using the Mantel-Haenszel test. Results Four eligible trials with 663 patients (328 assigned to cisplatin and 335 to carboplatin) were included in the analysis. Median OS was 9.6 months for cisplatin and 9.4 months for carboplatin (hazard rati...

379 citations


Journal ArticleDOI
TL;DR: The diagnosis of Hypersensitivity pneumonitis requires a high index of suspicion and should be considered in any patient presenting with clinical evidence of interstitial lung disease, making diagnosis extremely difficult.
Abstract: Hypersensitivity pneumonitis (HP) is a complex syndrome resulting from repeated exposure to a variety of organic particles. HP may present as acute, subacute, or chronic clinical forms but with frequent overlap of these various forms. An intriguing question is why only few of the exposed individuals develop the disease. According to a two-hit model, antigen exposure associated with genetic or environmental promoting factors provokes an immunopathological response. This response is mediated by immune complexes in the acute form and by Th1 and likely Th17 T cells in subacute/chronic cases. Pathologically, HP is characterized by a bronchiolocentric granulomatous lymphocytic alveolitis, which evolves to fibrosis in chronic advanced cases. On high-resolution computed tomography scan, ground-glass and poorly defined nodules, with patchy areas of air trapping, are seen in acute/subacute cases, whereas reticular opacities, volume loss, and traction bronchiectasis superimposed on subacute changes are observed in chronic cases. Importantly, subacute and chronic HP may mimic several interstitial lung diseases, including nonspecific interstitial pneumonia and usual interstitial pneumonia, making diagnosis extremely difficult. Thus, the diagnosis of HP requires a high index of suspicion and should be considered in any patient presenting with clinical evidence of interstitial lung disease. The definitive diagnosis requires exposure to known antigen, and the assemblage of clinical, radiologic, laboratory, and pathologic findings. Early diagnosis and avoidance of further exposure are keys in management of the disease. Corticosteroids are generally used, although their long-term efficacy has not been proved in prospective clinical trials. Lung transplantation should be recommended in cases of progressive end-stage illness.

372 citations


Journal Article
TL;DR: This work introduces a framework for feature selection based on dependence maximization between the selected features and the labels of an estimation problem, using the Hilbert-Schmidt Independence Criterion, and shows that a number of existing feature selectors are special cases of this framework.
Abstract: We introduce a framework for feature selection based on dependence maximization between the selected features and the labels of an estimation problem, using the Hilbert-Schmidt Independence Criterion. The key idea is that good features should be highly dependent on the labels. Our approach leads to a greedy procedure for feature selection. We show that a number of existing feature selectors are special cases of this framework. Experiments on both artificial and real-world data show that our feature selector works well in practice.

360 citations


01 Mar 2012
TL;DR: This paper introduces a replay methodology for contextual bandit algorithm evaluation that is completely data-driven and very easy to adapt to different applications and can provide provably unbiased evaluations.
Abstract: Contextual bandit algorithms have become popular for online recommendation systems such as Digg, Yahoo! Buzz, and news recommendation in general. Offline evaluation of the effectiveness of new algorithms in these applications is critical for protecting online user experiences but very challenging due to their "partial-label" nature. Common practice is to create a simulator which simulates the online environment for the problem at hand and then run an algorithm against this simulator. However, creating simulator itself is often difficult and modeling bias is usually unavoidably introduced. In this paper, we introduce a replay methodology for contextual bandit algorithm evaluation. Different from simulator-based approaches, our method is completely data-driven and very easy to adapt to different applications. More importantly, our method can provide provably unbiased evaluations. Our empirical results on a large-scale news article recommendation dataset collected from Yahoo! Front Page conform well with our theoretical results. Furthermore, comparisons between our offline replay and online bucket evaluation of several contextual bandit algorithms show accuracy and effectiveness of our offline evaluation method.

Patent
07 Jan 2012
TL;DR: In this paper, a system and methods for implementing searches using contextual information associated with a Web page (or other document) that a user is viewing when a query is entered is described.
Abstract: Systems and methods, including user interfaces, are provided for implementing searches using contextual information associated with a Web page (or other document) that a user is viewing when a query is entered. The page includes a contextual search interface that has an associated context vector representing content of the page. When the user submits a search query via the contextual search interface, the query and the context vector are both provided to the query processor and used in responding to the query.

Proceedings ArticleDOI
20 May 2012
TL;DR: In this article, the authors present bLSM, a Log Structured Merge (LSM) tree with the advantages of B-Trees and log structured approaches, which has near-optimal read and scan performance and its new "spring and gear" merge scheduler bounds write latency without impacting throughput or allowing merges to block writes for extended periods of time.
Abstract: Data management workloads are increasingly write-intensive and subject to strict latency SLAs. This presents a dilemma: Update in place systems have unmatched latency but poor write throughput. In contrast, existing log structured techniques improve write throughput but sacrifice read performance and exhibit unacceptable latency spikes.We begin by presenting a new performance metric: read fanout, and argue that, with read and write amplification, it better characterizes real-world indexes than approaches such as asymptotic analysis and price/performance.We then present bLSM, a Log Structured Merge (LSM) tree with the advantages of B-Trees and log structured approaches: (1) Unlike existing log structured trees, bLSM has near-optimal read and scan performance, and (2) its new "spring and gear" merge scheduler bounds write latency without impacting throughput or allowing merges to block writes for extended periods of time. It does this by ensuring merges at each level of the tree make steady progress without resorting to techniques that degrade read performance.We use Bloom filters to improve index performance, and find a number of subtleties arise. First, we ensure reads can stop after finding one version of a record. Otherwise, frequently written items would incur multiple B-Tree lookups. Second, many applications check for existing values at insert. Avoiding the seek performed by the check is crucial.

Proceedings ArticleDOI
20 May 2012
TL;DR: A novel approach to automatically partitioning databases for enterprise-class OLTP systems that significantly extends the state of the art by minimizing the number distributed transactions, while concurrently mitigating the effects of temporal skew in both the data distribution and accesses is presented.
Abstract: The advent of affordable, shared-nothing computing systems portends a new class of parallel database management systems (DBMS) for on-line transaction processing (OLTP) applications that scale without sacrificing ACID guarantees [7, 9]. The performance of these DBMSs is predicated on the existence of an optimal database design that is tailored for the unique characteristics of OLTP workloads. Deriving such designs for modern DBMSs is difficult, especially for enterprise-class OLTP systems, since they impose extra challenges: the use of stored procedures, the need for load balancing in the presence of time-varying skew, complex schemas, and deployments with larger number of partitions.To this purpose, we present a novel approach to automatically partitioning databases for enterprise-class OLTP systems that significantly extends the state of the art by: (1) minimizing the number distributed transactions, while concurrently mitigating the effects of temporal skew in both the data distribution and accesses, (2) extending the design space to include replicated secondary indexes, (4) organically handling stored procedure routing, and (3) scaling of schema complexity, data size, and number of partitions. This effort builds on two key technical contributions: an analytical cost model that can be used to quickly estimate the relative coordination cost and skew for a given workload and a candidate database design, and an informed exploration of the huge solution space based on large neighborhood search. To evaluate our methods, we integrated our database design tool with a high-performance parallel, main memory DBMS and compared our methods against both popular heuristics and a state-of-the-art research prototype [17]. Using a diverse set of benchmarks, we show that our approach improves throughput by up to a factor of 16x over these other approaches.

Proceedings Article
26 Jun 2012
TL;DR: The expected sample complexity bound for LUCB is novel even for single-arm selection, and a lower bound on the worst case sample complexity of PAC algorithms for Explore-m is given.
Abstract: We consider the problem of selecting, from among the arms of a stochastic n-armed bandit, a subset of size m of those arms with the highest expected rewards, based on efficiently sampling the arms. This "subset selection" problem finds application in a variety of areas. In the authors' previous work (Kalyanakrishnan & Stone, 2010), this problem is framed under a PAC setting (denoted "Explore-m"), and corresponding sampling algorithms are analyzed. Whereas the formal analysis therein is restricted to the worst case sample complexity of algorithms, in this paper, we design and analyze an algorithm ("LUCB") with improved expected sample complexity. Interestingly LUCB bears a close resemblance to the well-known UCB algorithm for regret minimization. The expected sample complexity bound we show for LUCB is novel even for single-arm selection (Explore-1). We also give a lower bound on the worst case sample complexity of PAC algorithms for Explore-m.

Proceedings ArticleDOI
16 Apr 2012
TL;DR: This paper proposes efficient algorithms that address all requirements of online team formation: these algorithms form teams that always satisfy the required skills, provide approximation guarantees with respect to team communication overhead, and they are online-competitive with Respect to load balancing.
Abstract: We study the problem of online team formation. We consider a setting in which people possess different skills and compatibility among potential team members is modeled by a social network. A sequence of tasks arrives in an online fashion, and each task requires a specific set of skills. The goal is to form a new team upon arrival of each task, so that (i) each team possesses all skills required by the task, (ii) each team has small communication overhead, and (iii) the workload of performing the tasks is balanced among people in the fairest possible way.We propose efficient algorithms that address all these requirements: our algorithms form teams that always satisfy the required skills, provide approximation guarantees with respect to team communication overhead, and they are online-competitive with respect to load balancing. Experiments performed on collaboration networks among film actors and scientists, confirm that our algorithms are successful at balancing these conflicting requirements.This is the first paper that simultaneously addresses all these aspects. Previous work has either focused on minimizing coordination for a single task or balancing the workload neglecting coordination costs.

Patent
31 Oct 2012
TL;DR: In this article, an approach and methods for providing a user augmented reality (UAR) service for a camera-enabled mobile device, so that a user of such mobile device can use the mobile device to obtain meta data regarding one or more images/video that are captured with such device.
Abstract: Apparatus and methods are described for providing a user augmented reality (UAR) service for a camera-enabled mobile device, so that a user of such mobile device can use the mobile device to obtain meta data regarding one or more images/video that are captured with such device. The meta data is interactive and allows the user to obtain additional information or specific types of information, such as information that will aid the user in making a decision regarding the identified objects or selectable action options that can be used to initiate actions with respect to the identified objects.

Proceedings ArticleDOI
08 Feb 2012
TL;DR: The problem of correlating micro-blogging activity with stock-market events, defined as changes in the price and traded volume of stocks, is studied and it is shown that even relatively small correlations between price and micro- bloggers features can be exploited to drive a stock trading strategy that outperforms other baseline strategies.
Abstract: We study the problem of correlating micro-blogging activity with stock-market events, defined as changes in the price and traded volume of stocks. Specifically, we collect messages related to a number of companies, and we search for correlations between stock-market events for those companies and features extracted from the micro-blogging messages. The features we extract can be categorized in two groups. Features in the first group measure the overall activity in the micro-blogging platform, such as number of posts, number of re-posts, and so on. Features in the second group measure properties of an induced interaction graph, for instance, the number of connected components, statistics on the degree distribution, and other graph-based properties.We present detailed experimental results measuring the correlation of the stock market events with these features, using Twitter as a data source. Our results show that the most correlated features are the number of connected components and the number of nodes of the interaction graph. The correlation is stronger with the traded volume than with the price of the stock. However, by using a simulator we show that even relatively small correlations between price and micro-blogging features can be exploited to drive a stock trading strategy that outperforms other baseline strategies.

Proceedings ArticleDOI
08 Feb 2012
TL;DR: A scalable parallel framework for efficient inference in latent variable models over streaming web-scale data by introducing a novel delta-based aggregation system with a bandwidth-efficient communication protocol, schedule-aware out-of-core storage, and approximate forward sampling to rapidly incorporate new data.
Abstract: Latent variable techniques are pivotal in tasks ranging from predicting user click patterns and targeting ads to organizing the news and managing user generated content. Latent variable techniques like topic modeling, clustering, and subspace estimation provide substantial insight into the latent structure of complex data with little or no external guidance making them ideal for reasoning about large-scale, rapidly evolving datasets. Unfortunately, due to the data dependencies and global state introduced by latent variables and the iterative nature of latent variable inference, latent-variable techniques are often prohibitively expensive to apply to large-scale, streaming datasets.In this paper we present a scalable parallel framework for efficient inference in latent variable models over streaming web-scale data. Our framework addresses three key challenges: 1) synchronizing the global state which includes global latent variables (e.g., cluster centers and dictionaries); 2) efficiently storing and retrieving the large local state which includes the data-points and their corresponding latent variables (e.g., cluster membership); and 3) sequentially incorporating streaming data (e.g., the news). We address these challenges by introducing: 1) a novel delta-based aggregation system with a bandwidth-efficient communication protocol; 2) schedule-aware out-of-core storage; and 3) approximate forward sampling to rapidly incorporate new data. We demonstrate state-of-the-art performance of our framework by easily tackling datasets two orders of magnitude larger than those addressed by the current state-of-the-art. Furthermore, we provide an optimized and easily customizable open-source implementation of the framework1.

Journal ArticleDOI
TL;DR: A green synthesis route for the production of silver nanoparticles using methanol extract from Solanum xanthocarpum berry (SXE) is reported in the present investigation, and AgNps under study were found to be equally efficient against the antibiotic-resistant and antibiotic-susceptible strains of H. pylori.
Abstract: A green synthesis route for the production of silver nanoparticles using methanol extract from Solanum xanthocarpum berry (SXE) is reported in the present investigation. Silver nanoparticles (AgNps), having a surface plasmon resonance (SPR) band centered at 406 nm, were synthesized by reacting SXE (as capping as well as reducing agent) with AgNO3 during a 25 min process at 45 °C. The synthesized AgNps were characterized using UV–Visible spectrophotometry, powdered X-ray diffraction, and transmission electron microscopy (TEM). The results showed that the time of reaction, temperature and volume ratio of SXE to AgNO3 could accelerate the reduction rate of Ag+ and affect the AgNps size and shape. The nanoparticles were found to be about 10 nm in size, mono-dispersed in nature, and spherical in shape. In vitro anti-Helicobacter pylori activity of synthesized AgNps was tested against 34 clinical isolates and two reference strains of Helicobacter pylori by the agar dilution method and compared with AgNO3 and four standard drugs, namely amoxicillin (AMX), clarithromycin (CLA), metronidazole (MNZ) and tetracycline (TET), being used in anti-H. pylori therapy. Typical AgNps sample (S1) effectively inhibited the growth of H. pylori, indicating a stronger anti-H. pylori activity than that of AgNO3 or MNZ, being almost equally potent to TET and less potent than AMX and CLA. AgNps under study were found to be equally efficient against the antibiotic-resistant and antibiotic-susceptible strains of H. pylori. Besides, in the H. pylori urease inhibitory assay, S1 also exhibited a significant inhibition. Lineweaver-Burk plots revealed that the mechanism of inhibition was noncompetitive.

Patent
Sergiy Bilobrov1
13 Jan 2012
TL;DR: In this paper, an audio fingerprint is extracted from an audio sample by computing an energy spectrum for the audio sample, resampling the energy spectrum logarithmically in the time dimension, transforming the resampled energy spectrum to produce a series of feature vectors, and computing the fingerprint using differential coding of the feature vectors.
Abstract: An audio fingerprint is extracted from an audio sample, where the fingerprint contains information that is characteristic of the content in the sample. The fingerprint may be generated by computing an energy spectrum for the audio sample, resampling the energy spectrum logarithmically in the time dimension, transforming the resampled energy spectrum to produce a series of feature vectors, and computing the fingerprint using differential coding of the feature vectors. The generated fingerprint can be compared to a set of reference fingerprints in a database to identify the original audio content.

Journal ArticleDOI
03 Dec 2012
TL;DR: This work provides a simple and efficient learning procedure that is guaranteed to recover the parameters for a wide class of multi-view models and topic models, including latent Dirichlet allocation (LDA).
Abstract: Topic modeling is a generalization of clustering that posits that observations (words in a document) are generated by multiple latent factors (topics), as opposed to just one. The increased representational power comes at the cost of a more challenging unsupervised learning problem for estimating the topic-word distributions when only words are observed, and the topics are hidden. This work provides a simple and efficient learning procedure that is guaranteed to recover the parameters for a wide class of multi-view models and topic models, including latent Dirichlet allocation (LDA). For LDA, the procedure correctly recovers both the topic-word distributions and the parameters of the Dirichlet prior over the topic mixtures, using only trigram statistics (i.e., third order moments, which may be estimated with documents containing just three words). The method is based on an efficiently computable orthogonal tensor decomposition of low-order moments.

Proceedings ArticleDOI
12 Aug 2012
TL;DR: A large-scale data mining approach to learning word-word relatedness, where known pairs of related words impose constraints on the learning process, and learns for each word a low-dimensional representation, which strives to maximize the likelihood of a word given the contexts in which it appears.
Abstract: Prior work on computing semantic relatedness of words focused on representing their meaning in isolation, effectively disregarding inter-word affinities. We propose a large-scale data mining approach to learning word-word relatedness, where known pairs of related words impose constraints on the learning process. We learn for each word a low-dimensional representation, which strives to maximize the likelihood of a word given the contexts in which it appears. Our method, called CLEAR, is shown to significantly outperform previously published approaches. The proposed method is based on first principles, and is generic enough to exploit diverse types of text corpora, while having the flexibility to impose constraints on the derived word similarities. We also make publicly available a new labeled dataset for evaluating word relatedness algorithms, which we believe to be the largest such dataset to date.

Proceedings ArticleDOI
16 Apr 2012
TL;DR: The authors showed that conversational behavior can reveal power relationships in two very different settings: discussions among Wikipedians and arguments before the U. S. Supreme Court, and proposed an analysis framework based on linguistic coordination that can be used to shed light on power relationships.
Abstract: Understanding social interaction within groups is key to analyzing online communities. Most current work focuses on structural properties: who talks to whom, and how such interactions form larger network structures. The interactions themselves, however, generally take place in the form of natural language --- either spoken or written --- and one could reasonably suppose that signals manifested in language might also provide information about roles, status, and other aspects of the group's dynamics. To date, however, finding domain-independent language-based signals has been a challenge.Here, we show that in group discussions, power differentials between participants are subtly revealed by how much one individual immediately echoes the linguistic style of the person they are responding to. Starting from this observation, we propose an analysis framework based on linguistic coordination that can be used to shed light on power relationships and that works consistently across multiple types of power --- including a more "static" form of power based on status differences, and a more "situational" form of power in which one individual experiences a type of dependence on another. Using this framework, we study how conversational behavior can reveal power relationships in two very different settings: discussions among Wikipedians and arguments before the U. S. Supreme Court.

Journal ArticleDOI
TL;DR: This review of anatomy revisits the left atrium, inside as well as outside, for a better understanding of the atrial component parts and the spatial relationships of specific structures.
Abstract: Recent decades have seen rapid developments in arrhythmia treatment, especially the use of catheter ablation. Although the substrates of atrial fibrillation, its initiation and maintenance, remain to be fully elucidated, catheter ablation in the left atrium has become a therapeutic option for patients with this arrhythmia. With ablation techniques, various isolation lines and focal targets are deployed; the majority of these are anatomic approaches. It has been over a decade since we published our first article on the anatomy of the left atrium relevant to interventional electrophysiologists.1 Our aim then, as now, was to increase awareness of anatomic structures inside the left atrium. In this review of anatomy, we revisit the left atrium, inside as well as outside, for a better understanding of the atrial component parts and the spatial relationships of specific structures. ### Location and Atrial Walls Viewed from the frontal aspect of the chest, the left atrium is the most posteriorly situated of the cardiac chambers. Owing to the obliquity of the plane of the atrial septum and the different levels of the orifices of the mitral and tricuspid valves, the left atrial chamber is more posteriorly and superiorly situated relative to the right atrial chamber. The pulmonary veins enter the posterior part of the left atrium with the left veins located more superior than the right veins. The transverse pericardial sinus lies anterior to the left atrium, and in front of the sinus is the root of the aorta. The tracheal bifurcation, the esophagus, and descending thoracic aorta are immediately behind the pericardium overlying the posterior wall of the left atrium. Further behind is the vertebral column. Following the direction of blood flow, the atrial chamber begins at the pulmonary veno-atrial junctions and terminates at the fibro-fatty tissue plane that marks the atrioventricular junction at the mitral orifice. The walls …

Book ChapterDOI
16 Jul 2012
TL;DR: This paper provides initial insights into engagement patterns, allowing for a better understanding of the important characteristics of how users repeatedly interact with a service or group of services.
Abstract: Our research goal is to provide a better understanding of how users engage with online services, and how to measure this engagement. We should not speak of one main approach to measure user engagement --- e.g. through one fixed set of metrics --- because engagement depends on the online services at hand. Instead, we should be talking of models of user engagement. As a first step, we analysed a number of online services, and show that it is possible to derive effectively simple models of user engagement, for example, accounting for user types and temporal aspects. This paper provides initial insights into engagement patterns, allowing for a better understanding of the important characteristics of how users repeatedly interact with a service or group of services.

Journal ArticleDOI
19 Jul 2012-PLOS ONE
TL;DR: It is shown that daily trading volumes of stocks traded in NASDAQ-100 are correlated with daily volumes of queries related to the same stocks, and query volumes anticipate in many cases peaks of trading by one day or more.
Abstract: We live in a computerized and networked society where many of our actions leave a digital trace and affect other people’s actions. This has lead to the emergence of a new data-driven research field: mathematical methods of computer science, statistical physics and sociometry provide insights on a wide range of disciplines ranging from social science to human mobility. A recent important discovery is that search engine traffic (i.e., the number of requests submitted by users to search engines on the www) can be used to track and, in some cases, to anticipate the dynamics of social phenomena. Successful examples include unemployment levels, car and home sales, and epidemics spreading. Few recent works applied this approach to stock prices and market sentiment. However, it remains unclear if trends in financial markets can be anticipated by the collective wisdom of on-line users on the web. Here we show that daily trading volumes of stocks traded in NASDAQ-100 are correlated with daily volumes of queries related to the same stocks. In particular, query volumes anticipate in many cases peaks of trading by one day or more. Our analysis is carried out on a unique dataset of queries, submitted to an important web search engine, which enable us to investigate also the user behavior. We show that the query volume dynamics emerges from the collective but seemingly uncoordinated activity of many users. These findings contribute to the debate on the identification of early warnings of financial systemic risk, based on the activity of users of the www.

Journal ArticleDOI
01 Jan 2012
TL;DR: In this paper, the authors present new algorithms for finding the densest subgraph in the streaming model, which make O(log 1+en) passes over the input and find a subgraph whose density is guaranteed to be within a factor 2(1 + e) of the optimum.
Abstract: The problem of finding locally dense components of a graph is an important primitive in data analysis, with wide-ranging applications from community mining to spam detection and the discovery of biological network modules. In this paper we present new algorithms for finding the densest subgraph in the streaming model. For any e > 0, our algorithms make O(log1+en) passes over the input and find a subgraph whose density is guaranteed to be within a factor 2(1 + e) of the optimum. Our algorithms are also easily parallelizable and we illustrate this by realizing them in the MapReduce model. In addition we perform extensive experimental evaluation on massive real-world graphs showing the performance and scalability of our algorithms in practice.

Journal ArticleDOI
TL;DR: This paper provides a comprehensive analysis of interleaving using data from two major commercial search engines and a retrieval system for scientific literature, and analyzes the agreement ofinterleaving with manual relevance judgments and observational implicit feedback measures.
Abstract: Interleaving is an increasingly popular technique for evaluating information retrieval systems based on implicit user feedback. While a number of isolated studies have analyzed how this technique agrees with conventional offline evaluation approaches and other online techniques, a complete picture of its efficiency and effectiveness is still lacking. In this paper we extend and combine the body of empirical evidence regarding interleaving, and provide a comprehensive analysis of interleaving using data from two major commercial search engines and a retrieval system for scientific literature. In particular, we analyze the agreement of interleaving with manual relevance judgments and observational implicit feedback measures, estimate the statistical efficiency of interleaving, and explore the relative performance of different interleaving variants. We also show how to learn improved credit-assignment functions for clicks that further increase the sensitivity of interleaving.

Journal ArticleDOI
TL;DR: This work systematically addresses two challenges to effectively integrate interactions over multiple dimensions to discover hidden community structures shared by heterogeneous interactions and presents and analyzes four possible integration strategies to extend community detection from single-dimensional to multi-dimensional networks.
Abstract: The pervasiveness of Web 2.0 and social networking sites has enabled people to interact with each other easily through various social media. For instance, popular sites like Del.icio.us, Flickr, and YouTube allow users to comment on shared content (bookmarks, photos, videos), and users can tag their favorite content. Users can also connect with one another, and subscribe to or become a fan or a follower of others. These diverse activities result in a multi-dimensional network among actors, forming group structures with group members sharing similar interests or affiliations. This work systematically addresses two challenges. First, it is challenging to effectively integrate interactions over multiple dimensions to discover hidden community structures shared by heterogeneous interactions. We show that representative community detection methods for single-dimensional networks can be presented in a unified view. Based on this unified view, we present and analyze four possible integration strategies to extend community detection from single-dimensional to multi-dimensional networks. In particular, we propose a novel integration scheme based on structural features. Another challenge is the evaluation of different methods without ground truth information about community membership. We employ a novel cross-dimension network validation (CDNV) procedure to compare the performance of different methods. We use synthetic data to deepen our understanding, and real-world data to compare integration strategies as well as baseline methods in a large scale. We study further the computational time of different methods, normalization effect during integration, sensitivity to related parameters, and alternative community detection methods for integration.