scispace - formally typeset
Search or ask a question

Showing papers by "Yahoo! published in 2014"


Posted Content
TL;DR: This work extracts and analyzes the core of the Bitcoin protocol, which is term the Bitcoin backbone, and proves two of its fundamental properties which are called common prefix and chain quality in the static setting where the number of players remains fixed.
Abstract: Bitcoin is the first and most popular decentralized cryptocurrency to date. In this work, we extract and analyze the core of the Bitcoin protocol, which we term the Bitcoin backbone, and prove two of its fundamental properties which we call common prefix and chain quality in the static setting where the number of players remains fixed. Our proofs hinge on appropriate and novel assumptions on the “hashing power” of the adversary relative to network synchronicity; we show our results to be tight under high synchronization. Next, we propose and analyze applications that can be built “on top” of the backbone protocol, specifically focusing on Byzantine agreement (BA) and on the notion of a public transaction ledger. Regarding BA, we observe that Nakamoto’s suggestion falls short of solving it, and present a simple alternative which works assuming that the adversary’s hashing power is bounded by 1/3. The public transaction ledger captures the essence of Bitcoin’s operation as a cryptocurrency, in the sense that it guarantees the liveness and persistence of committed transactions. Based on this notion we describe and analyze the Bitcoin system as well as a more elaborate BA protocol, proving them secure assuming high network synchronicity and that the adversary’s hashing power is strictly less than 1/2, while the adversarial bound needed for security decreases as the network desynchronizes.

746 citations


Proceedings ArticleDOI
19 Jun 2014
TL;DR: The big data benchmark suite-BigDataBench not only covers broad application scenarios, but also includes diverse and representative data sets, and comprehensively characterize 19 big data workloads included in BigDataBench with varying data inputs.
Abstract: As architecture, systems, and data management communities pay greater attention to innovative big data systems and architecture, the pressure of benchmarking and evaluating these systems rises. However, the complexity, diversity, frequently changed workloads, and rapid evolution of big data systems raise great challenges in big data benchmarking. Considering the broad use of big data systems, for the sake of fairness, big data benchmarks must include diversity of data and workloads, which is the prerequisite for evaluating big data systems and architecture. Most of the state-of-the-art big data benchmarking efforts target evaluating specific types of applications or system software stacks, and hence they are not qualified for serving the purposes mentioned above.

529 citations


Proceedings Article
21 Jun 2014
TL;DR: In this paper, a new algorithm for the contextual bandit learning problem is presented, where the learner repeatedly takes one of K actions in response to the observed context, and observes the reward only for that action.
Abstract: We present a new algorithm for the contextual bandit learning problem, where the learner repeatedly takes one of K actions in response to the observed context, and observes the reward only for that action. Our method assumes access to an oracle for solving fully supervised cost-sensitive classification problems and achieves the statistically optimal regret guarantee with only O(√KT) oracle calls across all T rounds. By doing so, we obtain the most practical contextual bandit learning algorithm amongst approaches that work for general policy classes. We conduct a proof-of-concept experiment which demonstrates the excellent computational and statistical performance of (an online variant of) our algorithm relative to several strong baselines.

393 citations


Journal ArticleDOI
TL;DR: A mathematical formulation equating the design of cross-modal retrieval systems to that of isomorphic feature spaces for different content modalities is proposed, finding that both hypotheses hold, in a complementary form, although evidence in favor of the abstraction hypothesis is stronger than that for correlation.
Abstract: The problem of cross-modal retrieval from multimedia repositories is considered. This problem addresses the design of retrieval systems that support queries across content modalities, for example, using an image to search for texts. A mathematical formulation is proposed, equating the design of cross-modal retrieval systems to that of isomorphic feature spaces for different content modalities. Two hypotheses are then investigated regarding the fundamental attributes of these spaces. The first is that low-level cross-modal correlations should be accounted for. The second is that the space should enable semantic abstraction. Three new solutions to the cross-modal retrieval problem are then derived from these hypotheses: correlation matching (CM), an unsupervised method which models cross-modal correlations, semantic matching (SM), a supervised technique that relies on semantic representation, and semantic correlation matching (SCM), which combines both. An extensive evaluation of retrieval performance is conducted to test the validity of the hypotheses. All approaches are shown successful for text retrieval in response to image queries and vice versa. It is concluded that both hypotheses hold, in a complementary form, although evidence in favor of the abstraction hypothesis is stronger than that for correlation.

371 citations


Proceedings ArticleDOI
26 Apr 2014
TL;DR: The first results on how photos with human faces relate to engagement on large scale image sharing communities are presented, finding that the number of faces, their age and gender do not have an effect.
Abstract: Photos are becoming prominent means of communication online. Despite photos' pervasive presence in social media and online world, we know little about how people interact and engage with their content. Understanding how photo content might signify engagement, can impact both science and design, influencing production and distribution. One common type of photo content that is shared on social media, is the photos of people. From studies of offline behavior, we know that human faces are powerful channels of non-verbal communication. In this paper, we study this behavioral phenomena online. We ask how presence of a face, it's age and gender might impact social engagement on the photo. We use a corpus of 1 million Instagram images and organize our study around two social engagement feedback factors, likes and comments. Our results show that photos with faces are 38% more likely to receive likes and 32% more likely to receive comments, even after controlling for social network reach and activity. We find, however, that the number of faces, their age and gender do not have an effect. This work presents the first results on how photos with human faces relate to engagement on large scale image sharing communities. In addition to contributing to the research around online user behavior, our findings offer a new line of future work using visual analysis.

350 citations


Journal ArticleDOI
TL;DR: This paper presents a theoretically supported framework for active learning from drifting data streams and develops three active learning strategies for streaming data that explicitly handle concept drift, based on uncertainty, dynamic allocation of labeling efforts over time, and randomization of the search space.
Abstract: In learning to classify streaming data, obtaining true labels may require major effort and may incur excessive cost. Active learning focuses on carefully selecting as few labeled instances as possible for learning an accurate predictive model. Streaming data poses additional challenges for active learning, since the data distribution may change over time (concept drift) and models need to adapt. Conventional active learning strategies concentrate on querying the most uncertain instances, which are typically concentrated around the decision boundary. Changes occurring further from the boundary may be missed, and models may fail to adapt. This paper presents a theoretically supported framework for active learning from drifting data streams and develops three active learning strategies for streaming data that explicitly handle concept drift. They are based on uncertainty, dynamic allocation of labeling efforts over time, and randomization of the search space. We empirically demonstrate that these strategies react well to changes that can occur anywhere in the instance space and unexpectedly.

323 citations


Posted Content
TL;DR: In this paper, the authors use data from a crowd-sourcing platform that shows two street scenes in London and a user votes on which one looks more beautiful, quiet, and happy.
Abstract: When providing directions to a place, web and mobile mapping services are all able to suggest the shortest route. The goal of this work is to automatically suggest routes that are not only short but also emotionally pleasant. To quantify the extent to which urban locations are pleasant, we use data from a crowd-sourcing platform that shows two street scenes in London (out of hundreds), and a user votes on which one looks more beautiful, quiet, and happy. We consider votes from more than 3.3K individuals and translate them into quantitative measures of location perceptions. We arrange those locations into a graph upon which we learn pleasant routes. Based on a quantitative validation, we find that, compared to the shortest routes, the recommended ones add just a few extra walking minutes and are indeed perceived to be more beautiful, quiet, and happy. To test the generality of our approach, we consider Flickr metadata of more than 3.7M pictures in London and 1.3M in Boston, compute proxies for the crowdsourced beauty dimension (the one for which we have collected the most votes), and evaluate those proxies with 30 participants in London and 54 in Boston. These participants have not only rated our recommendations but have also carefully motivated their choices, providing insights for future work.

300 citations


Journal ArticleDOI
TL;DR: Obesity remains strongly associated with diabetes, hypercholesterolemia, and hypertension in the KSA, although the epidemic’s characteristics differ between men and women.
Abstract: Introduction Data on obesity from the Kingdom of Saudi Arabia (KSA) are nonexistent, making it impossible to determine whether the efforts of the Saudi Ministry of Health are having an effect on obesity trends. To determine obesity prevalence and associated factors in the KSA, we conducted a national survey on chronic diseases and their risk factors.

288 citations


Proceedings ArticleDOI
01 Sep 2014
TL;DR: This work uses data from a crowd-sourcing platform to quantify the extent to which urban locations are pleasant, and finds that the recommended routes add just a few extra walking minutes and are indeed perceived to be more beautiful, quiet, and happy.
Abstract: When providing directions to a place, web and mobile mapping services are all able to suggest the shortest route. The goal of this work is to automatically suggest routes that are not only short but also emotionally pleasant. To quantify the extent to which urban locations are pleasant, we use data from a crowd-sourcing platform that shows two street scenes in London (out of hundreds), and a user votes on which one looks more beautiful, quiet, and happy. We consider votes from more than 3.3K individuals and translate them into quantitative measures of location perceptions. We arrange those locations into a graph upon which we learn pleasant routes. Based on a quantitative validation, we find that, compared to the shortest routes, the recommended ones add just a few extra walking minutes and are indeed perceived to be more beautiful, quiet, and happy. To test the generality of our approach, we consider Flickr metadata of more than 3.7M pictures in London and 1.3M in Boston, compute proxies for the crowdsourced beauty dimension (the one for which we have collected the most votes), and evaluate those proxies with 30 participants in London and 54 in Boston. These participants have not only rated our recommendations but have also carefully motivated their choices, providing insights for future work.

265 citations


Journal ArticleDOI
TL;DR: It is shown that with particular choices of kernel functions, nonredundant features with strong statistical dependence on output values can be found in terms of kernel-based independence measures such as the Hilbert-Schmidt independence criterion and the globally optimal solution can be efficiently computed.
Abstract: The goal of supervised feature selection is to find a subset of input features that are responsible for predicting output values. The least absolute shrinkage and selection operator (Lasso) allows computationally efficient feature selection based on linear dependency between input features and output values. In this letter, we consider a feature-wise kernelized Lasso for capturing nonlinear input-output dependency. We first show that with particular choices of kernel functions, nonredundant features with strong statistical dependence on output values can be found in terms of kernel-based independence measures such as the Hilbert-Schmidt independence criterion. We then show that the globally optimal solution can be efficiently computed; this makes the approach scalable to high-dimensional problems. The effectiveness of the proposed method is demonstrated through feature selection experiments for classification and regression with thousands of features.

240 citations


Journal Article
TL;DR: In the gradient oracle model, this paper gave a deterministic algorithm with regret O(log(T) for stochastic strongly-convex optimization with O(1/T)-approximation.
Abstract: We give novel algorithms for stochastic strongly-convex optimization in the gradient oracle model which return a O(1/T)-approximate solution after T iterations. The first algorithm is deterministic, and achieves this rate via gradient updates and historical averaging. The second algorithm is randomized, and is based on pure gradient steps with a random step size. his rate of convergence is optimal in the gradient oracle model. This improves upon the previously known best rate of O(log(T/T), which was obtained by applying an online strongly-convex optimization algorithm with regret O(log(T)) to the batch setting. We complement this result by proving that any algorithm has expected regret of Ω(log(T)) in the online stochastic strongly-convex optimization setting. This shows that any online-to-batch conversion is inherently suboptimal for stochastic strongly-convex optimization. This is the first formal evidence that online convex optimization is strictly more difficult than batch stochastic convex optimization.

Proceedings ArticleDOI
Xing Yi1, Liangjie Hong1, Erheng Zhong1, Nanthan Nan Liu1, Suju Rajan1 
06 Oct 2014
TL;DR: A novel method to compute accurate dwell time based on client-side and server-side logging is described and how to normalize dwell time across different devices and contexts is demonstrated.
Abstract: Many internet companies, such as Yahoo, Facebook, Google and Twitter, rely on content recommendation systems to deliver the most relevant content items to individual users through personalization. Delivering such personalized user experiences is believed to increase the long term engagement of users. While there has been a lot of progress in designing effective personalized recommender systems, by exploiting user interests and historical interaction data through implicit (item click) or explicit (item rating) feedback, directly optimizing for users' satisfaction with the system remains challenging. In this paper, we explore the idea of using item-level dwell time as a proxy to quantify how likely a content item is relevant to a particular user. We describe a novel method to compute accurate dwell time based on client-side and server-side logging and demonstrate how to normalize dwell time across different devices and contexts. In addition, we describe our experiments in incorporating dwell time into state-of-the-art learning to rank techniques and collaborative filtering models that obtain competitive performances in both offline and online settings.

Journal ArticleDOI
TL;DR: Patients who underwent ACL reconstruction had fewer subsequent meniscal injuries, less need for further surgery, and significantly greater improvement in activity level as measured with the Tegner score.
Abstract: Results: Comparison of operative and nonoperative cohorts revealed no significant differences in age, sex, body mass index, or rate of initial meniscal injury (p > 0.05 for all). Operative cohorts had significantly less need for further surgery (12.4% compared with 24.9% for nonoperative, p = 0.0176), less need for subsequent meniscal surgery (13.9% comparedwith29.4%,p=0.0017),andlessdeclineintheTegnerscore(21.9comparedwith 23.1,p=0.0215).Adifference in pivot-shift test results was observed (25.5% pivot-positive compared with 46.6% for nonoperative) but did not reach significance (p = 0.09). No significant differences were seen in outcome scores (Lysholm, International Knee Documentation Committee [IKDC], or final Tegner scores) or the rate of radiographically evident degenerative joint disease (p > 0.05 for all). Conclusions: At a mean of 13.9 ± 3.1 years after injury, the patients who underwent ACL reconstruction had fewer subsequent meniscal injuries, less need for further surgery, and significantly greater improvement in activity level as measured with the Tegner score. There were no significant differences in the Lysholm score, IKDC score, or development of radiographically evident osteoarthritis. Level of Evidence: Therapeutic Level III. See Instructions for Authors for a complete description of levels of evidence.

Proceedings ArticleDOI
23 Jun 2014
TL;DR: An extensive evaluation of the method compared to previous state-of-the-art approaches on the challenging PASCAL VOC 2007 and Object Discovery datasets and a large-scale study of co-localization on ImageNet, involving ground-truth annotations for 3, 624 classes and approximately 1 million images.
Abstract: In this paper, we tackle the problem of co-localization in real-world images. Co-localization is the problem of simultaneously localizing (with bounding boxes) objects of the same class across a set of distinct images. Although similar problems such as co-segmentation and weakly supervised localization have been previously studied, we focus on being able to perform co-localization in real-world settings, which are typically characterized by large amounts of intra-class variation, inter-class diversity, and annotation noise. To address these issues, we present a joint image-box formulation for solving the co-localization problem, and show how it can be relaxed to a convex quadratic program which can be efficiently solved. We perform an extensive evaluation of our method compared to previous state-of-the-art approaches on the challenging PASCAL VOC 2007 and Object Discovery datasets. In addition, we also present a large-scale study of co-localization on ImageNet, involving ground-truth annotations for 3, 624 classes and approximately 1 million images.

Journal ArticleDOI
TL;DR: In this article, the authors consider the very large differences in adaptive capacity among the world's urban centres and discuss how risk levels may change for a range of climatic drivers of impacts in the cities.
Abstract: This paper considers the very large differences in adaptive capacity among the world’s urban centres. It then discusses how risk levels may change for a range of climatic drivers of impacts in the ...

Journal ArticleDOI
TL;DR: Several gender specific risk factors identified can be utilized in health promotion programmes, including avoiding fat and cholesterol, physically inactivity, current tobacco use and childhood physical abuse.
Abstract: Obesity among young people increases lifetime cardiovascular risk. This study assesses the prevalence of overweight/obesity and its associated factors among a random sample of university students from 22 universities in 22 low, middle income and emerging economy countries. This cross-sectional survey comprised of a self-administered questionnaire and collected anthropometric measurements. The study population was 6773 (43.2%) males and 8913 (56.8%) females, aged 16 to 30 years (mean 20.8 years, SD = 2.6). Body mass index (BMI) was used for weight status. Among men, the prevalence of underweight was 10.8%, normal weight 64.4%, overweight 18.9% and obesity 5.8%, while among women, the prevalence of underweight was 17.6%, normal weight 62.1%, overweight 14.1% and obesity 5.2%. Overall, 22% were overweight or obese (24.7% men and 19.3% women). In multivariate regression among men, younger age, coming from a higher income country, consciously avoiding fat and cholesterol, physically inactivity, current tobacco use and childhood physical abuse, and among women older age, coming from a higher income country, frequent organized religious activity, avoiding fat and cholesterol, posttraumatic stress symptoms and physical childhood abuse were associated overweight or obesity. Several gender specific risk factors identified can be utilized in health promotion programmes.

Proceedings ArticleDOI
15 Feb 2014
TL;DR: A crowdsourcing project that aims to investigate, at scale, which visual aspects of London neighborhoods make them appear beautiful, quiet, and/or happy, and collects votes from over 3.3K individuals and translates them into quantitative measures of urban perception.
Abstract: In the 1960s, Lynch's 'The Image of the City' explored what impression US city neighborhoods left on its inhabitants. The scale of urban perception studies until recently was considerably constrained by the limited number of study participants. We here present a crowdsourcing project that aims to investigate, at scale, which visual aspects of London neighborhoods make them appear beautiful, quiet, and/or happy. We collect votes from over 3.3K individuals and translate them into quantitative measures of urban perception. In so doing, we quantify each neighborhood's aesthetic capital. By then using state-of-the-art image processing techniques, we determine visual cues that may cause a street to be perceived as being beautiful, quiet, or happy. We identify effects of color, texture and visual words. For example, the amount of greenery is the most positively associated visual cue with each of three qualities; by contrast, broad streets, fortress-like buildings, and council houses tend to be associated with the opposite qualities (ugly, noisy, and unhappy).

Proceedings ArticleDOI
Martin Saveski1, Amin Mantrach1
06 Oct 2014
TL;DR: This work proposes to learn Local Collective Embeddings: a matrix factorization that exploits items' properties and past user preferences while enforcing the manifold structure exhibited by the collective embeddings and presents a learning algorithm based on multiplicative update rules that is efficient and easy to implement.
Abstract: Recommender systems suggest to users items that they might like (e.g., news articles, songs, movies) and, in doing so, they help users deal with information overload and enjoy a personalized experience. One of the main problems of these systems is the item cold-start, i.e., when a new item is introduced in the system and no past information is available, then no effective recommendations can be produced. The item cold-start is a very common problem in practice: modern online platforms have hundreds of new items published every day. To address this problem, we propose to learn Local Collective Embeddings: a matrix factorization that exploits items' properties and past user preferences while enforcing the manifold structure exhibited by the collective embeddings. We present a learning algorithm based on multiplicative update rules that are efficient and easy to implement. The experimental results on two item cold-start use cases: news recommendation and email recipient recommendation, demonstrate the effectiveness of this approach and show that it significantly outperforms six state-of-the-art methods for item cold-start.

Proceedings ArticleDOI
24 Aug 2014
TL;DR: WTFW ("Who to Follow and Why"), a stochastic topic model for link prediction over directed and nodes-attributed graphs, is proposed, which not only predicts links, but for each predicted link it decides whether it is a "topical" or a "social" link, and depending on this decision it produces a different type of explanation.
Abstract: User recommender systems are a key component in any on-line social networking platform: they help the users growing their network faster, thus driving engagement and loyalty.In this paper we study link prediction with explanations for user recommendation in social networks. For this problem we propose WTFW ("Who to Follow and Why"), a stochastic topic model for link prediction over directed and nodes-attributed graphs. Our model not only predicts links, but for each predicted link it decides whether it is a "topical" or a "social" link, and depending on this decision it produces a different type of explanation.A topical link is recommended between a user interested in a topic and a user authoritative in that topic: the explanation in this case is a set of binary features describing the topic responsible of the link creation. A social link is recommended between users which share a large social neighborhood: in this case the explanation is the set of neighbors which are more likely to be responsible for the link creation.Our experimental assessment on real-world data confirms the accuracy of WTFW in the link prediction and the quality of the associated explanations.

Journal ArticleDOI
TL;DR: There is conflicting evidence to support postoperative rehabilitation protocols using early motion over immobilization following rotator cuff repair, and the use of platelet-rich plasma (PRP).
Abstract: Several studies have noted that increasing age is a significant factor for diminished rotator cuff healing, while biomechanical studies have suggested the reason for this may be an inferior healing environment in older patients. Larger tears and fatty infiltration or atrophy negatively affect rotator cuff healing. Arthroscopic rotator cuff repair, double-row repairs, performing a concomitant acromioplasty, and the use of platelet-rich plasma (PRP) do not demonstrate an improvement in structural healing over mini-open rotator cuff repairs, single-row repairs, not performing an acromioplasty, or not using PRP. There is conflicting evidence to support postoperative rehabilitation protocols using early motion over immobilization following rotator cuff repair.

Journal ArticleDOI
23 Oct 2014-Viruses
TL;DR: C causation criteria for viruses and cancer will be described, as well as the viral agents that comply with these criteria in human tumors, their epidemiological and biological characteristics, the molecular mechanisms by which they induce cellular transformation and their associated cancers.
Abstract: The first human tumor virus was discovered in the middle of the last century by Anthony Epstein, Bert Achong and Yvonne Barr in African pediatric patients with Burkitt's lymphoma. To date, seven viruses -EBV, KSHV, high-risk HPV, MCPV, HBV, HCV and HTLV1- have been consistently linked to different types of human cancer, and infections are estimated to account for up to 20% of all cancer cases worldwide. Viral oncogenic mechanisms generally include: generation of genomic instability, increase in the rate of cell proliferation, resistance to apoptosis, alterations in DNA repair mechanisms and cell polarity changes, which often coexist with evasion mechanisms of the antiviral immune response. Viral agents also indirectly contribute to the development of cancer mainly through immunosuppression or chronic inflammation, but also through chronic antigenic stimulation. There is also evidence that viruses can modulate the malignant properties of an established tumor. In the present work, causation criteria for viruses and cancer will be described, as well as the viral agents that comply with these criteria in human tumors, their epidemiological and biological characteristics, the molecular mechanisms by which they induce cellular transformation and their associated cancers.

Book
01 Dec 2014
TL;DR: This book advocates for the development of ``good'' measures and good measurement practices that will advance the study of user engagement and improve the understanding of this construct, which has become so vital in the authors' wired world.
Abstract: User engagement refers to the quality of the user experience that emphasizes the positive aspects of interacting with an online application and, in particular, the desire to use that application longer and repeatedly. User engagement is a key concept in the design of online applications (whether for desktop, tablet or mobile), motivated by the observation that successful applications are not just used, but are engaged with. Users invest time, attention, and emotion in their use of technology, and seek to satisfy pragmatic and hedonic needs. Measurement is critical for evaluating whether online applications are able to successfully engage users, and may inform the design of and use of applications. User engagement is a multifaceted, complex phenomenon; this gives rise to a number of potential measurement approaches. Common ways to evaluate user engagement include using self-report measures, e.g., questionnaires; observational methods, e.g. facial expression analysis, speech analysis; neuro-physiological signal processing methods, e.g., respiratory and cardiovascular accelerations and decelerations, muscle spasms; and web analytics, e.g., number of site visits, click depth. These methods represent various trade-offs in terms of the setting (laboratory versus ``in the wild''), object of measurement (user behaviour, affect or cognition) and scale of data collected. For instance, small-scale user studies are deep and rich, but limited in terms of generalizability, whereas large-scale web analytic studies are powerful but negate users' motivation and context. The focus of this book is how user engagement is currently being measured and various considerations for its measurement. Our goal is to leave readers with an appreciation of the various ways in which to measure user engagement, and their associated strengths and weaknesses. We emphasize the multifaceted nature of user engagement and the unique contextual constraints that come to bear upon attempts to measure engagement in different settings, and across different user groups and web domains. At the same time, this book advocates for the development of ``good'' measures and good measurement practices that will advance the study of user engagement and improve our understanding of this construct, which has become so vital in our wired world. Table of Contents: Preface / Acknowledgments / Introduction and Scope / Approaches Based on Self-Report Methods / Approaches Based on Physiological Measurements / Approaches Based on Web Analytics / Beyond Desktop, Single Site, and Single Task / Enhancing the Rigor of User Engagement Methods and Measures / Conclusions and Future Research Directions / Bibliography / Authors' Biographies / Index

Journal ArticleDOI
TL;DR: It is shown experimentally that annotator expertise can indeed vary in real tasks and that the presented approaches provide clear advantages over previously introduced multi-annotator methods, which only consider input-independent annotator characteristics, and over alternative approaches that do not model multiple annotators.
Abstract: Learning from multiple annotators or knowledge sources has become an important problem in machine learning and data mining. This is in part due to the ease with which data can now be shared/collected among entities sharing a common goal, task, or data source; and additionally the need to aggregate and make inferences about the collected information. This paper focuses on the development of probabilistic approaches for statistical learning in this setting. It specially considers the case when annotators may be unreliable, but also when their expertise vary depending on the data they observe. That is, annotators may have better knowledge about different parts of the input space and therefore be inconsistently accurate across the task domain. The models developed address both the supervised and the semi-supervised settings and produce classification and annotator models that allow us to provide estimates of the true labels and annotator expertise when no ground-truth is available. In addition, we provide an analysis of the proposed models, tasks, and related practical problems under various scenarios. In particular, we address how to evaluate annotators and how to consider cases where some ground-truth may be available. We show experimentally that annotator expertise can indeed vary in real tasks and that the presented approaches provide clear advantages over previously introduced multi-annotator methods, which only consider input-independent annotator characteristics, and over alternative approaches that do not model multiple annotators.

Proceedings Article
02 Apr 2014
TL;DR: A kernel extension, kernel cluster canonical correlation analysis (cluster-KCCA) is presented that extends clusterCCA to account for non-linear relationships and is shown to be computationally efficient, the complexity being similar to standard (K)CCA.
Abstract: In this paper we present cluster canonical correlation analysis (cluster-CCA) for joint dimensionality reduction of two sets of data points. Unlike the standard pairwise correspondence between the data points, in our problem each set is partitioned into multiple clusters or classes, wheretheclass labelsdefinecorrespondencesbetween the sets. Cluster-CCA is able to learn discriminant low dimensional representations that maximize the correlation between the two sets while segregating the different classes on the learned space. Furthermore, we present a kernel extension, kernel cluster canonical correlation analysis (cluster-KCCA) that extends clusterCCA to account for non-linear relationships. Cluster-(K)CCA is shown to be computationally efficient, the complexity being similar to standard (K)CCA. By means of experimental evaluation on benchmark datasets, cluster-(K)CCA is shown to achieve state of the art performance for cross-modal retrieval tasks.

Proceedings ArticleDOI
TL;DR: It is found that superposters display above-average engagement across Coursera, enrolling in more courses and obtaining better grades than the average forum participant; additionally, students who are super posters in one course are significantly more likely to be superposter in other courses they take.
Abstract: Discussion forums, employed by MOOC providers as the primary mode of interaction among instructors and students, have emerged as one of the important components of online courses. We empirically study contribution behavior in these online collaborative learning forums using data from 44 MOOCs hosted on Coursera, focusing primarily on the highest-volume contributors---"superposters"---in a forum. We explore who these superposters are and study their engagement patterns across the MOOC platform, with a focus on the following question---to what extent is superposting a positive phenomenon for the forum? Specifically, while superposters clearly contribute heavily to the forum in terms of quantity, how do these contributions rate in terms of quality, and does this prolific posting behavior negatively impact contribution from the large remainder of students in the class?We analyze these questions across the courses in our dataset, and find that superposters display above-average engagement across Coursera, enrolling in more courses and obtaining better grades than the average forum participant; additionally, students who are superposters in one course are significantly more likely to be superposters in other courses they take. In terms of utility, our analysis indicates that while being neither the fastest nor the most upvoted, superposters' responses are speedier and receive more upvotes than the average forum user's posts; a manual assessment of quality on a subset of this content supports this conclusion that a large fraction of superposter contributions indeed constitute useful content. Finally, we find that superposters' prolific contribution behavior does not `drown out the silent majority'---high superposter activity correlates positively and significantly with higher overall activity and forum health, as measured by total contribution volume, higher average perceived utility in terms of received votes, and a smaller fraction of orphaned threads.

Proceedings ArticleDOI
24 Aug 2014
TL;DR: It is shown that core decomposition of uncertain graphs can be carried out efficiently as well, and the definitions and methods are evaluated on a number of real-world datasets and applications, such as influence maximization and task-driven team formation.
Abstract: Core decomposition has proven to be a useful primitive for a wide range of graph analyses. One of its most appealing features is that, unlike other notions of dense subgraphs, it can be computed linearly in the size of the input graph. In this paper we provide an analogous tool for uncertain graphs, i.e., graphs whose edges are assigned a probability of existence. The fact that core decomposition can be computed efficiently in deterministic graphs does not guarantee efficiency in uncertain graphs, where even the simplest graph operations may become computationally intensive. Here we show that core decomposition of uncertain graphs can be carried out efficiently as well.We extensively evaluate our definitions and methods on a number of real-world datasets and applications, such as influence maximization and task-driven team formation.

Journal ArticleDOI
21 Oct 2014-Cancers
TL;DR: The main features of drug resistance including mechanisms promoted by cancer cells or cancer stem cells, as well as stromal cells, and the acellular components surrounding the tumor cells—known as peritumoral desmoplasia—that affects intra-tumoral drug delivery are highlighted.
Abstract: Pancreatic ductal adenocarcinoma (PDAC) occurs mainly in people older than 50 years of age. Although great strides have been taken in treating PDAC over the past decades its incidence nearly equals its mortality rate and it was quoted as the 4th leading cause of cancer deaths in the U.S. in 2012. This review aims to focus on research models and scientific developments that help to explain the extraordinary resistance of PDAC towards current therapeutic regimens. Furthermore, it highlights the main features of drug resistance including mechanisms promoted by cancer cells or cancer stem cells (CSCs), as well as stromal cells, and the acellular components surrounding the tumor cells—known as peritumoral desmoplasia—that affects intra-tumoral drug delivery. Finally, therapeutic concepts and avenues for future research are suggested, based on the topics discussed.

Journal ArticleDOI
TL;DR: This paper analyzes the novel concept of object bank, a high-level image representation encoding object appearance and spatial location information in images, and demonstrates that object bank is a high level representation, from which it can easily discover semantic information of unknown images.
Abstract: It is a remarkable fact that images are related to objects constituting them. In this paper, we propose to represent images by using objects appearing in them. We introduce the novel concept of object bank (OB), a high-level image representation encoding object appearance and spatial location information in images. OB represents an image based on its response to a large number of pre-trained object detectors, or `object filters', blind to the testing dataset and visual recognition task. Our OB representation demonstrates promising potential in high level image recognition tasks. It significantly outperforms traditional low level image representations in image classification on various benchmark image datasets by using simple, off-the-shelf classification algorithms such as linear SVM and logistic regression. In this paper, we analyze OB in detail, explaining our design choice of OB for achieving its best potential on different types of datasets. We demonstrate that object bank is a high level representation, from which we can easily discover semantic information of unknown images. We provide guidelines for effectively applying OB to high level image recognition tasks where it could be easily compressed for efficient computation in practice and is very robust to various classifiers.

Journal ArticleDOI
Natosha Cramer1
TL;DR: Judging patients for poor life choices is neither right nor professional.
Abstract: Judging patients for poor life choices is neither right nor professional.

Proceedings ArticleDOI
23 Jun 2014
TL;DR: This work proposes a method to learn optimal seeds for object saliency, and the propagation of the resulting saliency seeds, using a diffusion process, is shown to outperform the state of the art on a number of salient object detection datasets.
Abstract: In diffusion-based saliency detection, an image is partitioned into superpixels and mapped to a graph, with superpixels as nodes and edge strengths proportional to superpixel similarity Saliency information is then propagated over the graph using a diffusion process, whose equilibrium state yields the object saliency map The optimal solution is the product of a propagation matrix and a saliency seed vector that contains a prior saliency assessment This is obtained from either a bottom-up saliency detector or some heuristics In this work, we propose a method to learn optimal seeds for object saliency Two types of features are computed per superpixel: the bottom-up saliency of the superpixel region and a set of mid-level vision features informative of how likely the superpixel is to belong to an object The combination of features that best discriminates between object and background saliency is then learned, using a large-margin formulation of the discriminant saliency principle The propagation of the resulting saliency seeds, using a diffusion process, is finally shown to outperform the state of the art on a number of salient object detection datasets