scispace - formally typeset
Search or ask a question

Showing papers by "Yahoo! published in 2008"


Book
08 Jul 2008
TL;DR: This survey covers techniques and approaches that promise to directly enable opinion-oriented information-seeking systems and focuses on methods that seek to address the new challenges raised by sentiment-aware applications, as compared to those that are already present in more traditional fact-based analysis.
Abstract: An important part of our information-gathering behavior has always been to find out what other people think. With the growing availability and popularity of opinion-rich resources such as online review sites and personal blogs, new opportunities and challenges arise as people now can, and do, actively use information technologies to seek out and understand the opinions of others. The sudden eruption of activity in the area of opinion mining and sentiment analysis, which deals with the computational treatment of opinion, sentiment, and subjectivity in text, has thus occurred at least in part as a direct response to the surge of interest in new systems that deal directly with opinions as a first-class object. This survey covers techniques and approaches that promise to directly enable opinion-oriented information-seeking systems. Our focus is on methods that seek to address the new challenges raised by sentiment-aware applications, as compared to those that are already present in more traditional fact-based analysis. We include material on summarization of evaluative text and on broader issues regarding privacy, manipulation, and economic impact that the development of opinion-oriented information-access services gives rise to. To facilitate future work, a discussion of available resources, benchmark datasets, and evaluation campaigns is also provided.

7,452 citations


Proceedings ArticleDOI
15 Dec 2008
TL;DR: This work identifies unique properties of implicit feedback datasets and proposes treating the data as indication of positive and negative preference associated with vastly varying confidence levels, which leads to a factor model which is especially tailored for implicit feedback recommenders.
Abstract: A common task of recommender systems is to improve customer experience through personalized recommendations based on prior implicit feedback. These systems passively track different sorts of user behavior, such as purchase history, watching habits and browsing activity, in order to model user preferences. Unlike the much more extensively researched explicit feedback, we do not have any direct input from the users regarding their preferences. In particular, we lack substantial evidence on which products consumer dislike. In this work we identify unique properties of implicit feedback datasets. We propose treating the data as indication of positive and negative preference associated with vastly varying confidence levels. This leads to a factor model which is especially tailored for implicit feedback recommenders. We also suggest a scalable optimization procedure, which scales linearly with the data size. The algorithm is used successfully within a recommender system for television shows. It compares favorably with well tuned implementations of other known methods. In addition, we offer a novel way to give explanations to recommendations given by this factor model.

3,149 citations


Proceedings ArticleDOI
Christopher Olston1, Benjamin Reed1, Utkarsh Srivastava1, Ravi Kumar1, Andrew Tomkins1 
09 Jun 2008
TL;DR: A new language called Pig Latin is described, designed to fit in a sweet spot between the declarative style of SQL, and the low-level, procedural style of map-reduce, which is an open-source, Apache-incubator project, and available for general use.
Abstract: There is a growing need for ad-hoc analysis of extremely large data sets, especially at internet companies where innovation critically depends on being able to analyze terabytes of data collected every day. Parallel database products, e.g., Teradata, offer a solution, but are usually prohibitively expensive at this scale. Besides, many of the people who analyze this data are entrenched procedural programmers, who find the declarative, SQL style to be unnatural. The success of the more procedural map-reduce programming model, and its associated scalable implementations on commodity hardware, is evidence of the above. However, the map-reduce paradigm is too low-level and rigid, and leads to a great deal of custom user code that is hard to maintain, and reuse.We describe a new language called Pig Latin that we have designed to fit in a sweet spot between the declarative style of SQL, and the low-level, procedural style of map-reduce. The accompanying system, Pig, is fully implemented, and compiles Pig Latin into physical plans that are executed over Hadoop, an open-source, map-reduce implementation. We give a few examples of how engineers at Yahoo! are using Pig to dramatically reduce the time required for the development and execution of their data analysis tasks, compared to using Hadoop directly. We also report on a novel debugging environment that comes integrated with Pig, that can lead to even higher productivity gains. Pig is an open-source, Apache-incubator project, and available for general use.

2,058 citations


Journal ArticleDOI
TL;DR: An alternating minimization algorithm for recovering images from blurry and noisy observations with total variation (TV) regularization from a new half-quadratic model applicable to not only the anisotropic but also the isotropic forms of TV discretizations is proposed.
Abstract: We propose, analyze, and test an alternating minimization algorithm for recovering images from blurry and noisy observations with total variation (TV) regularization. This algorithm arises from a new half-quadratic model applicable to not only the anisotropic but also the isotropic forms of TV discretizations. The per-iteration computational complexity of the algorithm is three fast Fourier transforms. We establish strong convergence properties for the algorithm including finite convergence for some variables and relatively fast exponential (or $q$-linear in optimization terminology) convergence for the others. Furthermore, we propose a continuation scheme to accelerate the practical convergence of the algorithm. Extensive numerical results show that our algorithm performs favorably in comparison to several state-of-the-art algorithms. In particular, it runs orders of magnitude faster than the lagged diffusivity algorithm for TV-based deblurring. Some extensions of our algorithm are also discussed.

1,883 citations


Proceedings ArticleDOI
11 Feb 2008
TL;DR: This paper introduces a general classification framework for combining the evidence from different sources of information, that can be tuned automatically for a given social media type and quality definition, and shows that its system is able to separate high-quality items from the rest with an accuracy close to that of humans.
Abstract: The quality of user-generated content varies drastically from excellent to abuse and spam. As the availability of such content increases, the task of identifying high-quality content sites based on user contributions --social media sites -- becomes increasingly important. Social media in general exhibit a rich variety of information sources: in addition to the content itself, there is a wide array of non-content information available, such as links between items and explicit quality ratings from members of the community. In this paper we investigate methods for exploiting such community feedback to automatically identify high quality content. As a test case, we focus on Yahoo! Answers, a large community question/answering portal that is particularly rich in the amount and types of content and social interactions available in it. We introduce a general classification framework for combining the evidence from different sources of information, that can be tuned automatically for a given social media type and quality definition. In particular, for the community question/answering domain, we show that our system is able to separate high-quality items from the rest with an accuracy close to that of humans

1,300 citations


Journal ArticleDOI
01 Aug 2008
TL;DR: PNUTS provides data storage organized as hashed or ordered tables, low latency for large numbers of concurrent requests including updates and queries, and novel per-record consistency guarantees and utilizes automated load-balancing and failover to reduce operational complexity.
Abstract: We describe PNUTS, a massively parallel and geographically distributed database system for Yahoo!'s web applications. PNUTS provides data storage organized as hashed or ordered tables, low latency for large numbers of concurrent requests including updates and queries, and novel per-record consistency guarantees. It is a hosted, centrally managed, and geographically distributed service, and utilizes automated load-balancing and failover to reduce operational complexity. The first version of the system is currently serving in production. We describe the motivation for PNUTS and the design and implementation of its table storage and replication layers, and then present experimental results.

1,142 citations


Proceedings ArticleDOI
23 Jun 2008
TL;DR: It is shown that one can build histogram intersection kernel SVMs (IKSVMs) with runtime complexity of the classifier logarithmic in the number of support vectors as opposed to linear for the standard approach.
Abstract: Straightforward classification using kernelized SVMs requires evaluating the kernel for a test vector and each of the support vectors. For a class of kernels we show that one can do this much more efficiently. In particular we show that one can build histogram intersection kernel SVMs (IKSVMs) with runtime complexity of the classifier logarithmic in the number of support vectors as opposed to linear for the standard approach. We further show that by precomputing auxiliary tables we can construct an approximate classifier with constant runtime and space requirements, independent of the number of support vectors, with negligible loss in classification accuracy on various tasks. This approximation also applies to 1 - chi2 and other kernels of similar form. We also introduce novel features based on a multi-level histograms of oriented edge energy and present experiments on various detection datasets. On the INRIA pedestrian dataset an approximate IKSVM classifier based on these features has the current best performance, with a miss rate 13% lower at 10-6 False Positive Per Window than the linear SVM detector of Dalal & Triggs. On the Daimler Chrysler pedestrian dataset IKSVM gives comparable accuracy to the best results (based on quadratic SVM), while being 15times faster. In these experiments our approximate IKSVM is up to 2000times faster than a standard implementation and requires 200times less memory. Finally we show that a 50times speedup is possible using approximate IKSVM based on spatial pyramid features on the Caltech 101 dataset with negligible loss of accuracy.

1,074 citations


Proceedings ArticleDOI
21 Apr 2008
TL;DR: This paper analyzes a representative snapshot of Flickr and presents and evaluates tag recommendation strategies to support the user in the photo annotation task by recommending a set of tags that can be added to the photo.
Abstract: Online photo services such as Flickr and Zooomr allow users to share their photos with family, friends, and the online community at large. An important facet of these services is that users manually annotate their photos using so called tags, which describe the contents of the photo or provide additional contextual and semantical information. In this paper we investigate how we can assist users in the tagging phase. The contribution of our research is twofold. We analyse a representative snapshot of Flickr and present the results by means of a tag characterisation focussing on how users tags photos and what information is contained in the tagging. Based on this analysis, we present and evaluate tag recommendation strategies to support the user in the photo annotation task by recommending a set of tags that can be added to the photo. The results of the empirical evaluation show that we can effectively recommend relevant tags for a variety of photos with different levels of exhaustiveness of original tagging.

1,048 citations


Proceedings ArticleDOI
05 Jul 2008
TL;DR: A novel dual coordinate descent method for linear SVM with L1-and L2-loss functions that reaches an ε-accurate solution in O(log(1/ε)) iterations is presented.
Abstract: In many applications, data appear with a huge number of instances as well as features. Linear Support Vector Machines (SVM) is one of the most popular tools to deal with such large-scale sparse data. This paper presents a novel dual coordinate descent method for linear SVM with L1-and L2-loss functions. The proposed method is simple and reaches an e-accurate solution in O(log(1/e)) iterations. Experiments indicate that our method is much faster than state of the art solvers such as Pegasos, TRON, SVMperf, and a recent primal coordinate descent implementation.

1,014 citations


Proceedings ArticleDOI
21 Apr 2008
TL;DR: It is found that a generative model, in which new edges are added via an iterative "forest fire" burning process, is able to produce graphs exhibiting a network community structure similar to that observed in nearly every network dataset examined.
Abstract: A large body of work has been devoted to identifying community structure in networks. A community is often though of as a set of nodes that has more connections between its members than to the remainder of the network. In this paper, we characterize as a function of size the statistical and structural properties of such sets of nodes. We define the network community profile plot, which characterizes the "best" possible community - according to the conductance measure - over a wide range of size scales, and we study over 70 large sparse real-world networks taken from a wide range of application domains. Our results suggest a significantly more refined picture of community structure in large real-world networks than has been appreciated previously.Our most striking finding is that in nearly every network dataset we examined, we observe tight but almost trivial communities at very small scales, and at larger size scales, the best possible communities gradually "blend in" with the rest of the network and thus become less "community-like." This behavior is not explained, even at a qualitative level, by any of the commonly-used network generation models. Moreover, this behavior is exactly the opposite of what one would expect based on experience with and intuition from expander graphs, from graphs that are well-embeddable in a low-dimensional structure, and from small social networks that have served as testbeds of community detection algorithms. We have found, however, that a generative model, in which new edges are added via an iterative "forest fire" burning process, is able to produce graphs exhibiting a network community structure similar to our observations.

999 citations


Patent
11 Feb 2008
TL;DR: In this paper, methods and apparatus are described for detecting social relationships across multiple networks and/or communication channels, which can then be used in a wide variety of ways to support and enhance a broad range of user services.
Abstract: Methods and apparatus are described for detecting social relationships across multiple networks and/or communication channels. These social relationships may then be utilized in a wide variety of ways to support and enhance a broad range of user services.

Journal ArticleDOI
TL;DR: In low-income countries, infectious diseases still account for a large proportion of deaths, highlighting health inequities largely caused by economic differences, and vaccination can cut health-care costs and reduce these inequities.
Abstract: In low-income countries, infectious diseases still account for a large proportion of deaths, highlighting health inequities largely caused by economic differences. Vaccination can cut health-care costs and reduce these inequities. Disease control, elimination or eradication can save billions of US dollars for communities and countries. Vaccines have lowered the incidence of hepatocellular carcinoma and will control cervical cancer. Travellers can be protected against "exotic" diseases by appropriate vaccination. Vaccines are considered indispensable against bioterrorism. They can combat resistance to antibiotics in some pathogens. Noncommunicable diseases, such as ischaemic heart disease, could also be reduced by influenza vaccination. Immunization programmes have improved the primary care infrastructure in developing countries, lowered mortality in childhood and empowered women to better plan their families, with consequent health, social and economic benefits. Vaccination helps economic growth everywhere, because of lower morbidity and mortality. The annual return on investment in vaccination has been calculated to be between 12% and 18%. Vaccination leads to increased life expectancy. Long healthy lives are now recognized as a prerequisite for wealth, and wealth promotes health. Vaccines are thus efficient tools to reduce disparities in wealth and inequities in health.

Proceedings ArticleDOI
24 Aug 2008
TL;DR: A complete model of network evolution, where nodes arrive at a prespecified rate and select their lifetimes, and the combination of the gap distribution with the node lifetime leads to a power law out-degree distribution that accurately reflects the true network in all four cases is presented.
Abstract: We present a detailed study of network evolution by analyzing four large online social networks with full temporal information about node and edge arrivals. For the first time at such a large scale, we study individual node arrival and edge creation processes that collectively lead to macroscopic properties of networks. Using a methodology based on the maximum-likelihood principle, we investigate a wide variety of network formation strategies, and show that edge locality plays a critical role in evolution of networks. Our findings supplement earlier network models based on the inherently non-local preferential attachment.Based on our observations, we develop a complete model of network evolution, where nodes arrive at a prespecified rate and select their lifetimes. Each node then independently initiates edges according to a "gap" process, selecting a destination for each edge according to a simple triangle-closing model free of any parameters. We show analytically that the combination of the gap distribution with the node lifetime leads to a power law out-degree distribution that accurately reflects the true network in all four cases. Finally, we give model parameter settings that allow automatic evolution and generation of realistic synthetic networks of arbitrary scale.

Journal ArticleDOI
01 Sep 2008-ReCALL
TL;DR: A review of publications reporting mobile-assisted language learning (MALL) was undertaken to discover how far mobile devices are being used to support social contact and collaborative learning and in the possibilities for both synchronous and asynchronous interaction in the context of online and distance learning.
Abstract: Mobile learning is undergoing rapid evolution. While early generations of mobile learning tended to propose activities that were carefully crafted by educators and technologists, learners are increasingly motivated by their personal learning needs, including those arising from greater mobility and frequent travel. At the same time, it is often argued that mobile devices are particularly suited to supporting social contacts and collaborative learning-claims that have obvious relevance for language learning. A review of publications reporting mobile-assisted language learning (MALL) was undertaken to discover how far mobile devices are being used to support social contact and collaborative learning. In particular, we were interested in speaking and listening practice and in the possibilities for both synchronous and asynchronous interaction in the context of online and distance learning. We reflect on how mobile language learning has developed to date and suggest directions for the future.

Proceedings ArticleDOI
24 Aug 2008
TL;DR: Two simple tests are proposed that can identify influence as a source of social correlation when the time series of user actions is available and are applied to real tagging data on Flickr, exhibiting that while there is significant social correlation in tagging behavior on this system, this correlation cannot be attributed to social influence.
Abstract: In many online social systems, social ties between users play an important role in dictating their behavior. One of the ways this can happen is through social influence, the phenomenon that the actions of a user can induce his/her friends to behave in a similar way. In systems where social influence exists, ideas, modes of behavior, or new technologies can diffuse through the network like an epidemic. Therefore, identifying and understanding social influence is of tremendous interest from both analysis and design points of view.This is a difficult task in general, since there are factors such as homophily or unobserved confounding variables that can induce statistical correlation between the actions of friends in a social network. Distinguishing influence from these is essentially the problem of distinguishing correlation from causality, a notoriously hard statistical problem.In this paper we study this problem systematically. We define fairly general models that replicate the aforementioned sources of social correlation. We then propose two simple tests that can identify influence as a source of social correlation when the time series of user actions is available.We give a theoretical justification of one of the tests by proving that with high probability it succeeds in ruling out influence in a rather general model of social correlation. We also simulate our tests on a number of examples designed by randomly generating actions of nodes on a real social network (from Flickr) according to one of several models. Simulation results confirm that our test performs well on these data. Finally, we apply them to real tagging data on Flickr, exhibiting that while there is significant social correlation in tagging behavior on this system, this correlation cannot be attributed to social influence.

Journal ArticleDOI
TL;DR: PsPD has a clinical impact on chemotherapy-treated GBM, as it may express the glioma killing effects of treatment and is significantly correlated with MGMT status.
Abstract: Purpose Standard therapy for glioblastoma (GBM) is temozolomide (TMZ) administration, initially concurrent with radiotherapy (RT), and subsequently as maintenance therapy. The radiologic images obtained in this setting can be difficult to interpret since they may show radiation-induced pseudoprogression (psPD) rather than disease progression. Methods Patients with histologically confirmed GBM underwent radiotherapy plus continuous daily temozolomide (75 mg/m 2 /d), followed by 12 maintenance temozolomide cycles (150 to 200 mg/m 2 for 5 days every 28 days) if magnetic resonance imaging (MRI) showed no enhancement suggesting a tumor; otherwise, chemotherapy was delivered until complete response or unequivocal progression. The first MRI scan was performed 1 month after completing combined chemoradiotherapy. Results In 103 patients (mean age, 52 years [range 20 to 73 years]), total resection, subtotal resection, and biopsy were obtained in 51, 51, and 1 cases, respectively. MGMT promoter was methylated in 36 patients (35%) and unmethylated in 67 patients (65%). Lesion enlargement, evidenced at the first MRI scan in 50 of 103 patients, was subsequently classified as psPD in 32 patients and early disease progression in 18 patients. PsPD was recorded in 21 (91%) of 23 methylated MGMT promoter and 11 (41%) of 27 unmethylated MGMT promoter (P .0002) patients. MGMT status (P .001) and psPD detection (P .045) significantly influenced survival. Conclusion PsPD has a clinical impact on chemotherapy-treated GBM, as it may express the glioma killing effects of treatment and is significantly correlated with MGMT status. Improvement in the early recognition of psPD patterns and knowledge of mechanisms underlying this phenomenon are crucial to eliminating biases in evaluating the results of clinical trials and guaranteeing effective treatment. J Clin Oncol 26:2192-2197. © 2008 by American Society of Clinical Oncology

Journal ArticleDOI
14 Mar 2008
TL;DR: In this paper, the authors outline the problems of content-based music information retrieval and explore the state-of-the-art methods using audio cues (e.g., query by humming, audio fingerprinting, contentbased music retrieval) and other cues such as music notation and symbolic representation.
Abstract: The steep rise in music downloading over CD sales has created a major shift in the music industry away from physical media formats and towards online products and services. Music is one of the most popular types of online information and there are now hundreds of music streaming and download services operating on the World-Wide Web. Some of the music collections available are approaching the scale of ten million tracks and this has posed a major challenge for searching, retrieving, and organizing music content. Research efforts in music information retrieval have involved experts from music perception, cognition, musicology, engineering, and computer science engaged in truly interdisciplinary activity that has resulted in many proposed algorithmic and methodological solutions to music search using content-based methods. This paper outlines the problems of content-based music information retrieval and explores the state-of-the-art methods using audio cues (e.g., query by humming, audio fingerprinting, content-based music retrieval) and other cues (e.g., music notation and symbolic representation), and identifies some of the major challenges for the coming years.

Journal ArticleDOI
TL;DR: ActionAid conducted participatory vulnerability analysis in five African cities to explore local people's perceptions of why floods occur, how they adjust to them, who is responsible for reducing the flood risk and what action the community itself can take as discussed by the authors.
Abstract: Many of the urban poor in Africa face growing problems of severe flooding. Increased storm frequency and intensity related to climate change are exacerbated by such local factors as the growing occupation of floodplains, increased runoff from hard surfaces, inadequate waste management and silted-up drainage. One can distinguish four types of flooding in urban areas: localized flooding due to inadequate drainage; flooding from small streams within the built-up area; flooding from major rivers; and coastal flooding. ActionAid undertook participatory vulnerability analysis in five African cities, to explore local people's perceptions of why floods occur, how they adjust to them, who is responsible for reducing the flood risk and what action the community itself can take. While local people adapt to floods, recognition of local, national and international governments' and organizations' responsibility to act to alleviate flooding and its causes, especially the consequences of climate change, is urgently needed.

Journal ArticleDOI
TL;DR: This work counted daily unique queries originating in the United States that contained influenza-related search terms from the Yahoo! search engine from March 2004 through May 2008, and estimated linear models, using searches with 1-10-week lead times as explanatory variables to predict the percentage of cultures positive for influenza and deaths attributable to pneumonia and influenza in the US.
Abstract: The Internet is an important source of health information. Thus, the frequency of Internet searches may provide information regarding infectious disease activity. As an example, we examined the relationship between searches for influenza and actual influenza occurrence. Using search queries from the Yahoo! search engine ( http://search.yahoo.com ) from March 2004 through May 2008, we counted daily unique queries originating in the United States that contained influenza-related search terms. Counts were divided by the total number of searches, and the resulting daily fraction of searches was averaged over the week. We estimated linear models, using searches with 1-10-week lead times as explanatory variables to predict the percentage of cultures positive for influenza and deaths attributable to pneumonia and influenza in the United States. With use of the frequency of searches, our models predicted an increase in cultures positive for influenza 1-3 weeks in advance of when they occurred (P < .001), and similar models predicted an increase in mortality attributable to pneumonia and influenza up to 5 weeks in advance (P < .001). Search-term surveillance may provide an additional tool for disease surveillance.

Journal ArticleDOI
TL;DR: PGT coronary CT angiography offers improved image quality and substantially reduced effective radiation dose compared with traditional RGH coronary CTAngiography.
Abstract: Purpose: To retrospectively compare image quality, radiation dose, and blood vessel assessability for coronary artery computed tomographic (CT) angiograms obtained with a prospectively gated transverse (PGT) CT technique and a retrospectively gated helical (RGH) CT technique. Materials and Methods: This HIPAA-compliant study received a waiver for approval from the institutional review board, including one for informed consent. Coronary CT angiograms obtained with 64–detector row CT were retrospectively evaluated in 203 clinical patients. A routine RGH technique was evaluated in 82 consecutive patients (44 males, 38 females; mean age, 55.6 years). The PGT technique was then evaluated in 121 additional patients (71 males, 50 females; mean age, 56.7 years). All images were evaluated for image quality, estimated radiation dose, and coronary artery segment assessability. Differences in image quality score were evaluated by using a proportional odds logistic regression model, with main effects for three readers...

Proceedings ArticleDOI
16 Aug 2008
TL;DR: This shared task not only unifies the shared tasks of the previous four years under a unique dependency-based formalism, but also extends them significantly: this year's syntactic dependencies include more information such as named-entity boundaries; the semantic dependencies model roles of both verbal and nominal predicates.
Abstract: The Conference on Computational Natural Language Learning is accompanied every year by a shared task whose purpose is to promote natural language processing applications and evaluate them in a standard setting. In 2008 the shared task was dedicated to the joint parsing of syntactic and semantic dependencies. This shared task not only unifies the shared tasks of the previous four years under a unique dependency-based formalism, but also extends them significantly: this year's syntactic dependencies include more information such as named-entity boundaries; the semantic dependencies model roles of both verbal and nominal predicates. In this paper, we define the shared task and describe how the data sets were created. Furthermore, we report and analyze the results and describe the approaches of the participating systems.

Journal ArticleDOI
Weihua Deng1
TL;DR: The finite element method is developed for the numerical resolution of the space and time fractional Fokker-Planck equation, which is an effective tool for describing a process with both traps and flights.
Abstract: We develop the finite element method for the numerical resolution of the space and time fractional Fokker-Planck equation, which is an effective tool for describing a process with both traps and flights; the time fractional derivative of the equation is used to characterize the traps, and the flights are depicted by the space fractional derivative. The stability and error estimates are rigorously established, and we prove that the convergent order is $O(k^{2-\alpha}+h^\mu)$, where $k$ is the time step size and $h$ the space step size. Numerical computations are presented which demonstrate the effectiveness of the method and confirm the theoretical claims.

Journal ArticleDOI
TL;DR: A theoretical analysis of Model-based Interval Estimation and a new variation called MBIE-EB are presented, proving their efficiency even under worst-case conditions.

Proceedings ArticleDOI
20 Jul 2008
TL;DR: It is confirmed that a user almost always see the document directly after a clicked document, and why documents situated just after a very relevant document are clicked more often is explained.
Abstract: Search engine click logs provide an invaluable source of relevance information but this information is biased because we ignore which documents from the result list the users have actually seen before and after they clicked. Otherwise, we could estimate document relevance by simple counting. In this paper, we propose a set of assumptions on user browsing behavior that allows the estimation of the probability that a document is seen, thereby providing an unbiased estimate of document relevance. To train, test and compare our model to the best alternatives described in the Literature, we gather a large set of real data and proceed to an extensive cross-validation experiment. Our solution outperforms very significantly all previous models. As a side effect, we gain insight into the browsing behavior of users and we can compare it to the conclusions of an eye-tracking experiments by Joachims et al. [12]. In particular, our findings confirm that a user almost always see the document directly after a clicked document. They also explain why documents situated just after a very relevant document are clicked more often.

Journal ArticleDOI
TL;DR: By using spectral graph analysis, SRDA casts discriminant analysis into a regression framework that facilitates both efficient computation and the use of regularization techniques, and there is no eigenvector computation involved, which is a huge save of both time and memory.
Abstract: Linear Discriminant Analysis (LDA) has been a popular method for extracting features that preserves class separability. The projection functions of LDA are commonly obtained by maximizing the between-class covariance and simultaneously minimizing the within-class covariance. It has been widely used in many fields of information processing, such as machine learning, data mining, information retrieval, and pattern recognition. However, the computation of LDA involves dense matrices eigendecomposition, which can be computationally expensive in both time and memory. Specifically, LDA has O(mnt + t3) time complexity and requires O(mn + mt + nt) memory, where m is the number of samples, n is the number of features, and t = min(m,n). When both m and n are large, it is infeasible to apply LDA. In this paper, we propose a novel algorithm for discriminant analysis, called Spectral Regression Discriminant Analysis (SRDA). By using spectral graph analysis, SRDA casts discriminant analysis into a regression framework that facilitates both efficient computation and the use of regularization techniques. Specifically, SRDA only needs to solve a set of regularized least squares problems, and there is no eigenvector computation involved, which is a huge save of both time and memory. Our theoretical analysis shows that SRDA can be computed with O(mn) time and O(ms) memory, where .s(les n) is the average number of nonzero features in each sample. Extensive experimental results on four real-world data sets demonstrate the effectiveness and efficiency of our algorithm.

Proceedings ArticleDOI
26 Oct 2008
TL;DR: This is the first work to identify, measure and automatically segment sequences of user queries into their hierarchical structure, and paves the way for evaluating search engines in terms of user task completion.
Abstract: Most analysis of web search relevance and performance takes a single query as the unit of search engine interaction. When studies attempt to group queries together by task or session, a timeout is typically used to identify the boundary. However, users query search engines in order to accomplish tasks at a variety of granularities, issuing multiple queries as they attempt to accomplish tasks. In this work we study real sessions manually labeled into hierarchical tasks, and show that timeouts, whatever their length, are of limited utility in identifying task boundaries, achieving a maximum precision of only 70%. We report on properties of this search task hierarchy, as seen in a random sample of user interactions from a major web search engine's log, annotated by human editors, learning that 17% of tasks are interleaved, and 20% are hierarchically organized. No previous work has analyzed or addressed automatic identification of interleaved and hierarchically organized search tasks. We propose and evaluate a method for the automated segmentation of users' query streams into hierarchical units. Our classifiers can improve on timeout segmentation, as well as other previously published approaches, bringing the accuracy up to 92% for identifying fine-grained task boundaries, and 89-97% for identifying pairs of queries from the same task when tasks are interleaved hierarchically. This is the first work to identify, measure and automatically segment sequences of user queries into their hierarchical structure. The ability to perform this kind of segmentation paves the way for evaluating search engines in terms of user task completion.

Proceedings ArticleDOI
21 Apr 2008
TL;DR: This work uses a combination of context- and content-based tools to generate representative sets of images for location-driven features and landmarks, a common search task.
Abstract: Can we leverage the community-contributed collections of rich media on the web to automatically generate representative and diverse views of the world's landmarks? We use a combination of context- and content-based tools to generate representative sets of images for location-driven features and landmarks, a common search task. To do that, we using location and other metadata, as well as tags associated with images, and the images' visual features. We present an approach to extracting tags that represent landmarks. We show how to use unsupervised methods to extract representative views and images for each landmark. This approach can potentially scale to provide better search and representation for landmarks, worldwide. We evaluate the system in the context of image search using a real-life dataset of 110,000 images from the San Francisco area.

Proceedings ArticleDOI
21 Apr 2008
TL;DR: This paper proposes FacetNet, a novel framework for analyzing communities and their evolutions through a robust unified process, where communities not only generate evolutions, they also are regularized by the temporal smoothness of evolutions.
Abstract: We discover communities from social network data, and analyze the community evolution. These communities are inherent characteristics of human interaction in online social networks, as well as paper citation networks. Also, communities may evolve over time, due to changes to individuals' roles and social status in the network as well as changes to individuals' research interests. We present an innovative algorithm that deviates from the traditional two-step approach to analyze community evolutions. In the traditional approach, communities are first detected for each time slice, and then compared to determine correspondences. We argue that this approach is inappropriate in applications with noisy data. In this paper, we propose FacetNet for analyzing communities and their evolutions through a robust unified process. In this novel framework, communities not only generate evolutions, they also are regularized by the temporal smoothness of evolutions. As a result, this framework will discover communities that jointly maximize the fit to the observed data and the temporal evolution. Our approach relies on formulating the problem in terms of non-negative matrix factorization, where communities and their evolutions are factorized in a unified way. Then we develop an iterative algorithm, with proven low time complexity, which is guaranteed to converge to an optimal solution. We perform extensive experimental studies, on both synthetic datasets and real datasets, to demonstrate that our method discovers meaningful communities and provides additional insights not directly obtainable from traditional methods.

Journal ArticleDOI
TL;DR: Subspace sampling as discussed by the authors is a sampling method for low-rank matrix decompositions with relative error guarantees. But it is not known whether such a matrix decomposition exists in general.
Abstract: Many data analysis applications deal with large matrices and involve approximating the matrix using a small number of “components.” Typically, these components are linear combinations of the rows and columns of the matrix, and are thus difficult to interpret in terms of the original features of the input data. In this paper, we propose and study matrix approximations that are explicitly expressed in terms of a small number of columns and/or rows of the data matrix, and thereby more amenable to interpretation in terms of the original data. Our main algorithmic results are two randomized algorithms which take as input an $m\times n$ matrix $A$ and a rank parameter $k$. In our first algorithm, $C$ is chosen, and we let $A'=CC^+A$, where $C^+$ is the Moore-Penrose generalized inverse of $C$. In our second algorithm $C$, $U$, $R$ are chosen, and we let $A'=CUR$. ($C$ and $R$ are matrices that consist of actual columns and rows, respectively, of $A$, and $U$ is a generalized inverse of their intersection.) For each algorithm, we show that with probability at least $1-\delta$, $\|A-A'\|_F\leq(1+\epsilon)\,\|A-A_k\|_F$, where $A_k$ is the “best” rank-$k$ approximation provided by truncating the SVD of $A$, and where $\|X\|_F$ is the Frobenius norm of the matrix $X$. The number of columns of $C$ and rows of $R$ is a low-degree polynomial in $k$, $1/\epsilon$, and $\log(1/\delta)$. Both the Numerical Linear Algebra community and the Theoretical Computer Science community have studied variants of these matrix decompositions over the last ten years. However, our two algorithms are the first polynomial time algorithms for such low-rank matrix approximations that come with relative-error guarantees; previously, in some cases, it was not even known whether such matrix decompositions exist. Both of our algorithms are simple and they take time of the order needed to approximately compute the top $k$ singular vectors of $A$. The technical crux of our analysis is a novel, intuitive sampling method we introduce in this paper called “subspace sampling.” In subspace sampling, the sampling probabilities depend on the Euclidean norms of the rows of the top singular vectors. This allows us to obtain provable relative-error guarantees by deconvoluting “subspace” information and “size-of-$A$” information in the input matrix. This technique is likely to be useful for other matrix approximation and data analysis problems.

Proceedings ArticleDOI
Xin Li1, Lei Guo1, Yihong Eric Zhao1
21 Apr 2008
TL;DR: An Internet Social Interest Discovery system, ISID, to discover the common user interests and cluster users and their saved URLs by different interest topics, and shows that ISID can effectively cluster similar documents by interest topics and discover user communities with common interests no matter if they have any online connections.
Abstract: The success and popularity of social network systems, such as del.icio.us, Facebook, MySpace, and YouTube, have generated many interesting and challenging problems to the research community. Among others, discovering social interests shared by groups of users is very important because it helps to connect people with common interests and encourages people to contribute and share more contents. The main challenge to solving this problem comes from the difficulty of detecting and representing the interest of the users. The existing approaches are all based on the online connections of users and so unable to identify the common interest of users who have no online connections.In this paper, we propose a novel social interest discovery approach based on user-generated tags. Our approach is motivated by the key observation that in a social network, human users tend to use descriptive tags to annotate the contents that they are interested in. Our analysis on a large amount of real-world traces reveals that in general, user-generated tags are consistent with the web content they are attached to, while more concise and closer to the understanding and judgments of human users about the content. Thus, patterns of frequent co-occurrences of user tags can be used to characterize and capture topics of user interests. We have developed an Internet Social Interest Discovery system, ISID, to discover the common user interests and cluster users and their saved URLs by different interest topics. Our evaluation shows that ISID can effectively cluster similar documents by interest topics and discover user communities with common interests no matter if they have any online connections.