Showing papers by "Yahoo! published in 2006"

PDF

Open Access

Journal Article•DOI•

[...]

Lou Jost¹•Institutions (1)

01 May 2006-Oikos

TL;DR: The standard similarity measure based on untransformed indices is shown to give misleading results, but transforming the indices or entropies to effective numbers of species produces a stable, easily interpreted, sensitive general similarity measure.

...read moreread less

Abstract: Entropies such as the Shannon–Wiener and Gini–Simpson indices are not themselves diversities. Conversion of these to effective number of species is the key to a unified and intuitive interpretation of diversity. Effective numbers of species derived from standard diversity indices share a common set of intuitive mathematical properties and behave as one would expect of a diversity, while raw indices do not. Contrary to Keylock, the lack of concavity of effective numbers of species is irrelevant as long as they are used as transformations of concave alpha, beta, and gamma entropies. The practical importance of this transformation is demonstrated by applying it to a popular community similarity measure based on raw diversity indices or entropies. The standard similarity measure based on untransformed indices is shown to give misleading results, but transforming the indices or entropies to effective numbers of species produces a stable, easily interpreted, sensitive general similarity measure. General overlap measures derived from this transformed similarity measure yield the Jaccard index, Sorensen index, Horn index of overlap, and the Morisita–Horn index as special cases.

...read moreread less

3,677 citations

Book Chapter•DOI•

A Survey of Clustering Data Mining Techniques

[...]

Pavel Berkhin¹•Institutions (1)

Yahoo!¹

01 Jan 2006

TL;DR: This survey concentrates on clustering algorithms from a data mining perspective as a data modeling technique that provides for concise summaries of the data.

...read moreread less

Abstract: Clustering is the division of data into groups of similar objects. In clustering, some details are disregarded in exchange for data simplification. Clustering can be viewed as a data modeling technique that provides for concise summaries of the data. Clustering is therefore related to many disciplines and plays an important role in a broad range of applications. The applications of clustering usually deal with large datasets and data with many attributes. Exploration of such data is a subject of data mining. This survey concentrates on clustering algorithms from a data mining perspective.

...read moreread less

3,047 citations

Journal Article•DOI•

The amphibian tree of life

[...]

Darrel R. Frost¹, Taran Grant², Taran Grant¹, Julián Faivovich¹, Julián Faivovich², Raoul H. Bain¹, Alexander Haas³, Célio F. B. Haddad⁴, Rafael O. de Sá⁵, Alan Channing⁶, Mark Wilkinson⁷, Stephen C. Donnellan, Christopher J. Raxworthy¹, Jonathan A. Campbell⁸, Boris L. Blotto⁹, Paul E. Moler¹⁰, Robert C. Drewes¹¹, Ronald A. Nussbaum¹², John D. Lynch¹³, David M. Green¹⁴, Ward C. Wheeler¹ - Show less +17 more•Institutions (14)

American Museum of Natural History¹, Columbia University², University of Hamburg³, Sao Paulo State University⁴, University of Richmond⁵, University of the Western Cape⁶, Natural History Museum⁷, University of Texas at Arlington⁸, Yahoo!⁹, Florida Fish and Wildlife Conservation Commission¹⁰, California Academy of Sciences¹¹, University of Michigan¹², National University of Colombia¹³, McGill University¹⁴

01 Jan 2006-Bulletin of the American Museum of Natural History

TL;DR: A new taxonomy of living amphibians is proposed to correct the deficiencies of the old one, based on the largest phylogenetic analysis of living Amphibia so far accomplished, and many subsidiary taxa are demonstrated to be nonmonophyletic.

...read moreread less

Abstract: The evidentiary basis of the currently accepted classification of living amphibians is discussed and shown not to warrant the degree of authority conferred on it by use and tradition. A new taxonomy of living amphibians is proposed to correct the deficiencies of the old one. This new taxonomy is based on the largest phylogenetic analysis of living Amphibia so far accomplished. We combined the comparative anatomical character evidence of Haas (2003) with DNA sequences from the mitochondrial transcription unit H1 (12S and 16S ribosomal RNA and tRNAValine genes, ≈ 2,400 bp of mitochondrial sequences) and the nuclear genes histone H3, rhodopsin, tyrosinase, and seven in absentia, and the large ribosomal subunit 28S (≈ 2,300 bp of nuclear sequences; ca. 1.8 million base pairs; x = 3.7 kb/terminal). The dataset includes 532 terminals sampled from 522 species representative of the global diversity of amphibians as well as seven of the closest living relatives of amphibians for outgroup comparisons. The...

...read moreread less

1,994 citations

Proceedings Article•DOI•

Structure and evolution of online social networks

[...]

Ravi Kumar¹, Jasmine Novak¹, Andrew Tomkins¹•Institutions (1)

Yahoo!¹

20 Aug 2006

TL;DR: A simple model of network growth is presented, characterizing users as either passive members of the network; inviters who encourage offline friends and acquaintances to migrate online; and linkers who fully participate in the social evolution of thenetwork.

...read moreread less

Abstract: In this paper, we consider the evolution of structure within large online social networks. We present a series of measurements of two such networks, together comprising in excess of five million people and ten million friendship links, annotated with metadata capturing the time of every event in the life of the network. Our measurements expose a surprising segmentation of these networks into three regions: singletons who do not participate in the network; isolated communities which overwhelmingly display star structure; and a giant component anchored by a well-connected core region which persists even in the absence of stars.We present a simple model of network growth which captures these aspects of component structure. The model follows our experimental results, characterizing users as either passive members of the network; inviters who encourage offline friends and acquaintances to migrate online; and linkers who fully participate in the social evolution of the network.

...read moreread less

1,151 citations

Proceedings Article•DOI•

Local Graph Partitioning using PageRank Vectors

[...]

R. Andersen¹, Fan Chung¹, Kevin J. Lang²•Institutions (2)

University of California, Berkeley¹, Yahoo!²

21 Oct 2006

TL;DR: An improved algorithm for computing approximate PageRank vectors, which allows us to find a cut with conductance at most oslash and approximately optimal balance in time O(m log4 m/oslash) in time proportional to its size.

...read moreread less

Abstract: A local graph partitioning algorithm finds a cut near a specified starting vertex, with a running time that depends largely on the size of the small side of the cut, rather than the size of the input graph. In this paper, we present a local partitioning algorithm using a variation of PageRank with a specified starting distribution. We derive a mixing result for PageRank vectors similar to that for random walks, and show that the ordering of the vertices produced by a PageRank vector reveals a cut with small conductance. In particular, we show that for any set C with conductance \Phi and volume k, a PageRank vector with a certain starting distribution can be used to produce a set with conductance O\left( {\sqrt {\Phi \log k} } \right). We present an improved algorithm for computing approximate PageRank vectors, which allows us to find such a set in time proportional to its size. In particular, we can find a cut with conductance at most ot o , whose small side has volume at least 2b, in time O\left( {2^b \log ^2 m/ ot o^2 } \right) where m is the number of edges in the graph. By combining small sets found by this local partitioning algorithm, we obtain a cut with conductance ot o and approximately optimal balance in time O\left( {m\log ^4 m/ ot o^2 } \right).

...read moreread less

1,022 citations

Proceedings Article•DOI•

HT06, tagging paper, taxonomy, Flickr, academic article, to read

[...]

Cameron Marlow¹, Mor Naaman¹, danah boyd², Marc Davis²•Institutions (2)

Yahoo!¹, University of California, Berkeley²

22 Aug 2006

TL;DR: A model of tagging systems, specifically in the context of web-based systems, is offered to help illustrate the possible benefits of these tools and a simple taxonomy of incentives and contribution models is provided to inform potential evaluative frameworks.

...read moreread less

Abstract: In recent years, tagging systems have become increasingly popular. These systems enable users to add keywords (i.e., "tags") to Internet resources (e.g., web pages, images, videos) without relying on a controlled vocabulary. Tagging systems have the potential to improve search, spam detection, reputation systems, and personal organization while introducing new modalities of social communication and opportunities for data mining. This potential is largely due to the social structure that underlies many of the current systems.Despite the rapid expansion of applications that support tagging of resources, tagging systems are still not well studied or understood. In this paper, we provide a short description of the academic related work to date. We offer a model of tagging systems, specifically in the context of web-based systems, to help us illustrate the possible benefits of these tools. Since many such systems already exist, we provide a taxonomy of tagging systems to help inform their analysis and design, and thus enable researchers to frame and compare evidence for the sustainability of such systems. We also provide a simple taxonomy of incentives and contribution models to inform potential evaluative frameworks. While this work does not present comprehensive empirical results, we present a preliminary study of the photo-sharing and tagging system Flickr to demonstrate our model and explore some of the issues in one sample system. This analysis helps us outline and motivate possible future directions of research in tagging systems.

...read moreread less

993 citations

Journal Article•DOI•

Orthogonal Laplacianfaces for Face Recognition

[...]

Deng Cai¹, Xiaofei He², Jiawei Han¹, Hong-Jiang Zhang³•Institutions (3)

University of Illinois at Urbana–Champaign¹, Yahoo!², Microsoft³

01 Nov 2006-IEEE Transactions on Image Processing

TL;DR: An appearance-based face recognition method, called orthogonal Laplacianface, based on the locality preserving projection (LPP) algorithm, which aims at finding a linear approximation to the eigenfunctions of the Laplace Beltrami operator on the face manifold.

...read moreread less

Abstract: Following the intuition that the naturally occurring face data may be generated by sampling a probability distribution that has support on or near a submanifold of ambient space, we propose an appearance-based face recognition method, called orthogonal Laplacianface. Our algorithm is based on the locality preserving projection (LPP) algorithm, which aims at finding a linear approximation to the eigenfunctions of the Laplace Beltrami operator on the face manifold. However, LPP is nonorthogonal, and this makes it difficult to reconstruct the data. The orthogonal locality preserving projection (OLPP) method produces orthogonal basis functions and can have more locality preserving power than LPP. Since the locality preserving power is potentially related to the discriminating power, the OLPP is expected to have more discriminating power than LPP. Experimental results on three face databases demonstrate the effectiveness of our proposed algorithm

...read moreread less

783 citations

Proceedings Article•DOI•

Generating query substitutions

[...]

Rosie Jones¹, Benjamin Rey¹, Omid Madani¹, Wiley Greiner•Institutions (1)

Yahoo!¹

23 May 2006

TL;DR: A model for selecting between candidates is built, by using a number of features relating the query-candidate pair, and by fitting the model to human judgments of relevance of query suggestions, which improves the quality of the candidates generated.

...read moreread less

Abstract: We introduce the notion of query substitution, that is, generating a new query to replace a user's original search query. Our technique uses modifications based on typical substitutions web searchers make to their queries. In this way the new query is strongly related to the original query, containing terms closely related to all of the original terms. This contrasts with query expansion through pseudo-relevance feedback, which is costly and can lead to query drift. This also contrasts with query relaxation through boolean or TFIDF retrieval, which reduces the specificity of the query. We define a scale for evaluating query substitution, and show that our method performs well at generating new queries related to the original queries. We build a model for selecting between candidates, by using a number of features relating the query-candidate pair, and by fitting the model to human judgments of relevance of query suggestions. This further improves the quality of the candidates generated. Experiments show that our techniques significantly increase coverage and effectiveness in the setting of sponsored search.

...read moreread less

707 citations

Proceedings Article•DOI•

Evolutionary clustering

[...]

Deepayan Chakrabarti¹, Ravi Kumar¹, Andrew Tomkins¹•Institutions (1)

Yahoo!¹

20 Aug 2006

TL;DR: This work presents a generic framework for clustering data over time, and discusses evolutionary versions of two widely-used clustering algorithms within this framework: k-means and agglomerative hierarchical clustering.

...read moreread less

Abstract: We consider the problem of clustering data over time. An evolutionary clustering should simultaneously optimize two potentially conflicting criteria: first, the clustering at any point in time should remain faithful to the current data as much as possible; and second, the clustering should not shift dramatically from one timestep to the next. We present a generic framework for this problem, and discuss evolutionary versions of two widely-used clustering algorithms within this framework: k-means and agglomerative hierarchical clustering. We extensively evaluate these algorithms on real data sets and show that our algorithms can simultaneously attain both high accuracy in capturing today's data, and high fidelity in reflecting yesterday's clustering.

...read moreread less

686 citations

Journal Article•DOI•

Effect of iodine intake on thyroid diseases in China.

[...]

Weiping Teng¹, Zhongyan Shan, Xiao-chun Teng, Haixia Guan, Yushu Li, Di Teng, Ying Jin, Xiaohui Yu, Chenling Fan, Wei Chong, Fan Yang, Hong Dai, Yang Yu, Jia Li, Yanyan Chen, Dong Zhao, Xiao-guang Shi, Fengnan Hu, Jinyuan Mao, Xiaolan Gu, Rong Yang, Ya-jie Tong, Wei-bo Wang, Tian-shu Gao, Chenyang Li - Show less +21 more•Institutions (1)

Yahoo!¹

29 Jun 2006-The New England Journal of Medicine

TL;DR: More than adequate or excessive iodine intake may lead to hypothyroidism and autoimmune thyroiditis in cohorts from three regions with different levels of iodine intake.

...read moreread less

Abstract: Background Iodine is an essential component of thyroid hormones; either low or high intake may lead to thyroid disease. We observed an increase in the prevalence of overt hypothyroidism, subclinical hypothyroidism, and autoimmune thyroiditis with increasing iodine intake in China in cohorts from three regions with different levels of iodine intake: mildly deficient (median urinary iodine excretion, 84 μg per liter), more than adequate (median, 243 μg per liter), and excessive (median, 651 μg per liter). Participants enrolled in a baseline study in 1999, and during the five-year follow-up through 2004, we examined the effect of regional differences in iodine intake on the incidence of thyroid disease. Methods Of the 3761 unselected subjects who were enrolled at baseline, 3018 (80.2 percent) participated in this follow-up study. Levels of thyroid hormones and thyroid autoantibodies in serum, and iodine in urine, were measured and B-mode ultrasonography of the thyroid was performed at baseline and follow-up. Results Among subjects with mildly deficient iodine intake, those with more than adequate intake, and those with excessive intake, the cumulative incidence of overt hypothyroidism was 0.2 percent, 0.5 percent, and 0.3 percent, respectively; that of subclinical hypothyroidism, 0.2 percent, 2.6 percent, and 2.9 percent, respectively; and that of autoimmune thyroiditis, 0.2 percent, 1.0 percent, and 1.3 percent, respectively. Among subjects with euthyroidism and antithyroid antibodies at baseline, the five-year incidence of elevated serum thyrotropin levels was greater among those with more than adequate or excessive iodine intake than among those with mildly deficient iodine intake. A baseline serum thyrotropin level of 1.0 to 1.9 mIU per liter was associated with the lowest subsequent incidence of abnormal thyroid function. Conclusions More than adequate or excessive iodine intake may lead to hypothyroidism and autoimmune thyroiditis.

...read moreread less

626 citations

Proceedings Article•DOI•

Spectral clustering for multi-type relational data

[...]

Bo Long¹, Zhongfei Zhang¹, Xiaoyun Wu², Philip S. Yu³•Institutions (3)

Binghamton University¹, Yahoo!², IBM³

25 Jun 2006

TL;DR: A general model, the collective factorization on related matrices, is proposed for multi-type relational data clustering and a novel algorithm is derived, the spectral relational clustering, to cluster multi- type interrelated data objects simultaneously.

...read moreread less

Abstract: Clustering on multi-type relational data has attracted more and more attention in recent years due to its high impact on various important applications, such as Web mining, e-commerce and bioinformatics. However, the research on general multi-type relational data clustering is still limited and preliminary. The contribution of the paper is three-fold. First, we propose a general model, the collective factorization on related matrices, for multi-type relational data clustering. The model is applicable to relational data with various structures. Second, under this model, we derive a novel algorithm, the spectral relational clustering, to cluster multi-type interrelated data objects simultaneously. The algorithm iteratively embeds each type of data objects into low dimensional spaces and benefits from the interactions among the hidden structures of different types of data objects. Extensive experiments demonstrate the promise and effectiveness of the proposed algorithm. Third, we show that the existing spectral clustering algorithms can be considered as the special cases of the proposed model and algorithm. This demonstrates the good theoretic generality of the proposed model and algorithm.

...read moreread less

Patent•

Method and system for using smart tags and a recommendation engine using smart tags

[...]

Edward Stanley Ott¹, Nathanael Joe Hayashi¹, Matthew Fukuda¹•Institutions (1)

Yahoo!¹

19 Jun 2006

TL;DR: In this paper, a system and method for recommending tags and/or content items in response to requests received from remote computing devices is presented, where the tag density is defined as the number of times a tag has been associated with a content item by any user of a plurality of users who are members of a community.

...read moreread less

Abstract: The present invention relates to a system and method for recommending tags and/or content items in response to requests received from remote computing devices. In one aspect, a content item recommendation system comprises a database configured to store an identifier of a first content item, a first tag and information from which a tag density associated with the first tag and with the first content item may be derived. The tag density may be a measure of times a tag has been associated with a content item by any user of a plurality of users who are members of a community. The system also comprises a recommendation engine configured to receive search results containing the first tag from a search engine and to correlate the first tag with information stored in the database. The recommendation engine may be further configured to determine a recommended tag, based on a recommendation threshold and a tag density, the tag density associated with both the recommended tag and the first content item.

...read moreread less

Patent•

Apparatus and method for content annotation and conditional annotation retrieval in a search context

[...]

Ramesh Sarukkai¹•Institutions (1)

Yahoo!¹

26 Jun 2006

TL;DR: In this article, a trust network is defined for each user, and annotations by any member of the user's trust network are made visible to the user during search and/or browsing of the corpus if the querying user and trust network members use similar queries to identify documents in the corpus.

...read moreread less

Abstract: Computer systems and methods incorporate user annotations (metadata) regarding various pages or sites, including annotations by a querying user and by members of a trust network defined for the querying user into search and browsing of a corpus such as the World Wide Web. A trust network is defined for each user, and annotations by any member of the querying user's trust network are made visible to the querying user during search and/or browsing of the corpus if the querying user and trust network members use similar queries to identify documents in the corpus. Users can also limit searches to content annotated by members of their trust networks or by members of a community selected by the user.

...read moreread less

Journal Article•DOI•

The validity of the hospital anxiety and depression scale and the geriatric depression scale in Parkinson's disease.

[...]

Federica Mondolo¹, Marjan Jahanshahi, Alessia Granà, Emanuele Biasutti, Emanuela Cacciatori, Paolo Di Benedetto - Show less +2 more•Institutions (1)

Yahoo!¹

01 Jan 2006-Behavioural Neurology

TL;DR: The results indicate the validity of using the HADS and the GDS to screen for depressive symptoms and to diagnose depressive illness in PD.

...read moreread less

Abstract: We assessed the concurrent validity of the Hospital Anxiety and Depression Scale (HADS) and the Geriatric Depression Scale (GDS) against the Hamilton Rating Scale for Depression (Ham-D) in patients with Parkinson's disease (PD). Forty-six non-demented PD patients were assessed by a neurologist on the Ham-D. Patients also completed four mood rating scales: the HADS, the GDS, the VAS and the Face Scale. For the HADS and the GDS, Receiver Operating Characteristics (ROC) curves were obtained and the positive and negative predictive values (PPV, NPV) were calculated for different cut-off scores. Maximum discrimination between depressed and non-depressed PD patients was reached at a cut-off score of 10/11 for both the HADS and the GDS. At the same cut-off score of 10/11 for both the HADS and the GDS, the high sensitivity and NPV make these scales appropriate screening instruments for depression in PD. A high specificity and PPV, which is necessary for a diagnostic test, was reached at a cut-off score of 12/13 for the GDS and at a cut-off score of 11/12 for the HADS. The results indicate the validity of using the HADS and the GDS to screen for depressive symptoms and to diagnose depressive illness in PD.

...read moreread less

Journal Article•DOI•

Building Support Vector Machines with Reduced Classifier Complexity

[...]

S. Sathiya Keerthi¹, Olivier Chapelle², Dennis DeCoste•Institutions (2)

Yahoo!¹, Max Planck Society²

01 Dec 2006-Journal of Machine Learning Research

TL;DR: A primal method that decouples the idea of basis functions from the concept of support vectors and greedily finds a set of kernel basis functions of a specified maximum size to approximate the SVM primal cost function well.

...read moreread less

Abstract: Support vector machines (SVMs), though accurate, are not preferred in applications requiring great classification speed, due to the number of support vectors being large. To overcome this problem we devise a primal method with the following properties: (1) it decouples the idea of basis functions from the concept of support vectors; (2) it greedily finds a set of kernel basis functions of a specified maximum size (dmax) to approximate the SVM primal cost function well; (3) it is efficient and roughly scales as O(ndmax2) where n is the number of training examples; and, (4) the number of basis functions it requires to achieve an accuracy close to the SVM accuracy is usually far less than the number of SVM support vectors.

...read moreread less

Patent•

Multimedia sharing in social networks for mobile devices

[...]

Nathanael Joe Hayashi¹, Edward Stanley Ott, Audrey Y. Tsang¹, Matthew Fukuda¹, Daniel James Wascovich¹, Michael Quoc¹ - Show less +2 more•Institutions (1)

Yahoo!¹

28 Apr 2006

TL;DR: A mobile device, system, and method are directed towards sharing multimedia information on a mobile device based at least in part on vitality information and other social networking information as discussed by the authors, where multimedia information captured on the mobile device may be manually and/or automatically annotated and shared with members of the social network.

...read moreread less

Abstract: A mobile device, system, and method are directed towards sharing multimedia information on a mobile device based at least in part on vitality information and other social networking information. Multimedia information may be received and/or synchronized on the mobile device based on a relationship between vitality information of members of a social network. The relationship may comprise a common membership in a group, a common multimedia usage behavior, a geographical proximity of members of the social network, a degree of separation of members of the social network, a common search behavior, or the like. Multimedia information captured on the mobile device may be manually and/or automatically annotated and shared with members of the social network. The multimedia information may be displayed in an integrated live view in conjunction with other social networking information.

...read moreread less

Patent•

Searching and route mapping based on a social network, location, and time

[...]

Paul Robert Birnie¹, Murray Blake Fortescue¹•Institutions (1)

Yahoo!¹

01 Nov 2006

TL;DR: In this paper, a GPS coordinate and a search criterion are received from a client device associated with a member of a social network, and a route is determined between the start and end location and through the location of interest.

...read moreread less

Abstract: A device, system, and method are directed towards providing location information from a social network. A GPS coordinate and a search criterion are received from a client device associated with a member of a social network. The social network is searched for another member associated with a location name based on the GPS coordinate and the search criterion. The location name may be a sponsored advertisement. The location name is provided to the client device. A communication may be enabled between the member and the other member. Moreover, a start and end location may also be received. The GPS coordinate and/or search criterion may be associated with either the start or end location. The searched location name is used to determine a location of interest. A route is determined between the start and end location and through the location of interest. The route is provided to the client device.

...read moreread less

Proceedings Article•DOI•

Visualizing tags over time

[...]

Micah Dubinko¹, Ravi Kumar¹, Joseph Magnani¹, Jasmine Novak¹, Prabhakar Raghavan¹, Andrew Tomkins¹ - Show less +2 more•Institutions (1)

Yahoo!¹

23 May 2006

TL;DR: This work combines a novel solution to an interval covering problem with extensions to previous work on score aggregation in order to create an efficient backend system capable of producing visualizations at arbitrary scales on this large dataset in real time.

...read moreread less

Abstract: We consider the problem of visualizing the evolution of tags within the Flickr (flickr.com) online image sharing community. Any user of the Flickr service may append a tag to any photo in the system. Over the past year, users have on average added over a million tags each week. Understanding the evolution of these tags over time is therefore a challenging task. We present a new approach based on a characterization of the most interesting tags associated with a sliding interval of time. An animation provided via Flash in a web browser allows the user to observe and interact with the interesting tags as they evolve over time.New algorithms and data structures are required to support the efficient generation of this visualization. We combine a novel solution to an interval covering problem with extensions to previous work on score aggregation in order to create an efficient backend system capable of producing visualizations at arbitrary scales on this large dataset in real time.

...read moreread less

Patent•

Media object metadata association and ranking

[...]

Daniel Stewart Butterfield¹, Eric Costello¹, Caterina Fake¹, Callum James Henderson-Begg¹, Serguei Mourachov¹, Joshua Schachter¹ - Show less +2 more•Institutions (1)

Yahoo!¹

20 Apr 2006

TL;DR: In this paper, metadata may be in the form of tags, comments, annotations or favorites, and the media objects may be searched according to metadata, and ranked in a variety of ways.

...read moreread less

Abstract: Metadata may be associated with media objects by providing media objects for display, and accepting input concerning the media objects, where the input may include at least two different types of metadata. For example, metadata may be in the form of tags, comments, annotations or favorites. The media objects may be searched according to metadata, and ranked in a variety of ways.

...read moreread less

Proceedings Article•DOI•

Generating summaries and visualization for large collections of geo-referenced photographs

[...]

Alexandar Jaffe¹, Mor Naaman¹, Tamir Tassa², Marc Davis¹•Institutions (2)

Yahoo!¹, Open University of Israel²

26 Oct 2006

TL;DR: A framework for automatically selecting a summary set of photos from a large collection of geo-referenced photographs, based on spa-tial patterns in photo sets, as well as textual-topical patterns and user (photographer) identity cues, which can be expanded to support social, temporal, and other factors.

...read moreread less

Abstract: We describe a framework for automatically selecting a summary set of photos from a large collection of geo-referenced photographs. Such large collections are inherently difficult to browse, and become excessively so as they grow in size, making summaries an important tool in rendering these collections accessible. Our summary algorithm is based on spa-tial patterns in photo sets, as well as textual-topical patterns and user (photographer) identity cues. The algorithm can be expanded to support social, temporal, and other factors. The summary can thus be biased by the content of the query, the user making the query, and the context in which the query is made.A modified version of our summarization algorithm serves as a basis for a new map-based visualization of large collections of geo-referenced photos, called Tag Maps. Tag Maps visualize the data by placing highly representative textual tags on relevant map locations in the viewed region, effectively providing a sense of the important concepts embodied in the collection.An initial evaluation of our implementation on a set of geo-referenced photos shows that our algorithm and visualization perform well, producing summaries and views that are highly rated by users.

...read moreread less

Patent•

Interestingness ranking of media objects

[...]

Daniel Stewart Butterfield¹, Caterina Fake¹, Callum James Henderson-Begg¹, Serguei Mourachov¹•Institutions (1)

Yahoo!¹

08 Feb 2006

TL;DR: In this paper, a new class of metrics known as "interestingness" is proposed to rank media objects based on the quantity of user-entered metadata concerning the media object.

...read moreread less

Abstract: Media objects, such as images or soundtracks, may be ranked according to a new class of metrics known as “interestingness.” These rankings may be based at least in part on the quantity of user-entered metadata concerning the media object, the number of users who have assigned metadata to the media object, access patterns related to the media object, and/or a lapse of time related to the media object.

...read moreread less

Patent•

Prefetching content based on a mobile user profile

[...]

Jason Morse¹, Jonathan Pantera Grubb¹•Institutions (1)

Yahoo!¹

07 Jun 2006

TL;DR: In this article, a system and method are directed towards prefetching content for a mobile terminal based on characteristics of, and tracked usage of the mobile terminal to request content through an online portal service, which provides access to content in multiple subject areas.

...read moreread less

Abstract: A system and method are directed towards prefetching content for a mobile terminal based on characteristics of, and tracked usage of the mobile terminal to request content through an online portal service, which provides access to content in multiple subject areas. A mobile user profile is created from the characteristics and patterns of the tracked usage. The tracked usage information includes the time, location, frequency at which the content was requested. Based on the mobile user profile information, content similar to previously requested content is prefetched and cached in anticipation of the mobile terminal making a similar request. Prefetching may also occur based on a trigger event such as the mobile terminal returning to a location from which certain content was previously requested. Prefetching may further be based on a related general user profile that indicates usage of an alternate electronic device to access content through the portal.

...read moreread less

Book Chapter•DOI•

Subset ranking using regression

[...]

David Cossock¹, Tong Zhang¹•Institutions (1)

Yahoo!¹

22 Jun 2006

TL;DR: In this article, the authors consider the problem of subset ranking, motivated by its important application in web search and present bounds that relate the approximate optimization of DCG to the approximate minimization of certain regression errors.

...read moreread less

Abstract: We study the subset ranking problem, motivated by its important application in web-search. In this context, we consider the standard DCG criterion (discounted cumulated gain) that measures the quality of items near the top of the rank-list. Similar to error minimization for binary classification, the DCG criterion leads to a non-convex optimization problem that can be NP-hard. Therefore a computationally more tractable approach is needed. We present bounds that relate the approximate optimization of DCG to the approximate minimization of certain regression errors. These bounds justify the use of convex learning formulations for solving the subset ranking problem. The resulting estimation methods are not conventional, in that we focus on the estimation quality in the top-portion of the rank-list. We further investigate the generalization ability of these formulations. Under appropriate conditions, the consistency of the estimation schemes with respect to the DCG metric can be derived.

...read moreread less

Proceedings Article•DOI•

Large scale semi-supervised linear SVMs

[...]

Vikas Sindhwani¹, S. Sathiya Keerthi²•Institutions (2)

University of Chicago¹, Yahoo!²

06 Aug 2006

TL;DR: An implementation of Transductive SVM (TSVM) that is significantly more efficient and scalable than currently used dual techniques, for linear classification problems involving large, sparse datasets, and a variant of TSVM that involves multiple switching of labels.

...read moreread less

Abstract: Large scale learning is often realistic only in a semi-supervised setting where a small set of labeled examples is available together with a large collection of unlabeled data. In many information retrieval and data mining applications, linear classifiers are strongly preferred because of their ease of implementation, interpretability and empirical performance. In this work, we present a family of semi-supervised linear support vector classifiers that are designed to handle partially-labeled sparse datasets with possibly very large number of examples and features. At their core, our algorithms employ recently developed modified finite Newton techniques. Our contributions in this paper are as follows: (a) We provide an implementation of Transductive SVM (TSVM) that is significantly more efficient and scalable than currently used dual techniques, for linear classification problems involving large, sparse datasets. (b) We propose a variant of TSVM that involves multiple switching of labels. Experimental results show that this variant provides an order of magnitude further improvement in training efficiency. (c) We present a new algorithm for semi-supervised learning based on a Deterministic Annealing (DA) approach. This algorithm alleviates the problem of local minimum in the TSVM optimization procedure while also being computationally attractive. We conduct an empirical study on several document classification tasks which confirms the value of our methods in large scale semi-supervised settings.

...read moreread less

Patent•

System and Method for Creating and Providing a User Interface for Displaying Advertiser Defined Groups of Advertisement Campaign Information

[...]

Robert J. Collins¹, Paul Joseph Apodaca¹, Adam J. Wand, Claude Jones¹•Institutions (1)

Yahoo!¹

28 Apr 2006

TL;DR: In this article, a system and a method for creating and providing a user interface for optimizing advertiser defined groups of advertisement campaign information is disclosed, which is based on the forecasting information to optimize performance of at least one or more ad groups.

...read moreread less

Abstract: A system and method for creating and providing a user interface for optimizing advertiser defined groups of advertisement campaign information is disclosed. Generally, advertisement campaign information is organized into one more ad groups. An ad group typically includes advertisements and parameters for advertisements that are to be handled by an advertisement campaign management system in a similar manner. Forecasting information is obtained relating to at least a portion of one of the one or more ad groups. At least a portion of the advertisement campaign information is then modified based at least in part on the forecasting information to optimize performance of at least one of the one or more ad groups.

...read moreread less

Proceedings Article•DOI•

Communities from seed sets

[...]

Reid Andersen¹, Kevin J. Lang²•Institutions (2)

University of California, San Diego¹, Yahoo!²

23 May 2006

TL;DR: This work shows how to adapt recent results from theoretical computer science to expand a seed set into a community with small conductance and a strong relationship to the seed, while examining only a small neighborhood of the entire graph.

...read moreread less

Abstract: Expanding a seed set into a larger community is a common procedure in link-based analysis. We show how to adapt recent results from theoretical computer science to expand a seed set into a community with small conductance and a strong relationship to the seed, while examining only a small neighborhood of the entire graph. We extend existing results to give theoretical guarantees that apply to a variety of seed sets from specified communities. We also describe simple and flexible heuristics for applying these methods in practice, and present early experiments showing that these methods compare favorably with existing approaches.

...read moreread less

Journal Article•DOI•