scispace - formally typeset
Search or ask a question

Showing papers by "Yahoo! published in 2007"


Patent•
Farzad Nazem1, Ashvinkumar P. Patel•
22 Jan 2007
TL;DR: In this paper, a custom page server is provided with user preferences organized into templates stored in compact data structures and the live data used to fill the templates stored local to the page server which is handing user requests for custom pages.
Abstract: An custom page server is provided with user preferences organized into templates stored in compact data structures and the live data used to fill the templates stored local to the page server which is handing user requests for custom pages. One process is executed on the page server for every request. The process is provided a user template for the user making the request, where the user template is either generated from user preferences or retrieved from a cache of recently used user templates. Each user process is provided access to a large region of shared memory which contains all of the live data needed to fill any user template. Typically, the pages served are news pages, giving the user a custom selection of stock quotes, news headlines, sports scores, weather, and the like. With the live data stored in a local, shared memory, any custom page can be built within the page server, eliminating the need to make requests from other servers for portions of the live data. While the shared memory might include RAM (random access memory) and disk storage, in many computer systems, it is faster to store all the live data in RAM.

919 citations


Proceedings Article•DOI•
29 Apr 2007
TL;DR: The incentives for annotation in Flickr, a popular web-based photo-sharing system, and ZoneTag, a cameraphone photo capture and annotation tool that uploads images to Flickr are investigated to offer a taxonomy of motivations for annotation along two dimensions (sociality and function).
Abstract: Why do people tag? Users have mostly avoided annotating media such as photos -- both in desktop and mobile environments -- despite the many potential uses for annotations, including recall and retrieval. We investigate the incentives for annotation in Flickr, a popular web-based photo-sharing system, and ZoneTag, a cameraphone photo capture and annotation tool that uploads images to Flickr. In Flickr, annotation (as textual tags) serves both personal and social purposes, increasing incentives for tagging and resulting in a relatively high number of annotations. ZoneTag, in turn, makes it easier to tag cameraphone photos that are uploaded to Flickr by allowing annotation and suggesting relevant tags immediately after capture. A qualitative study of ZoneTag/Flickr users exposed various tagging patterns and emerging motivations for photo annotation. We offer a taxonomy of motivations for annotation in this system along two dimensions (sociality and function), and explore the various factors that people consider when tagging their photos. Our findings suggest implications for the design of digital photo organization and sharing applications, as well as other applications that incorporate user-based annotation.

912 citations


Proceedings Article•DOI•
11 Jun 2007
TL;DR: A Merge phase is added to Map-Reduce a Merge phase that can efficiently merge data already partitioned and sorted by map and reduce modules, and it is demonstrated that this new model can express relational algebra operators as well as implement several join algorithms.
Abstract: Map-Reduce is a programming model that enables easy development of scalable parallel applications to process a vast amount of data on large clusters of commodity machines. Through a simple interface with two functions, map and reduce, this model facilitates parallel implementation of many real-world tasks such as data processing jobs for search engines and machine learning. However,this model does not directly support processing multiple related heterogeneous datasets. While processing relational data is a common need, this limitation causes difficulties and/or inefficiency when Map-Reduce is applied on relational operations like joins. We improve Map-Reduce into a new model called Map-Reduce-Merge. It adds to Map-Reduce a Merge phase that can efficiently merge data already partitioned and sorted (or hashed) by map and reduce modules. We also demonstrate that this new model can express relational algebra operators as well as implement several join algorithms.

821 citations


Proceedings Article•DOI•
26 Dec 2007
TL;DR: This paper proposes a novel method, called Semi- supervised Discriminant Analysis (SDA), which makes use of both labeled and unlabeled samples to learn a discriminant function which is as smooth as possible on the data manifold.
Abstract: Linear Discriminant Analysis (LDA) has been a popular method for extracting features which preserve class separability. The projection vectors are commonly obtained by maximizing the between class covariance and simultaneously minimizing the within class covariance. In practice, when there is no sufficient training samples, the covariance matrix of each class may not be accurately estimated. In this paper, we propose a novel method, called Semi- supervised Discriminant Analysis (SDA), which makes use of both labeled and unlabeled samples. The labeled data points are used to maximize the separability between different classes and the unlabeled data points are used to estimate the intrinsic geometric structure of the data. Specifically, we aim to learn a discriminant function which is as smooth as possible on the data manifold. Experimental results on single training image face recognition and relevance feedback image retrieval demonstrate the effectiveness of our algorithm.

730 citations


Journal Article•DOI•
TL;DR: The authors performed an exploratory analysis of the determinants of prices in online auctions for collectible United States one-cent coins at the eBay Web site and found that negative feedback ratings have a much greater effect than positive feedback ratings do.
Abstract: This paper presents an exploratory analysis of the determinants of prices in online auctions for collectible United States one-cent coins at the eBay Web site. Starting with an initial data set of 20,000 auctions, we perform regression analysis on a restricted sample of 461 coins for which we obtained estimates of book value. We have three major findings. First, a seller’s feedback ratings, reported by other eBay users, have a measurable effect on her auction prices. Negative feedback ratings have a much greater effect than positive feedback ratings do. Second, minimum bids and reserve prices have positive effects on the final auction price. In particular, minimum bids appear only to have a significant effect when they are binding on a single bidder, as predicted by theory. Third, when a seller chooses to have her auction last for a longer period of days, this significantly increases the auction price on average.

704 citations


Journal Article•DOI•
TL;DR: In this article, the physical demands of modern basketball were assessed by investigating 38 elite under-19-year-old basketball players during competition, and the mean (SD) heart rate during total time was 171 (4) beats/min, with a significant difference (p<0.01) between guards and centres.
Abstract: The physical demands of modern basketball were assessed by investigating 38 elite under-19-year-old basketball players during competition. Computerised time-motion analyses were performed on 18 players of various positions. Heart rate was recorded continuously for all subjects. Blood was sampled before the start of each match, at half time and at full time to determine lactate concentration. Players spent 8.8% (1%), 5.3% (0.8%) and 2.1% (0.3%) of live time in high "specific movements", sprinting and jumping, respectively. Centres spent significantly lower live time competing in high-intensity activities than guards (14.7% (1%) v 17.1% (1.2%); p<0.01) and forwards (16.6% (0.8%); p<0.05). The mean (SD) heart rate during total time was 171 (4) beats/min, with a significant difference (p<0.01) between guards and centres. Mean (SD) plasma lactate concentration was 5.49 (1.24) mmol/l, with concentrations at half time (6.05 (1.27) mmol/l) being significantly (p<0.001) higher than those at full time (4.94 (1.46) mmol/l). The changes to the rules of basketball have slightly increased the cardiac efforts involved during competition. The game intensity may differ according to the playing position, being greatest in guards.

528 citations


Proceedings Article•DOI•
Tye Rattenbury1, Nathaniel Good1, Mor Naaman1•
23 Jul 2007
TL;DR: An approach for extracting semantics of tags, unstructured text-labels assigned to resources on the Web, based on each tag's usage patterns, and shows that the Scale-structure Identification method outperforms the existing techniques.
Abstract: We describe an approach for extracting semantics of tags, unstructured text-labels assigned to resources on the Web, based on each tag's usage patterns. In particular, we focus on the problem of extracting place and event semantics for tags that are assigned to photos on Flickr, a popular photo sharing website that supports time and location (latitude/longitude) metadata. We analyze two methods inspired by well-known burst-analysis techniques and one novel method: Scale-structure Identification. We evaluate the methods on a subset of Flickr data, and show that our Scale-structure Identification method outperforms the existing techniques. The approach and methods described in this work can be used in other domains such as geo-annotated web pages, where text terms can be extracted and associated with usage patterns.

527 citations


Journal Article•DOI•
01 Feb 2007-Spine
TL;DR: In this paper, a meta-analysis of the published literature was conducted specifically looking at accuracy and the postoperative methods used for the assessment of pedicle screw placement in the human spine.
Abstract: Study design A meta-analysis of the published literature was conducted specifically looking at accuracy and the postoperative methods used for the assessment of pedicle screw placement in the human spine. Objectives This study specifically aimed to identify postoperative methods used for pedicle screw placement assessment, including the most common method, and to report cumulative pedicle screw placement study statistics from synthesis of the published literature. Summary of background data Safety concerns have driven specific interests in the accuracy and precision of pedicle screw placement. A large variation in reported accuracy may exist partly due to the lack of a standardized evaluation method and/or the lack of consensus to what, or in which range, is pedicle screw placement accuracy considered satisfactory. Methods A MEDLINE search was executed covering the span from 1966 until 2006, and references from identified papers were reviewed. An extensive database was constructed for synthesis of the identified studies. Subgroups and descriptive statistics were determined based on the type of population, in vivo or cadaveric, and separated based on whether the assistance of navigation was employed. Results In total, we report on 130 studies resulting in 37,337 total pedicle screws implanted, of which 34,107 (91.3%) were identified as accurately placed for the combined in vivo and cadaveric populations. The most common assessment method identified pedicle screw violations simply as either present or absent. Overall, the median placement accuracy for the in vivo assisted navigation subgroup (95.2%) was higher than that of the subgroup without the use of navigation (90.3%). Conclusions Navigation does indeed provide a higher accuracy in the placement of pedicle screws for most of the subgroups presented. However, an exception is found at the thoracic levels for both the in vivo and cadaveric populations, where no advantage in the use of navigation was found.

518 citations


Proceedings Article•
06 Jan 2007
TL;DR: A novel linear algorithm for discriminant analysis, called Locality Sensitive Discriminant Analysis (LSDA), which finds a projection which maximizes the margin between data points from different classes at each local area by discovering the local manifold structure.
Abstract: Linear Discriminant Analysis (LDA) is a popular data-analytic tool for studying the class relationship between data points A major disadvantage of LDA is that it fails to discover the local geometrical structure of the data manifold In this paper, we introduce a novel linear algorithm for discriminant analysis, called Locality Sensitive Discriminant Analysis (LSDA) When there is no sufficient training samples, local structure is generally more important than global structure for discriminant analysis By discovering the local manifold structure, LSDA finds a projection which maximizes the margin between data points from different classes at each local area Specifically, the data points are mapped into a subspace in which the nearby points with the same label are close to each other while the nearby points with different labels are far apart Experiments carried out on several standard face databases show a clear improvement over the results of LDA-based recognition

500 citations


Proceedings Article•
03 Dec 2007
TL;DR: The Epoch-Greedy algorithm is presented, an algorithm for contextual multi-armed bandits (also known as bandits with side information) that is controlled by a sample complexity bound for a hypothesis class.
Abstract: We present Epoch-Greedy, an algorithm for contextual multi-armed bandits (also known as bandits with side information). Epoch-Greedy has the following properties: 1. No knowledge of a time horizon T is necessary. 2. The regret incurred by Epoch-Greedy is controlled by a sample complexity bound for a hypothesis class. 3. The regret scales as O(T2/3S1/3) or better (sometimes, much better). Here S is the complexity term in a sample complexity bound for standard supervised learning.

429 citations


Journal Article•DOI•
TL;DR: This paper presents a substantially generalized co-clustering framework wherein any Bregman divergence can be used in the objective function, and various conditional expectation based constraints can be considered based on the statistics that need to be preserved.
Abstract: Co-clustering, or simultaneous clustering of rows and columns of a two-dimensional data matrix, is rapidly becoming a powerful data analysis technique. Co-clustering has enjoyed wide success in varied application domains such as text clustering, gene-microarray analysis, natural language processing and image, speech and video analysis. In this paper, we introduce a partitional co-clustering formulation that is driven by the search for a good matrix approximation---every co-clustering is associated with an approximation of the original data matrix and the quality of co-clustering is determined by the approximation error. We allow the approximation error to be measured using a large class of loss functions called Bregman divergences that include squared Euclidean distance and KL-divergence as special cases. In addition, we permit multiple structurally different co-clustering schemes that preserve various linear statistics of the original data matrix. To accomplish the above tasks, we introduce a new minimum Bregman information (MBI) principle that simultaneously generalizes the maximum entropy and standard least squares principles, and leads to a matrix approximation that is optimal among all generalized additive models in a certain natural parameter space. Analysis based on this principle yields an elegant meta algorithm, special cases of which include most previously known alternate minimization based clustering algorithms such as kmeans and co-clustering algorithms such as information theoretic (Dhillon et al., 2003b) and minimum sum-squared residue co-clustering (Cho et al., 2004). To demonstrate the generality and flexibility of our co-clustering framework, we provide examples and empirical evidence on a variety of problem domains and also describe novel co-clustering applications such as missing value prediction and compression of categorical data matrices.

Proceedings Article•DOI•
Lyndon Kennedy1, Mor Naaman1, Shane Ahern1, Rahul Nair1, Tye Rattenbury1 •
29 Sep 2007
TL;DR: A location-tag-vision-based approach to retrieving images of geography-related landmarks and features from the Flickr dataset is demonstrated, suggesting that community-contributed media and annotation can enhance and improve access to multimedia resources - and the understanding of the world.
Abstract: The advent of media-sharing sites like Flickr and YouTube has drastically increased the volume of community-contributed multimedia resources available on the web These collections have a previously unimagined depth and breadth, and have generated new opportunities - and new challenges - to multimedia research How do we analyze, understand and extract patterns from these new collections? How can we use these unstructured, unrestricted community contributions of media (and annotation) to generate "knowledge" As a test case, we study Flickr - a popular photo sharing website Flickr supports photo, time and location metadata, as well as a light-weight annotation model We extract information from this dataset using two different approaches First, we employ a location-driven approach to generate aggregate knowledge in the form of "representative tags" for arbitrary areas in the world Second, we use a tag-driven approach to automatically extract place and event semantics for Flickr tags, based on each tag's metadata patterns With the patterns we extract from tags and metadata, vision algorithms can be employed with greater precision In particular, we demonstrate a location-tag-vision-based approach to retrieving images of geography-related landmarks and features from the Flickr dataset The results suggest that community-contributed media and annotation can enhance and improve our access to multimedia resources - and our understanding of the world

Proceedings Article•DOI•
15 Apr 2007
TL;DR: This work develops techniques for detecting CFD violations in SQL as well as novel techniques for checking multiple constraints in a single query, and experimentally evaluates the performance of the CFD-based methods for inconsistency detection.
Abstract: We propose a class of constraints, referred to as conditional functional dependencies (CFDs), and study their applications in data cleaning. In contrast to traditional functional dependencies (FDs) that were developed mainly for schema design, CFDs aim at capturing the consistency of data by incorporating bindings of semantic ally related values. For CFDs we provide an inference system analogous to Armstrong's axioms for FDs, as well as consistency analysis. Since CFDs allow data bindings, a large number of individual constraints may hold on a table, complicating detection of constraint violations. We develop techniques for detecting CFD violations in SQL as well as novel techniques for checking multiple constraints in a single query. We experimentally evaluate the performance of our CFD-based methods for inconsistency detection. This not only yields a constraint theory for CFDs but is also a step toward a practical constraint-based method for improving data quality.

Journal Article•DOI•
TL;DR: Operative principles that maximize an impingement-free range of motion include correct combined acetabular and femoral anteversion and an optimal head-neck ratio.
Abstract: Impingement is a cause of poor outcomes of prosthetic hip arthroplasty; it can lead to instability, accelerated wear, and unexplained pain. Impingement is influenced by prosthetic design, component position, biomechanical factors, and patient variables. Evidence linking impingement to dislocation and accelerated wear comes from implant retrieval studies. Operative principles that maximize an impingement-free range of motion include correct combined acetabular and femoral anteversion and an optimal head-neck ratio. Operative techniques for preventing impingement include medialization of the cup to avoid component impingement and restoration of hip offset and length to avoid osseous impingement.

Proceedings Article•DOI•
12 Aug 2007
TL;DR: A novel way to represent queries in a vector space based on a graph derived from the query-click bipartite graph is proposed, showing that it is less sparse than previous results suggested, and that almost all the measures of these graphs follow power laws.
Abstract: In this paper we study a large query log of more than twenty million queries with the goal of extracting the semantic relations that are implicitly captured in the actions of users submitting queries and clicking answers. Previous query log analyses were mostly done with just the queries and not the actions that followed after them. We first propose a novel way to represent queries in a vector space based on a graph derived from the query-click bipartite graph. We then analyze the graph produced by our query log, showing that it is less sparse than previous results suggested, and that almost all the measures of these graphs follow power laws, shedding some light on the searching user behavior as well as on the distribution of topics that people want in the Web. The representation we introduce allows to infer interesting semantic relationships between queries. Second, we provide an experimental analysis on the quality of these relations, showing that most of them are relevant. Finally we sketch an application that detects multitopical URLs.

Proceedings Article•DOI•
26 Dec 2007
TL;DR: This paper proposes a novel dimensionality reduction framework, called spectral regression (SR), for efficient regularized subspace learning, which casts the problem of learning the projective functions into a regression framework, which avoids eigen-decomposition of dense matrices.
Abstract: Subspace learning based face recognition methods have attracted considerable interests in recent years, including principal component analysis (PCA), linear discriminant analysis (LDA), locality preserving projection (LPP), neighborhood preserving embedding (NPE) and marginal Fisher analysis (MFA). However, a disadvantage of all these approaches is that their computations involve eigen- decomposition of dense matrices which is expensive in both time and memory. In this paper, we propose a novel dimensionality reduction framework, called spectral regression (SR), for efficient regularized subspace learning. SR casts the problem of learning the projective functions into a regression framework, which avoids eigen-decomposition of dense matrices. Also, with the regression based framework, different kinds of regularizes can be naturally incorporated into our algorithm which makes it more flexible. Computational analysis shows that SR has only linear-time complexity which is a huge speed up comparing to the cubic-time complexity of the ordinary approaches. Experimental results on face recognition demonstrate the effectiveness and efficiency of our method.

Proceedings Article•DOI•
17 Jun 2007
TL;DR: This paper introduces a regularized subspace learning model using a Laplacian penalty to constrain the coefficients to be spatially smooth and shows results on face recognition which are better for image representation than their original version.
Abstract: Subspace learning based face recognition methods have attracted considerable interests in recently years, including principal component analysis (PCA), linear discriminant analysis (LDA), locality preserving projection (LPP), neighborhood preserving embedding (NPE), marginal fisher analysis (MFA) and local discriminant embedding (LDE). These methods consider an n1timesn2 image as a vector in Rn 1 timesn 2 and the pixels of each image are considered as independent. While an image represented in the plane is intrinsically a matrix. The pixels spatially close to each other may be correlated. Even though we have n1xn2 pixels per image, this spatial correlation suggests the real number of freedom is far less. In this paper, we introduce a regularized subspace learning model using a Laplacian penalty to constrain the coefficients to be spatially smooth. All these existing subspace learning algorithms can fit into this model and produce a spatially smooth subspace which is better for image representation than their original version. Recognition, clustering and retrieval can be then performed in the image subspace. Experimental results on face recognition demonstrate the effectiveness of our method.

Proceedings Article•DOI•
23 Jul 2007
TL;DR: A spam detection system that combines link-based and content-based features, and uses the topology of the Web graph by exploiting the link dependencies among the Web pages, which finds that linked hosts tend to belong to the same class.
Abstract: Web spam can significantly deteriorate the quality of search engine results. Thus there is a large incentive for commercial search engines to detect spam pages efficiently and accurately. In this paper we present a spam detection system that combines link-based and content-based features, and uses the topology of the Web graph by exploiting the link dependencies among the Web pages. We find that linked hosts tend to belong to the same class: either both are spam or both are non-spam. We demonstrate three methods of incorporating the Web graph topology into the predictions obtained by our base classifier: (i) clustering the host graph, and assigning the label of all hosts in the cluster by majority vote, (ii) propagating the predicted labels to neighboring hosts, and (iii) using the predicted labels of neighboring hosts as new features and retraining the classifier. The result is an accurate system for detecting Web spam, tested on a large and public dataset, using algorithms that can be applied in practice to large-scale Web data.

Posted Content•
TL;DR: These two algorithms are the first polynomial time algorithms for such low-rank matrix approximations that come with relative-error guarantees; previously, in some cases, it was not even known whether such matrix decompositions exist.
Abstract: Many data analysis applications deal with large matrices and involve approximating the matrix using a small number of ``components.'' Typically, these components are linear combinations of the rows and columns of the matrix, and are thus difficult to interpret in terms of the original features of the input data. In this paper, we propose and study matrix approximations that are explicitly expressed in terms of a small number of columns and/or rows of the data matrix, and thereby more amenable to interpretation in terms of the original data. Our main algorithmic results are two randomized algorithms which take as input an $m \times n$ matrix $A$ and a rank parameter $k$. In our first algorithm, $C$ is chosen, and we let $A'=CC^+A$, where $C^+$ is the Moore-Penrose generalized inverse of $C$. In our second algorithm $C$, $U$, $R$ are chosen, and we let $A'=CUR$. ($C$ and $R$ are matrices that consist of actual columns and rows, respectively, of $A$, and $U$ is a generalized inverse of their intersection.) For each algorithm, we show that with probability at least $1-\delta$: $$ ||A-A'||_F \leq (1+\epsilon) ||A-A_k||_F, $$ where $A_k$ is the ``best'' rank-$k$ approximation provided by truncating the singular value decomposition (SVD) of $A$. The number of columns of $C$ and rows of $R$ is a low-degree polynomial in $k$, $1/\epsilon$, and $\log(1/\delta)$. Our two algorithms are the first polynomial time algorithms for such low-rank matrix approximations that come with relative-error guarantees; previously, in some cases, it was not even known whether such matrix decompositions exist. Both of our algorithms are simple, they take time of the order needed to approximately compute the top $k$ singular vectors of $A$, and they use a novel, intuitive sampling method called ``subspace sampling.''

Proceedings Article•DOI•
23 Jul 2007
TL;DR: A system for contextual ad matching based on a combination of semantic and syntactic features is proposed, which will help improve the user experience and reduce the number of irrelevant ads.
Abstract: Contextual advertising or Context Match (CM) refers to the placement of commercial textual advertisements within the content of a generic web page, while Sponsored Search (SS) advertising consists in placing ads on result pages from a web search engine, with ads driven by the originating query. In CM there is usually an intermediary commercial ad-network entity in charge of optimizing the ad selection with the twin goal of increasing revenue (shared between the publisher and the ad-network) and improving the user experience. With these goals in mind it is preferable to have ads relevant to the page content, rather than generic ads. The SS market developed quicker than the CM market, and most textual ads are still characterized by "bid phrases" representing those queries where the advertisers would like to have their ad displayed. Hence, the first technologies for CM have relied on previous solutions for SS, by simply extracting one or more phrases from the given page content, and displaying ads corresponding to searches on these phrases, in a purely syntactic approach. However, due to the vagaries of phrase extraction, and the lack of context, this approach leads to many irrelevant ads. To overcome this problem, we propose a system for contextual ad matching based on a combination of semantic and syntactic features.

Book Chapter•DOI•
13 Jun 2007
TL;DR: The effectiveness of the framework for margin based active learning of linear separators both in the realizable case and in a specific noisy setting related to the Tsybakov small noise condition is analyzed.
Abstract: We present a framework for margin based active learning of linear separators. We instantiate it for a few important cases, some of which have been previously considered in the literature.We analyze the effectiveness of our framework both in the realizable case and in a specific noisy setting related to the Tsybakov small noise condition.

Proceedings Article•DOI•
23 Jul 2007
TL;DR: It is demonstrated that changes to search engine results can hinder re-finding, and a way to automatically detect repeat searches and predict repeat clicks is provided.
Abstract: People often repeat Web searches, both to find new information on topics they have previously explored and to re-find information they have seen in the past. The query associated with a repeat search may differ from the initial query but can nonetheless lead to clicks on the same results. This paper explores repeat search behavior through the analysis of a one-year Web query log of 114 anonymous users and a separate controlled survey of an additional 119 volunteers. Our study demonstrates that as many as 40% of all queries are re-finding queries. Re-finding appears to be an important behavior for search engines to explicitly support, and we explore how this can be done. We demonstrate that changes to search engine results can hinder re-finding, and provide a way to automatically detect repeat searches and predict repeat clicks.

Proceedings Article•DOI•
Shane Ahern1, Dean Eckles2, Nathaniel Good1, Simon P. King1, Mor Naaman1, Rahul Nair1 •
29 Apr 2007
TL;DR: In this paper, the authors use context-aware camerephone devices to examine privacy decisions in mobile and online photo sharing and identify relationships between location of photo capture and photo privacy settings.
Abstract: As sharing personal media online becomes easier and widely spread, new privacy concerns emerge - especially when the persistent nature of the media and associated context reveals details about the physical and social context in which the media items were created. In a first-of-its-kind study, we use context-aware camerephone devices to examine privacy decisions in mobile and online photo sharing. Through data analysis on a corpus of privacy decisions and associated context data from a real-world system, we identify relationships between location of photo capture and photo privacy settings. Our data analysis leads to further questions which we investigate through a set of interviews with 15 users. The interviews reveal common themes in privacy considerations: security, social disclosure, identity and convenience. Finally, we highlight several implications and opportunities for design of media sharing applications, including using past privacy patterns to prevent oversights and errors.

Proceedings Article•DOI•
18 Jun 2007
TL;DR: This work analyzes the tags associated with the geo-referenced Flickr images to generate aggregate knowledge in the form of "representative tags" for arbitrary areas in the world, and uses these tags to create a visualization tool, World Explorer, that can help expose the content of the data, using a map interface to display the derived tags and the original photo items.
Abstract: The availability of map interfaces and location-aware devices makes a growing amount of unstructured, geo-referenced information available on the Web. This type of information can be valuable not only for browsing, finding and making sense of individual items, but also in aggregate form to help understand data trends and features. In particular, over twenty million geo-referenced photos are now available on Flickr, a photo-sharing website - the first major collection of its kind. These photos are often associated with user-entered unstructured text labels (i.e., tags). We show how we analyze the tags associated with the geo-referenced Flickr images to generate aggregate knowledge in the form of "representative tags" for arbitrary areas in the world. We use these tags to create a visualization tool, World Explorer, tha tcan help expose the content of the data, using a map interface to display the derived tags and the original photo items. We perform a qualitative evaluation of World Explorer that outlines the visualization's benefits in browsing this type of content. We provide insights regarding the aggregate versus individual-item requirements in browsing digital geo-referenced material.

Journal Article•DOI•
TL;DR: A novel algorithm is presented that is effectively used for the analysis of admixed populations without having to trace the origin of individuals, and can be easily applied on large genome-wide datasets, facilitating the identification of population substructure, stratification assessment in multi-stage whole-genome association studies, and the study of demographic history in human populations.
Abstract: Existing methods to ascertain small sets of markers for the identification of human population structure require prior knowledge of individual ancestry. Based on Principal Components Analysis (PCA), and recent results in theoretical computer science, we present a novel algorithm that, applied on genomewide data, selects small subsets of SNPs (PCA-correlated SNPs) to reproduce the structure found by PCA on the complete dataset, without use of ancestry information. Evaluating our method on a previously described dataset (10,805 SNPs, 11 populations), we demonstrate that a very small set of PCA-correlated SNPs can be effectively employed to assign individuals to particular continents or populations, using a simple clustering algorithm. We validate our methods on the HapMap populations and achieve perfect intercontinental differentiation with 14 PCA-correlated SNPs. The Chinese and Japanese populations can be easily differentiated using less than 100 PCA-correlated SNPs ascertained after evaluating 1.7 million SNPs from HapMap. We show that, in general, structure informative SNPs are not portable across geographic regions. However, we manage to identify a general set of 50 PCA-correlated SNPs that effectively assigns individuals to one of nine different populations. Compared to analysis with the measure of informativeness, our methods, although unsupervised, achieved similar results. We proceed to demonstrate that our algorithm can be effectively used for the analysis of admixed populations without having to trace the origin of individuals. Analyzing a Puerto Rican dataset (192 individuals, 7,257 SNPs), we show that PCA-correlated SNPs can be used to successfully predict structure and ancestry proportions. We subsequently validate these SNPs for structure identification in an independent Puerto Rican dataset. The algorithm that we introduce runs in seconds and can be easily applied on large genome-wide datasets, facilitating the identification of population substructure, stratification assessment in multi-stage whole-genome association studies, and the study of demographic history in human populations.

Journal Article•DOI•
TL;DR: Two new support vector approaches for ordinal regression are proposed, which optimize multiple thresholds to define parallel discriminant hyperplanes for the ordinal scales, and guarantee that the thresholds are properly ordered at the optimal solution.
Abstract: In this letter, we propose two new support vector approaches for ordinal regression, which optimize multiple thresholds to define parallel discriminant hyperplanes for the ordinal scales. Both approaches guarantee that the thresholds are properly ordered at the optimal solution. The size of these optimization problems is linear in the number of training samples. The sequential minimal optimization algorithm is adapted for the resulting optimization problems; it is extremely easy to implement and scales efficiently as a quadratic function of the number of examples. The results of numerical experiments on some benchmark and real-world data sets, including applications of ordinal regression to information retrieval, verify the usefulness of these approaches.

Journal Article•DOI•
Tasha Goldberg1•

Proceedings Article•DOI•
20 Jun 2007
TL;DR: In this paper, a trust region Newton method is applied to maximize the log-likelihood of the logistic regression model, which achieves fast convergence in the end, using only approximate Newton steps in the beginning.
Abstract: Large-scale logistic regression arises in many applications such as document classification and natural language processing. In this paper, we apply a trust region Newton method to maximize the log-likelihood of the logistic regression model. The proposed method uses only approximate Newton steps in the beginning, but achieves fast convergence in the end. Experiments show that it is faster than the commonly used quasi Newton approach for logistic regression. We also compare it with linear SVM implementations.

Patent•
18 Apr 2007
TL;DR: In this article, a context oriented user interface interprets inputs from a mobile user based on vitality information, such as a location of the mobile device, a time of day, an event, information from the mobile user's calendar, past behavior of the user, weather, social networking data, aggregate behaviors, or even information about proximity of a social contact.
Abstract: A system, apparatus, and method are directed to managing contextual based mobile searches. A context oriented user interface interprets inputs from a mobile user based on vitality information. In one embodiment, the input may be interpreted as a request to perform a context-based search over a network using at least some of the vitality information. Vitality information may include a location of the mobile device, a time of day, an event, information from the mobile user's calendar, past behavior of the mobile user, weather, social networking data, aggregate behaviors, or even information about proximity of a social contact. By employing vitality information to perform a mobile search, better search results and a richer user experience may be provided that includes a sense of community, a sense of presence (e.g., a sense of 'here-ness.'). In one embodiment, the mobile user may provide comments to others regarding the search results.

Proceedings Article•
19 Jul 2007
TL;DR: Experimental results are presented showing that incorporating an explicit model of the missing data mechanism can lead to significant improvements in prediction performance on the random sample of ratings.
Abstract: Rating prediction is an important application, and a popular research topic in collaborative filtering. However, both the validity of learning algorithms, and the validity of standard testing procedures rest on the assumption that missing ratings are missing at random (MAR). In this paper we present the results of a user study in which we collect a random sample of ratings from current users of an online radio service. An analysis of the rating data collected in the study shows that the sample of random ratings has markedly different properties than ratings of user-selected songs. When asked to report on their own rating behaviour, a large number of users indicate they believe their opinion of a song does affect whether they choose to rate that song, a violation of the MAR condition. Finally, we present experimental results showing that incorporating an explicit model of the missing data mechanism can lead to significant improvements in prediction performance on the random sample of ratings.