scispace - formally typeset
Search or ask a question
Author

Xin Jin

Bio: Xin Jin is an academic researcher from Microsoft. The author has contributed to research in topics: Classifier (UML) & Hierarchical classifier. The author has an hindex of 2, co-authored 2 publications receiving 132 citations.

Papers
More filters
Patent
Ying Li1, Teresa Mah1, Jie Tong1, Xin Jin1, Saleel Sathe1, Jingyi Xu1 
14 May 2007
TL;DR: In this paper, a multi-class classifier is developed and one or more webpages with webpage content are received and analyzed with the multi classifier and, in various embodiments, a sensitivity level is predicted that is associated with the webpage content.
Abstract: Computer-readable media, systems, and methods for sensitive webpage content detection are described. In embodiments, a multi-class classifier is developed and one or more webpages with webpage content are received. In various embodiments, the one or more webpages are analyzed with the multi-class classifier and, in various embodiments, a sensitivity level is predicted that is associated with the webpage content of the one or more webpages. In various other embodiments, the multi-class classifier includes one or more sensitivity categories.

77 citations

Proceedings ArticleDOI
Xin Jin1, Ying Li1, Teresa Mah1, Jie Tong1
12 Aug 2007
TL;DR: This paper takes a webpage classification approach to solve the problem of how to detect whether a publisher webpage contains sensitive content and is appropriate for showing advertisement(s) on it, and designs a unique sensitive content taxonomy.
Abstract: Online advertising has been a popular topic in recent years. In this paper, we address one of the important problems in online advertising, i.e., how to detect whether a publisher webpage contains sensitive content and is appropriate for showing advertisement(s) on it.We take a webpage classification approach to solve this problem. First we design a unique sensitive content taxonomy. Then we adopt an iterative training data collection and classifier building approach, to build a hierarchical classifier which can classify webpages into one of the nodes in the sensitive content taxonomy. The experimental result show that using this approach, we are able to build a unique sensitive content classifier with decent accuracy while only requiring limited amount of human labeling effort.

56 citations


Cited by
More filters
Book
08 Jul 2008
TL;DR: This survey covers techniques and approaches that promise to directly enable opinion-oriented information-seeking systems and focuses on methods that seek to address the new challenges raised by sentiment-aware applications, as compared to those that are already present in more traditional fact-based analysis.
Abstract: An important part of our information-gathering behavior has always been to find out what other people think. With the growing availability and popularity of opinion-rich resources such as online review sites and personal blogs, new opportunities and challenges arise as people now can, and do, actively use information technologies to seek out and understand the opinions of others. The sudden eruption of activity in the area of opinion mining and sentiment analysis, which deals with the computational treatment of opinion, sentiment, and subjectivity in text, has thus occurred at least in part as a direct response to the surge of interest in new systems that deal directly with opinions as a first-class object. This survey covers techniques and approaches that promise to directly enable opinion-oriented information-seeking systems. Our focus is on methods that seek to address the new challenges raised by sentiment-aware applications, as compared to those that are already present in more traditional fact-based analysis. We include material on summarization of evaluative text and on broader issues regarding privacy, manipulation, and economic impact that the development of opinion-oriented information-access services gives rise to. To facilitate future work, a discussion of available resources, benchmark datasets, and evaluation campaigns is also provided.

7,452 citations

Journal ArticleDOI
TL;DR: A new method for sentiment analysis in Facebook is presented, starting from messages written by users, to extract information about the users' sentiment polarity (positive, neutral or negative), as transmitted in the messages they write, and to model the Users' usual sentiment pol parity and to detect significant emotional changes.

508 citations

Proceedings ArticleDOI
21 Aug 2011
TL;DR: An intuitive and simple probabilistic model to directly quantify the attribution of different advertising channels and a bagged logistic regression model, which shows achieves a comparable classification accuracy as a usualLogistic regression, but a much more stable estimate of individual advertising channel contributions.
Abstract: In digital advertising, attribution is the problem of assigning credit to one or more advertisements for driving the user to the desirable actions such as making a purchase. Rather than giving all the credit to the last ad a user sees, multi-touch attribution allows more than one ads to get the credit based on their corresponding contributions. Multi-touch attribution is one of the most important problems in digital advertising, especially when multiple media channels, such as search, display, social, mobile and video are involved. Due to the lack of statistical framework and a viable modeling approach, true data-driven methodology does not exist today in the industry. While predictive modeling has been thoroughly researched in recent years in the digital advertising domain, the attribution problem focuses more on accurate and stable interpretation of the influence of each user interaction to the final user decision rather than just user classification. Traditional classification models fail to achieve those goals.In this paper, we first propose a bivariate metric, one measures the variability of the estimate, and the other measures the accuracy of classifying the positive and negative users. We then develop a bagged logistic regression model, which we show achieves a comparable classification accuracy as a usual logistic regression, but a much more stable estimate of individual advertising channel contributions. We also propose an intuitive and simple probabilistic model to directly quantify the attribution of different advertising channels. We then apply both the bagged logistic model and the probabilistic model to a real-world data set from a multi-channel advertising campaign for a well-known consumer software and services brand. The two models produce consistent general conclusions and thus offer useful cross-validation. The results of our attribution models also shed several important insights that have been validated by the advertising team.We have implemented the probabilistic model in the production advertising platform of the first author's company, and plan to implement the bagged logistic regression in the next product release. We believe availability of such data-driven multi-touch attribution metric and models is a break-through in the digital advertising industry.

132 citations

Proceedings ArticleDOI
26 Oct 2008
TL;DR: Empirical evaluation based on over 9,000 query-ad pairwise judgments confirms that using augmented queries produces highly relevant ads.
Abstract: The business of Web search, a $10 billion industry, relies heavily on sponsored search, whereas a few carefully-selected paid advertisements are displayed alongside algorithmic search results. A key technical challenge in sponsored search is to select ads that are relevant for the user's query. Identifying relevant ads is challenging because queries are usually very short, and because users, consciously or not, choose terms intended to lead to optimal Web search results and not to optimal ads. Furthermore, the ads themselves are short and usually formulated to capture the reader's attention rather than to facilitate query matching.Traditionally, matching of ads to queries employed standard information retrieval techniques using the bag of words approach. Here we propose to go beyond the bag of words, and augment both queries and ads with additional knowledge-rich features. We use Web search results initially returned for the query to create a pool of relevant documents. Classifying these documents with respect to an external taxonomy and identifying salient named entities give rise to two new feature types. Empirical evaluation based on over 9,000 query-ad pairwise judgments confirms that using augmented queries produces highly relevant ads. Our methodology also relaxes the requirement for each ad to explicitly specify the exhaustive list of queries ("bid phrases") that can trigger it.

119 citations

Patent
21 Aug 2008
TL;DR: In this article, the authors proposed a real-time fast signature generation method for large-scale content-segments based on relevant audio and visual signals, and scalable matching of signatures of high-volume database of content segments' signatures.
Abstract: Content-based clustering, recognition, classification and search of high volumes of multimedia data in real-time. The invention is dedicated to real-time fast generation of signatures to high-volume of multimedia content-segments, based on relevant audio and visual signals, and to scalable matching of signatures of high-volume database of content-segments' signatures. The invention can be implemented in any applications which involve large-scale content-based clustering, recognition and classification of multimedia data, such as, content-tracking, video filtering, multimedia taxonomy generation, video fingerprinting, speech-to-text, audio classification, object recognition, video search and any other application requiring content-based signatures generation and matching for large content volumes such as, web and other large-scale databases.

80 citations