scispace - formally typeset
Search or ask a question

Showing papers by "Andrei Z. Broder published in 2007"


Proceedings Article•DOI•
23 Jul 2007
TL;DR: A system for contextual ad matching based on a combination of semantic and syntactic features is proposed, which will help improve the user experience and reduce the number of irrelevant ads.
Abstract: Contextual advertising or Context Match (CM) refers to the placement of commercial textual advertisements within the content of a generic web page, while Sponsored Search (SS) advertising consists in placing ads on result pages from a web search engine, with ads driven by the originating query. In CM there is usually an intermediary commercial ad-network entity in charge of optimizing the ad selection with the twin goal of increasing revenue (shared between the publisher and the ad-network) and improving the user experience. With these goals in mind it is preferable to have ads relevant to the page content, rather than generic ads. The SS market developed quicker than the CM market, and most textual ads are still characterized by "bid phrases" representing those queries where the advertisers would like to have their ad displayed. Hence, the first technologies for CM have relied on previous solutions for SS, by simply extracting one or more phrases from the given page content, and displaying ads corresponding to searches on these phrases, in a purely syntactic approach. However, due to the vagaries of phrase extraction, and the lack of context, this approach leads to many irrelevant ads. To overcome this problem, we propose a system for contextual ad matching based on a combination of semantic and syntactic features.

356 citations


Book Chapter•DOI•
13 Jun 2007
TL;DR: The effectiveness of the framework for margin based active learning of linear separators both in the realizable case and in a specific noisy setting related to the Tsybakov small noise condition is analyzed.
Abstract: We present a framework for margin based active learning of linear separators. We instantiate it for a few important cases, some of which have been previously considered in the literature.We analyze the effectiveness of our framework both in the realizable case and in a specific noisy setting related to the Tsybakov small noise condition.

351 citations


Proceedings Article•DOI•
Andrei Z. Broder1, Marcus Fontoura1, Evgeniy Gabrilovich1, Amruta Joshi1, Vanja Josifovski1, Tong Zhang1 •
23 Jul 2007
TL;DR: This work proposes a methodology for building a practical robust query classification system that can identify thousands of query classes with reasonable accuracy, while dealing in real-time with the query volume of a commercial web search engine.
Abstract: We propose a methodology for building a practical robust query classification system that can identify thousands of query classes with reasonable accuracy, while dealing in real-time with the query volume of a commercial web search engine. We use a blind feedback technique: given a query, we determine its topic by classifying the web search results retrieved by the query. Motivated by the needs of search advertising, we primarily focus on rare queries, which are the hardest from the point of view of machine learning, yet in aggregation account for a considerable fraction of search engine traffic. Empirical evaluation confirms that our methodology yields a considerably higher classification accuracy than previously reported. We believe that the proposed methodology will lead to better matching of online ads to rare queries and overall to a better user experience.

207 citations


Book Chapter•DOI•
01 Nov 2007
TL;DR: This study has made a significant impact on research in physics, computer science and mathematics and given birth to new branches of research in different areas of mathematics, most notably graph theory and probability.
Abstract: For barely a decade now the Web graph (the network formed by Web pages and their hyperlinks) has been the focus of scientific study. In that short a time, this study has made a significant impact on research in physics, computer science and mathematics. It has focussed the attention of the scientific community on all the different kinds of networks that have arisen through technology and human activity; some speak of a "new science of networks". It has brought the computational and deductive power of computer science to the study of the complex social networks formed by inter-human relationships. And, it has given birth to new branches of research in different areas of mathematics, most notably graph theory and probability.

149 citations


Proceedings Article•DOI•
Aris Anagnostopoulos1, Andrei Z. Broder1, Evgeniy Gabrilovich1, Vanja Josifovski1, Lance Riedel1 •
06 Nov 2007
TL;DR: Empirical evaluation proves that matching ads on the basis of a carefully selected 5% fraction of the page text sacrifices only 1%-3% in ad relevance, and is competitive with matching based on the entire page content.
Abstract: Contextual Advertising is a type of Web advertising, which, given the URL of a Web page, aims to embed into the page (typically via JavaScript) the most relevant textual ads available. For static pages that are displayed repeatedly, the matching of ads can be based on prior analysis of their entire content; however, ads need to be matched also to new or dynamically created pages that cannot be processed ahead of time. Analyzing the entire body of such pages on-the-fly entails prohibitive communication and latency costs. To solve the three-horned dilemma of either low-relevance or high-latency or high-load, we propose to use text summarization techniques paired with external knowledge (exogenous to the page) to craft short page summaries in real time. Empirical evaluation proves that matching ads on the basis of such summaries does not sacrifice relevance, and is competitive with matching based on the entire page content. Specifically, we found that analyzing a carefully selected 5% fraction of the page text sacrifices only 1%-3% in ad relevance. Furthermore, our summaries are fully compatible with the standard JavaScript mechanisms used for ad placement: they can be produced at ad-display time by simple additions to the usual script, and they only add 500-600 bytes to the usual request.

84 citations


Patent•
20 Jul 2007
TL;DR: In this article, a system and method to facilitate real-time matching of content to advertising information in a network is described, where a request for advertising information is received over a network, the advertising information to be displayed for a user entity in association with content information within a web page requested by the user entity, the request containing the content information, web page identifier, and additional data associated with the web page.
Abstract: A system and method to facilitate real-time matching of content to advertising information in a network are described. A request for advertising information is received over a network, the advertising information to be displayed for a user entity in association with content information within a web page requested by the user entity, the request containing the content information, a web page identifier, and additional data associated with the web page. The content information is further analyzed in real-time to construct a page summary of the web page. The web page identifier and the additional data are further analyzed in real-time to extract at least one keyword relevant to the content information. Finally, the advertising information is determined in real-time based on the page summary and the extracted keywords.

63 citations


Proceedings Article•DOI•
12 Aug 2007
TL;DR: On a real-world dataset consisting of 1/2 billion impressions, it is demonstrated that even with 95% negative events in the training set, the method can effectively discriminate extremely rare events in terms of their click propensity.
Abstract: We consider the problem of estimating occurrence rates of rare eventsfor extremely sparse data, using pre-existing hierarchies to perform inference at multiple resolutions. In particular, we focus on the problem of estimating click rates for (webpage, advertisement) pairs (called impressions) where both the pages and the ads are classified into hierarchies that capture broad contextual information at different levels of granularity. Typically the click rates are low and the coverage of the hierarchies is sparse. To overcome these difficulties we devise a sampling method whereby we analyze aspecially chosen sample of pages in the training set, and then estimate click rates using a two-stage model. The first stage imputes the number of (webpage, ad) pairs at all resolutions of the hierarchy to adjust for the sampling bias. The second stage estimates clickrates at all resolutions after incorporating correlations among sibling nodes through a tree-structured Markov model. Both models are scalable and suited to large scale data mining applications. On a real-world dataset consisting of 1/2 billion impressions, we demonstrate that even with 95% negative (non-clicked) events in the training set, our method can effectively discriminate extremely rare events in terms of their click propensity.

56 citations


Patent•
13 Dec 2007
TL;DR: In this article, the staleness of a web page is assessed by examining internal date references within the web page and the link status of the hyperlinks of a hyperlinked web page.
Abstract: Systems and methods are herein disclosed for assessing the staleness of a web page. In particular, in one method of the present invention, the staleness of a web page is assessed by examining internal date references within the web page. In another method of the present invention, the staleness of a web page is assessed by examining the meta-data associated with the web page. In a further method of the present invention, the staleness of a hyperlinked web page is determined by examining the link status of the hyperlinks. If the web page has a relatively large number of dead links, it is assessed as being a stale web page. In a still further method of the present invention, the link status of web pages in the neighborhood of the web page being assessed is likewise examined.

29 citations


Patent•
20 Feb 2007
TL;DR: In this paper, a system for dynamically creating customized advertisements is introduced, where behavior and any demographic information known about web viewers are used to select an advertising template that will be used to create an advertisement.
Abstract: The World Wide Web portion of the Internet is largely supported by advertising. To deliver the most effective advertising, a system for dynamically creating customized advertisements is introduced. The behavior and any demographic information known about web viewers is used to select an advertising template that will be used to create an advertisement. The advertisement template comprises an incomplete advertisement with certain missing information along with identifiers for functions that may be used to complete the advertisement. In one embodiment, the functions may specify how the advertiser associated with the advertisement template may be contacted with the demographic information known about the user in order to fill in the missing portions of advertisement template. For example, the advertisement may concern flights to Hawaii and the advertiser may fill in the price of a flight to Hawaii based upon being provided with the user's location. The complete advertisement may then be displayed to the user.

28 citations


Patent•
Andrei Z. Broder1, Boris Klots1•
13 Dec 2007
TL;DR: In this paper, an adjustable interject policy is proposed to determine if an interjection should occur after a web viewer has selected an advertisement and before the viewer is directed to the advertiser's designated web site, so the number of web viewers that are subjected to the intermediate web page is reduced.
Abstract: Non human entities such as automated web crawlers or malicious click-fraud programs can skew the tracking of clicks on web site advertisements. It is to filter out page views caused by automated entities. A web site interject an intermediate web page after a web viewer selects an advertising link but before the viewer is sent to the advertiser's designated web site. The page allows for a response from the viewer. The system analyzes the viewer's response along with other information using an adjustable testing policy to make a determination as to whether the viewer is a human or not. An adjustable interject policy may determine if an interjection should occur after a web viewer has selected an advertisement and before the viewer is directed to the advertiser's designated web site, so the number of web viewers that are subjected to the intermediate web page is reduced.

27 citations


Patent•
20 Feb 2007
TL;DR: In this paper, a method of automatically creating an Internet web site may be performed by first crawling through the Internet web sites to identify products and services offered by the Internet Web sites.
Abstract: Advertising is used to generate awareness of commercial Internet web sites. To greatly simplify the marketing of a commercial Internet web site, the automatic creation of an advertising campaign would be desirable. A method of automatically creating an Internet web site may be performed by first crawling through the Internet web site to identify products and services offered by the Internet web site. Information about the identified products and services is stored. The system then creates advertisements for the identified products and services. The advertisements may include images, text, a link to the web page where the product or service was found, and keywords associated with the product or service. The automatically created advertisements may then be placed into an advertisement pool for use with advertising supported web sites. The automatic Internet advertisement campaign creations system of the present invention may be used to create free trial advertisement campaigns for potential advertising clients.

Patent•
Sihem Amer Yahia1, Andrei Z. Broder1•
27 Dec 2007
TL;DR: In this article, a reference framework is provided by creating context according to previous activity, bias, or background information of a given reviewer for annotating and ranking user reviews on social review systems with inferred analytics.
Abstract: The present invention is directed towards methods and computer readable media for annotating and ranking user reviews on social review systems with inferred analytics. A reference framework is provided by creating context according to previous activity, bias, or background information of a given reviewer. The method of the present invention comprises receiving a first query identifying a given content item, generating a collection of content items based on one or more identical objective attributes associated with the given content item, identifying one or more subjective attributes associated with a given item in the collection of items, and providing a reference framework to interpret the subjective attributes associated with each item in the collection.

Patent•
20 Jul 2007
TL;DR: In this paper, an advertisement request mechanism for selecting advertisements to serve to a client requesting a primary webpage is presented. But the advertisement server uses the content of the primary webpage to select the one or more advertisements.
Abstract: Methods for selecting advertisements to serve to a client requesting a primary webpage is provided The client displays a referring webpage having a hyperlink to the primary webpage Upon selection of the hyperlink, the client sends a request to a content server storing the primary webpage, the request including a referrer of the primary webpage comprising a URL address of the referring webpage The content server sends the primary webpage to the client which includes the referrer and an advertisement request mechanism configured to make an advertisement request to an advertisement server and attach the referrer to the advertisement request The advertisement server uses the referrer to select one or more advertisements to serve to the client The referrer may comprise one or more search query terms submitted by the client The advertisement server may also use the content of the primary webpage to select the one or more advertisements

Patent•
20 Jul 2007
TL;DR: In this article, a system and method to facilitate the importation of data taxonomies within a network is described, where publishers and advertisers access a data storage module within a NER to retrieve content information from one or more content taxonomic databases stored within the NER.
Abstract: A system and method to facilitate importation of data taxonomies within a network are described. Advertiser entities access a data storage module within a network-based entity to retrieve content information from one or more content taxonomies stored within the data storage module. Subsequently, the advertiser entities select advertisements targeted to specific users based on the retrieved content information and further transmit the advertisements to the network-based entity. Furthermore, publisher entities and/or advertiser entities transmit data, such as, for example, associated taxonomy information, to the network-based entity. The entity receives the respective taxonomy information and parses the taxonomy information to extract node information and associated categories related to the received information. Finally, the entity integrates the node information and associated categories into one or more taxonomies stored within the data storage module. Alternatively, the entity maps the node information and associated categories into corresponding nodes within one or more taxonomies stored within the data storage module, and further stores the mapping information into a mapping database within the data storage module.

Patent•
Deepak Agarwal1, Dejan Diklic1, Deepayan Chakrabarti1, Andrei Z. Broder1, Vanja Josifovski1 •
05 Apr 2007
TL;DR: In this article, the authors proposed a system and method for determining an event occurrence rate, where each content item is associated with at least one region in a hierarchical data structure and a scale factor is applied to the first impression volume to generate a second impression volume.
Abstract: Described are a system and method for determined an event occurrence rate. A sample set of content items may be obtained. Each of the content items may be associated with at least one region in a hierarchical data structure. A first impression volume may be determined for the at least one region as a function of a number of impressions registered for the content items associated with the at least one region. A scale factor may be applied to the first impression volume to generate a second impression volume. The scale factor may be selected so that the second impression volume is within a predefined range of a third impression volume. A click-through-rate (CTR) may be estimated as a function of the second impression volume and a number of clicks on the content item.

Patent•
20 Jul 2007
TL;DR: In this article, a system and method to facilitate classification and storage of events in a network is described, where an event and associated content information are received from an entity over a network.
Abstract: A system and method to facilitate classification and storage of events in a network are described. An event and associated content information are received from an entity over a network. The content information is further analyzed to determine one or more themes representing subject matter related to the content information. The event is further classified according to the themes into one or more corresponding categories. Finally, the event is stored into one or more corresponding databases of a data storage module according to the one or more corresponding categories.

Patent•
20 Feb 2007
TL;DR: In this article, a system and method to facilitate classification of search queries and selection of associated advertising information over a network is described, where a search query is processed to retrieve a predetermined number of query results and then classified to select one or more categories associated with the query results.
Abstract: A system and method to facilitate classification of search queries and selection of associated advertising information over a network are described. A search query received from a user over a network is processed to retrieve a predetermined number of query results. The predetermined number of query results is further classified to select one or more categories associated with the query results. Finally, advertising information is selected based on the one or more selected categories for further display to the user in connection with the query results.

Patent•
20 Jul 2007
TL;DR: In this article, a system and method to facilitate mapping and storage of data within one or more data taxonomies is described, where content information is received over a network and analyzed to determine at least one theme representing subject matter related to the content information.
Abstract: A system and method to facilitate mapping and storage of data within one or more data taxonomies are described. Content information is received over a network. The content information is further analyzed to determine at least one theme representing subject matter related to the content information. Finally, the content information is stored within respective predetermined categories organized within at least one taxonomy, the predetermined categories being associated with the at least one theme.

Patent•
19 Nov 2007
TL;DR: In this article, a forecasting module identifies a set of candidate webpages on which a digital ad may be displayed and estimates a click through rate associated with the digital ad and a webpage of the set of candidates.
Abstract: Systems and methods for estimating an amount of traffic associated with a digital ad are disclosed. Generally, a forecasting module identifies a set of candidate webpages on which a digital ad may be displayed and estimates a click through rate associated with the digital ad and a webpage of the set of candidate webpages. The forecasting module determines a ranking score associated with the digital ad based on the determined click through rate and a bid price associated with the digital ad. The forecasting module then examines historical data, such as search logs, to determine an estimate of traffic associated with the digital ad with respect to the webpage in response to determining the ranking score of the digital ad exceeds a ranking score associated with another digital ad that was previously displayed on the webpage.

Book•
01 Nov 2007
TL;DR: This work presents a Scalable Multilevel Algorithm for Graph Clustering and Community Structure Detection and a Phrase Recommendation Algorithm Based on Query Stream Mining in Web Search Engines.
Abstract: Modelling and Mining of Networked Information Spaces.- Workshop on Algorithms and Models for the Web Graph.- Expansion and Lack Thereof in Randomly Perturbed Graphs.- Web Structure in 2005.- Local/Global Phenomena in Geometrically Generated Graphs.- Approximating PageRank from In-Degree.- Probabilistic Relation between In-Degree and PageRank.- Communities in Large Networks: Identification and Ranking.- Combating Spamdexing: Incorporating Heuristics in Link-Based Ranking.- Traps and Pitfalls of Topic-Biased PageRank.- A Scalable Multilevel Algorithm for Graph Clustering and Community Structure Detection.- A Phrase Recommendation Algorithm Based on Query Stream Mining in Web Search Engines.- Characterization of Graphs Using Degree Cores.- Web Structure Mining by Isolated Stars.- Representing and Quantifying Rank - Change for the Web Graph.

Book Chapter•DOI•
Andrei Z. Broder1•
02 Apr 2007
TL;DR: This talk identifies two trends, both representing "short-circuits" of the classic IR model: the trend towards context driven Information Supply (IS), that is, the goal of Web IR will widen to include the supply of relevant information from multiple sources without requiring the user to make an explicit query.
Abstract: The classic IR model assumes a human engaged in activity that generates an "information need". This need is verbalized and then expressed as a query to search engine over a defined corpus. In the past decade, Web search engines have evolved from a first generation based on classic IR algorithms scaled to web size and thus supporting only informational queries, to a second generation supporting navigational queries using web specific information (primarily link analysis), to a third generation enabling transactional and other "semantic" queries based on a variety of technologies aimed to directly satisfy the unexpressed "user intent", thus moving further and further away from the classic model. What is coming next? In this talk, we identify two trends, both representing "short-circuits" of the model: The first is the trend towards context driven Information Supply (IS), that is, the goal of Web IR will widen to include the supply of relevant information from multiple sources without requiring the user to make an explicit query. The information supply concept greatly precedes information retrieval; what is new in the web framework, is the ability to supply relevant information specific to a given activity and a given user, while the activity is being performed. Thus the entire verbalization and query-formation phase are eliminated. The second trend is "social search" driven by the fact that the Web has evolved to being simultaneously a huge repository of knowledge and a vast social environment. As such, it is often more effective to ask the members of a given web milieu rather than construct elaborate queries. This short-circuits only the query formulation, but allows information finding activities such as opinion elicitation and discovery of social norms, that are not expressible at all as queries against a fixed corpus.

Patent•
17 Oct 2007
TL;DR: In this article, a system is provided for assessing the effectiveness of online advertising by recording the context in which each advertisement is provided and tracking whether each advertisement resulted in a consumer response.
Abstract: A system is provided for assessing an effectiveness of online advertising by recording the context in which each advertisement is provided and tracking whether each advertisement resulted in a consumer response. When a request for an online advertisement is received, a unique code, which can be utilized to redeem a coupon, is generated and provided with an advertisement. Contextual information associated with providing the online advertisement is recorded for the unique code. Contextual information can include, for example, information about the provided advertisement, information about how the advertisement will be presented, information about the potential viewer for the advertisement, and the like. If the unique code is later utilized to redeem the coupon, the redemption is recorded for the unique code so that an online advertiser can assess the effectiveness of their online advertisements in relation to various contexts in which their advertisements are provided.

Book Chapter•DOI•
01 Nov 2007
TL;DR: In recent years, the emergence of the Web and the dramatic increase in computing, storage and networking capacity has given rise to the concept of networked information spaces, and the prime example is the World Wide Web itself.
Abstract: In recent years, the emergence of the Web and the dramatic increase in computing, storage and networking capacity has given rise to the concept of networked information spaces. The prime example of a networked information space is the World Wide Web itself. The Web, in its pure form, is a set of hypertext documents, with links in one document pointing to another document.

Patent•
Deepak Agarwal1, Dejan Diklic1, Deepayan Chakrabarti1, Andrei Z. Broder1, Vanja Josifovski1 •
05 Apr 2007
TL;DR: In this article, the authors present a system and method for determining an event occurrence rate, where each content item may be associated with at least one region in a hierarchical data structure and a scale factor may be applied to the first impression volume to generate a second impression volume.
Abstract: Described are a system and method for determining an event occurrence rate. A sample set of content items may be obtained. Each of the content items may be associated with at least one region in a hierarchical data structure. A first impression volume may be determined for the at least one region as a function of a number of impressions registered for the content items associated with the at least one region. A scale factor may be applied to the first impression volume to generate a second impression volume. The scale factor may be selected so that the second impression volume is within a predefined range of a third impression volume. A click-through-rate (CTR) may be estimated as a function of the second impression volume and a number of clicks on the content item.