scispace - formally typeset
Search or ask a question
Author

Marcus Fontoura

Other affiliations: Princeton University, Google, IBM  ...read more
Bio: Marcus Fontoura is an academic researcher from Microsoft. The author has contributed to research in topics: Set (abstract data type) & Inverted index. The author has an hindex of 33, co-authored 122 publications receiving 3606 citations. Previous affiliations of Marcus Fontoura include Princeton University & Google.


Papers
More filters
Proceedings ArticleDOI
14 Oct 2017
TL;DR: An extensive characterization of Microsoft Azure's VM workload, including distributions of the VMs' lifetime, deployment size, and resource consumption is introduced, and Resource Central, a system that collects VM telemetry, learns these behaviors offline, and provides predictions online to various resource managers via a general client-side library is introduced.
Abstract: Cloud research to date has lacked data on the characteristics of the production virtual machine (VM) workloads of large cloud providers. A thorough understanding of these characteristics can inform the providers' resource management systems, e.g. VM scheduler, power manager, server health manager. In this paper, we first introduce an extensive characterization of Microsoft Azure's VM workload, including distributions of the VMs' lifetime, deployment size, and resource consumption. We then show that certain VM behaviors are fairly consistent over multiple lifetimes, i.e. history is an accurate predictor of future behavior. Based on this observation, we next introduce Resource Central (RC), a system that collects VM telemetry, learns these behaviors offline, and provides predictions online to various resource managers via a general client-side library. As an example of RC's online use, we modify Azure's VM scheduler to leverage predictions in oversubscribing servers (with oversubscribable VM types), while retaining high VM performance. Using real VM traces, we then show that the prediction-informed schedules increase utilization and prevent physical resource exhaustion. We conclude that providers can exploit their workloads' characteristics and machine learning to improve resource management substantially.

479 citations

Proceedings ArticleDOI
23 Jul 2007
TL;DR: A system for contextual ad matching based on a combination of semantic and syntactic features is proposed, which will help improve the user experience and reduce the number of irrelevant ads.
Abstract: Contextual advertising or Context Match (CM) refers to the placement of commercial textual advertisements within the content of a generic web page, while Sponsored Search (SS) advertising consists in placing ads on result pages from a web search engine, with ads driven by the originating query. In CM there is usually an intermediary commercial ad-network entity in charge of optimizing the ad selection with the twin goal of increasing revenue (shared between the publisher and the ad-network) and improving the user experience. With these goals in mind it is preferable to have ads relevant to the page content, rather than generic ads. The SS market developed quicker than the CM market, and most textual ads are still characterized by "bid phrases" representing those queries where the advertisers would like to have their ad displayed. Hence, the first technologies for CM have relied on previous solutions for SS, by simply extracting one or more phrases from the given page content, and displaying ads corresponding to searches on these phrases, in a purely syntactic approach. However, due to the vagaries of phrase extraction, and the lack of context, this approach leads to many irrelevant ads. To overcome this problem, we propose a system for contextual ad matching based on a combination of semantic and syntactic features.

356 citations

Proceedings ArticleDOI
23 Jul 2007
TL;DR: This work proposes a methodology for building a practical robust query classification system that can identify thousands of query classes with reasonable accuracy, while dealing in real-time with the query volume of a commercial web search engine.
Abstract: We propose a methodology for building a practical robust query classification system that can identify thousands of query classes with reasonable accuracy, while dealing in real-time with the query volume of a commercial web search engine. We use a blind feedback technique: given a query, we determine its topic by classifying the web search results retrieved by the query. Motivated by the needs of search advertising, we primarily focus on rare queries, which are the hardest from the point of view of machine learning, yet in aggregation account for a considerable fraction of search engine traffic. Empirical evaluation confirms that our methodology yields a considerably higher classification accuracy than previously reported. We believe that the proposed methodology will lead to better matching of online ads to rare queries and overall to a better user experience.

207 citations

Journal ArticleDOI
01 Apr 2005
TL;DR: The TurboXPath path processor is proposed, which accepts a language equivalent to a subset of the for-let-where constructs of XQuery over a single document, and can be extended to provide full XQuery support or used to augment federated database engines for efficient handling of queries over XML data streams produced by external sources.
Abstract: Efficient querying of XML streams will be one of the fundamental features of next-generation information systems. In this paper we propose the TurboXPath path processor, which accepts a language equivalent to a subset of the for-let-where constructs of XQuery over a single document. TurboXPath can be extended to provide full XQuery support or used to augment federated database engines for efficient handling of queries over XML data streams produced by external sources. Internally, TurboXPath uses a tree-shaped path expression with multiple outputs to drive the execution. The result of a query execution is a sequence of tuples of XML fragments matching the output nodes. Based on a streamed execution model, TurboXPath scales up to large documents and has limited memory consumption for increased concurrency. Experimental evaluation of a prototype demonstrates performance gains compared to other state-of-the-art path processors.

172 citations

Patent
Marcus Fontoura1, Vanja Josifovsld1
14 Apr 2003
TL;DR: In this paper, a system and method for querying a stream of XML data in a single pass using standard XQuery expressions is presented, consisting of an expression parser that receives a query and generates a parse tree; a SAX events API that receives the stream of XQuery data and generates an evaluator that receives parse trees and stream of events and buffers fragments from the stream.
Abstract: A system and method for querying a stream of XML data in a single pass using standard XQuery expressions. The system comprises: an expression parser that receives a query and generates a parse tree; a SAX events API that receives the stream of XML data and generates a stream of SAX events; an evaluator that receives the parse tree and stream of SAX events and buffers fragments from the stream of SAX events that meet an evaluation criteria; and a tuple constructor that joins fragments to form a set of tuple results that satisfies the query for the stream of XML data.

133 citations


Cited by
More filters
01 Apr 1997
TL;DR: The objective of this paper is to give a comprehensive introduction to applied cryptography with an engineer or computer scientist in mind on the knowledge needed to create practical systems which supports integrity, confidentiality, or authenticity.
Abstract: The objective of this paper is to give a comprehensive introduction to applied cryptography with an engineer or computer scientist in mind. The emphasis is on the knowledge needed to create practical systems which supports integrity, confidentiality, or authenticity. Topics covered includes an introduction to the concepts in cryptography, attacks against cryptographic systems, key use and handling, random bit generation, encryption modes, and message authentication codes. Recommendations on algorithms and further reading is given in the end of the paper. This paper should make the reader able to build, understand and evaluate system descriptions and designs based on the cryptographic components described in the paper.

2,188 citations

Book
16 Feb 2009
TL;DR: This text provides the background and tools needed to evaluate, compare and modify search engines and numerous programming exercises make extensive use of Galago, a Java-based open source search engine.
Abstract: KEY BENEFIT: Written by a leader in the field of information retrieval, this text provides the background and tools needed to evaluate, compare and modify search engines. KEY TOPICS: Coverage of the underlying IR and mathematical models reinforce key concepts. Numerous programming exercises make extensive use of Galago, a Java-based open source search engine. MARKET: A valuable tool for search engine and information retrieval professionals.

1,050 citations

Book
27 Jun 2011
TL;DR: The challenges that remain open, in particular the need for language generation and deeper semantic understanding of language that would be necessary for future advances in the field are discussed.
Abstract: It has now been 50 years since the publication of Luhn’s seminal paper on automatic summarization. During these years the practical need for automatic summarization has become increasingly urgent and numerous papers have been published on the topic. As a result, it has become harder to find a single reference that gives an overview of past efforts or a complete view of summarization tasks and necessary system components. This article attempts to fill this void by providing a comprehensive overview of research in summarization, including the more traditional efforts in sentence extraction as well as the most novel recent approaches for determining important content, for domain and genre specific summarization and for evaluation of summarization. We also discuss the challenges that remain open, in particular the need for language generation and deeper semantic understanding of language that would be necessary for future advances in the field. We would like to thank the anonymous reviewers, our students and Noemie Elhadad, Hongyan Jing, Julia Hirschberg, Annie Louis, Smaranda Muresan and Dragomir Radev for their helpful feedback. This paper was supported in part by the U.S. National Science Foundation (NSF) under IIS-05-34871 and CAREER 09-53445. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF. Full text available at: http://dx.doi.org/10.1561/1500000015

697 citations