scispace - formally typeset
Search or ask a question

Showing papers on "Web page published in 2009"


Book
23 Sep 2009

1,494 citations


Journal ArticleDOI
TL;DR: DNAPlotter is an interactive Java application for generating circular and linear representations of genomes that filters features of interest to display on separate user-definable tracks.
Abstract: Summary: DNAPlotter is an interactive Java application for generating circular and linear representations of genomes. Making use of the Artemis libraries to provide a user-friendly method of loading in sequence files (EMBL, GenBank, GFF) as well as data from relational databases, it filters features of interest to display on separate user-definable tracks. It can be used to produce publication quality images for papers or web pages. Availability: DNAPlotter is freely available (under a GPL licence) for download (for MacOSX, UNIX and Windows) at the Wellcome Trust Sanger Institute web sites: http://www.sanger.ac.uk/Software/Artemis/circular/ Contact: ku.ca.regnas@simetra

761 citations


Patent
04 Mar 2009
TL;DR: In this paper, a client component in a client device is operative to monitor usage of the client device in accordance with a monitoring profile, and to generate corresponding usage data, typically including information specifying which features of which application programs are to be disabled on the client devices.
Abstract: Systems and methods for monitoring usage of an electronic device are disclosed herein. A client component in stalled in a client device is operative to monitor usage of the client device in accordance with a monitoring profile, and to generate corresponding usage data. The monitoring profile typically includes information specifying which features of which application programs are to be disabled on the client device. A server component, installed on a server device in communication with the client device, provides the monitoring profile to the client device and receives the usage data from the client device.

522 citations


Journal ArticleDOI
TL;DR: As work in Web page classification is reviewed, the importance of these Web-specific features and algorithms are noted, state-of-the-art practices are described, and the underlying assumptions behind the use of information from neighboring pages are tracked.
Abstract: Classification of Web page content is essential to many tasks in Web information retrieval such as maintaining Web directories and focused crawling. The uncontrolled nature of Web content presents additional challenges to Web page classification as compared to traditional text classification, but the interconnected nature of hypertext also provides features that can assist the process.As we review work in Web page classification, we note the importance of these Web-specific features and algorithms, describe state-of-the-art practices, and track the underlying assumptions behind the use of information from neighboring pages.

502 citations


Proceedings ArticleDOI
28 Jun 2009
TL;DR: This work gives formulations for the trade-off between local spot-to-entity compatibility and measures of global coherence between entities, and investigates practical solutions based on local hill-climbing, rounding integer linear programs, and pre-clustering entities followed by local optimization within clusters.
Abstract: To take the first step beyond keyword-based search toward entity-based search, suitable token spans ("spots") on documents must be identified as references to real-world entities from an entity catalog. Several systems have been proposed to link spots on Web pages to entities in Wikipedia. They are largely based on local compatibility between the text around the spot and textual metadata associated with the entity. Two recent systems exploit inter-label dependencies, but in limited ways. We propose a general collective disambiguation approach. Our premise is that coherent documents refer to entities from one or a few related topics or domains. We give formulations for the trade-off between local spot-to-entity compatibility and measures of global coherence between entities. Optimizing the overall entity assignment is NP-hard. We investigate practical solutions based on local hill-climbing, rounding integer linear programs, and pre-clustering entities followed by local optimization within clusters. In experiments involving over a hundred manually-annotated Web pages and tens of thousands of spots, our approaches significantly outperform recently-proposed algorithms.

476 citations


Proceedings ArticleDOI
20 Jun 2009
TL;DR: This paper leverages the vast amount of multimedia data on the Web, the availability of an Internet image search engine, and advances in object recognition and clustering techniques, to address issues of modeling and recognizing landmarks at world-scale.
Abstract: Modeling and recognizing landmarks at world-scale is a useful yet challenging task There exists no readily available list of worldwide landmarks Obtaining reliable visual models for each landmark can also pose problems, and efficiency is another challenge for such a large scale system This paper leverages the vast amount of multimedia data on the Web, the availability of an Internet image search engine, and advances in object recognition and clustering techniques, to address these issues First, a comprehensive list of landmarks is mined from two sources: (1) ~20 million GPS-tagged photos and (2) online tour guide Web pages Candidate images for each landmark are then obtained from photo sharing Websites or by querying an image search engine Second, landmark visual models are built by pruning candidate images using efficient image matching and unsupervised clustering techniques Finally, the landmarks and their visual models are validated by checking authorship of their member images The resulting landmark recognition engine incorporates 5312 landmarks from 1259 cities in 144 countries The experiments demonstrate that the engine can deliver satisfactory recognition performance with high efficiency

355 citations


Journal ArticleDOI
TL;DR: BioMart Central Portal offers a one-stop shop solution to access a wide array of biological databases such as Ensembl, Uniprot, Reactome, HGNC, Wormbase and PRIDE making cross querying of these data sources in a user friendly and unified way.
Abstract: BioMart Central Portal (www.biomart.org) offers a one-stop shop solution to access a wide array of biological databases. These include major biomolecular sequence, pathway and annotation databases such as Ensembl, Uniprot, Reactome, HGNC, Wormbase and PRIDE; for a complete list, visit, http://www.biomart.org/biomart/martview. Moreover, the web server features seamless data federation making cross querying of these data sources in a user friendly and unified way. The web server not only provides access through a web interface (MartView), it also supports programmatic access through a Perl API as well as RESTful and SOAP oriented web services. The website is free and open to all users and there is no login requirement.

352 citations


Proceedings ArticleDOI
20 Apr 2009
TL;DR: Triplify is implemented as a light-weight software component, which can be easily integrated into and deployed by the numerous, widely installed Web applications and is usable to publish very large datasets, such as 160GB of geo data from the OpenStreetMap project.
Abstract: In this paper we present Triplify - a simplistic but effective approach to publish Linked Data from relational databases. Triplify is based on mapping HTTP-URI requests onto relational database queries. Triplify transforms the resulting relations into RDF statements and publishes the data on the Web in various RDF serializations, in particular as Linked Data. The rationale for developing Triplify is that the largest part of information on the Web is already stored in structured form, often as data contained in relational databases, but usually published by Web applications only as HTML mixing structure, layout and content. In order to reveal the pure structured information behind the current Web, we have implemented Triplify as a light-weight software component, which can be easily integrated into and deployed by the numerous, widely installed Web applications. Our approach includes a method for publishing update logs to enable incremental crawling of linked data sources. Triplify is complemented by a library of configurations for common relational schemata and a REST-enabled data source registry. Triplify configurations containing mappings are provided for many popular Web applications, including osCommerce, WordPress, Drupal, Gallery, and phpBB. We will show that despite its light-weight architecture Triplify is usable to publish very large datasets, such as 160GB of geo data from the OpenStreetMap project.

321 citations


Patent
30 Oct 2009
TL;DR: In this paper, the authors proposed a method, apparatus, and program product to convert HTML components of a web page to voice prompts for a user by selectively determining at least one HTML component from a plurality of HTML components and transmitting the parameterized data to the mobile system.
Abstract: Embodiments of the invention address the deficiencies of the prior art by providing a method, apparatus, and program product to of converting components of a web page to voice prompts for a user. In some embodiments, the method comprises selectively determining at least one HTML component from a plurality of HTML components of a web page to transform into a voice prompt for a mobile system based upon a voice attribute file associated with the web page. The method further comprises transforming the at least one HTML component into parameterized data suitable for use by the mobile system based upon at least a portion of the voice attribute file associated with the at least one HTML component and transmitting the parameterized data to the mobile system.

297 citations


Journal ArticleDOI
TL;DR: The paper concludes by stating that the Web has succeeded as a single global information space that has dramatically changed the way the authors use information, disrupted business models, and led to profound societal change.
Abstract: The paper discusses the semantic Web and Linked Data. The classic World Wide Web is built upon the idea of setting hyperlinks between Web documents. These hyperlinks are the basis for navigating and crawling the Web.Technologically, the core idea of Linked Data is to use HTTP URLs not only to identify Web documents, but also to identify arbitrary real world entities.Data about these entities is represented using the Resource Description Framework (RDF). Whenever a Web client resolves one of these URLs, the corresponding Web server provides an RDF/ XML or RDFa description of the identified entity. These descriptions can contain links to entities described by other data sources.The Web of Linked Data can be seen as an additional layer that is tightly interwoven with the classic document Web. The author mentions the application of Linked Data in media, publications, life sciences, geographic data, user-generated content, and cross-domain data sources. The paper concludes by stating that the Web has succeeded as a single global information space that has dramatically changed the way we use information, disrupted business models, and led to profound societal change.

293 citations


Patent
17 Feb 2009
TL;DR: In this article, the authors present a system and method for remotely securing, accessing, and managing a mobile device or group of mobile devices, which enables a remote access web page to be generated by a server and displayed on a client computer, where the server receives requested actions from the client computer and interacts with the mobile device to perform the actions.
Abstract: The present invention provides a system and method for remotely securing, accessing, and managing a mobile device or group of mobile devices. The invention enables a remote access web page to be generated by a server and displayed on a client computer. The server receives requested actions from the client computer and interacts with the mobile device to perform the actions. In the case of a lost or stolen device, the invention enables a user to take actions leading to the recovery or destruction of the device and data stored on it. The invention enables multiple types of remote access, including: locking the device, backing up data from the device, restoring data to the device, locating the device, playing a sound on the device, and wiping data from the device. The invention may be used to provide both self-help and administrator-assisted security for a device or group of devices.

Proceedings ArticleDOI
20 Apr 2009
TL;DR: This work performs an extensive study of compression techniques for document IDs and presents new optimizations of existing techniques which can achieve significant improvement in both compression and decompression performances.
Abstract: Web search engines use highly optimized compression schemes to decrease inverted index size and improve query throughput, and many index compression techniques have been studied in the literature. One approach taken by several recent studies first performs a renumbering of the document IDs in the collection that groups similar documents together, and then applies standard compression techniques. It is known that this can significantly improve index compression compared to a random document ordering. We study index compression and query processing techniques for such reordered indexes. Previous work has focused on determining the best possible ordering of documents. In contrast, we assume that such an ordering is already given, and focus on how to optimize compression methods and query processing for this case. We perform an extensive study of compression techniques for document IDs and present new optimizations of existing techniques which can achieve significant improvement in both compression and decompression performances. We also propose and evaluate techniques for compressing frequency values for this case. Finally, we study the effect of this approach on query processing performance. Our experiments show very significant improvements in index size and query processing speed on the TREC GOV2 collection of 25.2 million web pages.

01 Jan 2009
TL;DR: Open IE (OIE), a new extraction paradigm where the system makes a single data-driven pass over its corpus and extracts a large set of relational tuples without requiring any human input, is introduced.
Abstract: The World Wide Web contains a significant amount of information expressed using natural language. While unstructured text is often difficult for machines to understand, the field of Information Extraction (IE) offers a way to map textual content into a structured knowledge base. The ability to amass vast quantities of information from Web pages has the potential to increase the power with which a modern search engine can answer complex queries. IE has traditionally focused on acquiring knowledge about particular relationships within a small collection of domain-specific text. Typically, a target relation is provided to the system as input along with extraction patterns or examples that have been specified by hand. Shifting to a new relation requires a person to create new patterns or examples. This manual labor scales linearly with the number of relations of interest. The task of extracting information from the Web presents several challenges for existing IE systems. The Web is large and heterogeneous; the number of potentially interesting relations is massive and their identity often unknown. To enable large-scale knowledge acquisition from the Web, this thesis presents Open Information Extraction, a novel extraction paradigm that automatically discovers thousands of relations from unstructured text and readily scales to the size and diversity of the Web.

Proceedings ArticleDOI
04 Apr 2009
TL;DR: An eye-tracking study is presented in which 20 users viewed 361 Web pages while engaged in information foraging and page recognition tasks, and the concept of fixation impact is introduced, a new method for mapping gaze data to visual scenes that is motivated by findings in vision research.
Abstract: An understanding of how people allocate their visual attention when viewing Web pages is very important for Web authors, interface designers, advertisers and others. Such knowledge opens the door to a variety of innovations, ranging from improved Web page design to the creation of compact, yet recognizable, visual representations of long pages. We present an eye-tracking study in which 20 users viewed 361 Web pages while engaged in information foraging and page recognition tasks. From this data, we describe general location-based characteristics of visual attention for Web pages dependent on different tasks and demographics, and generate a model for predicting the visual attention that individual page elements may receive. Finally, we introduce the concept of fixation impact, a new method for mapping gaze data to visual scenes that is motivated by findings in vision research.

Proceedings ArticleDOI
20 Apr 2009
TL;DR: Evaluations of the method show that it outperforms existing methods producing key terms with higher precision and recall, and appears to be substantially more effective on noisy and multi-theme documents than existing methods.
Abstract: We present a novel method for key term extraction from text documents. In our method, document is modeled as a graph of semantic relationships between terms of that document. We exploit the following remarkable feature of the graph: the terms related to the main topics of the document tend to bunch up into densely interconnected subgraphs or communities, while non-important terms fall into weakly interconnected communities, or even become isolated vertices. We apply graph community detection techniques to partition the graph into thematically cohesive groups of terms. We introduce a criterion function to select groups that contain key terms discarding groups with unimportant terms. To weight terms and determine semantic relatedness between them we exploit information extracted from Wikipedia.Using such an approach gives us the following two advantages. First, it allows effectively processing multi-theme documents. Second, it is good at filtering out noise information in the document, such as, for example, navigational bars or headers in web pages.Evaluations of the method show that it outperforms existing methods producing key terms with higher precision and recall. Additional experiments on web pages prove that our method appears to be substantially more effective on noisy and multi-theme documents than existing methods.

Book
27 Apr 2009
TL;DR: This book argues that it can be useful for social scientists to measure aspects of the web and explains how this can be achieved on both a small and large scale.
Abstract: Webometrics is concerned with measuring aspects of the web: web sites, web pages, parts of web pages, words in web pages, hyperlinks, web search engine results. The importance of the web itself as a communication medium and for hosting an increasingly wide array of documents, from journal articles to holiday brochures, needs no introduction. Given this huge and easily accessible source of information, there are limitless possibilities for measuring or counting on a huge scale (e.g., the number of web sites, the number of web pages, the number of blogs) or on a smaller scale (e.g., the number of web sites in Ireland, the number of web pages in the CNN web site, the number of blogs mentioning Barack Obama before the 2008 presidential campaign). This book argues that it can be useful for social scientists to measure aspects of the web and explains how this can be achieved on both a small and large scale. The book is intended for social scientists with research topics that are wholly or partly online (e.g., social networks, news, political communication) and social scientists with offline research topics with an online reflection, even if this is not a core component (e.g., diaspora communities, consumer culture, linguistic change). The book is also intended for library and information science students in the belief that the knowledge and techniques described will be useful for them to guide and aid other social scientists in their research. In addition, the techniques and issues are all directly relevant to library and information science research problems. Table of Contents: Introduction / Web Impact Assessment / Link Analysis / Blog Searching / Automatic Search Engine Searches: LexiURL Searcher / Web Crawling: SocSciBot / Search Engines and Data Reliability / Tracking User Actions Online / Advaned Techniques / Summary and Future Directions

Journal ArticleDOI
01 Aug 2009
TL;DR: Octopus is a system that combines search, extraction, data cleaning and integration, and enables users to create new data sets from those found on the Web, to offer the user a set of best-effort operators that automate the most labor-intensive tasks.
Abstract: The Web contains a vast amount of structured information such as HTML tables, HTML lists and deep-web databases; there is enormous potential in combining and re-purposing this data in creative ways. However, integrating data from this relational web raises several challenges that are not addressed by current data integration systems or mash-up tools. First, the structured data is usually not published cleanly and must be extracted (say, from an HTML list) before it can be used. Second, due to the vastness of the corpus, a user can never know all of the potentially-relevant databases ahead of time (much less write a wrapper or mapping for each one); the source databases must be discovered during the integration process. Third, some of the important information regarding the data is only present in its enclosing web page and needs to be extracted appropriately.This paper describes Octopus, a system that combines search, extraction, data cleaning and integration, and enables users to create new data sets from those found on the Web. The key idea underlying Octopus is to offer the user a set of best-effort operators that automate the most labor-intensive tasks. For example, the Search operator takes a search-style keyword query and returns a set of relevance-ranked and similarity-clustered structured data sources on the Web; the Context operator helps the user specify the semantics of the sources by inferring attribute values that may not appear in the source itself, and the Extend operator helps the user find related sources that can be joined to add new attributes to a table. Octopus executes some of these operators automatically, but always allows the user to provide feedback and correct errors. We describe the algorithms underlying each of these operators and experiments that demonstrate their efficacy.

Patent
11 Sep 2009
TL;DR: In this paper, a user can select sharable content on a web page, drag the content to a target, and drop it directly on the target, thereby sharing the content with the target.
Abstract: Techniques that enable content from a web page to be shared directly with one or more targets, which may be an application, a buddy from a buddy list (e.g., in a chat application), and the like. An embodiment of the present invention can identify contents on a web page that are to be made sharable and make the identified contents sharable. The content that is made sharable can then be shared with a share target using, for example, drag and drop operations. For example, a user may select sharable content on a web page, drag the content to a target, and drop it directly on the target thereby sharing the content with the target.

Proceedings ArticleDOI
09 Feb 2009
TL;DR: It is demonstrated how user-generated tags from large-scale social bookmarking websites such as del.icio.us can be used as a complementary data source to page text and anchor text for improving automatic clustering of web pages.
Abstract: Automatically clustering web pages into semantic groups promises improved search and browsing on the web. In this paper, we demonstrate how user-generated tags from large-scale social bookmarking websites such as del.icio.us can be used as a complementary data source to page text and anchor text for improving automatic clustering of web pages. This paper explores the use of tags in 1) K-means clustering in an extended vector space model that includes tags as well as page text and 2) a novel generative clustering algorithm based on latent Dirichlet allocation that jointly models text and tags. We evaluate the models by comparing their output to an established web directory. We find that the naive inclusion of tagging data improves cluster quality versus page text alone, but a more principled inclusion can substantially improve the quality of all models with a statistically significant absolute F-score increase of 4%. The generative model outperforms K-means with another 8% F-score increase.

Journal Article
TL;DR: With information systems (IS) classrooms quickly filling with the Google/Facebook generation accustomed to being connected to information sources and social networks all the time and in many forms, how can these technologies be used to transform, supplement, or even supplant current pedagogical practices?
Abstract: 1. INTRODUCTION Whether it is a social networking site like Facebook, a video stream delivered via YouTube, or collaborative discussion and document sharing via Google Apps, more people are using Web 2.0 and virtual world technologies in the classroom to communicate, express ideas, and form relationships centered around topical interests. Virtual Worlds immerse participants even deeper in technological realms rife with interaction. Instead of simply building information, people create entire communities comprised of self-built worlds and avatars centered around common interests, learning, or socialization in order to promote information exchange. With information systems (IS) classrooms quickly filling with the Google/Facebook generation accustomed to being connected to information sources and social networks all the time and in many forms, how can we best use these technologies to transform, supplement, or even supplant current pedagogical practices? Will holding office hours in a chat room make a difference? What about creating collaborative Web content with Wikis? How about demonstrations of complex concepts in a Virtual World so students can experiment endlessly? In this JISE special issue, we will explore these questions and more. 2. TYPES OF WEB 2.0 TECHNOLOGIES Web 2.0 technologies encompass a variety of different meanings that include an increased emphasis on user generated content, data and content sharing, collaborative effort, new ways of interacting with Web-based applications, and the use of the Web as a social platform for generating, repositioning and consuming content. The beginnings of the shared content nature of Web 2.0 appeared in 1980 in Tim Berners-Lee's prototype Web software. However, the content sharing aspects of the Web were lost in the original rollout, and did not reappear until Ward Cunningham wrote the first wiki in 1994-1995. Blogs, another early part of the Web 2.0 phenomenon, were sufficiently developed to gain the name weblogs in 1997 (Franklin & van Harmelen, 2007). The first use of the term Web 2.0 was in 2004 (Graham, 2005; O'Reilly, 2005a; O'Reilly, 2005b) "Web 2.0" refers to a perceived second generation of Web development and design that facilitates communications and secures information sharing, interoperability, and collaboration on the World Wide Web. Web 2.0 concepts have led to the development and evolution of Web-based communities, hosted services, and applications; such as social-networking sites, video-sharing sites, wikis, blogs, and folksonomies" (Web 2.0, 2009). The emphasis on user participation--also known as the "Read/Write" Web characterizes most people's definitions of Web 2.0. There are many types of Web 2.0 technologies and new offerings appear almost daily. The followng are some basic categories in which we can classify most Web 2.0 offerings. 2.1 Wikis A "wiki" is a collection of Web pages designed to enable anyone with access to contribute or modify content, using a simplified markup language, and is often used to create collaborative Websites. (Wiki, 2009). One of the best known wikis is Wikipedia. Wikis can be used in education to facilitate knowledge systems powered by students (Raman, Ryan, & Olfman, 2005). 2.2 Blogs A blog (weblog) is a type of Website, usually maintained by an individual with regular commentary entries, event descriptions, or other material such as graphics or video. One example of the use of blogs in education is the use of question blogging, a type of blog that answers questions. Moreover, these questions and discussions can be a collaborative endeavor among instructors and students. Wagner (2003) addressed using blogs in education by publishing learning logs. 2.3 Podcasts A podcast is a digital media file, usually digital audio or video that is freely available for download from the Internet using software that can handle RSS feeds (Podcast, 2009). …

Journal ArticleDOI
TL;DR: On the basis of trends of Google queries, these authors put their results into practice by creating a Web page dedicated to influenza surveillance, but did not develop the same approach for other diseases.
Abstract: To the Editor: The idea that populations provide data on their influenza status through information-seeking behavior on the Web has been explored in the United States in recent years (1,2). Two reports showed that queries to the Internet search engines Yahoo and Google could be informative for influenza surveillance (2,3). Ginsberg et al. scanned the Google database and found that the sum of the results of 45 queries that most correlated with influenza incidences provided the best predictor of influenza trends (3). On the basis of trends of Google queries, these authors put their results into practice by creating a Web page dedicated to influenza surveillance. However, they did not develop the same approach for other diseases. To date, no studies have been published about the relationship of search engine query data with other diseases or in languages other than English.

Journal ArticleDOI
TL;DR: An approach for extracting semantics for tags, unstructured text-labels assigned to resources on the Web, based on each tag's usage patterns, and two novel methods, TagMaps and scale-structure identification are described.
Abstract: We describe an approach for extracting semantics for tags, unstructured text-labels assigned to resources on the Web, based on each tag's usage patterns. In particular, we focus on the problem of extracting place semantics for tags that are assigned to photos on Flickr, a popular-photo sharing Web site that supports location (latitude/longitude) metadata for photos. We propose the adaptation of two baseline methods, inspired by well-known burst-analysis techniques, for the task; we also describe two novel methods, TagMaps and scale-structure identification. We evaluate the methods on a subset of Flickr data. We show that our scale-structure identification method outperforms existing techniques and that a hybrid approach generates further improvements (achieving 85p precision at 81p recall). The approach and methods described in this work can be used in other domains such as geo-annotated Web pages, where text terms can be extracted and associated with usage patterns.

Journal ArticleDOI
TL;DR: The base of Web 3.0 applications resides in the resource description framework (RDF) for providing a means to link data from multiple Web sites or databases, and with the SPARQL query language, applications can use native graph-based RDF stores and extract RDF data from traditional databases.
Abstract: While Web 3.0 technologies are difficult to define precisely, the outline of emerging applications has become clear over the past year. We can thus essentially view Web 3.0 as semantic Web technologies integrated into, or powering, large-scale Web applications. The base of Web 3.0 applications resides in the resource description framework (RDF) for providing a means to link data from multiple Web sites or databases. With the SPARQL query language, a SQL-like standard for querying RDF data, applications can use native graph-based RDF stores and extract RDF data from traditional databases.

Proceedings Article
10 Aug 2009
TL;DR: Gazelle is introduced, a secure web browser constructed as a multi-principal OS that exclusively manages resource protection and sharing across web site principals and exposes intricate design issues that no previous work has identified.
Abstract: Original web browsers were applications designed to view static web content. As web sites evolved into dynamic web applications that compose content from multiple web sites, browsers have become multiprincipal operating environments with resources shared among mutually distrusting web site principals. Nevertheless, no existing browsers, including new architectures like IE 8, Google Chrome, and OP, have a multi-principal operating system construction that gives a browser-based OS the exclusive control to manage the protection of all system resources among web site principals. In this paper, we introduce Gazelle, a secure web browser constructed as a multi-principal OS. Gazelle's browser kernel is an operating system that exclusively manages resource protection and sharing across web site principals. This construction exposes intricate design issues that no previous work has identified, such as crossprotection-domain display and events protection. We elaborate on these issues and provide comprehensive solutions. Our prototype implementation and evaluation experience indicates that it is realistic to turn an existing browser into a multi-principal OS that yields significantly stronger security and robustness with acceptable performance.

Proceedings ArticleDOI
28 Dec 2009
TL;DR: Google as mentioned in this paper is a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems.
Abstract: In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three years ago. This paper provides an in-depth description of our large-scale web search engine to define the values and traditional techniques of data in hypertext. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.

Proceedings ArticleDOI
19 Jul 2009
TL;DR: This paper incorporates context information into the problem of query classification by using conditional random field (CRF) models and shows that it can improve the F1 score by 52% as compared to other state-of-the-art baselines.
Abstract: Understanding users'search intent expressed through their search queries is crucial to Web search and online advertisement. Web query classification (QC) has been widely studied for this purpose. Most previous QC algorithms classify individual queries without considering their context information. However, as exemplified by the well-known example on query "jaguar", many Web queries are short and ambiguous, whose real meanings are uncertain without the context information. In this paper, we incorporate context information into the problem of query classification by using conditional random field (CRF) models. In our approach, we use neighboring queries and their corresponding clicked URLs (Web pages) in search sessions as the context information. We perform extensive experiments on real world search logs and validate the effectiveness and effciency of our approach. We show that we can improve the F1 score by 52% as compared to other state-of-the-art baselines.

Journal ArticleDOI
TL;DR: An advanced architecture for a personalization system to facilitate Web mining is proposed and the meaning of several recommendations are described, starting from the rules discovered by the Web mining algorithms.
Abstract: Nowadays, the application of Web mining techniques in e-learning and Web-based adaptive educational systems is increasing exponentially. In this paper, we propose an advanced architecture for a personalization system to facilitate Web mining. A specific Web mining tool is developed and a recommender engine is integrated into the AHA! system in order to help the instructor to carry out the whole Web mining process. Our objective is to be able to recommend to a student the most appropriate links/Web pages within the AHA! system to visit next. Several experiments are carried out with real data provided by Eindhoven University of Technology students in order to test both the architecture proposed and the algorithms used. Finally, we have also described the meaning of several recommendations, starting from the rules discovered by the Web mining algorithms.

Book
27 Mar 2009
TL;DR: This book is the first to extend detailed coverage of analysis beyond bit vectors, and equips readers with a combination of mutually supportive theory and practice, presenting mathematical foundations and including study of data flow analysis implementation through use of the GNU Compiler Collection.
Abstract: This work provides an in-depth treatment of data flow analysis technique. Apart from including interprocedural data flow analysis, this book is the first to extend detailed coverage of analysis beyond bit vectors. Supplemented by numerous examples, it equips readers with a combination of mutually supportive theory and practice, presenting mathematical foundations and including study of data flow analysis implementation through use of the GNU Compiler Collection (GCC). Readers can experiment with the analyses described in the book by accessing the authors web page, where they will find the source code of gdfa (generic data flow analyzer).

Proceedings ArticleDOI
09 Feb 2009
TL;DR: Algorithms, analyses, and models for characterizing changes in Web content are described, focusing on both time (by using hourly and sub-hourly crawls) and structure (by looking at page-, DOM-, and term-level changes).
Abstract: The Web is a dynamic, ever changing collection of information. This paper explores changes in Web content by analyzing a crawl of 55,000 Web pages, selected to represent different user visitation patterns. Although change over long intervals has been explored on random (and potentially unvisited) samples of Web pages, little is known about the nature of finer grained changes to pages that are actively consumed by users, such as those in our sample. We describe algorithms, analyses, and models for characterizing changes in Web content, focusing on both time (by using hourly and sub-hourly crawls) and structure (by looking at page-, DOM-, and term-level changes). Change rates are higher in our behavior-based sample than found in previous work on randomly sampled pages, with a large portion of pages changing more than hourly. Detailed content and structure analyses identify stable and dynamic content within each page. The understanding of Web change we develop in this paper has implications for tools designed to help people interact with dynamic Web content, such as search engines, advertising, and Web browsers.

Proceedings ArticleDOI
01 Apr 2009
TL;DR: A multi-process browser architecture is presented that isolates web program instances from each other, improving fault tolerance, resource management, and performance.
Abstract: Many of today's web sites contain substantial amounts of client-side code, and consequently, they act more like programs than simple documents. This creates robustness and performance challenges for web browsers. To give users a robust and responsive platform, the browser must identify program boundaries and provide isolation between them.We provide three contributions in this paper. First, we present abstractions of web programs and program instances, and we show that these abstractions clarify how browser components interact and how appropriate program boundaries can be identified. Second, we identify backwards compatibility tradeoffs that constrain how web content can be divided into programs without disrupting existing web sites. Third, we present a multi-process browser architecture that isolates these web program instances from each other, improving fault tolerance, resource management, and performance. We discuss how this architecture is implemented in Google Chrome, and we provide a quantitative performance evaluation examining its benefits and costs.