scispace - formally typeset
Search or ask a question
Journal ArticleDOI

The anatomy of a large-scale hypertextual Web search engine

01 Apr 1998-Vol. 30, pp 107-117
TL;DR: This paper provides an in-depth description of Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and looks at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.
Abstract: In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/. To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three years ago. This paper provides an in-depth description of our large-scale web search engine -- the first such detailed public description we know of to date. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
06 May 2021
TL;DR: A hybrid multicriteria model for the evaluation of critical IT systems where the elements for risk analysis and assessment are used as evaluation criteria and the main advantage of the new model is its use of generic criteria for risk assessment.
Abstract: One of the important objectives and concerns today is to find efficient means to manage the information security risks to which organizations are exposed. Due to a lack of necessary data and time and resource constraints, very often it is impossible to gather and process all of the required information about an IT system in order to properly assess it within an acceptable timeframe. That puts the organization into a state of increased security risk. One of the means to solve such complex problems is the use of multicriteria decision-making methods that have a strong mathematical foundation. This paper presents a hybrid multicriteria model for the evaluation of critical IT systems where the elements for risk analysis and assessment are used as evaluation criteria. The iterative steps of the design science research (DSR) methodology for development of a new multicriteria model for the objectives of evaluation, ranking, and selection of critical information systems are delineated. The main advantage of the new model is its use of generic criteria for risk assessment instead of redefining inherent criteria and calculating related weights for each individual IT system. That is why more efficient evaluation, ranking, and decision-making between several possible IT solutions can be expected. The proposed model was validated in a case study of online banking transaction systems and could be used as a generic model for the evaluation of critical IT systems.

3 citations


Cites background from "The anatomy of a large-scale hypert..."

  • ...Various studies have tested different damping factors, but in general, according to the authors of the Google PageRank algorithm [60], this factor is around 0....

    [...]

Book ChapterDOI
31 Aug 2021
TL;DR: In this article, the authors collected publicly available forum data from the “Chimp&See” project hosted on the Zooniverse platform via crawling its Talk pages and analyzed the collected data using social network analysis (SNA) and Epistemic Network Analysis (ENA) techniques.
Abstract: Citizen Science (CS) projects provide a space for collaboration among scientists and the general public as a basis for making joint scientific discoveries. Analysis of existing datasets from CS projects can broaden our understanding of how different stakeholder groups interact and contribute to the joint achievements. To this end, we have collected publicly available forum data from the “Chimp&See” project hosted on the Zooniverse platform via crawling its Talk pages. The collected data were then analysed using Social Network Analysis (SNA) and Epistemic Network Analysis (ENA) techniques. The results obtained shed light on the participation and collaboration patterns of different stakeholder groups within discussion forums of the “Chimp&See” project.

3 citations

Journal ArticleDOI
TL;DR: In this article, the authors proposed an integrated customer value model in three dimensions of purchase value, interactive value, and marketing diffusion value with 13 indicators based on the complex network theory and RFM model, considering the value created by connection and interaction among customers.
Abstract: The advent of mobile Internet era brings both opportunities and challenges to understanding customer value in the field of customer relationship management. Traditional customer relationship management theory and practice focus on the transaction value created by individual customers and do not take into consideration enough to the huge potential commercial value behind the interaction and connection among people by means of online social services in the mobile Internet era. Towards this end, this study first analyses the new characteristics of customer behaviour in the mobile Internet era. Second, the study proposes the integrated customer value model in three dimensions of purchase value, interactive value, and marketing diffusion value with 13 indicators based on the complex network theory and RFM model, considering the value created by connection and interaction among customers. Finally, the study discusses the eight types of customer clusters and the corresponding differentiate customer relationship management strategies in the era of mobile Internet based on the integrated customer value model. This study enriches the theory of customer value in the field of customer relationship management and also helps company better practice the innovation and change in the management of relationship with customer in the mobile Internet era.

3 citations

Journal ArticleDOI
TL;DR: The findings highlight the interplay of policy development and FoPL research, the presence of few self-reinforcing and well-established co-citation networks based on validated evidence in the literature and the presenceof alternative emerging theories that offer different and valid perspectives overlooked by mainstream co-Citation research networks.
Abstract: The last decades have been marked by the introduction of front-of-pack labels (FoPL) as an institutional corrective action against obesity and nutrition-related illnesses. However, FoPL-related policy-making initiatives issued by the European Union evolved over time and led to a diversity of labels with different effects on consumers’ decisions. As a result, the extant literature adapted to the regulative scenario over the years and investigated the effects of the labels, creating consensus on some topics while being fragmented on others. Similarly, policy-makers adapted some regulations to the evidence supported by the research. With the aim to systematize the overall structure and evolution of the literature on FoPL, investigate the presence of a consensus on specific topics through a co-citation analysis, and examine the evolution of the consensus and co-citation networks over the years and potential research gaps, we report the results of bibliometric and co-citation analyses and a systematic literature review involving 170 papers and a selection of 49 articles published in the last months, for a total of 219 articles, analysed according to three timespans (Period 1 (1989–2011); Period 2 (2012–2016) and Period 3 (2017–2022)). Our findings highlight the interplay of policy development and FoPL research, the presence of few self-reinforcing and well-established co-citation networks based on validated evidence in the literature and the presence of alternative emerging theories that offer different and valid perspectives overlooked by mainstream co-citation research networks.

3 citations

Journal ArticleDOI
TL;DR: In this article , a taxonomy of big data systems based on individual dimensions such as accuracy metrics and workload type is proposed. And the authors aim at enabling the usage of their taxonomy in identifying adapted benchmarks for their BD, HPC, and ML systems.
Abstract: In recent years, there has been a convergence of Big Data (BD), High Performance Computing (HPC), and Machine Learning (ML) systems. This convergence is due to the increasing complexity of long data analysis pipelines on separated software stacks. With the increasing complexity of data analytics pipelines comes a need to evaluate their systems, in order to make informed decisions about technology selection, sizing and scoping of hardware. While there are many benchmarks for each of these domains, there is no convergence of these efforts. As a first step, it is also necessary to understand how the individual benchmark domains relate. In this work, we analyze some of the most expressive and recent benchmarks of BD, HPC, and ML systems. We propose a taxonomy of those systems based on individual dimensions such as accuracy metrics and common dimensions such as workload type. Moreover, we aim at enabling the usage of our taxonomy in identifying adapted benchmarks for their BD, HPC, and ML systems. Finally, we identify challenges and research directions related to the future of converged BD, HPC, and ML system benchmarking.

3 citations

References
More filters
Proceedings Article
11 Nov 1999
TL;DR: This paper describes PageRank, a mathod for rating Web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them, and shows how to efficiently compute PageRank for large numbers of pages.
Abstract: The importance of a Web page is an inherently subjective matter, which depends on the readers interests, knowledge and attitudes. But there is still much that can be said objectively about the relative importance of Web pages. This paper describes PageRank, a mathod for rating Web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them. We compare PageRank to an idealized random Web surfer. We show how to efficiently compute PageRank for large numbers of pages. And, we show how to apply PageRank to search and to user navigation.

14,400 citations

Book
11 May 1999
TL;DR: A guide to the MG system and its applications, as well as a comparison to the NZDL reference index, are provided.
Abstract: PREFACE 1. OVERVIEW 2. TEXT COMPRESSION 3. INDEXING 4. QUERYING 5. INDEX CONSTRUCTION 6. IMAGE COMPRESSION 7. TEXTUAL IMAGES 8. MIXED TEXT AND IMAGES 9. IMPLEMENTATION 10. THE INFORMATION EXPLOSION A. GUIDE TO THE MG SYSTEM B. GUIDE TO THE NZDL REFERENCES INDEX

2,068 citations

Proceedings ArticleDOI
Jon Kleinberg1
01 Jan 1998
TL;DR: This work proposes and test an algorithmic formulation of the notion of authority, based on the relationship between a set of relevant authoritative pages and the set of \hub pages that join them together in the link structure, that has connections to the eigenvectors of certain matrices associated with the link graph.
Abstract: The network structure of a hyperlinked environment can be a rich source of information about the content of the environment, provided we have eective means for understanding it. We develop a set of algorithmic tools for extracting information from the link structures of such environments, and report on experiments that demonstrate their eectiveness in a variety of contexts on the World Wide Web. The central issue we address within our framework is the distillation of broad search topics, through the discovery of \authoritative" information sources on such topics. We propose and test an algorithmic formulation of the notion of authority, based on the relationship between a set of relevant authoritative pages and the set of \hub pages" that join them together in the link structure. Our formulation has connections to the eigenvectors of certain matrices associated with the link graph; these connections in turn motivate additional heuristics for link-based analysis.

1,440 citations


Additional excerpts

  • ...There is quite a bit of recent optimism that the use of more hypertextual information can help improve search and other applications [Marchiori 97] [Spertus 97] [Weiss 96] [Kleinberg 98]....

    [...]

Journal ArticleDOI
01 Apr 1998
TL;DR: In this paper, the authors study in what order a crawler should visit the URLs it has seen, in order to obtain more "important" pages first, and they show that a good ordering scheme can obtain important pages significantly faster than one without.
Abstract: In this paper we study in what order a crawler should visit the URLs it has seen, in order to obtain more "important" pages first. Obtaining important pages rapidly can be very useful when a crawler cannot visit the entire Web in a reasonable amount of time. We define several importance metrics, ordering schemes, and performance evaluation measures for this problem. We also experimentally evaluate the ordering schemes on the Stanford University Web. Our results show that a crawler with a good ordering scheme can obtain important pages significantly faster than one without.

980 citations

Journal ArticleDOI
01 Apr 1998
TL;DR: An evaluation of ARC suggests that the resources found by ARC frequently fare almost as well as, and sometimes better than, lists of resources that are manually compiled or classified into a topic.
Abstract: We describe the design, prototyping and evaluation of ARC, a system for automatically compiling a list of authoritative Web resources on any (sufficiently broad) topic. The goal of ARC is to compile resource lists similar to those provided by Yahoo! or Infoseek. The fundamental difference is that these services construct lists either manually or through a combination of human and automated effort, while ARC operates fully automatically. We describe the evaluation of ARC, Yahoo!, and Infoseek resource lists by a panel of human users. This evaluation suggests that the resources found by ARC frequently fare almost as well as, and sometimes better than, lists of resources that are manually compiled or classified into a topic. We also provide examples of ARC resource lists for the reader to examine.

810 citations