scispace - formally typeset
Journal ArticleDOI

A survey of Web crawlers for information retrieval

Reads0
Chats0
TLDR
This study follows the guidelines of systematic literature review and applies it to the field of Web crawling, calling for an increased awareness in various fields of the Web crawler and identify how techniques from other domains can be used for crawling the Web.
Abstract
Performance of any search engine relies heavily on its Web crawler. Web crawlers are the programs that get webpages from the Web by following hyperlinks. These webpages are indexed by a search engine and can be retrieved by a user query. In the area of Web crawling, we still lack an exhaustive study that covers all crawling techniques. This study follows the guidelines of systematic literature review and applies it to the field of Web crawling. We used the standard procedure of carrying out a systematic literature review on 248 studies from a total of 1488 articles published in 12 leading journals and other premier conferences and workshops. Existing literature about the Web crawler is classified into different key subareas. Each subarea is further divided according to the techniques being used. We analyzed the distribution of various articles using multiple criteria and depicted conclusions. Various studies that use open source Web crawlers are also reported. We have highlighted future areas of research. We call for an increased awareness in various fields of the Web crawler and identify how techniques from other domains can be used for crawling the Web. Limitations and recommendations for future are also discussed. WIREs Data Mining Knowl Discov 2017, 7:e1218. doi: 10.1002/widm.1218

read more

Citations
More filters
Journal ArticleDOI

A survey on big data-driven digital phenotyping of mental health

TL;DR: The vision of digital phenotyping of mental health (DPMH) is outlined by fusing the enriched data from ubiquitous sensors, social media and healthcare systems, and a broad overview of DPMH from sensing and computing perspectives is presented.
Journal ArticleDOI

An Automated Word Embedding with Parameter Tuned Model for Web Crawling

TL;DR: In this article , an automated word embedding with parameter tuned deep learning (AWE-PTDL) model is proposed for focused web crawling, which involves different processes such as preprocessing, incremental skip-gram model with negative sampling (ISGNS), bidirectional long short-term memory-based classification and bird swarm optimization based hyperparameter tuning.
Journal ArticleDOI

A Novel Web Scraping Approach Using the Additional Information Obtained From Web Pages

TL;DR: This study proposes a novel approach, namely UzunExt, which extracts content quickly using the string methods and additional information without creating a DOM Tree, which can easily be adapted to other DOM-based studies/parsers in this task to enhance their time efficiencies.
Proceedings ArticleDOI

The Representativeness of Automated Web Crawls as a Surrogate for Human Browsing

TL;DR: In this paper, the authors quantify the repeatability and representativeness of Web crawls in terms of common tracking and fingerprinting metrics, considering both variation across crawls and divergence from human browser usage.
Journal ArticleDOI

Data Analytics in Industry 4.0: A Survey.

TL;DR: In this article, a systematic literature review on the interaction between Industry 4.0 and data analytics is conducted to understand the existing research focus and trend and to better understand current research efforts, hot topics, and tending topics on this critical intersection.
References
More filters
Proceedings ArticleDOI

Performing systematic literature reviews in software engineering

TL;DR: This tutorial is designed to provide an introduction to the role, form and processes involved in performing Systematic Literature Reviews, and to gain the knowledge needed to conduct systematic reviews of their own.
Journal ArticleDOI

Systematic literature reviews in software engineering - A systematic literature review

TL;DR: The series of cost estimation SLRs demonstrate the potential value of EBSE for synthesising evidence and making it available to practitioners and European researchers appear to be the leading exponents of systematic literature reviews.
Journal ArticleDOI

Lessons from applying the systematic literature review process within the software engineering domain

TL;DR: In this article, the authors report experiences with applying one such approach, the practice of systematic literature review, to the published studies relevant to topics within the software engineering domain, and some lessons about the applicability of this practice to software engineering are extracted.
Journal ArticleDOI

Focused crawling: a new approach to topic-specific Web resource discovery

TL;DR: A new hypertext resource discovery system called a Focused Crawler that is robust against large perturbations in the starting set of URLs, and capable of exploring out and discovering valuable resources that are dozens of links away from the start set, while carefully pruning the millions of pages that may lie within this same radius.
Journal ArticleDOI

Efficient crawling through URL ordering

TL;DR: In this paper, the authors study in what order a crawler should visit the URLs it has seen, in order to obtain more "important" pages first, and they show that a good ordering scheme can obtain important pages significantly faster than one without.