Journal ArticleDOI
A survey of Web crawlers for information retrieval
Reads0
Chats0
TLDR
This study follows the guidelines of systematic literature review and applies it to the field of Web crawling, calling for an increased awareness in various fields of the Web crawler and identify how techniques from other domains can be used for crawling the Web.Abstract:
Performance of any search engine relies heavily on its Web crawler. Web crawlers are the programs that get webpages from the Web by following hyperlinks. These webpages are indexed by a search engine and can be retrieved by a user query. In the area of Web crawling, we still lack an exhaustive study that covers all crawling techniques. This study follows the guidelines of systematic literature review and applies it to the field of Web crawling. We used the standard procedure of carrying out a systematic literature review on 248 studies from a total of 1488 articles published in 12 leading journals and other premier conferences and workshops. Existing literature about the Web crawler is classified into different key subareas. Each subarea is further divided according to the techniques being used. We analyzed the distribution of various articles using multiple criteria and depicted conclusions. Various studies that use open source Web crawlers are also reported. We have highlighted future areas of research. We call for an increased awareness in various fields of the Web crawler and identify how techniques from other domains can be used for crawling the Web. Limitations and recommendations for future are also discussed. WIREs Data Mining Knowl Discov 2017, 7:e1218. doi: 10.1002/widm.1218read more
Citations
More filters
Journal ArticleDOI
A survey on big data-driven digital phenotyping of mental health
TL;DR: The vision of digital phenotyping of mental health (DPMH) is outlined by fusing the enriched data from ubiquitous sensors, social media and healthcare systems, and a broad overview of DPMH from sensing and computing perspectives is presented.
Journal ArticleDOI
An Automated Word Embedding with Parameter Tuned Model for Web Crawling
Sindhu. Neelakandan.,Abhishek Arun,Raghu Ram Bhukya,Bhalchandra M. Hardas,T. Ch. Anil Kumar,M. Ashok +5 more
TL;DR: In this article , an automated word embedding with parameter tuned deep learning (AWE-PTDL) model is proposed for focused web crawling, which involves different processes such as preprocessing, incremental skip-gram model with negative sampling (ISGNS), bidirectional long short-term memory-based classification and bird swarm optimization based hyperparameter tuning.
Journal ArticleDOI
A Novel Web Scraping Approach Using the Additional Information Obtained From Web Pages
TL;DR: This study proposes a novel approach, namely UzunExt, which extracts content quickly using the string methods and additional information without creating a DOM Tree, which can easily be adapted to other DOM-based studies/parsers in this task to enhance their time efficiencies.
Proceedings ArticleDOI
The Representativeness of Automated Web Crawls as a Surrogate for Human Browsing
David Zeber,Sarah Bird,Camila Oliveira,Walter Rudametkin,Ilana Segall,Fredrik Wollsén,Martin Lopatka +6 more
TL;DR: In this paper, the authors quantify the repeatability and representativeness of Web crawls in terms of common tracking and fingerprinting metrics, considering both variation across crawls and divergence from human browser usage.
Journal ArticleDOI
Data Analytics in Industry 4.0: A Survey.
Lian Duan,Li Da Xu +1 more
TL;DR: In this article, a systematic literature review on the interaction between Industry 4.0 and data analytics is conducted to understand the existing research focus and trend and to better understand current research efforts, hot topics, and tending topics on this critical intersection.
References
More filters
Proceedings ArticleDOI
Performing systematic literature reviews in software engineering
David Budgen,Pearl Brereton +1 more
TL;DR: This tutorial is designed to provide an introduction to the role, form and processes involved in performing Systematic Literature Reviews, and to gain the knowledge needed to conduct systematic reviews of their own.
Journal ArticleDOI
Systematic literature reviews in software engineering - A systematic literature review
Barbara Kitchenham,O. Pearl Brereton,David Budgen,Mark Turner,John W. Bailey,Stephen Linkman +5 more
TL;DR: The series of cost estimation SLRs demonstrate the potential value of EBSE for synthesising evidence and making it available to practitioners and European researchers appear to be the leading exponents of systematic literature reviews.
Journal ArticleDOI
Lessons from applying the systematic literature review process within the software engineering domain
TL;DR: In this article, the authors report experiences with applying one such approach, the practice of systematic literature review, to the published studies relevant to topics within the software engineering domain, and some lessons about the applicability of this practice to software engineering are extracted.
Journal ArticleDOI
Focused crawling: a new approach to topic-specific Web resource discovery
TL;DR: A new hypertext resource discovery system called a Focused Crawler that is robust against large perturbations in the starting set of URLs, and capable of exploring out and discovering valuable resources that are dozens of links away from the start set, while carefully pruning the millions of pages that may lie within this same radius.
Journal ArticleDOI
Efficient crawling through URL ordering
TL;DR: In this paper, the authors study in what order a crawler should visit the URLs it has seen, in order to obtain more "important" pages first, and they show that a good ordering scheme can obtain important pages significantly faster than one without.