This work describes a prototype system built that specializes in crawling entity-oriented deep-web sites and proposes techniques tailored to tackle important subproblems including query generation, empty page filtering and URL deduplication in the specific context of entity oriented deep- web sites.
Abstract:
Deep-web crawl is concerned with the problem of surfacing hidden content behind search interfaces on the Web. While many deep-web sites maintain document-oriented textual content (e.g., Wikipedia, PubMed, Twitter, etc.), which has traditionally been the focus of the deep-web literature, we observe that a significant portion of deep-web sites, including almost all online shopping sites, curate structured entities as opposed to text documents. Although crawling such entity-oriented content is clearly useful for a variety of purposes, existing crawling techniques optimized for document oriented content are not best suited for entity-oriented sites. In this work, we describe a prototype system we have built that specializes in crawling entity-oriented deep-web sites. We propose techniques tailored to tackle important subproblems including query generation, empty page filtering and URL deduplication in the specific context of entity oriented deep-web sites. These techniques are experimentally evaluated and shown to be effective.
TL;DR: The experimental results show the agility and accuracy of the proposed crawler framework, SmartCrawler, which efficiently retrieves deep-web interfaces from large-scale sites and achieves higher harvest rates than other crawlers.
TL;DR: It is concluded that crawler evaluation is an immature research area due to the lack of a standard set of performance measures, or a benchmark or publicly available dataset to evaluate the crawlers, and future work in this area should be focused on devising crawlers to deal with ever-evolving Web technologies and improving the crawling efficiency and scalability.
TL;DR: This work introduces and elaborate on all aspects of CLEAR+, an extended version of the Credible Live Evaluation Method for Archive Readiness (CLEAR) method, and uses a systematic approach to evaluate WA from multiple different perspectives, which is called Website Archivability Facets.
TL;DR: This paper proposes techniques that are required to construct a POI database, including focused crawling, information extraction, and information retrieval techniques, and demonstrates that the proposed geographical information retrieval model outperforms Wikimapia and a commercial app called ‘What’s the Number?’
TL;DR: These tools are described in detail and the proposed methods provide a further step in providing a transparent and comprehensive solution for observing darknet markets tailored for data scientists, social scientists, criminologist and others interested in analysing trends fromdarknet markets.
TL;DR: This paper provides an in-depth description of Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and looks at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.
TL;DR: Google as discussed by the authors is a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems.
TL;DR: This paper summarizes the insights gained in automatic term weighting, and provides baseline single term indexing models with which other more elaborate content analysis procedures can be compared.
TL;DR: This book discusses data mining through the lens of cluster analysis, which examines the relationships between data, clusters, and algorithms, and some of the techniques used to solve these problems.
TL;DR: MQL provides an easy-to-use object-oriented interface to the tuple data in Freebase and is designed to facilitate the creation of collaborative, Web-based data-oriented applications.
Q1. What contributions have the authors mentioned in the paper "Crawling deep web entity pages" ?
In this work, the authors describe a prototype system they have built that specializes in crawling entity-oriented deep-web sites. The authors propose techniques tailored to tackle important subproblems including query generation, empty page filtering and URL deduplication in the specific context of entity oriented deep-web sites.
Q2. What are the future works in "Crawling deep web entity pages" ?
While these techniques are shown to be useful, their experience points to a few areas that warrant future studies. Given the ubiquity of entity-oriented deep-web sites and the variety of uses that entity-oriented content can enable, the authors believe entity-oriented crawl is a useful research effort, and they hope their initial efforts in this area can serve as a springboard for future research.
Q3. What is the purpose of the URL template generation component?
The URL template generation component then crawls the home-pages of these sites, extracts and parses the web forms found on the homepages, and produces URL templates.
Q4. How can the authors use this approach to extract relevant entities?
The authors demonstrate that classical techniques for infor-mation retrieval and entity extraction can be used to robustly derive relevant entities for each site, so that crawling bandwidth can be utilized efficiently and effectively (Section 5).
Q5. What is the problem with the prefix “whereto stay in”?
when all the prefix/suffix in the query logs are aggregated, the correct prefix “whereto stay in” occurs much more frequently and should clearly stand out as a entity irrelevant pattern.
Q6. What is the definition of deep-web crawl?
While crawling the deep-web can be immensely useful for a variety of tasks including web indexing [15] and data integration [14], crawling the deep-web content is known to be hard.
Q7. What is the definition of deep web crawl?
Deep-web crawl refers to the problem of surfacing rich information behind the web search interface of diverse sites across the Web.
Q8. What is the reason why the authors filter second-level URLs?
a significant portion of these second-level URLs are in fact entirely irrelevant with no deep-web content (many URLs are static and navigational, for example browsing URLs, login URLs, etc.), which need to be filtered out.