Analyzing the impact of deep web on real-time business search

doi:10.1109/ICACCS.2017.8014607

Citations

PDF

Open Access

More filters

Journal Article•DOI•

A novel user trend-based priority assigner and URL scheduler for dynamic incremental crawling

[...]

Ashlesha Gupta¹, Ashutosh Dixit¹•Institutions (1)

Bose Corporation¹

08 Aug 2021-Concurrency and Computation: Practice and Experience

TL;DR: A priority assigner and scheduler method for organizing Uniform Resource Locators (URLs) is being proposed that helps the crawler in tracking user's interest and prioritize downloading documents that are relevant to the user's choice in addition to current trends.

...read moreread less

Abstract: An efficient search engine needs to be designed in such a way that is able to provide relevant and accurate information in accordance with user needs and interests. The quality of downloaded records can be guaranteed only when website pages of high pertinence are downloaded by the crawlers in accordance with the current topics or user trends. Earlier Focused Crawlers were used to download topic specific pages but these crawlers were not able to adapt to the changing interest of the users. Therefore, there is a need to design crawlers that are able to naturally track the present pattern points and download site pages that meet client's present need. In this paper, a priority assigner and scheduler method for organizing Uniform Resource Locators (URLs) is being proposed that helps the crawler in tracking user's interest and prioritize downloading documents that are relevant to the user's choice in addition to current trends. The experimental results conforms that the proposed priority assigner and URL scheduler‐based crawling outshines conventional crawling strategies based on Change‐history or Site‐Map‐based methods in terms of quality of downloaded web pages and reducing network traffic over the Internet.

...read moreread less

References

PDF

Open Access

More filters

Proceedings Article•DOI•

An adaptive crawler for locating hidden-Web entry points

[...]

Luciano Barbosa¹, Juliana Freire¹•Institutions (1)

University of Utah¹

08 May 2007

TL;DR: A new framework is proposed whereby crawlers automatically learn patterns of promising links and adapt their focus as the crawl progresses, thus greatly reducing the amount of required manual setup andtuning.

...read moreread less

Abstract: In this paper we describe new adaptive crawling strategies to efficiently locate the entry points to hidden-Web sources. The fact that hidden-Web sources are very sparsely distributedmakes the problem of locating them especially challenging. We deal with this problem by using the contents ofpages to focus the crawl on a topic; by prioritizing promisinglinks within the topic; and by also following links that may not lead to immediate benefit. We propose a new frameworkwhereby crawlers automatically learn patterns of promisinglinks and adapt their focus as the crawl progresses, thus greatly reducing the amount of required manual setup andtuning. Our experiments over real Web pages in a representativeset of domains indicate that online learning leadsto significant gains in harvest rates' the adaptive crawlers retrieve up to three times as many forms as crawlers thatuse a fixed focus strategy.

...read moreread less

190 citations

"Analyzing the impact of deep web on..." refers background in this paper

...The crawler mentioned in [7] traverses through all the web pages that are precise to a specific area only....
[...]

Proceedings Article•DOI•

A work-efficient parallel breadth-first search algorithm (or how to cope with the nondeterminism of reducers)

[...]

Charles E. Leiserson¹, Tao B. Schardl¹•Institutions (1)

Massachusetts Institute of Technology¹

13 Jun 2010

TL;DR: A general method for analyzing nondeterministic programs that use reducers and it is shown that for a graph G=(V,E) with diameter D and bounded out-degree, this data-race-free version of PBFS algorithm attains near-perfect linear speedup if P << (V+E)/Dlg3(V/D).

...read moreread less

Abstract: We have developed a multithreaded implementation of breadth-first search (BFS) of a sparse graph using the Cilk++ extensions to C++. Our PBFS program on a single processor runs as quickly as a standar. C++ breadth-first search implementation. PBFS achieves high work-efficiency by using a novel implementation of a multiset data structure, called a "bag," in place of the FIFO queue usually employed in serial breadth-first search algorithms. For a variety of benchmark input graphs whose diameters are significantly smaller than the number of vertices -- a condition met by many real-world graphs -- PBFS demonstrates good speedup with the number of processing cores.Since PBFS employs a nonconstant-time "reducer" -- "hyperobject" feature of Cilk++ -- the work inherent in a PBFS execution depends nondeterministically on how the underlying work-stealing scheduler load-balances the computation. We provide a general method for analyzing nondeterministic programs that use reducers. PBFS also is nondeterministic in that it contains benign races which affect its performance but not its correctness. Fixing these races with mutual-exclusion locks slows down PBFS empirically, but it makes the algorithm amenable to analysis. In particular, we show that for a graph G=(V,E) with diameter D and bounded out-degree, this data-race-free version of PBFS algorithm runs it time O((V+E)/P + Dlg3(V/D)) on P processors, which means that it attains near-perfect linear speedup if P

...read moreread less

174 citations

"Analyzing the impact of deep web on..." refers methods in this paper

...Old-fashioned web crawler traverses through all the web pages using BFS technique [6] which is not efficient....
[...]

Journal Article•DOI•

Friendship Selection in the Social Internet of Things: Challenges and Possible Strategies

[...]

Michele Nitti¹, Luigi Atzori¹, Irena Pletikosa Cvijikj²•Institutions (2)

University of Cagliari¹, ETH Zurich²

01 Jun 2015-IEEE Internet of Things Journal

TL;DR: This work proposed five heuristics, which are based on local network properties and that are expected to have an impact on the overall network structure, and discovered that minimizing the local clustering in the network allowed for achieving the best results in terms of average path length.

...read moreread less

Abstract: The Internet of Things (IoT) is expected to be overpopulated by a very large number of objects, with intensive interactions, heterogeneous communications, and millions of services. Consequently, scalability issues will arise from the search of the right object that can provide the desired service. A new paradigm known as Social Internet of Things (SIoT) has been introduced and proposes the integration of social networking concepts into the Internet of Things. The underneath idea is that every object can look for the desired service using its friendships, in a distributed manner, with only local information. In the SIoT it is very important to set appropriate rules in the objects to select the right friends as these impact the performance of services developed on top of this social network. In this work, we addressed this issue by analyzing possible strategies for the benefit of overall network navigability. We first propose five heuristics, which are based on local network properties and that are expected to have an impact on the overall network structure. We then perform extensive experiments, which are intended to analyze the performance in terms of giant components, average degree of connections, local clustering, and average path length. Unexpectedly, we discovered that minimizing the local clustering in the network allowed for achieving the best results in terms of average path length. We have conducted further analysis to understand the potential causes, which have been found to be linked to the number of hubs in the network.

...read moreread less

166 citations

"Analyzing the impact of deep web on..." refers background in this paper

...In [15], friendship selection process in social Internet of Things is described....
[...]

Journal Article•DOI•

Snoogle: A Search Engine for Pervasive Environments

[...]

Haodong Wang¹, Chiu C. Tan², Qun Li²•Institutions (2)

Virginia State University¹, College of William & Mary²

01 Aug 2010-IEEE Transactions on Parallel and Distributed Systems

TL;DR: Snoogle is presented, a search engine for a wireless network of objects that uses information retrieval techniques to index information and process user queries, and Bloom filters to reduce communication overhead.

...read moreread less

Abstract: Embedding small devices into everyday objects like toasters and coffee mugs creates a wireless network of objects. These embedded devices can contain a description of the underlying objects, or other user defined information. In this paper, we present Snoogle, a search engine for such a network. A user can query Snoogle to find a particular mobile object, or a list of objects that fit the description. Snoogle uses information retrieval techniques to index information and process user queries, and Bloom filters to reduce communication overhead. Security and privacy protections are also engineered into Snoogle to protect sensitive information. We have implemented a prototype of Snoogle using off-the-shelf sensor motes, and conducted extensive experiments to evaluate the system performance.

...read moreread less

127 citations

"Analyzing the impact of deep web on..." refers background in this paper

...The idea behind Snoogle [4] is that every sensor node is having description of the object....
[...]

Proceedings Article•DOI•

Hits on the web: how does it compare?

[...]

Marc Najork¹, Hugo Zaragoza², Michael J. Taylor¹•Institutions (2)

Microsoft¹, Yahoo!²

23 Jul 2007

TL;DR: It is found that HITS outperforms PageRank, but is about as effective as web-page in-degree, and that link-based features perform better forgeneral queries, whereas BM25F performs better for specificqueries.

...read moreread less

Abstract: This paper describes a large-scale evaluation of the effectiveness of HITS in comparison with other link-based ranking algorithms, when used in combination with a state-of-the-art text retrieval algorithm exploiting anchor text. We quantified their effectiveness using three common performance measures: the mean reciprocal rank, the mean average precision, and the normalized discounted cumulative gain measurements. The evaluation is based on two large data sets: a breadth-first search crawl of 463 million web pages containing 17.6 billion hyperlinks and referencing 2.9 billion distinct URLs; and a set of 28,043 queries sampled from a query log, each query having on average 2,383 results, about 17 of which were labeled by judges. We found that HITS outperforms PageRank, but is about as effective as web-page in-degree. The same holds true when any of the link-based features are combined with the text retrieval algorithm. Finally, we studied the relationship between query specificity and the effectiveness of selected features, and found that link-based features perform better for general queries, whereas BM25F performs better for specific queries.

...read moreread less

89 citations

"Analyzing the impact of deep web on..." refers background in this paper

...The crawler traverses through many domains in chorus for different ontologies [11]....
[...]

Analyzing the impact of deep web on real-time business search

Citations

References

"Analyzing the impact of deep web on..." refers background in this paper

"Analyzing the impact of deep web on..." refers methods in this paper

"Analyzing the impact of deep web on..." refers background in this paper

"Analyzing the impact of deep web on..." refers background in this paper

"Analyzing the impact of deep web on..." refers background in this paper

Related Papers (5)