scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Analyzing the impact of deep web on real-time business search

TL;DR: This work proposes a novel approach to the distribution of web search engine to take care of sub parts of entity and sensor pages by distributing the search engine itself along with the index over multiple nodes.
Abstract: Real time business search largely has the involvement of Internet of Things which has now become a huge set of objects with large magnitude of intercommunication links and services. This scalability issue can be resolved by bringing in (SIoT) Social IoT i.e. an object looking for its social partners who will have similar set of rules to positively influence the performance of the service. In this work, we are putting forward the analysis of the impact of deep web on real time search. We also propose a novel approach by analyzing the shortcomings of the existing techniques. The novel idea is the distribution of web search engine to take care of sub parts of entity and sensor pages by distributing the search engine itself along with the index over multiple nodes. Deep websites are needed to be brought in for accurate results. This work offers better accuracy and a significant speed up for multiple query execution in distributed environment.
Citations
More filters
Journal ArticleDOI
TL;DR: A priority assigner and scheduler method for organizing Uniform Resource Locators (URLs) is being proposed that helps the crawler in tracking user's interest and prioritize downloading documents that are relevant to the user's choice in addition to current trends.
Abstract: An efficient search engine needs to be designed in such a way that is able to provide relevant and accurate information in accordance with user needs and interests. The quality of downloaded records can be guaranteed only when website pages of high pertinence are downloaded by the crawlers in accordance with the current topics or user trends. Earlier Focused Crawlers were used to download topic specific pages but these crawlers were not able to adapt to the changing interest of the users. Therefore, there is a need to design crawlers that are able to naturally track the present pattern points and download site pages that meet client's present need. In this paper, a priority assigner and scheduler method for organizing Uniform Resource Locators (URLs) is being proposed that helps the crawler in tracking user's interest and prioritize downloading documents that are relevant to the user's choice in addition to current trends. The experimental results conforms that the proposed priority assigner and URL scheduler‐based crawling outshines conventional crawling strategies based on Change‐history or Site‐Map‐based methods in terms of quality of downloaded web pages and reducing network traffic over the Internet.
References
More filters
Proceedings ArticleDOI
08 May 2007
TL;DR: A new framework is proposed whereby crawlers automatically learn patterns of promising links and adapt their focus as the crawl progresses, thus greatly reducing the amount of required manual setup andtuning.
Abstract: In this paper we describe new adaptive crawling strategies to efficiently locate the entry points to hidden-Web sources. The fact that hidden-Web sources are very sparsely distributedmakes the problem of locating them especially challenging. We deal with this problem by using the contents ofpages to focus the crawl on a topic; by prioritizing promisinglinks within the topic; and by also following links that may not lead to immediate benefit. We propose a new frameworkwhereby crawlers automatically learn patterns of promisinglinks and adapt their focus as the crawl progresses, thus greatly reducing the amount of required manual setup andtuning. Our experiments over real Web pages in a representativeset of domains indicate that online learning leadsto significant gains in harvest rates' the adaptive crawlers retrieve up to three times as many forms as crawlers thatuse a fixed focus strategy.

190 citations


"Analyzing the impact of deep web on..." refers background in this paper

  • ...The crawler mentioned in [7] traverses through all the web pages that are precise to a specific area only....

    [...]

Proceedings ArticleDOI
13 Jun 2010
TL;DR: A general method for analyzing nondeterministic programs that use reducers and it is shown that for a graph G=(V,E) with diameter D and bounded out-degree, this data-race-free version of PBFS algorithm attains near-perfect linear speedup if P << (V+E)/Dlg3(V/D).
Abstract: We have developed a multithreaded implementation of breadth-first search (BFS) of a sparse graph using the Cilk++ extensions to C++. Our PBFS program on a single processor runs as quickly as a standar. C++ breadth-first search implementation. PBFS achieves high work-efficiency by using a novel implementation of a multiset data structure, called a "bag," in place of the FIFO queue usually employed in serial breadth-first search algorithms. For a variety of benchmark input graphs whose diameters are significantly smaller than the number of vertices -- a condition met by many real-world graphs -- PBFS demonstrates good speedup with the number of processing cores.Since PBFS employs a nonconstant-time "reducer" -- "hyperobject" feature of Cilk++ -- the work inherent in a PBFS execution depends nondeterministically on how the underlying work-stealing scheduler load-balances the computation. We provide a general method for analyzing nondeterministic programs that use reducers. PBFS also is nondeterministic in that it contains benign races which affect its performance but not its correctness. Fixing these races with mutual-exclusion locks slows down PBFS empirically, but it makes the algorithm amenable to analysis. In particular, we show that for a graph G=(V,E) with diameter D and bounded out-degree, this data-race-free version of PBFS algorithm runs it time O((V+E)/P + Dlg3(V/D)) on P processors, which means that it attains near-perfect linear speedup if P

174 citations


"Analyzing the impact of deep web on..." refers methods in this paper

  • ...Old-fashioned web crawler traverses through all the web pages using BFS technique [6] which is not efficient....

    [...]

Journal ArticleDOI
TL;DR: This work proposed five heuristics, which are based on local network properties and that are expected to have an impact on the overall network structure, and discovered that minimizing the local clustering in the network allowed for achieving the best results in terms of average path length.
Abstract: The Internet of Things (IoT) is expected to be overpopulated by a very large number of objects, with intensive interactions, heterogeneous communications, and millions of services. Consequently, scalability issues will arise from the search of the right object that can provide the desired service. A new paradigm known as Social Internet of Things (SIoT) has been introduced and proposes the integration of social networking concepts into the Internet of Things. The underneath idea is that every object can look for the desired service using its friendships, in a distributed manner, with only local information. In the SIoT it is very important to set appropriate rules in the objects to select the right friends as these impact the performance of services developed on top of this social network. In this work, we addressed this issue by analyzing possible strategies for the benefit of overall network navigability. We first propose five heuristics, which are based on local network properties and that are expected to have an impact on the overall network structure. We then perform extensive experiments, which are intended to analyze the performance in terms of giant components, average degree of connections, local clustering, and average path length. Unexpectedly, we discovered that minimizing the local clustering in the network allowed for achieving the best results in terms of average path length. We have conducted further analysis to understand the potential causes, which have been found to be linked to the number of hubs in the network.

166 citations


"Analyzing the impact of deep web on..." refers background in this paper

  • ...In [15], friendship selection process in social Internet of Things is described....

    [...]

Journal ArticleDOI
TL;DR: Snoogle is presented, a search engine for a wireless network of objects that uses information retrieval techniques to index information and process user queries, and Bloom filters to reduce communication overhead.
Abstract: Embedding small devices into everyday objects like toasters and coffee mugs creates a wireless network of objects. These embedded devices can contain a description of the underlying objects, or other user defined information. In this paper, we present Snoogle, a search engine for such a network. A user can query Snoogle to find a particular mobile object, or a list of objects that fit the description. Snoogle uses information retrieval techniques to index information and process user queries, and Bloom filters to reduce communication overhead. Security and privacy protections are also engineered into Snoogle to protect sensitive information. We have implemented a prototype of Snoogle using off-the-shelf sensor motes, and conducted extensive experiments to evaluate the system performance.

127 citations


"Analyzing the impact of deep web on..." refers background in this paper

  • ...The idea behind Snoogle [4] is that every sensor node is having description of the object....

    [...]

Proceedings ArticleDOI
23 Jul 2007
TL;DR: It is found that HITS outperforms PageRank, but is about as effective as web-page in-degree, and that link-based features perform better forgeneral queries, whereas BM25F performs better for specificqueries.
Abstract: This paper describes a large-scale evaluation of the effectiveness of HITS in comparison with other link-based ranking algorithms, when used in combination with a state-of-the-art text retrieval algorithm exploiting anchor text. We quantified their effectiveness using three common performance measures: the mean reciprocal rank, the mean average precision, and the normalized discounted cumulative gain measurements. The evaluation is based on two large data sets: a breadth-first search crawl of 463 million web pages containing 17.6 billion hyperlinks and referencing 2.9 billion distinct URLs; and a set of 28,043 queries sampled from a query log, each query having on average 2,383 results, about 17 of which were labeled by judges. We found that HITS outperforms PageRank, but is about as effective as web-page in-degree. The same holds true when any of the link-based features are combined with the text retrieval algorithm. Finally, we studied the relationship between query specificity and the effectiveness of selected features, and found that link-based features perform better for general queries, whereas BM25F performs better for specific queries.

89 citations


"Analyzing the impact of deep web on..." refers background in this paper

  • ...The crawler traverses through many domains in chorus for different ontologies [11]....

    [...]