scispace - formally typeset
Search or ask a question
Author

Alexandros Batzios

Bio: Alexandros Batzios is an academic researcher from Aristotle University of Thessaloniki. The author has contributed to research in topics: Semantic Web & Data Web. The author has an hindex of 3, co-authored 6 publications receiving 60 citations.

Papers
More filters
Journal ArticleDOI
TL;DR: BioCrawler employs the principles of BioTope's intelligent agents on the semantic web, learns which sites are rich in semantic content and which sites link to them and adjusts its crawling habits accordingly, so that it learns to behave much like the state of the art search engine crawlers do.
Abstract: Web crawling has become an important aspect of web search, as the WWW keeps getting bigger and search engines strive to index the most important and up to date content. Many experimental approaches exist, but few actually try to model the current behaviour of search engines, which is to crawl and refresh the sites they deem as important, much more frequently than others. BioCrawler mirrors this behaviour on the semantic web, by applying the learning strategies adopted in previous work on ecosystem simulation, called BioTope. BioCrawler employs the principles of BioTope's intelligent agents on the semantic web, learns which sites are rich in semantic content and which sites link to them and adjusts its crawling habits accordingly. In the end, it learns to behave much like the state of the art search engine crawlers do. However, BioCrawler reaches that behavior solely by exploiting on-page factors, rather than off-page factors, such as the currently used link popularity.

30 citations

Journal ArticleDOI
TL;DR: This paper presents WebOWL, an experiment in using the latest technologies to develop a Semantic Web search engine that has been implemented using Jade, Jena and the db4o object database engine and has successfully stored over one million OWL classes, individuals and properties.
Abstract: This paper presents WebOWL, an experiment in using the latest technologies to develop a Semantic Web search engine. WebOWL consists of a community of intelligent agents, acting as crawlers, that are able to discover and learn the locations of Semantic Web neighborhoods on the Web, a semantic database to store data from different ontologies, a query mechanism that supports semantic queries in OWL, and a ranking algorithm that determines the order of the returned results based on the semantic relationships of classes and individuals. The system has been implemented using Jade, Jena and the db4o object database engine and has successfully stored over one million OWL classes, individuals and properties.

20 citations

Proceedings ArticleDOI
18 Dec 2006
TL;DR: This paper introduces BioSpider, an agent-based simulation framework for developing and testing autonomous, intelligent, semantically-focused Web spiders, and assumes a direct analogy of the problem at hand with a multi-variate ecosystem, where each member is self-maintaining.
Abstract: Although search engines traditionally use spiders for traversing and indexing the web, there has not yet been any methodological attempt to model, deploy and test learning spiders. The flourishing of the Semantic Web provides un- derstandable information that may improve the accuracy of search engines. In this paper, we introduce BioSpider, an agent-based simulation framework for developing and test- ing autonomous, intelligent, semantically-focused web spi- ders. BioSpider assumes a direct analogy of the problem at hand with a multi-variate ecosystem, where each mem- ber is self-maintaining. The population of the ecosystem comprises cooperative spiders incorporating communica- tion, mobility and learning skills, striving to improve effi- ciency. Genetic algorithms and classifier rules have been employed for spider adaptation and learning. A set of ex- periments has been performed in order to qualitatively test the efficacy and applicability of the proposed approach.

7 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: In this article , an automated word embedding with parameter tuned deep learning (AWE-PTDL) model is proposed for focused web crawling, which involves different processes such as preprocessing, incremental skip-gram model with negative sampling (ISGNS), bidirectional long short-term memory-based classification and bird swarm optimization based hyperparameter tuning.
Abstract: In recent years, web crawling has gained a significant attention due to the drastic advancements in the World Wide Web. Web Search Engines have the issue of retrieving massive quantity of web documents. One among the web crawlers is the focused crawler, that intends to selectively gather web pages from the Internet. But the efficiency of the focused crawling can easily be affected by the environment of web pages. In this view, this paper presents an Automated Word Embedding with Parameter Tuned Deep Learning (AWE-PTDL) model for focused web crawling. The proposed model involves different processes namely pre-processing, Incremental Skip-gram Model with Negative Sampling (ISGNS) based word embedding, bidirectional long short-term memory-based classification and bird swarm optimization based hyperparameter tuning. The SGNS training desires to go over the complete training data to pre-compute the noise distribution before performing Stochastic Gradient Descent (SGD) and the ISGNS technique is derived for the word embedding process. Besides, the cosine similarity is computed from the word embedding matrix to generate a feature vector which is fed as input into the Bidirectional Long Short-Term Memory (BiLSTM) for the prediction of website relevance. Finally, the Birds Swarm Optimization-Bidirectional Long Short-Term Memory (BSO-BiLSTM) based classification model is used to classify the webpages and the BSO algorithm is employed to determine the hyperparameters of the BiLSTM model so that the overall crawling performance can be considerably enhanced. For validating the enhanced outcome of the presented model, a comprehensive set of simulations are carried out and the results are examined in terms of different measures. The Automated Word Embedding with Parameter Tuned Deep Learning (AWE-PTDL) technique has attained a higher harvest rate of 85% when compared with the other techniques. The experimental results highlight the enhanced web crawling performance of the proposed model over the recent state of art web crawlers. This work is licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Intelligent Automation & Soft Computing DOI:10.32604/iasc.2022.022209 Article ech T Press Science

63 citations

Journal ArticleDOI
TL;DR: This study follows the guidelines of systematic literature review and applies it to the field of Web crawling, calling for an increased awareness in various fields of the Web crawler and identify how techniques from other domains can be used for crawling the Web.
Abstract: Performance of any search engine relies heavily on its Web crawler. Web crawlers are the programs that get webpages from the Web by following hyperlinks. These webpages are indexed by a search engine and can be retrieved by a user query. In the area of Web crawling, we still lack an exhaustive study that covers all crawling techniques. This study follows the guidelines of systematic literature review and applies it to the field of Web crawling. We used the standard procedure of carrying out a systematic literature review on 248 studies from a total of 1488 articles published in 12 leading journals and other premier conferences and workshops. Existing literature about the Web crawler is classified into different key subareas. Each subarea is further divided according to the techniques being used. We analyzed the distribution of various articles using multiple criteria and depicted conclusions. Various studies that use open source Web crawlers are also reported. We have highlighted future areas of research. We call for an increased awareness in various fields of the Web crawler and identify how techniques from other domains can be used for crawling the Web. Limitations and recommendations for future are also discussed. WIREs Data Mining Knowl Discov 2017, 7:e1218. doi: 10.1002/widm.1218

50 citations

Journal ArticleDOI
TL;DR: The present adaptive search engine allows for the efficient community creation and updating of social media indexes, which is able to instill and propagate deep knowledge into social media concerning the advanced search and usage of media resources.
Abstract: Effective sharing of diverse social media is often inhibited by limitations in their search and discovery mechanisms, which are particularly restrictive for media that do not lend themselves to automatic processing or indexing. Here, we present the structure and mechanism of an adaptive search engine which is designed to overcome such limitations. The basic framework of the adaptive search engine is to capture human judgment in the course of normal usage from user queries in order to develop semantic indexes which link search terms to media objects semantics. This approach is particularly effective for the retrieval of multimedia objects, such as images, sounds, and videos, where a direct analysis of the object features does not allow them to be linked to search terms, for example, nontextual/icon-based search, deep semantic search, or when search terms are unknown at the time the media repository is built. An adaptive search architecture is presented to enable the index to evolve with respect to user feedback, while a randomized query-processing technique guarantees avoiding local minima and allows the meaningful indexing of new media objects and new terms. The present adaptive search engine allows for the efficient community creation and updating of social media indexes, which is able to instill and propagate deep knowledge into social media concerning the advanced search and usage of media resources. Experiments with various relevance distribution settings have shown efficient convergence of such indexes, which enable intelligent search and sharing of social media resources that are otherwise hard to discover.

40 citations

Journal ArticleDOI
TL;DR: A decentralized learning automata-based focused Web crawler that can effectively adapt its configuration to the Web dynamics and show the superiority of the proposed crawler over several existing methods in terms of precision, recall, and running time.
Abstract: The recent years have witnessed the birth and explosive growth of the Web. The exponential growth of the Web has made it into a huge source of information wherein finding a document without an efficient search engine is unimaginable. Web crawling has become an important aspect of the Web search on which the performance of the search engines is strongly dependent. Focused Web crawlers try to focus the crawling process on the topic-relevant Web documents. Topic oriented crawlers are widely used in domain-specific Web search portals and personalized search tools. This paper designs a decentralized learning automata-based focused Web crawler. Taking advantage of learning automata, the proposed crawler learns the most relevant URLs and the promising paths leading to the target on-topic documents. It can effectively adapt its configuration to the Web dynamics. This crawler is expected to have a higher precision rate because of construction a small Web graph of only on-topic documents. Based on the Martingale theorem, the convergence of the proposed algorithm is proved. To show the performance of the proposed crawler, extensive simulation experiments are conducted. The obtained results show the superiority of the proposed crawler over several existing methods in terms of precision, recall, and running time. The t-test is used to verify the statistical significance of the precision results of the proposed crawler.

39 citations

Book ChapterDOI
09 Jul 2009
TL;DR: The features of these semantic focused crawlers are concluded and the overall state of the art of this field is drawn by means of a multi-dimensional comparison.
Abstract: Nowadays, the research of focused crawler approaches the field of semantic web, along with the appearance of increasing semantic web documents and the rapid development of ontology mark-up languages. Semantic focused crawlers are a series of focused crawlers enhanced by various semantic web technologies. In this paper, we make a survey in this research field. We discover eleven semantic focused crawlers from the existing literature, and classify them into three categories --- ontology-based focused crawlers, metadata abstraction focused crawlers and other semantic focused crawlers. By means of a multi-dimensional comparison, we conclude the features of these crawlers and draw the overall state of the art of this field.

33 citations