Showing papers on "Web search engine published in 2014"

PDF

Open Access

Proceedings Article•DOI•

Impact of response latency on user behavior in web search

[...]

Ioannis Arapakis¹, Xiao Bai¹, B. Barla Cambazoglu¹•Institutions (1)

03 Jul 2014

TL;DR: A controlled user study tries to understand how users perceive the response latency of a search system and how sensitive they are to increasing delays in response and demonstrates that latency has an impact on the click behavior of users to some extent.

...read moreread less

Abstract: Traditionally, the efficiency and effectiveness of search systems have both been of great interest to the information retrieval community. However, an in-depth analysis on the interplay between the response latency of web search systems and users' search experience has been missing so far. In order to fill this gap, we conduct two separate studies aiming to reveal how response latency affects the user behavior in web search. First, we conduct a controlled user study trying to understand how users perceive the response latency of a search system and how sensitive they are to increasing delays in response. This study reveals that, when artificial delays are introduced into the response, the users of a fast search system are more likely to notice these delays than the users of a slow search system. The introduced delays become noticeable by the users once they exceed a certain threshold value. Second, we perform an analysis using a large-scale query log obtained from Yahoo web search to observe the potential impact of increasing response latency on the click behavior of users. This analysis demonstrates that latency has an impact on the click behavior of users to some extent. In particular, given two content-wise identical search result pages, we show that the users are more likely to perform clicks on the result page that is served with lower latency.

...read moreread less

119 citations

Journal Article•DOI•

The PageRank Problem, Multiagent Consensus, and Web Aggregation: A Systems and Control Viewpoint

[...]

Hideaki Ishii¹, Roberto Tempo²•Institutions (2)

Tokyo Institute of Technology¹, Polytechnic University of Turin²

14 May 2014-IEEE Control Systems Magazine

TL;DR: The PageRank algorithm as mentioned in this paper assigns a numerical value to each element of a set of hyperlinked documents within the World Wide Web with the purpose of measuring the relative importance of each page.

...read moreread less

Abstract: PageRank is an algorithm introduced in 1998 and used by the Google Internet search engine. It assigns a numerical value to each element of a set of hyperlinked documents (that is, Web pages) within the World Wide Web with the purpose of measuring the relative importance of each page [1]. The key idea in the algorithm is to give a higher PageRank value to Web pages that are visited often by Web surfers. Google describes PageRank as: "PageRank reflects our view of the importance of Web pages by considering more than 500 million variables and 2 billion terms. Pages that are considered important receive a higher PageRank and are more likely to appear at the top of the search results."

...read moreread less

97 citations

Journal Article•DOI•

Mobile App Classification with Enriched Contextual Information

[...]

Hengshu Zhu¹, Enhong Chen¹, Hui Xiong², Huanhuan Cao³, Jilei Tian³ - Show less +1 more•Institutions (3)

University of Science and Technology of China¹, Rutgers University², Nokia³

01 Jul 2014-IEEE Transactions on Mobile Computing

TL;DR: This paper proposes an approach for first enriching the contextual information of mobile Apps by exploiting the additional Web knowledge from the Web search engine, and combines all the enriched contextual information into the Maximum Entropy model for training a mobile App classifier.

...read moreread less

Abstract: The study of the use of mobile Apps plays an important role in understanding the user preferences, and thus provides the opportunities for intelligent personalized context-based services. A key step for the mobile App usage analysis is to classify Apps into some predefined categories. However, it is a nontrivial task to effectively classify mobile Apps due to the limited contextual information available for the analysis. For instance, there is limited contextual information about mobile Apps in their names. However, this contextual information is usually incomplete and ambiguous. To this end, in this paper, we propose an approach for first enriching the contextual information of mobile Apps by exploiting the additional Web knowledge from the Web search engine. Then, inspired by the observation that different types of mobile Apps may be relevant to different real-world contexts , we also extract some contextual features for mobile Apps from the context-rich device logs of mobile users. Finally, we combine all the enriched contextual information into the Maximum Entropy model for training a mobile App classifier. To validate the proposed method, we conduct extensive experiments on 443 mobile users’ device logs to show both the effectiveness and efficiency of the proposed approach. The experimental results clearly show that our approach outperforms two state-of-the-art benchmark methods with a significant margin.

...read moreread less

81 citations

Journal Article•DOI•

Study of Web Crawler and its Different Types

[...]

Trupti V. Udapure, Ravindra D. Kale, Rajesh C. Dharmik

01 Jan 2014-IOSR Journal of Computer Engineering

TL;DR: This Paper briefly reviews the concepts of web crawler, its architecture and its various types, which are an essential method for collecting data on and keeping in touch with the rapidly increasing Internet.

...read moreread less

Abstract: Due to the current size of the Web and its dynamic nature, building an efficient search mechanism is very important. A vast number of web pages are continually being added every day, and information is constantly changing. Search engines are used to extract valuable Information from the internet. Web crawlers are the principal part of search engine, is a computer program or software that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. It is an essential method for collecting data on, and keeping in touch with the rapidly increasing Internet. This Paper briefly reviews the concepts of web crawler, its architecture and its various types. Keyword: Crawling techniques, Web Crawler, Search engine, WWW

...read moreread less

47 citations

Journal Article•DOI•

Web search volume as a predictor of academic fame: An exploration of Google trends

[...]

Liwen Vaughan¹, Esteban Romero-Frías²•Institutions (2)

University of Western Ontario¹, University of Granada²

01 Apr 2014

TL;DR: A significant correlation is found between the search volume of a university name and the university's academic reputation or fame in Google Trends, and the effect of university sizes on the correlations is examined to gain a deeper understanding of the nature of the relationships.

...read moreread less

Abstract: Searches conducted on web search engines reflect the interests of users and society. Google Trends, which provides information about the queries searched by users of the Google web search engine, is a rich data source from which a wealth of information can be mined. We investigated the possibility of using web search volume data from Google Trends to predict academic fame. As queries are language-dependent, we studied universities from two countries with different languages, the United States and Spain. We found a significant correlation between the search volume of a university name and the university's academic reputation or fame. We also examined the effect of some Google Trends features, namely, limiting the search to a specific country or topic category on the search volume data. Finally, we examined the effect of university sizes on the correlations found to gain a deeper understanding of the nature of the relationships.

...read moreread less

42 citations

Book Chapter•DOI•

Exposing Inconsistent Web Search Results with Bobble

[...]

Xinyu Xing¹, Wei Meng¹, Dan Doozan¹, Nick Feamster¹, Wenke Lee¹, Alex C. Snoeren² - Show less +2 more•Institutions (2)

Georgia Institute of Technology¹, University of California, San Diego²

10 Mar 2014

TL;DR: Bobble is presented, a Web browser extension that contemporaneously executes a user's Google search query from a variety of different world-wide vantage points under a range of different conditions, alerting the user to the extent of inconsistency present in the set of search results returned to them by Google.

...read moreread less

Abstract: Given their critical role as gateways to Web content, the search results a Web search engine provides to its users have an out-sized impact on the way each user views the Web. Previous studies have shown that popular Web search engines like Google employ sophisticated personalization engines that can occasionally provide dramatically inconsistent views of the Web to different users. Unfortunately, even if users are aware of this potential, it is not straightforward for them to determine the extent to which a particular set of search results differs from those returned to other users, nor the factors that contribute to this personalization. We present the design and implementation of Bobble, a Web browser extension that contemporaneously executes a user's Google search query from a variety of different world-wide vantage points under a range of different conditions, alerting the user to the extent of inconsistency present in the set of search results returned to them by Google. Using more than 75,000 real search queries issued by over 170 users during a nine-month period, we explore the frequency and nature of inconsistencies that arise in Google search queries. In contrast to previously published results, we find that 98% of all Google search results display some inconsistency, with a user's geographic location being the dominant factor influencing the nature of the inconsistency.

...read moreread less

42 citations

Journal Article•DOI•

Content Bias in Online Health Search

[...]

Ryen W. White¹, Ahmed Hassan¹•Institutions (1)

Microsoft¹

06 Nov 2014-ACM Transactions on The Web

TL;DR: This research broadens previous work on biases in search to examine the role of search systems in contributing to biases, and focuses on questions about medical interventions and employs reliable ground truth data from authoritative medical sources to assess bias.

...read moreread less

Abstract: Search engines help people answer consequential questions. Biases in retrieved and indexed content (e.g., skew toward erroneous outcomes that represent deviations from reality), coupled with searchers' biases in how they examine and interpret search results, can lead people to incorrect answers. In this article, we seek to better understand biases in search and retrieval, and in particular those affecting the accuracy of content in search results, including the search engine index, features used for ranking, and the formulation of search queries. Focusing on the important domain of online health search, this research broadens previous work on biases in search to examine the role of search systems in contributing to biases. To assess bias, we focus on questions about medical interventions and employ reliable ground truth data from authoritative medical sources. In the course of our study, we utilize large-scale log analysis using data from a popular Web search engine, deep probes of result lists on that search engine, and crowdsourced human judgments of search result captions and landing pages. Our findings reveal bias in results, amplifying searchers' existing biases that appear evident in their search activity. We also highlight significant bias in indexed content and show that specific ranking signals and specific query terms support bias. Both of these can degrade result accuracy and increase skewness in search results. Our analysis has implications for bias mitigation strategies in online search systems, and we offer recommendations for search providers based on our findings.

...read moreread less

42 citations

Proceedings Article•DOI•

A novel approach to personalize web search through user profiling and query reformulation

[...]

Kamlesh Makvana¹, Pinal Shah¹, Parth Shah¹•Institutions (1)

Charotar University of Science and Technology¹

13 Nov 2014

TL;DR: A novel approach is proposed that personalize web search result through query reformulation and user profiling that identifies relevant search term for particular user from previous search history by analysing web log file maintained in the server.

...read moreread less

Abstract: With a inundating of information in WWW (World Wide Web) users are often failed to retrieve search result in context of their interest through existing search engines. So the personalization of web search result has to be carryout that process user's query and re-rank retrieved results based on their interest. User have diverse background on same query, it is very difficult for some informative query to identify user's current intention. In this paper, a novel approach is proposed that personalize web search result through query reformulation and user profiling. First, a framework is proposed that identify relevant search term for particular user from previous search history by analysing web log file maintained in the server. These terms are appended to user's ambiguous query. Second, the proposed approach proceeds the user's search result and re-rank the retrieved result by identifying interest value of user on retrieved links. Proposed new approach also identify user interest on retrieved links by combing the user interest value generated from VSM (Vector Space Model) and actual rank of that link. Third, the framework also suggest some keywords that help to incorporate user's current interest. Finally, experimental result shows the effectiveness of proposed search engine with commercial search engine with different criteria.

...read moreread less

30 citations

Proceedings Article•DOI•

Improving search personalisation with dynamic group formation

[...]

Thanh Vu¹, Dawei Song¹, Alistair Willis¹, Son N. Tran², Jingfei Li³ - Show less +1 more•Institutions (3)

Open University¹, City University London², Tianjin University³

03 Jul 2014

TL;DR: This paper proposes a personalisation framework in which a user profile is enriched using information from other users dynamically grouped with respect to an input query, and demonstrates that the framework improves the performance of the web search engine and also achieves better performance than the static grouping method.

...read moreread less

Abstract: Recent research has shown that the performance of search engines can be improved by enriching a user's personal profile with information about other users with shared interests. In the existing approaches, groups of similar users are often statically determined, e.g., based on the common documents that users clicked. However, these static grouping methods are query-independent and neglect the fact that users in a group may have different interests with respect to different topics. In this paper, we argue that common interest groups should be dynamically constructed in response to the user's input query. We propose a personalisation framework in which a user profile is enriched using information from other users dynamically grouped with respect to an input query. The experimental results on query logs from a major commercial web search engine demonstrate that our framework improves the performance of the web search engine and also achieves better performance than the static grouping method.

...read moreread less

29 citations

Deep Learning Powered In-Session Contextual Ranking using Clickthrough Data

[...]

Xiujun Li, Chenlei Guo, Wei Chu, Ye-Yi Wang, Jude W. Shavlik - Show less +1 more

08 Dec 2014

TL;DR: This work demonstrates how to generate the semantic features from in-session contextual information with deep learning models, and incorporate these semantic features into the current ranking model to re-rank the results.

...read moreread less

Abstract: User interactions with search engines provide many cues that can be leveraged to improve the relevance of search results through personalization. The context information (history of queries, clicked documents, etc.) provides strong signals about users’ search intent, which can be used to personalize the search experience and improve a web search engine. We demonstrate how to generate the semantic features from in-session contextual information with deep learning models, and incorporate these semantic features into the current ranking model to re-rank the results. We evaluate our approach using a large, real-world search log data from a major commercial web search engine, and the experimental results show our approach can significantly improve the performance of the search engine. Furthermore, we also find that the domain-specific, click-based features can effectively decrease the unsatisfied clicks for the current ranking model to improve the search experience.

...read moreread less

28 citations

Posted Content•

Penerapan teknik web scraping pada mesin pencari artikel ilmiah

[...]

Ahmad Josi, Leon Andretti Abdillah, Suryayusra

18 Oct 2014-arXiv: Information Retrieval

TL;DR: The aim is for information collected after the program makers learn navigation techniques on the website information will be taken to a web application mimicked the scraping that the authors will create.

...read moreread less

Abstract: Search engines are a combination of hardware and computer software supplied by a particular company through the website which has been determined. Search engines collect information from the web through bots or web crawlers that crawls the web periodically. The process of retrieval of information from existing websites is called "web scraping." Web scraping is a technique of extracting information from websites. Web scraping is closely related to Web indexing, as for how to develop a web scraping technique that is by first studying the program makers HTML document from the website will be taken to the information in the HTML tag flanking the aim is for information collected after the program makers learn navigation techniques on the website information will be taken to a web application mimicked the scraping that we will create. It should also be noted that the implementation of this writing only scraping involves a free search engine such as: portal garuda, Indonesian scientific journal databases (ISJD), google scholar.

...read moreread less

Book•

Design of A Priority Based Frequency Regulated Incremental Crawler

[...]

Niraj Singhal, Ashutosh, Sharma, A. K. Sharma

02 Jul 2014

TL;DR: To regulate the revisiting frequency a novel mechanism and a novel architecture for incremental crawler is being proposed.

...read moreread less

Abstract: The World Wide Web is a huge source of hyperlinked information contained in hypertext documents. Search engines use web crawlers to collect these documents from web for the purpose of storage and indexing. However, many of these documents contain dynamic information which gets changed on daily, weekly, monthly or yearly basis and hence we need to refresh the search engine side storage so that latest information is made available to the user. An incremental crawler visits the web repeatedly after a specific interval for updating its collection. In this paper to regulate the revisiting frequency a novel mechanism and a novel architecture for incremental crawler is being proposed.

...read moreread less

Journal Article•DOI•

Moved but not gone: an evaluation of real-time methods for discovering replacement web pages

[...]

Martin Klein¹, Michael L. Nelson²•Institutions (2)

Los Alamos National Laboratory¹, Old Dominion University²

01 Apr 2014-International Journal on Digital Libraries

TL;DR: Analysis of four content- and link-based methods to rediscover missing Web pages indicates that Web pages are often not completely lost but have moved to a different location and “just” need to be rediscovered.

...read moreread less

Abstract: Inaccessible Web pages and 404 "Page Not Found" responses are a common Web phenomenon and a detriment to the user's browsing experience. The rediscovery of missing Web pages is, therefore, a relevant research topic in the digital preservation as well as in the Information Retrieval realm. In this article, we bring these two areas together by analyzing four content- and link-based methods to rediscover missing Web pages. We investigate the retrieval performance of the methods individually as well as their combinations and give an insight into how effective these methods are over time. As the main result of this work, we are able to recommend not only the best performing methods but also the sequence in which they should be applied, based on their performance, complexity required to generate them, and evolution over time. Our least complex single method results in a rediscovery rate of almost $$70\,\%$$ 70 % of Web pages of our sample dataset based on URIs sampled from the Open Directory Project (DMOZ). By increasing the complexity level and combining three different methods, our results show an increase of the success rate of up to $$77\,\%$$ 77 % . The results, based on our sample dataset, indicate that Web pages are often not completely lost but have moved to a different location and "just" need to be rediscovered.

...read moreread less

Proceedings Article•DOI•

Investigating users' query formulations for cognitive search intents

[...]

Makoto P. Kato¹, Takehiro Yamamoto¹, Hiroaki Ohshima¹, Katsumi Tanaka¹•Institutions (1)

Kyoto University¹

03 Jul 2014

TL;DR: The findings suggest users over-adapt to current Web search engines, and create opportunities to estimate CSIs with non-verbal user input.

...read moreread less

Abstract: This study investigated query formulations by users with {\it Cognitive Search Intents} (CSIs), which are users' needs for the cognitive characteristics of documents to be retrieved, {\em e.g. comprehensibility, subjectivity, and concreteness. Our four main contributions are summarized as follows (i) we proposed an example-based method of specifying search intents to observe query formulations by users without biasing them by presenting a verbalized task description;(ii) we conducted a questionnaire-based user study and found that about half our subjects did not input any keywords representing CSIs, even though they were conscious of CSIs;(iii) our user study also revealed that over 50\% of subjects occasionally had experiences with searches with CSIs while our evaluations demonstrated that the performance of a current Web search engine was much lower when we not only considered users' topical search intents but also CSIs; and (iv) we demonstrated that a machine-learning-based query expansion could improve the performances for some types of CSIs.Our findings suggest users over-adapt to current Web search engines,and create opportunities to estimate CSIs with non-verbal user input.

...read moreread less

Proceedings Article•DOI•

Implicitly Learning a User Interest Profile for Personalization of Web Search Using Collaborative Filtering

[...]

Ashish Nanda¹, Rohit Omanwar¹, Bharat M. Deshpande¹•Institutions (1)

Birla Institute of Technology and Science¹

11 Aug 2014

TL;DR: A robust user modeling technique is proposed that implicitly creates a Dynamic Category Interest Tree (DCIT), using a general ontology of the web and a set of web pages collected over time that give an insight into a user's interests.

...read moreread less

Abstract: The increasing abundance of content on the web has made information filtering even more important in helping users find information related to their interests. Personalization of web search is one such effort, that aims at improving the efficiency with which a user finds results relevant to his query. This is done by keeping track of a user's individual interests, and taking it into account while returning search results. We propose a robust user modeling technique that implicitly creates a Dynamic Category Interest Tree (DCIT), using a general ontology of the web and a set of web pages collected over time that give an insight into a user's interests. The DCIT is designed to use a fuzzy classification technique to keep track of what topics a user is interested in, his amount of interest in a topic, as well as reflect his changing interests overtime. The DCIT consists of a general ontology of the web, where each node represents a topic and consists of keywords that are usually used to describe that topic or category. Additional keywords that the user frequently associates with a topic, such as names of important people, organizations, or a specialized terminology, etc. Are also incorporated into the relevant topic. We use the Apriori Algorithm to extract these associated words from the user's web history in order to more accurately define the user's categories of interest. The DCIT is initially created by a content based approach using only the browsing history of the user, and is later further enhanced through collaborative filtering using the k-nearest neighbour-based algorithm. We propose a technique to re-rank the results from a search engine according to their relevance to a user, based on his implicitly learned DCIT. According to experimental results, our DCIT based ranking often outperforms search engines such as Google when it comes to retrieving web pages that are more relevant to a user's interest.

...read moreread less

Journal Article•DOI•

Web search query volume as a measure of pharmaceutical utilization and changes in prescribing patterns.

[...]

Jacob E. Simmering¹, Linnea A. Polgreen¹, Philip M. Polgreen¹•Institutions (1)

University of Iowa¹

01 Nov 2014-Research in Social & Administrative Pharmacy

TL;DR: Search volume provides a first order approximation to pharmaceutical utilization in the community and can be used to detect changes in prescribing behaviors, such as the publication of new information.

...read moreread less

Abstract: Background Monitoring prescription drug utilization is important for both drug safety and drug marketing purposes. However, access to utilization data is often expensive, limited and not timely. Objectives To demonstrate and validate the use of web search engine queries as a method for timely monitoring of drug utilization and changes in prescribing behaviors. Methods Drug utilization time series were obtained from the Medical Expenditure Panel Survey and normalized search volume was obtained from Google Trends. Correlation between the series was estimated using a cross-correlation function. Changes in the search volume following knowledge events were detected using a cumulative sums changepoint method. Results Search volume tracks closely with the utilization rates of several seasonal prescription drugs. Additionally, search volume exhibits changes following known major knowledge events, such as the publication of new information. Conclusions Search volume provides a first order approximation to pharmaceutical utilization in the community and can be used to detect changes in prescribing behavior.

...read moreread less

Proceedings Article•DOI•

Improving the efficiency of multi-site web search engines

[...]

Guillem Francès¹, Xiao Bai², B. Barla Cambazoglu², Ricardo Baeza-Yates²•Institutions (2)

Pompeu Fabra University¹, Yahoo!²

24 Feb 2014

TL;DR: This paper proposes a document replication technique that improves the query locality of the state-of-the-art approaches with various replication budget distribution strategies, and devise a machine learning approach to decide the query forwarding patterns.

...read moreread less

Abstract: A multi-site web search engine is composed of a number of search sites geographically distributed around the world. Each search site is typically responsible for crawling and indexing the web pages that are in its geographical neighborhood. A query is selectively processed on a subset of search sites that are predicted to return the best-matching results. The scalability and efficiency of multi-site web search engines have attracted a lot of research attention in recent years. In particular, research has focused on replicating important web pages across sites, forwarding queries to relevant sites, and caching results of previous queries. Yet, these problems have only been studied in isolation, but no prior work has properly investigated the interplay between them. In this paper, we take this challenge up and conduct what we believe is the first comprehensive analysis of a full stack of techniques for efficient multi-site web search. Specifically, we propose a document replication technique that improves the query locality of the state-of-the-art approaches with various replication budget distribution strategies. We devise a machine learning approach to decide the query forwarding patterns, achieving a significantly lower false positive ratio than a state-of-the-art thresholding approach with little negative impact on search result quality. We propose three result caching strategies that reduce the number of forwarded queries and analyze the trade-off they introduce in terms of storage and network overheads. Finally, we show that the combination of the best-of-the-class techniques yields very promising search efficiency, rendering multi-site, geographically distributed web search engines an attractive alternative to centralized web search engines.

...read moreread less

Journal Article•DOI•

Exploiting location information for Web search

[...]

Jie Zhao¹, Peiquan Jin², Qingqing Zhang², Run Wen¹•Institutions (2)

Anhui University¹, University of Science and Technology of China²

01 Jan 2014-Computers in Human Behavior

TL;DR: The proposed framework consists of an offline stage to extract focused locations for crawled Web pages, as well as an online ranking stage to perform location-aware ranking for search results.

...read moreread less

Journal Article•DOI•

Analysis and Detection of Bogus Behavior in Web Crawler Measurement

[...]

Quan Bai¹, Gang Xiong¹, Yong Zhao¹, Longtao He¹•Institutions (1)

Chinese Academy of Sciences¹

01 Jan 2014-Procedia Computer Science

TL;DR: A model to detect real and “bogus” web crawlers, with accuracy rate of about 95%, is proposed, which can improve the performance of a site and enhance the quality of service of the network.

...read moreread less

Proceedings Article•DOI•

A novel approach for content extraction from web pages

[...]

Aanshi Bhardwaj¹, Veenu Mangat¹•Institutions (1)

Panjab University, Chandigarh¹

06 Mar 2014

TL;DR: Various approaches for extracting informative content from web pages and a new approach for content extraction from webpages using word to leaf ratio and density of links are discussed.

...read moreread less

Abstract: The rapid development of the internet and web publishing techniques create numerous information sources published as HTML pages on World Wide Web. However, there is lot of redundant and irrelevant information also on web pages. Navigation panels, Table of content (TOC), advertisements, copyright statements, service catalogs, privacy policies etc. on web pages are considered as relevant and irrelevant content. Such information makes various web mining tasks such as web page crawling, web page classification, link based ranking, topic distillation complex. This paper discusses various approaches for extracting informative content from web pages and a new approach for content extraction from web pages using word to leaf ratio and density of links.

...read moreread less

Proceedings Article•DOI•

Scalability and efficiency challenges in large-scale web search engines

[...]

Ricardo Baeza-Yates¹, B. Barla Cambazoglu¹•Institutions (1)

Yahoo!¹

07 Apr 2014

TL;DR: This tutorial provides an in-depth architectural overview of a web search engine, mainly focusing on the web crawling, indexing, and query processing components, and provides recommendations to researchers who are new to the field.

...read moreread less

Abstract: The main goals of a web search engine are quality, efficiency, and scalability. In this tutorial, we focus on the last two goals, providing a fairly comprehensive overview of the scalability and efficiency challenges in large-scale web search engines. In particular, the tutorial provides an in-depth architectural overview of a web search engine, mainly focusing on the web crawling, indexing, and query processing components. The scalability and efficiency issues encountered in these components are presented at four different granularities: at the level of a single computer, a cluster of computers, a single data center, and a multi-center search engine. The tutorial also points at open research problems and provides recommendations to researchers who are new to the field.

...read moreread less

Original Research Web search query volume as a measure of pharmaceutical utilization and changes in prescribing patterns

[...]

Jacob E. Simmering, Linnea A. Polgreen, Philip M. Polgreen

01 Jan 2014

TL;DR: In this article, the authors demonstrate and validate the use of web search engine queries as a method for timely monitoring of drug utilization and changes in prescribing behaviors, using a cumulative sums changepoint method.

...read moreread less

Abstract: Background: Monitoring prescription drug utilization is important for both drug safety and drug marketing purposes. However, access to utilization data is often expensive, limited and not timely. Objectives: To demonstrate and validate the use of web search engine queries as a method for timely monitoring of drug utilization and changes in prescribing behaviors. Methods: Drug utilization time series were obtained from the Medical Expenditure Panel Survey and normalized search volume was obtained from Google Trends. Correlation between the series was estimated using a cross-correlation function. Changes in the search volume following knowledge events were detected using a cumulative sums changepoint method. Results: Search volume tracks closely with the utilization rates of several seasonal prescription drugs. Additionally, search volume exhibits changes following known major knowledge events, such as the publication of new information. Conclusions: Search volume provides a first order approximation to pharmaceutical utilization in the community and can be used to detect changes in prescribing behavior.

...read moreread less

Proceedings Article•DOI•

Detect phishing by checking content consistency

[...]

Yi-Shin Chen¹, Yi-Hsuan Yu¹, Huei-Sin Liu¹, Pang-Chieh Wang²•Institutions (2)

National Tsing Hua University¹, Industrial Technology Research Institute²

01 Aug 2014

TL;DR: This paper proposes a technique for identifying suspicious web pages, based on the literal and conceptual consistency between the URL and web contents, which can achieve 98% accuracy and is effective in detecting various forms of phishing attack.

...read moreread less

Abstract: Phishing is a form of cybercrime used to lure a victim to reveal his/her sensitive personal information to fraudulent web pages. To protect users from phishing attacks, many anti-phishing techniques have been proposed to block suspicious web pages, which are identified against registered black-lists, or checked by search engines. However, such approaches usually have difficulty in keeping up with the rapidly emerging phishing web pages. To lessen this problem, this paper proposes a technique for identifying suspicious web pages, based on the literal and conceptual consistency between the URL and web contents. By using the search logs only as reference data, our approach can achieve 98% accuracy, showing that it is effective in detecting various forms of phishing attack.

...read moreread less

Proceedings Article•DOI•

On correlation of absence time and search effectiveness

[...]

Sunandan Chakraborty¹, Filip Radlinski², Milad Shokouhi², Paul Baecke²•Institutions (2)

New York University¹, Microsoft²

03 Jul 2014

TL;DR: The effectiveness of absence time in evaluating new features in a web search engine, such as new ranking algorithm or a new user interface, is investigated and it is suggested that users are likely to return to the search engine sooner when their previous session has more queries and more clicks.

...read moreread less

Abstract: Online search evaluation metrics are typically derived based on implicit feedback from the users. For instance, computing the number of page clicks, number of queries, or dwell time on a search result. In a recent paper, Dupret and Lalmas introduced a new metric called absence time, which uses the time interval between successive sessions of users to measure their satisfaction with the system. They evaluated this metric on a version of Yahoo! Answers. In this paper, we investigate the effectiveness of absence time in evaluating new features in a web search engine, such as new ranking algorithm or a new user interface. We measured the variation of absence time to the effects of 21 experiments performed on a search engine. Our findings show that the outcomes of absence time agreed with the judgement of human experts performing a thorough analysis of a wide range of online and offline metrics in 14 out of these 21 cases. We also investigated the relationship between absence time and a set of commonly-used covariates (features) such as the number of queries and clicks in the session. Our results suggest that users are likely to return to the search engine sooner when their previous session has more queries and more clicks.

...read moreread less

Proceedings Article•DOI•

Finding pages on the unarchived web

[...]

Hugo C. Huurdeman¹, Anat Ben-David¹, Jaap Kamps¹, Thaer Samar², Arjen P. de Vries² - Show less +1 more•Institutions (2)

University of Amsterdam¹, Centrum Wiskunde & Informatica²

08 Sep 2014

TL;DR: This paper proposes an approach to recover significant parts of the unarchived Web, by reconstructing descriptions of these pages based on links and anchors in the set of crawled pages, and experiments with this approach on the DutchWeb archive.

...read moreread less

Abstract: Web archives preserve the fast changing Web, yet are highly incomplete due to crawling restrictions, crawling depth and frequency, or restrictive selection policies---most of the Web is unarchived and therefore lost to posterity. In this paper, we propose an approach to recover significant parts of the unarchived Web, by reconstructing descriptions of these pages based on links and anchors in the set of crawled pages, and experiment with this approach on the Dutch Web archive.Our main findings are threefold. First, the crawled Web contains evidence of a remarkable number of unarchived pages and websites, potentially dramatically increasing the coverage of the Web archive. Second, the link and anchor descriptions have a highly skewed distribution: popular pages such as home pages have more terms, but the richness tapers off quickly. Third, the succinct representation is generally rich enough to uniquely identify pages on the unarchived Web: in a known-item search setting we can retrieve these pages within the first ranks on average.

...read moreread less

Journal Article•DOI•

B-hist

[...]

Michele Catasta¹, Alberto Tonon², Gianluca Demartini², Jean-Eudes Ranvier¹, Karl Aberer¹, Philippe Cudré-Mauroux² - Show less +2 more•Institutions (2)

École Polytechnique Fédérale de Lausanne¹, University of Fribourg²

01 Aug 2014-Journal of Web Semantics

TL;DR: A novel approach to answer queries over web browsing logs that takes into account entities appearing in the web pages, user activities, as well as temporal information and an empirical evaluation of the entity-based approach used to cluster web pages are presented.

...read moreread less

Proceedings Article•DOI•

Log-based personalization: the 4th web search click data (WSCD) workshop

[...]

Pavel Serdyukov¹, Georges Dupret², Nick Craswell³•Institutions (3)

Yandex¹, Yahoo!², Microsoft³

24 Feb 2014

TL;DR: WSCD 2014 is a forum for new research relating to Web search usage logs and for discussing desirable properties of publicly released search log datasets.

...read moreread less

Abstract: WSCD 2014 is the fourth workshop on Web Search Click Data, following WSCD 2009, WSCD 2011 and WSCD 2012. It is a forum for new research relating to Web search usage logs and for discussing desirable properties of publicly released search log datasets. Research relating to search logs has been hampered by the limited availability of click datasets. This series of workshops comes with new datasets based on logged user search behaviour and accompanying data mining challenges. This year the challenge and the workshop are focused on the tasks of personalization using logs.

...read moreread less

Proceedings Article•DOI•

Question answering system: A heuristic approach

[...]

Varsha Bhoir¹, M. A. Potey¹•Institutions (1)

College of Engineering, Pune¹

15 May 2014

TL;DR: The proposed solution of Question Answering system works for a specific domain of tourism, which is a global and routine activity for leisure, to reduce the painstaking search through the long list of documents.

...read moreread less

Abstract: The exponential growth in digital information led to the need of increasingly sophisticated search tools like the web search engines. Search engines return ranked list of documents and are less effective when users need precise answers to natural language questions. Question Answering systems involve this critical capability required for the next generation web search engines, to reduce the painstaking search through the long list of documents. The proposed solution of Question Answering system works for a specific domain of tourism which is a global and routine activity for leisure. The users have to struggle to navigate through these overloaded sites for a short piece of information of their interest. The crawler developed in the system gathers web page information which is processed using Natural Language Processing and Procedure programming for a specific keyword. The system returns precise short string answers or list to natural language questions related to tourism domain like distance, person, date, list of hotels, list of forts, etc.

...read moreread less

Proceedings Article•DOI•

Learning to Rank Answer Candidates for Automatic Resolution of Crossword Puzzles

[...]

Gianni Barlacchi¹, Massimo Nicosia², Alessandro Moschitti²•Institutions (2)

University of Trento¹, Qatar Foundation²

01 Jun 2014

TL;DR: It is shown that learning to rank models based on relational syntactic structures defined between the clues and the answer can improve both modules above and improve the resolution accuracy of crossword puzzles.

...read moreread less

Abstract: In this paper, we study the impact of relational and syntactic representations for an interesting and challenging task: the automatic resolution of crossword puzzles. Automatic solvers are typically based on two answer retrieval modules: (i) a web search engine, e.g., Google, Bing, etc. and (ii) a database (DB) system for accessing previously resolved crossword puzzles. We show that learning to rank models based on relational syntactic structures defined between the clues and the answer can improve both modules above. In particular, our approach accesses the DB using a search engine and reranks its output by modeling paraphrasing. This improves on the MRR of previous system up to 53% in ranking answer candidates and greatly impacts on the resolution accuracy of crossword puzzles up to 15%.

...read moreread less

Proceedings Article•DOI•

Information Evolution in Wikipedia

[...]

Andrea Ceroni, Mihai Georgescu, Ujwal Gadiraju, Kaweh Djafari Naini, Marco Fisichella - Show less +1 more

27 Aug 2014

TL;DR: It is hypothesize that by employing the temporal aspect as the primary means for capturing the evolution of entities, it is possible to provide entity-based accessibility to Web archives and reflect the usefulness of leveraging temporal information in order to study the Evolution of entities.

...read moreread less

Abstract: The Web of data is constantly evolving based on the dynamics of its content. Current Web search engine technologies consider static collections and do not factor in explicitly or implicitly available temporal information, that can be leveraged to gain insights into the dynamics of the data. In this paper, we hypothesize that by employing the temporal aspect as the primary means for capturing the evolution of entities, it is possible to provide entity-based accessibility to Web archives. We empirically show that the edit activity on Wikipedia can be exploited to provide evidence of the evolution of Wikipedia pages over time, both in terms of their content and in terms of their temporally defined relationships, classified in literature as events. Finally, we present results from our extensive analysis of a dataset consisting of 31,998 Wikipedia pages describing politicians, and observations from in-depth case studies. Our findings reflect the usefulness of leveraging temporal information in order to study the evolution of entities and breed promising grounds for further research.

...read moreread less

Collapse