scispace - formally typeset
Search or ask a question
Journal Article

Balancing efficiency and effectiveness for fusion-based search engines in the 'big data' environment

01 Jun 2016-Information Research: An International Electronic Journal (Thomas D. Wilson. 9 Broomfield Road, Broomhill, Sheffield, S10 2SE, UK. Web site: http://informationr.net/ir)-Vol. 21, Iss: 2
TL;DR: This paper focuses on the data fusion approach to information search, in which each component search model contributes a result and all the results are combined by a fusion algorithm.
Abstract: Introduction. In the big data age, we have to deal with a tremendous amount of information, which can be collected from various types of sources. For information search systems such as Web search engines or online digital libraries, the collection of documents becomes larger and larger. For some queries, an information search system needs to retrieve a large number of documents. On the other hand, very often people are only willing to visit no more than a few top-ranked documents. Therefore, how to develop an information search system with desirable efficiency and effectiveness is a research problem. Method. In this paper, we focus on the data fusion approach to information search, in which each component search model contributes a result and all the results are combined by a fusion algorithm. Through empirical study, we are able to find a feasible combination method that balances effectiveness and efficiency in the context of data fusion. Analysis. It is a multi-optimisation problem that aims to balance effectiveness and efficiency. To support this, we need to understand how these two factors affect each other and to what extent. Results. Using some groups of historical runs from TREC to carry out the experiment, we find that using much less information (e.g., less than 10% of the documents in the experiment), good efficiency is achievable with marginal loss on effectiveness. Conclusions. We consider that the findings from our experiment are informative and this can be used as a guideline for providing more efficient search service in the big data environment.

Content maybe subject to copyright    Report

Balancing efficiency and effectiveness for fusion-based search engines in the 'big data' environment
http://www.informationr.net/ir/21-2/paper710.html[6/16/2016 6:03:27 PM]
VOL. 21 NO. 2, JUNE 2016
Contents | Author index | Subject index | Search |
Home
Balancing efficiency and effectiveness for fusion-based search
engines in the 'big data' environment
Jieyu Li, Chunlan Huang, Xiuhong Wang and Shengli Wu
Abstract
Introduction. In the big data age, we have to deal with a
tremendous amount of information, which can be collected from
various types of sources. For information search systems such as
Web search engines or online digital libraries, the collection of
documents becomes larger and larger. For some queries, an
information search system needs to retrieve a large number of
documents. On the other hand, very often people are only willing to
visit no more than a few top-ranked documents. Therefore, how to
develop an information search system with desirable efficiency and
effectiveness is a research problem.
Method. In this paper, we focus on the data fusion approach to
information search, in which each component search model
contributes a result and all the results are combined by a fusion
algorithm. Through empirical study, we are able to find a feasible
combination method that balances effectiveness and efficiency in
the context of data fusion.
Analysis. It is a multi-optimisation problem that aims to balance
effectiveness and efficiency. To support this, we need to understand
how these two factors affect each other and to what extent.
Results. Using some groups of historical runs from TREC to carry
out the experiment, we find that using much less information (e.g.,
less than 10% of the documents in the experiment), good efficiency is
achievable with marginal loss on effectiveness.
Conclusions. We consider that the findings from our experiment
are informative and this can be used as a guideline for providing
more efficient search service in the big data environment.
Introduction
In the big data age, we have to deal with a tremendous amount
change font

Balancing efficiency and effectiveness for fusion-based search engines in the 'big data' environment
http://www.informationr.net/ir/21-2/paper710.html[6/16/2016 6:03:27 PM]
of information, which can be collected from various types of
sources. For many information search systems, the corpora
they use become larger and larger. A typical example is the
Web. According to worldwidewebsize
, the indexed Web
contains at least 4.64 billion pages as of 12April, 2015. Some
collections used in many information retrieval and Web search
evaluation events such as TREC (the Text REtrieval
Conference, held annually by National Institute of Standard
and Technology of the USA) are also very large. For example,
ClueWeb09
has over one billion documents and ClueWeb12
has over 800 million documents.
The big data environment brings some challenges to
information search. For some queries, an information search
engine needs to retrieve a large number of documents. With
very large number of documents, it is even more difficult for an
information search engine to locate more useful and relevant
information so as to satisfy users’ information needs. To
improve effectiveness of the results by locating more relevant
document and ranking them on some top-ranked positions in
the resultant lists, more and more complex and expensive
search techniques have been explored and used. For example,
when ranking Web documents for the given information need,
Web search engines not only consider the relevance of the
documents to the information need, they also take the
authority of the Web sites that hold the Web pages into
consideration.
Usually, the authority of Web sites is estimated by link
analysis, which requires data about links (Nunes, Ribeiro and
Gabriel, 2013) between large numbers of Web pages. Page rank
(Brin and Page, 1998
) and HITS (Kleinberg, 1999) are two
well-known algorithms for such a purpose. There are also
many other methods such as entity detection, user log analysis
for personalized preference, user feedback and query
expansion, phrase recognition and structural analysis,
knowledge-based approach for word correction and
suggestion, and so on. Many of these data-intensive tasks are
commonly used in Web search engines. On the other hand,
efficiency, which concerns the time needed for the search
system to do a search task, becomes a big problem because of
the huge number of documents and related data involved and
complex and expensive techniques used (Baeza-Yates and
Cambazoglu, 2014;Francès, Bai, Cambazoglu and Baeza-Yates,
2014). This has to be considered when the response time of a
search system reaches the level that might go beyond users'
tolerance.
In this paper, we are going to address this within a particular

Balancing efficiency and effectiveness for fusion-based search engines in the 'big data' environment
http://www.informationr.net/ir/21-2/paper710.html[6/16/2016 6:03:27 PM]
type of information search system that is implemented through
the fusion approach. This approach is popular and often
referred to as rank aggregation or learn to rank. Previous
research shows that fusion is an attractive option on
information search (Wu, 2012
; Liu, 2011).
Data fusion works as follows: instead of using one information
search model, we use multiple search models working together
for any search task. For a given query, all of the component
search models search the same collection of documents and
each of them contributes a ranked list of documents as a
composite result. Then a selected data fusion algorithm is
applied to all the results involved and the combined result is
generated accordingly as the final result for the user. Usually
we take all the documents from all component results for
fusion because it is believed that the more documents we use,
the more information we can obtain and, therefore, the better
fused result we can obtain. Effectiveness is the only concern in
almost all the cases. Figure 1 shows its structure with three
search models.
Figure 1: Structure of a fusion-based information
search system
In Figure 1, the first step is that the user issues a query to the
search engine interface (Step 1), which forwards it to multiple
search models (Step 2). Each of those search models works
with the index files and the same document collection (Step 3)
to obtain the result. After that, the models forward the results
to the fusion algorithm (Step 4) and the fusion algorithm
combines those results and transfers the fused result to the
interface (Step 5). Finally, the result is presented to the user
(Step 6). In Figure 1, all three search engines work with the
same document collection and related index files. It is possible

Balancing efficiency and effectiveness for fusion-based search engines in the 'big data' environment
http://www.informationr.net/ir/21-2/paper710.html[6/16/2016 6:03:27 PM]
that each search model may work with a different document
collection. This case is usually referred to as federated search
(Shokouhi and Si, 2011
). In this paper, we focus on the
scenario of data fusion with all search models working on the
same document collection.
Data fusion has been extensively used in many information
search tasks such as expert search (Balog, Azzopardi and de
Rijke, 2009), blog opinion search (Orimaye. Alhashmi and
Siew, 2015), and search for diversified results (Santos,
Macdonald and Ounis, 2012), among others. In many
information search tasks, a traditional search model like BM25
is not enough. A lot of other techniques have also been applied.
Let us take Web search as an example. Many aspects including
link analysis, personalized ranking of documents, diversified
ranking of documents to let results cover more sub-topics, and
so on, have been provided by many Web search engines (Web
search engine, 2015). Traditional search models, such as BM25
or other alternatives, are just one part of the whole system. To
combine those results from different ranking components, data
fusion is required. In our experiments that use three groups of
data from TREC, some of the information retrieval systems
involved (such as uogTrA42, ICTNET11ADR3, srchvrs11b,
uogTRA45Vm, ICTNET11ADR2, ivoryL2Rb, srchvrs12c09,
uogTRA44xi) use BM25 as a component. Experimental results
show that data fusion can improve performance over those
component systems.
In the big data environment, we need to reconsider the whole
process carefully. Compared with some other solutions, the
data fusion approach is more complex because it runs multiple
search models concurrently and an extra layer of fusion
component is also necessary. It is more expensive in the sense
that more resources and more time are required. Both
effectiveness and efficiency become equally important aspects.
This can be looked at from the sides of both user and system.
First let us look at it from the user side. In some applications
such as Web search, very often users want to find some, but not
all, relevant documents. If the user finds some relevant pages,
then it is very likely that s/he will stop looking for further
relevant information (Cossock and Zhang, 2006
). In this
paper, we try to find solutions for such a situation. From the
system side, we can deal with this in different ways. First of all,
if a user has no interest in reading a lot of documents, then the
search system does not need to retrieve too many of them in
the first place. Secondly, if we consider this issue in the
framework of data fusion, then we can deal with it in several
ways. Among others, four of them are as follows:

Balancing efficiency and effectiveness for fusion-based search engines in the 'big data' environment
http://www.informationr.net/ir/21-2/paper710.html[6/16/2016 6:03:27 PM]
If there are a lot of candidates for component search
models, then we may just choose a subset of them for
submitting results to the fusion algorithm.
We may let each search model to retrieve only a limited
number of documents rather than all the documents that
have a certain estimated probability of being relevant to
the query.
For the fused result, we may generate a limited number
of documents as the result.
Some data fusion methods need training data to
determine weights for each search model or some
parameters for the fusion algorithm. How much data we
use for the training purpose can be investigated
For each aforementioned issue , we have different options of
choosing how many documents or models or data. Generally
speaking, if we use less information in those search models,
then we may achieve higher efficiency, but with possible loss of
effectiveness. Thus we need a balanced decision to addressing
both of them at the same time. To our knowledge, these issues
have not been addressed before. In this paper, we investigate
the last two of the aforementioned issues.
The rest of this paper is organized as follows: in the next
section we discuss some related work. This is followed by a
presentation of the data fusion method used in this study and
also some necessary background information. Related
experimental settings and results are then reported and our
conclusions are presented.
Previous work
In information retrieval and Web search, many data fusion
methods such as CombSum (Fox and Shaw, 1994
), CombMNZ
and its variants (Fox and Shaw, 1994
;He and Wu, 2008), linear
combination (Vogt and Cottrell, 1999
; Wu, 2012), Borda count
(Aslam and Montague, 2001
), Condorcet fusion (Montague
and Aslam, 2002),cluster-based fusion (Kozorovitsky and
Kurland, 2011), fusion-based implicit diversification method
(Liang and Ren and de Rijke, 2014
), and others have been
proposed. Ng and Kantor (2000
) use a few variables to predict
the effectiveness of data fusion (CombSum). However, almost
all of them concern effectiveness of the fused result, while
efficiency of the method has not been considered.
In some cases, fewer than all available documents are used for
fusion. But the major reason of doing this is for some purpose
other than improving efficiency.Two examples of this are by
Lee (1997
) and Spoerri (2007).
Lee (1997
) carried out some experiments to compare the

Citations
More filters
Proceedings ArticleDOI
27 Jun 2018
TL;DR: The goal of this half day, intermediate-level, tutorial is to provide a methodological view of the theoretical foundations of fusion approaches, the numerous fusion methods that have been devised and a variety of applications for which fusion techniques have been applied.
Abstract: Fusion is an important and central concept in Information Retrieval. The goal of fusion methods is to merge different sources of information so as to address a retrieval task. For example, in the adhoc retrieval setting, fusion methods have been applied to merge multiple document lists retrieved for a query. The lists could be retrieved using different query representations, document representations, ranking functions and corpora. The goal of this half day, intermediate-level, tutorial is to provide a methodological view of the theoretical foundations of fusion approaches, the numerous fusion methods that have been devised and a variety of applications for which fusion techniques have been applied.

23 citations


Cites background from "Balancing efficiency and effectiven..."

  • ..., [11, 29, 38, 41, 77, 79, 80, 103]) – Retrieval Score Normalization and Rank-to-Score Transformations [4, 5, 29, 39, 57, 73, 74, 78, 104, 108] – Content-based [15, 30, 49–51, 62, 64, 87, 91] – Selecting Retrieved Lists for Fusion [43, 45, 46] – Query Variations [11, 16–18, 22, 28, 52, 113] – Failure Analysis / Risk [18, 37] – Efficiency Considerations [44, 59] • Learning & Fusion [55, 88] – Models over Permutations (e....

    [...]

  • ...3 FORMAT AND PLANNED SCHEDULE – Retrieval Score Normalization and Rank-to-Score Transformations [4, 5, 29, 39, 57, 73, 74, 78, 104, 108] – Content-based [15, 30, 49–51, 62, 64, 87, 91] – Selecting Retrieved Lists for Fusion [43, 45, 46] – Query Variations [11, 16–18, 22, 28, 52, 113] – Failure Analysis / Risk [18, 37] – Efficiency Considerations [44, 59] • Learning & Fusion [55, 88] – Models over Permutations (e.g., [1, 38, 48, 54, 83]) – Supervised (e.g., [3, 55, 65–67, 85, 88, 89, 102, 105, 106, 110, 112]) vs Unsupervised (e.g., [6, 9, 29, 40, 107]) – Ensembles [36, 82, 116] • Applications – Query Performance Prediction [7, 35, 75, 81, 85, 92, 95, 111] – Diversification [60, 63, 109] – Relevance Feedback [8, 84] – Selecting a Ranker [2, 12, 33, 58] – Blog and Microblog Retrieval [60, 61, 64, 101] – Pooling and Evaluation [8, 21, 25, 68–71, 86, 97, 98] • Conclusions & Future Directions...

    [...]

Journal ArticleDOI
TL;DR: This paper presents, in this paper, the resolution of conflict at the instance level into two stages: references reconciliation and data fusion, and defines first the conflicts classification, the strategies for dealing with conflicts and the implementing conflict management strategies.
Abstract: With the progress of new technologies of information and communication, more and more producers of data exist. On the other hand, the web forms a huge support of all these kinds of data. Unfortunately, existing data is not proper due to the existence of the same information in different sources, as well as erroneous and incomplete data. The aim of data integration systems is to offer to a user a unique interface to query a number of sources. A key challenge of such systems is to deal with conflicting information from the same source or from different sources. We present, in this paper, the resolution of conflict at the instance level into two stages: references reconciliation and data fusion. The reference reconciliation methods seek to decide if two data descriptions are references to the same entity in reality. We define the principles of reconciliation method then we distinguish the methods of reference reconciliation, first on how to use the descriptions of references, then the way to acquire knowledge. We finish this section by discussing some current data reconciliation issues that are the subject of current research. Data fusion in turn, has the objective to merge duplicates into a single representation while resolving conflicts between the data. We define first the conflicts classification, the strategies for dealing with conflicts and the implementing conflict management strategies. We present then, the relational operators and data fusion techniques. Likewise, we finish this section by discussing some current data fusion issues that are the subject of current research.

8 citations

References
More filters
Journal ArticleDOI
01 Apr 1998
TL;DR: This paper provides an in-depth description of Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and looks at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.
Abstract: In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/. To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three years ago. This paper provides an in-depth description of our large-scale web search engine -- the first such detailed public description we know of to date. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.

14,696 citations


"Balancing efficiency and effectiven..." refers background in this paper

  • ...Page rank (Brin and Page, 1998) and HITS (Kleinberg, 1999) are two well-known algorithms for such a purpose....

    [...]

Journal Article
TL;DR: Google as discussed by the authors is a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems.
Abstract: In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/. To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three years ago. This paper provides an in-depth description of our large-scale web search engine -- the first such detailed public description we know of to date. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.

13,327 citations

Proceedings Article
Ron Kohavi1
20 Aug 1995
TL;DR: The results indicate that for real-word datasets similar to the authors', the best method to use for model selection is ten fold stratified cross validation even if computation power allows using more folds.
Abstract: We review accuracy estimation methods and compare the two most common methods crossvalidation and bootstrap. Recent experimental results on artificial data and theoretical re cults in restricted settings have shown that for selecting a good classifier from a set of classifiers (model selection), ten-fold cross-validation may be better than the more expensive leaveone-out cross-validation. We report on a largescale experiment--over half a million runs of C4.5 and a Naive-Bayes algorithm--to estimate the effects of different parameters on these algrithms on real-world datasets. For crossvalidation we vary the number of folds and whether the folds are stratified or not, for bootstrap, we vary the number of bootstrap samples. Our results indicate that for real-word datasets similar to ours, The best method to use for model selection is ten fold stratified cross validation even if computation power allows using more folds.

11,185 citations


"Balancing efficiency and effectiven..." refers methods in this paper

  • ...To make the results more reliable, we use the five-fold cross validation method (Kohavi, 1995)to test the fusion method....

    [...]

Journal ArticleDOI
Jon Kleinberg1
TL;DR: This work proposes and test an algorithmic formulation of the notion of authority, based on the relationship between a set of relevant authoritative pages and the set of “hub pages” that join them together in the link structure, and has connections to the eigenvectors of certain matrices associated with the link graph.
Abstract: The network structure of a hyperlinked environment can be a rich source of information about the content of the environment, provided we have effective means for understanding it. We develop a set of algorithmic tools for extracting information from the link structures of such environments, and report on experiments that demonstrate their effectiveness in a variety of context on the World Wide Web. The central issue we address within our framework is the distillation of broad search topics, through the discovery of “authorative” information sources on such topics. We propose and test an algorithmic formulation of the notion of authority, based on the relationship between a set of relevant authoritative pages and the set of “hub pages” that join them together in the link structure. Our formulation has connections to the eigenvectors of certain matrices associated with the link graph; these connections in turn motivate additional heuristrics for link-based analysis.

8,328 citations

Book
Tie-Yan Liu1
27 Jun 2009
TL;DR: Three major approaches to learning to rank are introduced, i.e., the pointwise, pairwise, and listwise approaches, the relationship between the loss functions used in these approaches and the widely-used IR evaluation measures are analyzed, and the performance of these approaches on the LETOR benchmark datasets is evaluated.
Abstract: This tutorial is concerned with a comprehensive introduction to the research area of learning to rank for information retrieval. In the first part of the tutorial, we will introduce three major approaches to learning to rank, i.e., the pointwise, pairwise, and listwise approaches, analyze the relationship between the loss functions used in these approaches and the widely-used IR evaluation measures, evaluate the performance of these approaches on the LETOR benchmark datasets, and demonstrate how to use these approaches to solve real ranking applications. In the second part of the tutorial, we will discuss some advanced topics regarding learning to rank, such as relational ranking, diverse ranking, semi-supervised ranking, transfer ranking, query-dependent ranking, and training data preprocessing. In the third part, we will briefly mention the recent advances on statistical learning theory for ranking, which explain the generalization ability and statistical consistency of different ranking methods. In the last part, we will conclude the tutorial and show several future research directions.

2,515 citations