scispace - formally typeset
Search or ask a question

Showing papers in "Information Processing and Management in 2005"


Journal ArticleDOI
TL;DR: In this paper, the authors examined the state of the DL domain after a decade of activity by applying social network analysis to the co-authorship network of the past ACM, IEEE, and joint ACM/IEEE digital library conferences.
Abstract: The field of digital libraries (DLs) coalesced in 1994: the first digital library conferences were held that year, awareness of the World Wide Web was accelerating, and the National Science Foundation awarded $24 Million (US) for the Digital Library Initiative (DLI). In this paper we examine the state of the DL domain after a decade of activity by applying social network analysis to the co-authorship network of the past ACM, IEEE, and joint ACM/IEEE digital library conferences. We base our analysis on a common binary undirectional network model to represent the co-authorship network, and from it we extract several established network measures. We also introduce a weighted directional network model to represent the co-authorship network, for which we define AuthorRank as an indicator of the impact of an individual author in the network. The results are validated against conference program committee members in the same period. The results show clear advantages of PageRank and AuthorRank over degree, closeness and betweenness centrality metrics. We also investigate the amount and nature of international participation in Joint Conference on Digital Libraries (JCDL).

828 citations


Journal ArticleDOI
TL;DR: Empirical results are presented, showing that the patent task performance process involves highly collaborative aspects throughout the stages of the information seeking and retrieval process and proposes a refined IR framework involving collaborative aspects.
Abstract: In this article we investigate the expressions of collaborative activities within information seeking and retrieval processes (IS&R). Generally, information seeking and retrieval is regarded as an individual and isolated process in IR research. We assume that an IS&R situation is not merely an individual effort, but inherently involves various collaborative activities. We present empirical results from a real-life and information-intensive setting within the patent domain, showing that the patent task performance process involves highly collaborative aspects throughout the stages of the information seeking and retrieval process. Furthermore, we show that these activities may be categorised and related to different stages in an information seeking and retrieval process. Therefore, the assumption that information retrieval performance is purely individual needs to be reconsidered. Finally, we also propose a refined IR framework involving collaborative aspects.

298 citations


Journal ArticleDOI
TL;DR: One approach is a trainable summarizer, which takes into account several features, including position, positive keyword, negative keyword, centrality, and the resemblance to the title, to generate summaries, while the other uses latent semantic analysis (LSA) to derive the semantic matrix of a document or a corpus and uses semantic sentence representation to construct a semantic text relationship map.
Abstract: This paper proposes two approaches to address text summarization: modified corpus-based approach (MCBA) and LSA-based T.R.M. approach (LSA + T.R.M.). The first is a trainable summarizer, which takes into account several features, including position, positive keyword, negative keyword, centrality, and the resemblance to the title, to generate summaries. Two new ideas are exploited: (1) sentence positions are ranked to emphasize the significances of different sentence positions, and (2) the score function is trained by the genetic algorithm (GA) to obtain a suitable combination of feature weights. The second uses latent semantic analysis (LSA) to derive the semantic matrix of a document or a corpus and uses semantic sentence representation to construct a semantic text relationship map. We evaluate LSA + T.R.M. both with single documents and at the corpus level to investigate the competence of LSA in text summarization. The two novel approaches were measured at several compression rates on a data corpus composed of 100 political articles. When the compression rate was 30%, an average f-measure of 49% for MCBA, 52% for MCBA + GA, 44% and 40% for LSA + T.R.M. in single-document and corpus level were achieved respectively.

264 citations


Journal ArticleDOI
TL;DR: The causes of journal articles which have been posted without charge on the internet are more heavily cited than those which have not been are examined.
Abstract: It has been shown (Lawrence, S. (2001). Online or invisible? Nature, 411, 521) that journal articles which have been posted without charge on the internet are more heavily cited than those which have not been. Using data from the NASA Astrophysics Data System (ads.harvard.edu) and from the ArXiv e-print archive at Cornell University (arXiv.org) we examine the causes of this effect.

237 citations


Journal ArticleDOI
TL;DR: Although social network metrics and ISI IF rankings deviate moderately for citation-based journal networks, they differ considerably for journal networks derived from download data, which raises questions regarding the validity of the ISI IF as the sole assessment of journal impact.
Abstract: We generated networks of journal relationships from citation and download data, and determined journal impact rankings from these networks using a set of social network centrality metrics. The resulting journal impact rankings were compared to the ISI IF. Results indicate that, although social network metrics and ISI IF rankings deviate moderately for citation-based journal networks, they differ considerably for journal networks derived from download data. We believe the results represent a unique aspect of general journal impact that is not captured by the ISI IF. These results furthermore raise questions regarding the validity of the ISI IF as the sole assessment of journal impact, and suggest the possibility of devising impact metrics based on usage information in general.

211 citations


Journal ArticleDOI
TL;DR: This longitudinal benchmark study shows that European Web searching is evolving in certain directions, and European search topics are broadening, with a notable percentage decline in sexual and pornographic searching.
Abstract: The Web has become a worldwide source of information and a mainstream business tool. It is changing the way people conduct the daily business of their lives. As these changes are occurring, we need to understand what Web searching trends are emerging within the various global regions. What are the regional differences and trends in Web searching, if any? What is the effectiveness of Web search engines as providers of information? As part of a body of research studying these questions, we have analyzed two data sets collected from queries by mainly European users submitted to AlltheWeb.com on 6 February 2001 and 28 May 2002. AlltheWeb.com is a major and highly rated European search engine. Each data set contains approximately a million queries submitted by over 200,000 users and spans a 24-h period. This longitudinal benchmark study shows that European Web searching is evolving in certain directions. There was some decline in query length, with extremely simple queries. European search topics are broadening, with a notable percentage decline in sexual and pornographic searching. The majority of Web searchers view fewer than five Web documents, spending only seconds on a Web document. Approximately 50% of the Web documents viewed by these European users were topically relevant. We discuss the implications for Web information systems and information content providers.

196 citations


Journal ArticleDOI
TL;DR: A new approach to create a patent classification system to replace the IPC or UPC system for conducting patent analysis and management is proposed, based on co-citation analysis of bibliometrics to assist patent manager in understanding the basic patents for a specific industry.
Abstract: The paper proposes a new approach to create a patent classification system to replace the IPC or UPC system for conducting patent analysis and management. The new approach is based on co-citation analysis of bibliometrics. The traditional approach for management of patents, which is based on either the IPC or UPC, is too general to meet the needs of specific industries. In addition, some patents are placed in incorrect categories, making it difficult for enterprises to carry out R&D planning, technology positioning, patent strategy-making and technology forecasting. Therefore, it is essential to develop a patent classification system that is adaptive to the characteristics of a specific industry. The analysis of this approach is divided into three phases. Phase I selects appropriate databases to conduct patent searches according to the subject and objective of this study and then select basic patents. Phase II uses the co-cited frequency of the basic patent pairs to assess their similarity. Phase III uses factor analysis to establish a classification system and assess the efficiency of the proposed approach. The main contribution of this approach is to develop a patent classification system based on patent similarities to assist patent manager in understanding the basic patents for a specific industry, the relationships among categories of technologies and the evolution of a technology category.

189 citations


Journal ArticleDOI
TL;DR: Full text analysis and traditional bibliometric methods are serially combined to improve the efficiency of the two individual methods and confirm the main results of the pilot study that such hybrid methodology can be applied to both research evaluation and information retrieval.
Abstract: In the present study results of an earlier pilot study by Glenisson, Glanzel and Persson are extended on the basis of larger sets of papers. Full text analysis and traditional bibliometric methods are serially combined to improve the efficiency of the two individual methods. The text mining methodology already introduced in the pilot study is applied to the complete publication year 2003 of the journal Scientometrics. Altogether 85 documents that can be considered research articles or notes have been selected for this exercise. The outcomes confirm the main results of the pilot study, namely, that such hybrid methodology can be applied to both research evaluation and information retrieval. Nevertheless, Scientometrics documents published in 2003 cover a much broader and more heterogeneous spectrum of bibliometrics and related research than those analysed in the pilot study. A modified subject classification based on the scheme used in an earlier study by Schoepflin and Glanzel has been applied for validation purposes.

150 citations


Journal ArticleDOI
TL;DR: Key results include identification of a previously unseen dependence of pre- and post-translation expansion on orthographic cognates and development of a query-specific measure for translation fanout that helps to explain the utility of structured query methods.
Abstract: Cross-language information retrieval (CLIR) systems allow users to find documents written in different languages from that of their query. Simple knowledge structures such as bilingual term lists have proven to be a remarkably useful basis for bridging that language gap. A broad array of dictionary-based techniques have demonstrated utility, but comparison across techniques has been difficult because evaluation results often span only a limited range of conditions. This article identifies the key issues in dictionary-based CLIR, develops unified frameworks for term selection and term translation that help to explain the relationships among existing techniques, and illustrates the effect of those techniques using four contrasting languages for systematic experiments with a uniform query translation architecture. Key results include identification of a previously unseen dependence of pre- and post-translation expansion on orthographic cognates and development of a query-specific measure for translation fanout that helps to explain the utility of structured query methods.

123 citations


Journal ArticleDOI
TL;DR: This paper introduces new sets of features specific to web documents, which are extracted from URL and HTML tags within the pages, and concludes which is an appropriate set of features in automatic genre classification of web documents.
Abstract: With the increase of information on the Web, it is difficult to find desired information quickly out of the documents retrieved by a search engine. One way to solve this problem is to classify web documents according to various criteria. Most document classification has been focused on a subject or a topic of a document. A genre or a style is another view of a document different from a subject or a topic. The genre is also a criterion to classify documents. In this paper, we suggest multiple sets of features to classify genres of web documents. The basic set of features, which have been proposed in the previous studies, is acquired from the textual properties of documents, such as the number of sentences, the number of a certain word, etc. However, web documents are different from textual documents in that they contain URL and HTML tags within the pages. We introduce new sets of features specific to web documents, which are extracted from URL and HTML tags. The present work is an attempt to evaluate the performance of the proposed sets of features, and to discuss their characteristics. Finally, we conclude which is an appropriate set of features in automatic genre classification of web documents.

122 citations


Journal ArticleDOI
TL;DR: This paper envisage a Digital Library not only as an information resource where users may submit queries to satisfy their daily information need, but also as a collaborative working and meeting space of people sharing common interests.
Abstract: The Web, and consequently the information contained in it, is growing rapidly. Every day a huge amount of newly created information is electronically published in Digital Libraries, whose aim is to satisfy users' information needs.In this paper, we envisage a Digital Library not only as an information resource where users may submit queries to satisfy their daily information need, but also as a collaborative working and meeting space of people sharing common interests. Indeed, we will present a personalized collaborative Digital Library environment, where users may organize the information space according to their own subjective view, may build communities, may become aware of each other, may exchange information and knowledge with other users, and may get recommendations based on preference patterns of users.

Journal ArticleDOI
TL;DR: In this study the rankings of IR systems based on binary and graded relevance in TREC 7 and 8 data are compared and the results show the different character of the measures.
Abstract: In this study the rankings of IR systems based on binary and graded relevance in TREC 7 and 8 data are compared. Relevance of a sample TREC results is reassessed using a relevance scale with four levels: non-relevant, marginally relevant, fairly relevant, highly relevant. Twenty-one topics and 90 systems from TREC 7 and 20 topics and 121 systems from TREC 8 form the data. Binary precision, and cumulated gain, discounted cumulated gain and normalised discounted cumulated gain are the measures compared. Different weighting schemes for relevance levels are tested with cumulated gain measures. Kendall's rank correlations are computed to determine to what extent the rankings produced by different measures are similar. Weighting schemes from binary to emphasising highly relevant documents form a continuum, where the measures correlate strongly in the binary end, and less in the heavily weighted end. The results show the different character of the measures.

Journal ArticleDOI
TL;DR: Webpage visibility can be improved by increasing the frequency of keywords in the title, in the full-text and in both the title and full- Text in search engine results lists.
Abstract: Content characteristics of a webpage include factors such as keyword position in a webpage, keyword duplication, layout, and their combination. These factors may impact webpage visibility in a search engine. Four hypotheses are presented relating to the impact of selected content characteristics on webpage visibility in search engine results lists. Webpage visibility can be improved by increasing the frequency of keywords in the title, in the full-text and in both the title and full-text.

Journal ArticleDOI
TL;DR: A classification of link types in academic environments on the Web provides an insight into the diverse uses of hypertext links on the Internet, and has implications for browsing and ranking in IR systems by differentiating between different types of links.
Abstract: The Web is an enormous set of documents connected through hypertext links created by authors of Web pages. These links have been studied quantitatively, but little has been done so far in order to understand why these links are created. As a first step towards a better understanding, we propose a classification of link types in academic environments on the Web. The classification is multi-faceted and involves different aspects of the source and the target page, the link area and the relationship between the source and the target. Such classification provides an insight into the diverse uses of hypertext links on the Web, and has implications for browsing and ranking in IR systems by differentiating between different types of links. As a case study we classified a sample of links between sites of Israeli academic institutions.

Journal ArticleDOI
TL;DR: This paper reviews state-of-the-art techniques and methods for enhancing effectiveness of cross-language information retrieval (CLIR) and focuses on matching strategies and translation techniques.
Abstract: This paper reviews state-of-the-art techniques and methods for enhancing effectiveness of cross-language information retrieval (CLIR). The following research issues are covered: (1) matching strategies and translation techniques, (2) methods for solving the problem of translation ambiguity, (3) formal models for CLIR such as application of the language model, (4) the pivot language approach, (5) methods for searching multilingual document collection, (6) techniques for combining multiple language resources, etc.

Journal ArticleDOI
TL;DR: The paper concludes that the two-step procedure to indexing is insufficient to explain the indexing process and suggests that the domain-centered approach offers a guide for indexers that can help them manage the complexity of indexing.
Abstract: The paper discusses the notion of steps in indexing and reveals that the document-centered approach to indexing is prevalent and argues that the document-centered approach is problematic because it blocks out context-dependent factors in the indexing process. A domain-centered approach to indexing is presented as an alternative and the paper discusses how this approach includes a broader range of analyses and how it requires a new set of actions from using this approach; analysis of the domain, users and indexers. The paper concludes that the two-step procedure to indexing is insufficient to explain the indexing process and suggests that the domain-centered approach offers a guide for indexers that can help them manage the complexity of indexing.

Journal Article
TL;DR: An investigation into the effects of summary length as a function of screen size, where query-biased summaries are used to present retrieval results to explore whether there is an optimal summary size for three types of device, given their different screen sizes.

Journal ArticleDOI
TL;DR: Findings suggest that metadata is a good mechanism to improve webpage visibility, the metadata subject field plays a more important role than any other metadata field and keywords extracted from the webpage itself, particularly title or full-text, are most effective.
Abstract: This paper discusses the impact of metadata implementation in a webpage on its visibility performance in a search engine results list. Influential internal and external factors of metadata implementation were identified. How these factors affect webpage visibility in a search engine results list was examined in an experimental study. Findings suggest that metadata is a good mechanism to improve webpage visibility, the metadata subject field plays a more important role than any other metadata field and keywords extracted from the webpage itself, particularly title or full-text, are most effective. To maximize the effects, these keywords should come from both title and full-text.

Journal ArticleDOI
TL;DR: The reasons underlying the inconsistent performance of automatic topic identification are investigated with statistical analysis and experimental design techniques, and the topic identification algorithm's performance becomes doubtful in various cases.
Abstract: The analysis of contextual information in search engine query logs enhances the understanding of Web users' search patterns. Obtaining contextual information on Web search engine logs is a difficult task, since users submit few number of queries, and search multiple topics. Identification of topic changes within a search session is an important branch of search engine user behavior analysis. The purpose of this study is to investigate the properties of a specific topic identification methodology in detail, and to test its validity. The topic identification algorithm's performance becomes doubtful in various cases. These cases are explored and the reasons underlying the inconsistent performance of automatic topic identification are investigated with statistical analysis and experimental design techniques.

Journal ArticleDOI
TL;DR: A multiple regression model was developed which shows that faculty quality and the language of the university are important predictors for links to a university Web site, and showed that English universities are advantaged.
Abstract: Hyperlink patterns between Canadian university Web sites were analyzed by a mathematical modeling approach. A multiple regression model was developed which shows that faculty quality and the language of the university are important predictors for links to a university Web site. Higher faculty quality means more links. French universities received lower numbers of links to their Web sites than comparable English universities. Analysis of interlinking between pairs of universities also showed that English universities are advantaged. Universities are more likely to link to each other when the geographical distance between them is less than 3000 km, possibly reflecting the east vs. west divide that exists in Canadian society.

Journal ArticleDOI
TL;DR: This study measures how similar are the rankings of search engines on the overlapping results of identical queries retrieved from several search engines and indicates that the large public search engines in the Web employ considerably different ranking algorithms.
Abstract: The Web has become an information source for professional data gathering. Because of the vast amounts of information on almost all topics, one cannot systematically go over the whole set of results, and therefore must rely on the ordering of the results by the search engine. It is well known that search engines on the Web have low overlap in terms of coverage. In this study we measure how similar are the rankings of search engines on the overlapping results.We compare rankings of results for identical queries retrieved from several search engines. The method is based only on the set of URLs that appear in the answer sets of the engines being compared. For comparing the similarity of rankings of two search engines, the Spearman correlation coefficient is computed. When comparing more than two sets Kendall's W is used. These are well-known measures and the statistical significance of the results can be computed. The methods are demonstrated on a set of 15 queries that were submitted to four large Web search engines. The findings indicate that the large public search engines on the Web employ considerably different ranking algorithms.

Journal ArticleDOI
TL;DR: A real-time measure of bias in Web search engines captures the degree to which the distribution of URLs, retrieved in response to a query, deviates from an ideal or fair distribution for that query.
Abstract: This paper examines a real-time measure of bias in Web search engines. The measure captures the degree to which the distribution of URLs, retrieved in response to a query, deviates from an ideal or fair distribution for that query. This ideal is approximated by the distribution produced by a collection of search engines. Differences between bias and classical retrieval measures are highlighted by examining the possibilities for bias in four extreme cases of recall and precision. The results of experiments examining the influence on bias measurement of subject domains, search engines, and search terms are presented. Three general conclusions are drawn: (1) the performance of search engines can be distinguished with the aid of the bias measure; (2) bias values depend on the subject matter under consideration; (3) choice of search terms does not account for much of the variance in bias values. These conclusions underscore the need to develop "bias profiles" for search engines.

Journal ArticleDOI
TL;DR: This work proposes to partition a large inhomogeneous dataset into several smaller ones with clustered structure, on which the truncated SVD is applied, and shows that the clustered SVD strategies may enhance the retrieval accuracy and reduce the computing and storage costs.
Abstract: The text retrieval method using latent semantic indexing (LSI) technique with truncated singular value decomposition (SVD) has been intensively studied in recent years. The SVD reduces the noise contained in the original representation of the term-document matrix and improves the information retrieval accuracy. Recent studies indicate that SVD is mostly useful for small homogeneous data collections. For large inhomogeneous datasets, the performance of the SVD based text retrieval technique may deteriorate. We propose to partition a large inhomogeneous dataset into several smaller ones with clustered structure, on which we apply the truncated SVD. Our experimental results show that the clustered SVD strategies may enhance the retrieval accuracy and reduce the computing and storage costs.

Journal ArticleDOI
Leo Egghe1
TL;DR: This editorial introductory paper first discusses the reasons for the clear growth of the field of informetrics, then discusses the content of the papers that are published in this special issue of Information Processing and Management, and describes an exponential growth of JASIS.
Abstract: This editorial introductory paper first discusses the reasons for the clear growth of the field of informetrics (bibliometrics, scientometrics, webometrics, ...). This has lead some journals to increase their number of volumes or the number of issues per volume. The journal Information Processing and Management decided to devote two special issues (the one here and another one to come in 2006) to the broad topic \"Informetrics\" where the scope of these special issues is to attract good papers dealing with gathering important data sets and/or presenting original models and explanations. Then we briefly discuss the content of the papers that are published in this special issue. They are dealing with models, mapping of science (cocitation, coword analysis), web sites and search engines, collaboration in digital libraries and the newest topic in informetrics: use of and access to articles in digital libraries. I. THE GROWTH OF THE FIELD OF INFORMETRICS In this introductory paper, we will use the term \"informetrics\" as the broad term comprising all -metrics studies related to information science, including bibliometrics (bibliographies, libraries, ...), scientometrics (science policy, citation analysis, research evaluation, ...), webometrics (metrics of the web, the Internet or other social networks such as citation or collaboration networks), ... . The term informetrics was introduced by Blackert and Siegel (1979) and by Nacke (1979) but gained popularity e.g. by the organization of the international informetrics conferences in 1987 (see Egghe and Rousseau (1988, 1990)). However the field \"informetrics\" (not the name) started already in the first half of the twentieth century e.g. by the works of Lotka, Bradford and Zipf (see Lotka (1926), Bradford (1934), Zipf 1949, but for the law of Zipf, see also Condon (1928) or even Estoup (1916)). The term bibliometrics was coined in Pritchard (1969) and the term scientometrics was coined in Nalimov and Mul'čenko (1969) in Russian: naukometrija. For more on the history of these and other terms see White and McCain (1989), Ikpaahindi (1985), Lawani (1981), Tague-Sutcliffe (1994), Brookes (1990), Wilson (1999), Egghe and Rousseau (1990) and Egghe (2005). That the field of informetrics has grown in the twentieth century is evident but this growth has become more and more clear the last decades. Lipetz (1999) describes an exponential growth of JASIS now called JASIST (Journal of the American Society for Information Science and Technology, existing 50 years in 1999) in terms of number of papers and in terms of number of authors and even in terms of average number of references per paper. One also shows in Lipetz (1999) that the average number of authors per paper is increasing. Authors are also responsable for a multidisciplinary growth of the field of informetrics see Summers, Oppenheim, Meadows, McKnight and Kinnell (1999) hereby also indicating the influence of informetrics to other scientific disciplines. Multidisciplinarity is evident if one looks at the \"new\" topics which informetrics is covering: the metrics of the web, Internet, intranets and other social networks such as citation or collaboration networks. In general one can say that the creation of the \"information society\" is responsable for the growth of the field of informetrics. So we can say that the field of informetrics nowadays comprises the fastly growing field of webometrics (see Hood and Wilson (2001)) (netometrics, as introduced in Bossy (1995) would be a better term covering also non-web activities but the term does not seem to become popular see Hood and Wilson (2001)). Cybermetrics also exists (it is even the name of an electronic journal under the editorial direction of I. Aguillo) but it is not clear whether it will overtake, some day, the term webometrics. Schubert (2002) describes 50 volumes of the journal Scientometrics and also concludes the increase of the number of authors and the fact that they more and more collaborate in the sense that the average number of authors per paper increases (same conclusions as in Lipetz (1999)). Schubert also remarks that there is no evidence that the degree of \"hardness\" of the field informetrics is increasing, a point to keep in mind for the future evolution of this field. He and Spink (2002) describe foreign authorship in JASIST and JDOC (Journal of Documentation) and prove that their share in these journals becomes larger and larger indicating an increase of internationalization of the field of informetrics. The latter is also illustrated in Bar-Ilan (2000) where one makes the constatation that the articles in the Proceedings of the international informetrics conferences are increasingly cited. The extension of information science to networks and the information society in general has the consequence that more and more data are gathered in an automatic way. This implies that data can be gathered in a much faster way than it used to be but also that the accuracy is dropping. There are several reasons for this. First of all one gets data from a documentary system (e.g. an OPAC, secondary or primary electronic database or digital library) but, since there is in general no clear definition of the topics due to lack of standards (see Glänzel (1996), Rousseau (2002)) one is not completely sure of what one gets. In addition an electronic system may suffer from system breakdown in which case one is obliged to make unexact interpolations. Data of electronic services and activities through the web (many data are) are also of a different nature than data gathered directly from a computer system. An example is connect time versus times of connection. When entering directly or via telephone lines into a computer system (e.g. an OPAC or the DIALOG system) one is able to report on the connect time. When using a documentary system via the web one cannot report on connect time anymore but only on number of connections (cf. the well-known DIALOG units). Networks such as the web typically have connections between the sites and one talks in this connection about hyperlinks (in-links when a site receives a hyperlink from another site; out-links when a site gives a hyperlink to another site). Their informetric distributions have been studied even in journals such as Nature and Science (see e.g. Albert, Jeong and Barabási (1999), Barabási and Albert (1999) and Huberman, Pirolli, Pitkow and Lukose (1998)) but also in physics journals (see e.g. Barabási, Jeong, Néda, Ravasz, Schubert and Vicsek (2002) and Adamic, Lukose, Puniyani and Huberman (2001)), again showing the interdisciplinary character of nowadays informetrics. Hyperlinks usually are compared with the better known citations but they are very different of nature: hyperlinks cannot be used for aging or author collaboration studies since they are not dated and are usually anonymous. Hyperlinks can be used for determining \"authoritative\" web sites or documents see CLEVER (1999) which in turn can be used in information retrieval (IR). Also in IR, quantitative methods, e.g. for the evaluation of searches and systems have drastically changed by the way search engines deliver search results: they give the retrieved documents in decreasing order of expected relevance which creates the need for evaluation measures on ordered sets instead of he classical ones (e.g. recall, precision, Jaccard, Cosine, Dice, ...) on ordinary sets (cf. Egghe and Michel (2002, 2003)). It is very important to mention that the fact that most articles are nowadays appearing in electronic journals and/or repositories gives the new possibilities of measuring the use of articles not only by citations or web citations but also by measuring their number of downloads. Downloads can be considered as electronic versions of reading or photocopying of a paper article. The latter indicators were never studied due to the great difficulty of manual datagathering. Hence the study of downloads and their relation with (web) citations is intriguing, see Antelman (2004), Brody and Harnad (2004), Harnad and Brody (2004a,b) and Perneger (2004). It is clear from the above that the extension of informetrics to electronic e.g. web activities gives a boost to the challenge of datagathering and datamanagement and hence to the growth of the field. The need for more publication outlet, which is a consequence from this, is also clearly seen if one looks at the two important informetrics journals JASIST and Scientometrics. JASIST decided in 1998 to increase its publication flow from 12 issues to 14 issues a year. Scientometrics is publishing, from 2005 onwards, 12 issues instead of 9 issues per year. In this connection I want to give a personal advise, which is shared with the informetric colleagues I contacted recently. The increase of publication outlet does also increase the need of refereeing. It is my personal feeling that one should expand the list of possible referees in informetrics to younger informetricians: my workload on refereeing has doubled in 2004, a phenomenon that is recognized by colleague informetricians. Apart from JASIST and Scientometrics, the present journal Information Processing and Management (IPM) is the only journal that regularly publishes papers devoted to informetrics studies, although, in general, IPM is more focused to the subfield of informetrics dealing with quantitative aspects of IR. Elsevier, the publisher of IPM, is interested if a more pronounced general informetrics component is possible in IPM. Hereby one wants to stress that the principal goal is to give an outlet to high quality papers in informetrics. High quality papers are papers that present good mathematical (probabilistic) models and explanations of informetric regularities (in the broad sense) and/or papers in which interesting and important datagathering is presented. The former request (good models and explanations) can be understood in the framework of increasing the degree of \"hardn

Journal ArticleDOI
TL;DR: A new method of document re-ranking is proposed that enables us to improve document scores using inter-document relationships, expressed by distances and can be obtained from the text, hyperlinks or other information.
Abstract: Lately there has been intensive research into the possibilities of using additional information about documents (such as hyperlinks) to improve retrieval effectiveness. It is called data fusion, based on the intuitive principle that different document and query representations or different methods lead to a better estimation of the documents' relevance scores.In this paper we propose a new method of document re-ranking that enables us to improve document scores using inter-document relationships. These relationships are expressed by distances and can be obtained from the text, hyperlinks or other information. The method formalizes the intuition that strongly related documents should not be assigned very different weights.

Journal ArticleDOI
TL;DR: Results of an empirical study analyzing when during the search process users seek automated searching assistance from the system and when they implement the assistance indicate that users are willing to accept automated assistance during thesearch process, especially after viewing results and locating relevant documents.
Abstract: Searchers seldom make use of the advanced searching features that could improve the quality of the search process because they do not know these features exist, do not understand how to use them, or do not believe they are effective or efficient. Information retrieval systems offering automated assistance could greatly improve search effectiveness by suggesting or implementing assistance automatically. A critical issue in designing such systems is determining when the system should intervene in the search process. In this paper, we report the results of an empirical study analyzing when during the search process users seek automated searching assistance from the system and when they implement the assistance. We designed a fully functional, automated assistance application and conducted a study with 30 subjects interacting with the system. The study used a 2G TREC document collection and TREC topics. Approximately 50% of the subjects sought assistance, and over 80% of those implemented that assistance. Results from the evaluation indicate that users are willing to accept automated assistance during the search process, especially after viewing results and locating relevant documents. We discuss implications for interactive information retrieval system design and directions for future research.

Journal ArticleDOI
TL;DR: It is found that insufficient attention has been given to the Web as a resource for multilingual research, and to languages which are spoken by hundreds of millions of people in the world but have been mainly neglected by the CLIR research community.
Abstract: This introductory paper covers not only the research content of the articles in this special issue of IP&M but attempts to characterize the state-of-the-art in the Cross-Language Information Retrieval (CLIR) domain. We present our view of some major directions for CLIR research in the future. In particular, we find that insufficient attention has been given to the Web as a resource for multilingual research, and to languages which are spoken by hundreds of millions of people in the world but have been mainly neglected by the CLIR research community. In addition, we find that most CLIR evaluation has focussed narrowly on the news genre to the exclusion of other important genres such as scientific and technical literature. The paper concludes by describing an ambitious 5-year research plan proposed by James Mayfield and Paul McNamee.

Journal ArticleDOI
TL;DR: Analysis suggests BM25 cannot be improved using structure weighting, and vector space, probability, and Okapi BM25 ranking are extended to include structure Weighting.
Abstract: Existing ranking schemes assume all term occurrences in a given document are of equal influence. Intuitively, terms occurring in some places should have a greater influence than those elsewhere. An occurrence in an abstract may be more important than an occurrence in the body text. Although this observation is not new, there remains the issue of finding good weights for each structure.Vector space, probability, and Okapi BM25 ranking are extended to include structure weighting. Weights are then selected for the TREC WSJ collection using a genetic algorithm. The learned weights are then tested on an evaluation set of queries. Structure weighted vector space inner product and structure weighted probabilistic retrieval show an about 5% improvement in mean average precision over their unstructured counterparts. Structure weighted BM25 shows nearly no improvement. Analysis suggests BM25 cannot be improved using structure weighting.

Journal ArticleDOI
TL;DR: The fall of in use with time must have been a function of the way that libraries arranged their material (in reverse chronological order); a lack of time and patience will inevitably result in readers aborting their searches after a few years and those few years will be the most recent ones.
Abstract: The publication age or date of documents used (or not used) has long fascinated researchers and practitioners alike. Much of this fascination can be attributed to the weeding opportunities the data is thought to provide for libraries in their never-ending battle to find the space to accommodate their expanding collections. In general journal article age studies have shown an initial increase in use/citation, then a gradual or sharp decline, depending on the discipline concerned. This characteristic has been termed obsolescence or decay and was largely measured, in the absence of accurate journal usage/borrowing data, by citations. In the sciences the decay rate was shown to be the greatest. This was largely put down to the rapid obsolescence of much scientific content. New research findings, methods or ensuing events rendered the material obsolescent. Of course, when reviewing the data we need to be reminded of the fact that citation studies reveal ‘‘use’’ by authors, whereas library loans or downloads represent actual use by readers, and it is readers that libraries and digital libraries principally target. Clearly the fall of in use with time must have also been a function of the way that libraries arranged their material (in reverse chronological order); a lack of time and patience will inevitably result in readers aborting their searches after a few years and those few years will be the most recent ones. Similarly, it must also have been a function of the difficulties of searching hard-copy back volumes/issues in libraries over time.

Journal ArticleDOI
TL;DR: This paper evaluates and compares the retrieval effectiveness of various search models, based on either automatic text-word indexing or on manually assigned controlled descriptors, from a relatively large collection of bibliographic material written in French.
Abstract: This paper evaluates and compares the retrieval effectiveness of various search models, based on either automatic text-word indexing or on manually assigned controlled descriptors. Retrieval is from a relatively large collection of bibliographic material written in French. Moreover, for this French collection we evaluate improvements that result from combining automatic and manual indexing. First, when considering various contexts, this study reveals that the combined indexing strategy always obtains the best retrieval performance. Second, when users wish to conduct exhaustive searches with minimal effort, we demonstrate that manually assigned terms are essential. Third, the evaluations presented in this paper study reveal the comparative retrieval performances that result from manual and automatic indexing in a variety of circumstances.