Showing papers in &quot;Information Processing and Management in 2005&quot;

Collaborative Information Retrieval in an information-intensive domain

TL;DR: In this paper, the authors examined the state of the DL domain after a decade of activity by applying social network analysis to the co-authorship network of the past ACM, IEEE, and joint ACM/IEEE digital library conferences.

...read moreread less

Abstract: The field of digital libraries (DLs) coalesced in 1994: the first digital library conferences were held that year, awareness of the World Wide Web was accelerating, and the National Science Foundation awarded $24 Million (US) for the Digital Library Initiative (DLI). In this paper we examine the state of the DL domain after a decade of activity by applying social network analysis to the co-authorship network of the past ACM, IEEE, and joint ACM/IEEE digital library conferences. We base our analysis on a common binary undirectional network model to represent the co-authorship network, and from it we extract several established network measures. We also introduce a weighted directional network model to represent the co-authorship network, for which we define AuthorRank as an indicator of the impact of an individual author in the network. The results are validated against conference program committee members in the same period. The results show clear advantages of PageRank and AuthorRank over degree, closeness and betweenness centrality metrics. We also investigate the amount and nature of international participation in Joint Conference on Digital Libraries (JCDL).

...read moreread less

828 citations

Journal Article•DOI•

[...]

Preben Hansen¹, Kalervo Järvelin²•Institutions (2)

Swedish Institute of Computer Science¹, University of Tampere²

Text summarization using a trainable summarizer and latent semantic analysis

TL;DR: Empirical results are presented, showing that the patent task performance process involves highly collaborative aspects throughout the stages of the information seeking and retrieval process and proposes a refined IR framework involving collaborative aspects.

...read moreread less

Abstract: In this article we investigate the expressions of collaborative activities within information seeking and retrieval processes (IS&R). Generally, information seeking and retrieval is regarded as an individual and isolated process in IR research. We assume that an IS&R situation is not merely an individual effort, but inherently involves various collaborative activities. We present empirical results from a real-life and information-intensive setting within the patent domain, showing that the patent task performance process involves highly collaborative aspects throughout the stages of the information seeking and retrieval process. Furthermore, we show that these activities may be categorised and related to different stages in an information seeking and retrieval process. Therefore, the assumption that information retrieval performance is purely individual needs to be reconsidered. Finally, we also propose a refined IR framework involving collaborative aspects.

...read moreread less

298 citations

Journal Article•DOI•

[...]

Jen-Yuan Yeh¹, Hao-Ren Ke¹, Wei-Pang Yang¹, I-Heng Meng¹•Institutions (1)

National Chiao Tung University¹

01 Jan 2005-Information Processing and Management

TL;DR: One approach is a trainable summarizer, which takes into account several features, including position, positive keyword, negative keyword, centrality, and the resemblance to the title, to generate summaries, while the other uses latent semantic analysis (LSA) to derive the semantic matrix of a document or a corpus and uses semantic sentence representation to construct a semantic text relationship map.

...read moreread less

Abstract: This paper proposes two approaches to address text summarization: modified corpus-based approach (MCBA) and LSA-based T.R.M. approach (LSA + T.R.M.). The first is a trainable summarizer, which takes into account several features, including position, positive keyword, negative keyword, centrality, and the resemblance to the title, to generate summaries. Two new ideas are exploited: (1) sentence positions are ranked to emphasize the significances of different sentence positions, and (2) the score function is trained by the genetic algorithm (GA) to obtain a suitable combination of feature weights. The second uses latent semantic analysis (LSA) to derive the semantic matrix of a document or a corpus and uses semantic sentence representation to construct a semantic text relationship map. We evaluate LSA + T.R.M. both with single documents and at the corpus level to investigate the competence of LSA in text summarization. The two novel approaches were measured at several compression rates on a data corpus composed of 100 political articles. When the compression rate was 30%, an average f-measure of 49% for MCBA, 52% for MCBA + GA, 44% and 40% for LSA + T.R.M. in single-document and corpus level were achieved respectively.

...read moreread less

264 citations

Journal Article•DOI•

The effect of use and access on citations

[...]

Michael J. Kurtz¹, Guenther Eichhorn¹, Alberto Accomazzi¹, Carolyn S. Grant¹, M. Demleitner¹, Edwin A. Henneken¹, Stephen S. Murray¹ - Show less +3 more•Institutions (1)

Harvard University¹

Toward alternative metrics of journal impact: a comparison of download and citation data

TL;DR: The causes of journal articles which have been posted without charge on the internet are more heavily cited than those which have not been are examined.

...read moreread less

Abstract: It has been shown (Lawrence, S. (2001). Online or invisible? Nature, 411, 521) that journal articles which have been posted without charge on the internet are more heavily cited than those which have not been. Using data from the NASA Astrophysics Data System (ads.harvard.edu) and from the ArXiv e-print archive at Cornell University (arXiv.org) we examine the causes of this effect.

...read moreread less

237 citations

Journal Article•DOI•

[...]

Johan Bollen¹, Herbert Van de Sompel², Joan A. Smith¹, Richard E. Luce²•Institutions (2)

Old Dominion University¹, Los Alamos National Laboratory²

An analysis of web searching by European AlltheWeb.com users

TL;DR: Although social network metrics and ISI IF rankings deviate moderately for citation-based journal networks, they differ considerably for journal networks derived from download data, which raises questions regarding the validity of the ISI IF as the sole assessment of journal impact.

...read moreread less

Abstract: We generated networks of journal relationships from citation and download data, and determined journal impact rankings from these networks using a set of social network centrality metrics. The resulting journal impact rankings were compared to the ISI IF. Results indicate that, although social network metrics and ISI IF rankings deviate moderately for citation-based journal networks, they differ considerably for journal networks derived from download data. We believe the results represent a unique aspect of general journal impact that is not captured by the ISI IF. These results furthermore raise questions regarding the validity of the ISI IF as the sole assessment of journal impact, and suggest the possibility of devising impact metrics based on usage information in general.

...read moreread less

211 citations

Journal Article•DOI•

[...]

Bernard J. Jansen¹, Amanda Spink²•Institutions (2)

Pennsylvania State University¹, University UCINF²

Using the patent co-citation approach to establish a new patent classification system

TL;DR: This longitudinal benchmark study shows that European Web searching is evolving in certain directions, and European search topics are broadening, with a notable percentage decline in sexual and pornographic searching.

...read moreread less

Abstract: The Web has become a worldwide source of information and a mainstream business tool. It is changing the way people conduct the daily business of their lives. As these changes are occurring, we need to understand what Web searching trends are emerging within the various global regions. What are the regional differences and trends in Web searching, if any? What is the effectiveness of Web search engines as providers of information? As part of a body of research studying these questions, we have analyzed two data sets collected from queries by mainly European users submitted to AlltheWeb.com on 6 February 2001 and 28 May 2002. AlltheWeb.com is a major and highly rated European search engine. Each data set contains approximately a million queries submitted by over 200,000 users and spans a 24-h period. This longitudinal benchmark study shows that European Web searching is evolving in certain directions. There was some decline in query length, with extremely simple queries. European search topics are broadening, with a notable percentage decline in sexual and pornographic searching. The majority of Web searchers view fewer than five Web documents, spending only seconds on a Web document. Approximately 50% of the Web documents viewed by these European users were topically relevant. We discuss the implications for Web information systems and information content providers.

...read moreread less

196 citations

Journal Article•DOI•

[...]

Kuei-Kuei Lai, Shiao-Jun Wu¹•Institutions (1)

National Yunlin University of Science and Technology¹

Combining full text and bibliometric information in mapping scientific disciplines

TL;DR: A new approach to create a patent classification system to replace the IPC or UPC system for conducting patent analysis and management is proposed, based on co-citation analysis of bibliometrics to assist patent manager in understanding the basic patents for a specific industry.

...read moreread less

Abstract: The paper proposes a new approach to create a patent classification system to replace the IPC or UPC system for conducting patent analysis and management. The new approach is based on co-citation analysis of bibliometrics. The traditional approach for management of patents, which is based on either the IPC or UPC, is too general to meet the needs of specific industries. In addition, some patents are placed in incorrect categories, making it difficult for enterprises to carry out R&D planning, technology positioning, patent strategy-making and technology forecasting. Therefore, it is essential to develop a patent classification system that is adaptive to the characteristics of a specific industry. The analysis of this approach is divided into three phases. Phase I selects appropriate databases to conduct patent searches according to the subject and objective of this study and then select basic patents. Phase II uses the co-cited frequency of the basic patent pairs to assess their similarity. Phase III uses factor analysis to establish a classification system and assess the efficiency of the proposed approach. The main contribution of this approach is to develop a patent classification system based on patent similarities to assist patent manager in understanding the basic patents for a specific industry, the relationships among categories of technologies and the evolution of a technology category.

...read moreread less

189 citations

Journal Article•DOI•

[...]

Patrick Glenisson¹, Wolfgang Glänzel², Frizo Janssens¹, Bart De Moor¹•Institutions (2)

Katholieke Universiteit Leuven¹, Hungarian Academy of Sciences²

Dictionary-based techniques for cross-language information retrieval

TL;DR: Full text analysis and traditional bibliometric methods are serially combined to improve the efficiency of the two individual methods and confirm the main results of the pilot study that such hybrid methodology can be applied to both research evaluation and information retrieval.

...read moreread less

Abstract: In the present study results of an earlier pilot study by Glenisson, Glanzel and Persson are extended on the basis of larger sets of papers. Full text analysis and traditional bibliometric methods are serially combined to improve the efficiency of the two individual methods. The text mining methodology already introduced in the pilot study is applied to the complete publication year 2003 of the journal Scientometrics. Altogether 85 documents that can be considered research articles or notes have been selected for this exercise. The outcomes confirm the main results of the pilot study, namely, that such hybrid methodology can be applied to both research evaluation and information retrieval. Nevertheless, Scientometrics documents published in 2003 cover a much broader and more heterogeneous spectrum of bibliometrics and related research than those analysed in the pilot study. A modified subject classification based on the scheme used in an earlier study by Schoepflin and Glanzel has been applied for validation purposes.

...read moreread less

150 citations

Journal Article•DOI•

[...]

Gina-Anne Levow¹, Douglas W. Oard², Philip Resnik²•Institutions (2)

University of Chicago¹, University of Maryland, College Park²

Multiple sets of features for automatic genre classification of web documents

TL;DR: Key results include identification of a previously unseen dependence of pre- and post-translation expansion on orthographic cognates and development of a query-specific measure for translation fanout that helps to explain the utility of structured query methods.

...read moreread less

Abstract: Cross-language information retrieval (CLIR) systems allow users to find documents written in different languages from that of their query. Simple knowledge structures such as bilingual term lists have proven to be a remarkably useful basis for bridging that language gap. A broad array of dictionary-based techniques have demonstrated utility, but comparison across techniques has been difficult because evaluation results often span only a limited range of conditions. This article identifies the key issues in dictionary-based CLIR, develops unified frameworks for term selection and term translation that help to explain the relationships among existing techniques, and illustrates the effect of those techniques using four contrasting languages for systematic experiments with a uniform query translation architecture. Key results include identification of a previously unseen dependence of pre- and post-translation expansion on orthographic cognates and development of a query-specific measure for translation fanout that helps to explain the utility of structured query methods.

...read moreread less

123 citations

Journal Article•DOI•

[...]

Chul Su Lim¹, Kong Joo Lee², Gil Chang Kim¹•Institutions (2)

KAIST¹, Women's College, Kolkata²

A personalized collaborative digital library environment: a model and an application

TL;DR: This paper introduces new sets of features specific to web documents, which are extracted from URL and HTML tags within the pages, and concludes which is an appropriate set of features in automatic genre classification of web documents.

...read moreread less

Abstract: With the increase of information on the Web, it is difficult to find desired information quickly out of the documents retrieved by a search engine. One way to solve this problem is to classify web documents according to various criteria. Most document classification has been focused on a subject or a topic of a document. A genre or a style is another view of a document different from a subject or a topic. The genre is also a criterion to classify documents. In this paper, we suggest multiple sets of features to classify genres of web documents. The basic set of features, which have been proposed in the previous studies, is acquired from the textual properties of documents, such as the number of sentences, the number of a certain word, etc. However, web documents are different from textual documents in that they contain URL and HTML tags within the pages. We introduce new sets of features specific to web documents, which are extracted from URL and HTML tags. The present work is an attempt to evaluate the performance of the proposed sets of features, and to discuss their characteristics. Finally, we conclude which is an appropriate set of features in automatic genre classification of web documents.

...read moreread less

122 citations

Journal Article•DOI•

[...]

M. Elena Renda¹, Umberto Straccia¹•Institutions (1)

Istituto di Scienza e Tecnologie dell'Informazione¹

01 Jan 2005-Information Processing and Management

TL;DR: This paper envisage a Digital Library not only as an information resource where users may submit queries to satisfy their daily information need, but also as a collaborative working and meeting space of people sharing common interests.

...read moreread less

Abstract: The Web, and consequently the information contained in it, is growing rapidly. Every day a huge amount of newly created information is electronically published in Digital Libraries, whose aim is to satisfy users' information needs.In this paper, we envisage a Digital Library not only as an information resource where users may submit queries to satisfy their daily information need, but also as a collaborative working and meeting space of people sharing common interests. Indeed, we will present a personalized collaborative Digital Library environment, where users may organize the information space according to their own subjective view, may build communities, may become aware of each other, may exchange information and knowledge with other users, and may get recommendations based on preference patterns of users.

...read moreread less

Journal Article•DOI•

Binary and graded relevance in IR evaluations: comparison of the effects on ranking of IR systems

[...]

Jaana Kekäläinen¹•Institutions (1)

University of Tampere¹

The impact of webpage content characteristics on webpage visibility in search engine results (part I)

TL;DR: In this study the rankings of IR systems based on binary and graded relevance in TREC 7 and 8 data are compared and the results show the different character of the measures.

...read moreread less

Abstract: In this study the rankings of IR systems based on binary and graded relevance in TREC 7 and 8 data are compared. Relevance of a sample TREC results is reassessed using a relevance scale with four levels: non-relevant, marginally relevant, fairly relevant, highly relevant. Twenty-one topics and 90 systems from TREC 7 and 20 topics and 121 systems from TREC 8 form the data. Binary precision, and cumulated gain, discounted cumulated gain and normalised discounted cumulated gain are the measures compared. Different weighting schemes for relevance levels are tested with cumulated gain measures. Kendall's rank correlations are computed to determine to what extent the rankings produced by different measures are similar. Weighting schemes from binary to emphasising highly relevant documents form a continuum, where the measures correlate strongly in the binary end, and less in the heavily weighted end. The results show the different character of the measures.

...read moreread less

Journal Article•DOI•

[...]

Jin Zhang¹, Alexandra Dimitroff¹•Institutions (1)

University of Wisconsin–Milwaukee¹

What do we know about links and linking? A framework for studying links in academic environments

TL;DR: Webpage visibility can be improved by increasing the frequency of keywords in the title, in the full-text and in both the title and full- Text in search engine results lists.

...read moreread less

Abstract: Content characteristics of a webpage include factors such as keyword position in a webpage, keyword duplication, layout, and their combination. These factors may impact webpage visibility in a search engine. Four hypotheses are presented relating to the impact of selected content characteristics on webpage visibility in search engine results lists. Webpage visibility can be improved by increasing the frequency of keywords in the title, in the full-text and in both the title and full-text.

...read moreread less

Journal Article•DOI•

[...]

Judit Bar-Ilan¹•Institutions (1)

Hebrew University of Jerusalem¹

Technical issues of cross-language information retrieval: a review

TL;DR: A classification of link types in academic environments on the Web provides an insight into the diverse uses of hypertext links on the Internet, and has implications for browsing and ranking in IR systems by differentiating between different types of links.

...read moreread less

Abstract: The Web is an enormous set of documents connected through hypertext links created by authors of Web pages. These links have been studied quantitatively, but little has been done so far in order to understand why these links are created. As a first step towards a better understanding, we propose a classification of link types in academic environments on the Web. The classification is multi-faceted and involves different aspects of the source and the target page, the link area and the relationship between the source and the target. Such classification provides an insight into the diverse uses of hypertext links on the Web, and has implications for browsing and ranking in IR systems by differentiating between different types of links. As a case study we classified a sample of links between sites of Israeli academic institutions.

...read moreread less

Journal Article•DOI•

[...]

Kazuaki Kishida¹•Institutions (1)

Surugadai University¹

Analysis in indexing: document and domain centered approaches

TL;DR: This paper reviews state-of-the-art techniques and methods for enhancing effectiveness of cross-language information retrieval (CLIR) and focuses on matching strategies and translation techniques.

...read moreread less

Abstract: This paper reviews state-of-the-art techniques and methods for enhancing effectiveness of cross-language information retrieval (CLIR). The following research issues are covered: (1) matching strategies and translation techniques, (2) methods for solving the problem of translation ambiguity, (3) formal models for CLIR such as application of the language model, (4) the pivot language approach, (5) methods for searching multilingual document collection, (6) techniques for combining multiple language resources, etc.

...read moreread less

Journal Article•DOI•

[...]

Jens-Erik Mai¹•Institutions (1)

University of Washington¹

Effective search results summary size and device screen size

TL;DR: The paper concludes that the two-step procedure to indexing is insufficient to explain the indexing process and suggests that the domain-centered approach offers a guide for indexers that can help them manage the complexity of indexing.

...read moreread less

Abstract: The paper discusses the notion of steps in indexing and reveals that the document-centered approach to indexing is prevalent and argues that the document-centered approach is problematic because it blocks out context-dependent factors in the indexing process. A domain-centered approach to indexing is presented as an alternative and the paper discusses how this approach includes a broader range of analyses and how it requires a new set of actions from using this approach; analysis of the domain, users and indexers. The paper concludes that the two-step procedure to indexing is insufficient to explain the indexing process and suggests that the domain-centered approach offers a guide for indexers that can help them manage the complexity of indexing.

...read moreread less

Journal Article•

[...]

Simon Sweeney¹, Fabio Crestani²•Institutions (2)

University of Strathclyde¹, University of Lugano²

24 Jun 2005-Information Processing and Management

TL;DR: An investigation into the effects of summary length as a function of screen size, where query-biased summaries are used to present retrieval results to explore whether there is an optimal summary size for three types of device, given their different screen sizes.

...read moreread less

Journal Article•DOI•

The impact of metadata implementation on webpage visibility in search engine results (part II)

[...]

Jin Zhang¹, Alexandra Dimitroff¹•Institutions (1)

University of Wisconsin–Milwaukee¹

Application of automatic topic identification on excite web search engine data logs

TL;DR: Findings suggest that metadata is a good mechanism to improve webpage visibility, the metadata subject field plays a more important role than any other metadata field and keywords extracted from the webpage itself, particularly title or full-text, are most effective.

...read moreread less

Abstract: This paper discusses the impact of metadata implementation in a webpage on its visibility performance in a search engine results list. Influential internal and external factors of metadata implementation were identified. How these factors affect webpage visibility in a search engine results list was examined in an experimental study. Findings suggest that metadata is a good mechanism to improve webpage visibility, the metadata subject field plays a more important role than any other metadata field and keywords extracted from the webpage itself, particularly title or full-text, are most effective. To maximize the effects, these keywords should come from both title and full-text.

...read moreread less

Journal Article•DOI•

[...]

H. Cenk Ozmutlu¹, Fatih Cavdur¹•Institutions (1)

Uludağ University¹

A modeling approach to uncover hyperlink patterns: the case of Canadian universities

TL;DR: The reasons underlying the inconsistent performance of automatic topic identification are investigated with statistical analysis and experimental design techniques, and the topic identification algorithm's performance becomes doubtful in various cases.

...read moreread less

Abstract: The analysis of contextual information in search engine query logs enhances the understanding of Web users' search patterns. Obtaining contextual information on Web search engine logs is a difficult task, since users submit few number of queries, and search multiple topics. Identification of topic changes within a search session is an important branch of search engine user behavior analysis. The purpose of this study is to investigate the properties of a specific topic identification methodology in detail, and to test its validity. The topic identification algorithm's performance becomes doubtful in various cases. These cases are explored and the reasons underlying the inconsistent performance of automatic topic identification are investigated with statistical analysis and experimental design techniques.

...read moreread less

Journal Article•DOI•

[...]

Liwen Vaughan¹, Mike Thelwall²•Institutions (2)

University of Western Ontario¹, Information Technology University²

Comparing rankings of search results on the web

TL;DR: A multiple regression model was developed which shows that faculty quality and the language of the university are important predictors for links to a university Web site, and showed that English universities are advantaged.

...read moreread less

Abstract: Hyperlink patterns between Canadian university Web sites were analyzed by a mathematical modeling approach. A multiple regression model was developed which shows that faculty quality and the language of the university are important predictors for links to a university Web site. Higher faculty quality means more links. French universities received lower numbers of links to their Web sites than comparable English universities. Analysis of interlinking between pairs of universities also showed that English universities are advantaged. Universities are more likely to link to each other when the geographical distance between them is less than 3000 km, possibly reflecting the east vs. west divide that exists in Canadian society.

...read moreread less

Journal Article•DOI•

[...]

Judit Bar-Ilan¹•Institutions (1)

Hebrew University of Jerusalem¹

Measuring search engine bias

TL;DR: This study measures how similar are the rankings of search engines on the overlapping results of identical queries retrieved from several search engines and indicates that the large public search engines in the Web employ considerably different ranking algorithms.

...read moreread less

Abstract: The Web has become an information source for professional data gathering. Because of the vast amounts of information on almost all topics, one cannot systematically go over the whole set of results, and therefore must rely on the ordering of the results by the search engine. It is well known that search engines on the Web have low overlap in terms of coverage. In this study we measure how similar are the rankings of search engines on the overlapping results.We compare rankings of results for identical queries retrieved from several search engines. The method is based only on the set of URLs that appear in the answer sets of the engines being compared. For comparing the similarity of rankings of two search engines, the Spearman correlation coefficient is computed. When comparing more than two sets Kendall's W is used. These are well-known measures and the statistical significance of the results can be computed. The methods are demonstrated on a set of 15 queries that were submitted to four large Web search engines. The findings indicate that the large public search engines on the Web employ considerably different ranking algorithms.

...read moreread less

Journal Article•DOI•

[...]

Abbe Mowshowitz¹, Akira Kawaguchi¹•Institutions (1)

City College of New York¹

Clustered SVD strategies in latent semantic indexing

TL;DR: A real-time measure of bias in Web search engines captures the degree to which the distribution of URLs, retrieved in response to a query, deviates from an ideal or fair distribution for that query.

...read moreread less

Abstract: This paper examines a real-time measure of bias in Web search engines. The measure captures the degree to which the distribution of URLs, retrieved in response to a query, deviates from an ideal or fair distribution for that query. This ideal is approximated by the distribution produced by a collection of search engines. Differences between bias and classical retrieval measures are highlighted by examining the possibilities for bias in four extreme cases of recall and precision. The results of experiments examining the influence on bias measurement of subject domains, search engines, and search terms are presented. Three general conclusions are drawn: (1) the performance of search engines can be distinguished with the aid of the bias measure; (2) bias values depend on the subject matter under consideration; (3) choice of search terms does not account for much of the variance in bias values. These conclusions underscore the need to develop "bias profiles" for search engines.

...read moreread less

Journal Article•DOI•

[...]

Jing Gao¹, Jun Zhang¹•Institutions (1)

University of Kentucky¹

Editorial: expansion of the field of informetrics: Origins and consequences

TL;DR: This work proposes to partition a large inhomogeneous dataset into several smaller ones with clustered structure, on which the truncated SVD is applied, and shows that the clustered SVD strategies may enhance the retrieval accuracy and reduce the computing and storage costs.

...read moreread less

Abstract: The text retrieval method using latent semantic indexing (LSI) technique with truncated singular value decomposition (SVD) has been intensively studied in recent years. The SVD reduces the noise contained in the original representation of the term-document matrix and improves the information retrieval accuracy. Recent studies indicate that SVD is mostly useful for small homogeneous data collections. For large inhomogeneous datasets, the performance of the SVD based text retrieval technique may deteriorate. We propose to partition a large inhomogeneous dataset into several smaller ones with clustered structure, on which we apply the truncated SVD. Our experimental results show that the clustered SVD strategies may enhance the retrieval accuracy and reduce the computing and storage costs.

...read moreread less

Journal Article•DOI•

[...]

Leo Egghe¹•Institutions (1)

University of Antwerp¹

Re-ranking method based on inter-document distances

TL;DR: This editorial introductory paper first discusses the reasons for the clear growth of the field of informetrics, then discusses the content of the papers that are published in this special issue of Information Processing and Management, and describes an exponential growth of JASIS.

...read moreread less

Abstract: This editorial introductory paper first discusses the reasons for the clear growth of the field of informetrics (bibliometrics, scientometrics, webometrics, ...). This has lead some journals to increase their number of volumes or the number of issues per volume. The journal Information Processing and Management decided to devote two special issues (the one here and another one to come in 2006) to the broad topic \"Informetrics\" where the scope of these special issues is to attract good papers dealing with gathering important data sets and/or presenting original models and explanations. Then we briefly discuss the content of the papers that are published in this special issue. They are dealing with models, mapping of science (cocitation, coword analysis), web sites and search engines, collaboration in digital libraries and the newest topic in informetrics: use of and access to articles in digital libraries. I. THE GROWTH OF THE FIELD OF INFORMETRICS In this introductory paper, we will use the term \"informetrics\" as the broad term comprising all -metrics studies related to information science, including bibliometrics (bibliographies, libraries, ...), scientometrics (science policy, citation analysis, research evaluation, ...), webometrics (metrics of the web, the Internet or other social networks such as citation or collaboration networks), ... . The term informetrics was introduced by Blackert and Siegel (1979) and by Nacke (1979) but gained popularity e.g. by the organization of the international informetrics conferences in 1987 (see Egghe and Rousseau (1988, 1990)). However the field \"informetrics\" (not the name) started already in the first half of the twentieth century e.g. by the works of Lotka, Bradford and Zipf (see Lotka (1926), Bradford (1934), Zipf 1949, but for the law of Zipf, see also Condon (1928) or even Estoup (1916)). The term bibliometrics was coined in Pritchard (1969) and the term scientometrics was coined in Nalimov and Mul'čenko (1969) in Russian: naukometrija. For more on the history of these and other terms see White and McCain (1989), Ikpaahindi (1985), Lawani (1981), Tague-Sutcliffe (1994), Brookes (1990), Wilson (1999), Egghe and Rousseau (1990) and Egghe (2005). That the field of informetrics has grown in the twentieth century is evident but this growth has become more and more clear the last decades. Lipetz (1999) describes an exponential growth of JASIS now called JASIST (Journal of the American Society for Information Science and Technology, existing 50 years in 1999) in terms of number of papers and in terms of number of authors and even in terms of average number of references per paper. One also shows in Lipetz (1999) that the average number of authors per paper is increasing. Authors are also responsable for a multidisciplinary growth of the field of informetrics see Summers, Oppenheim, Meadows, McKnight and Kinnell (1999) hereby also indicating the influence of informetrics to other scientific disciplines. Multidisciplinarity is evident if one looks at the \"new\" topics which informetrics is covering: the metrics of the web, Internet, intranets and other social networks such as citation or collaboration networks. In general one can say that the creation of the \"information society\" is responsable for the growth of the field of informetrics. So we can say that the field of informetrics nowadays comprises the fastly growing field of webometrics (see Hood and Wilson (2001)) (netometrics, as introduced in Bossy (1995) would be a better term covering also non-web activities but the term does not seem to become popular see Hood and Wilson (2001)). Cybermetrics also exists (it is even the name of an electronic journal under the editorial direction of I. Aguillo) but it is not clear whether it will overtake, some day, the term webometrics. Schubert (2002) describes 50 volumes of the journal Scientometrics and also concludes the increase of the number of authors and the fact that they more and more collaborate in the sense that the average number of authors per paper increases (same conclusions as in Lipetz (1999)). Schubert also remarks that there is no evidence that the degree of \"hardness\" of the field informetrics is increasing, a point to keep in mind for the future evolution of this field. He and Spink (2002) describe foreign authorship in JASIST and JDOC (Journal of Documentation) and prove that their share in these journals becomes larger and larger indicating an increase of internationalization of the field of informetrics. The latter is also illustrated in Bar-Ilan (2000) where one makes the constatation that the articles in the Proceedings of the international informetrics conferences are increasingly cited. The extension of information science to networks and the information society in general has the consequence that more and more data are gathered in an automatic way. This implies that data can be gathered in a much faster way than it used to be but also that the accuracy is dropping. There are several reasons for this. First of all one gets data from a documentary system (e.g. an OPAC, secondary or primary electronic database or digital library) but, since there is in general no clear definition of the topics due to lack of standards (see Glänzel (1996), Rousseau (2002)) one is not completely sure of what one gets. In addition an electronic system may suffer from system breakdown in which case one is obliged to make unexact interpolations. Data of electronic services and activities through the web (many data are) are also of a different nature than data gathered directly from a computer system. An example is connect time versus times of connection. When entering directly or via telephone lines into a computer system (e.g. an OPAC or the DIALOG system) one is able to report on the connect time. When using a documentary system via the web one cannot report on connect time anymore but only on number of connections (cf. the well-known DIALOG units). Networks such as the web typically have connections between the sites and one talks in this connection about hyperlinks (in-links when a site receives a hyperlink from another site; out-links when a site gives a hyperlink to another site). Their informetric distributions have been studied even in journals such as Nature and Science (see e.g. Albert, Jeong and Barabási (1999), Barabási and Albert (1999) and Huberman, Pirolli, Pitkow and Lukose (1998)) but also in physics journals (see e.g. Barabási, Jeong, Néda, Ravasz, Schubert and Vicsek (2002) and Adamic, Lukose, Puniyani and Huberman (2001)), again showing the interdisciplinary character of nowadays informetrics. Hyperlinks usually are compared with the better known citations but they are very different of nature: hyperlinks cannot be used for aging or author collaboration studies since they are not dated and are usually anonymous. Hyperlinks can be used for determining \"authoritative\" web sites or documents see CLEVER (1999) which in turn can be used in information retrieval (IR). Also in IR, quantitative methods, e.g. for the evaluation of searches and systems have drastically changed by the way search engines deliver search results: they give the retrieved documents in decreasing order of expected relevance which creates the need for evaluation measures on ordered sets instead of he classical ones (e.g. recall, precision, Jaccard, Cosine, Dice, ...) on ordinary sets (cf. Egghe and Michel (2002, 2003)). It is very important to mention that the fact that most articles are nowadays appearing in electronic journals and/or repositories gives the new possibilities of measuring the use of articles not only by citations or web citations but also by measuring their number of downloads. Downloads can be considered as electronic versions of reading or photocopying of a paper article. The latter indicators were never studied due to the great difficulty of manual datagathering. Hence the study of downloads and their relation with (web) citations is intriguing, see Antelman (2004), Brody and Harnad (2004), Harnad and Brody (2004a,b) and Perneger (2004). It is clear from the above that the extension of informetrics to electronic e.g. web activities gives a boost to the challenge of datagathering and datamanagement and hence to the growth of the field. The need for more publication outlet, which is a consequence from this, is also clearly seen if one looks at the two important informetrics journals JASIST and Scientometrics. JASIST decided in 1998 to increase its publication flow from 12 issues to 14 issues a year. Scientometrics is publishing, from 2005 onwards, 12 issues instead of 9 issues per year. In this connection I want to give a personal advise, which is shared with the informetric colleagues I contacted recently. The increase of publication outlet does also increase the need of refereeing. It is my personal feeling that one should expand the list of possible referees in informetrics to younger informetricians: my workload on refereeing has doubled in 2004, a phenomenon that is recognized by colleague informetricians. Apart from JASIST and Scientometrics, the present journal Information Processing and Management (IPM) is the only journal that regularly publishes papers devoted to informetrics studies, although, in general, IPM is more focused to the subfield of informetrics dealing with quantitative aspects of IR. Elsevier, the publisher of IPM, is interested if a more pronounced general informetrics component is possible in IPM. Hereby one wants to stress that the principal goal is to give an outlet to high quality papers in informetrics. High quality papers are papers that present good mathematical (probabilistic) models and explanations of informetric regularities (in the broad sense) and/or papers in which interesting and important datagathering is presented. The former request (good models and explanations) can be understood in the framework of increasing the degree of \"hardn

...read moreread less

Journal Article•DOI•

[...]

Jaroslaw Baliński¹, Czeslaw Danilowicz¹•Institutions (1)

Wrocław University of Technology¹

Seeking and implementing automated assistance during the search process

TL;DR: A new method of document re-ranking is proposed that enables us to improve document scores using inter-document relationships, expressed by distances and can be obtained from the text, hyperlinks or other information.

...read moreread less

Abstract: Lately there has been intensive research into the possibilities of using additional information about documents (such as hyperlinks) to improve retrieval effectiveness. It is called data fusion, based on the intuitive principle that different document and query representations or different methods lead to a better estimation of the documents' relevance scores.In this paper we propose a new method of document re-ranking that enables us to improve document scores using inter-document relationships. These relationships are expressed by distances and can be obtained from the text, hyperlinks or other information. The method formalizes the intuition that strongly related documents should not be assigned very different weights.

...read moreread less

Journal Article•DOI•

[...]

Bernard J. Jansen¹•Institutions (1)

Pennsylvania State University¹

Cross-language information retrieval: the way ahead

TL;DR: Results of an empirical study analyzing when during the search process users seek automated searching assistance from the system and when they implement the assistance indicate that users are willing to accept automated assistance during thesearch process, especially after viewing results and locating relevant documents.

...read moreread less

Abstract: Searchers seldom make use of the advanced searching features that could improve the quality of the search process because they do not know these features exist, do not understand how to use them, or do not believe they are effective or efficient. Information retrieval systems offering automated assistance could greatly improve search effectiveness by suggesting or implementing assistance automatically. A critical issue in designing such systems is determining when the system should intervene in the search process. In this paper, we report the results of an empirical study analyzing when during the search process users seek automated searching assistance from the system and when they implement the assistance. We designed a fully functional, automated assistance application and conducted a study with 30 subjects interacting with the system. The study used a 2G TREC document collection and TREC topics. Approximately 50% of the subjects sought assistance, and over 80% of those implemented that assistance. Results from the evaluation indicate that users are willing to accept automated assistance during the search process, especially after viewing results and locating relevant documents. We discuss implications for interactive information retrieval system design and directions for future research.

...read moreread less

Journal Article•DOI•

[...]

Fredric C. Gey¹, Noriko Kando², Carol Peters³•Institutions (3)

University of California, Berkeley¹, National Institute of Informatics², Istituto di Scienza e Tecnologie dell'Informazione³

Choosing document structure weights

TL;DR: It is found that insufficient attention has been given to the Web as a resource for multilingual research, and to languages which are spoken by hundreds of millions of people in the world but have been mainly neglected by the CLIR research community.

...read moreread less

Abstract: This introductory paper covers not only the research content of the articles in this special issue of IP&M but attempts to characterize the state-of-the-art in the Cross-Language Information Retrieval (CLIR) domain. We present our view of some major directions for CLIR research in the future. In particular, we find that insufficient attention has been given to the Web as a resource for multilingual research, and to languages which are spoken by hundreds of millions of people in the world but have been mainly neglected by the CLIR research community. In addition, we find that most CLIR evaluation has focussed narrowly on the news genre to the exclusion of other important genres such as scientific and technical literature. The paper concludes by describing an ambitious 5-year research plan proposed by James Mayfield and Paul McNamee.

...read moreread less

Journal Article•DOI•

[...]

Andrew Trotman¹•Institutions (1)

University of Otago¹

Revisiting 'obsolescence' and journal article 'decay' through usage data: an analysis of digital journal use by year of publication

TL;DR: Analysis suggests BM25 cannot be improved using structure weighting, and vector space, probability, and Okapi BM25 ranking are extended to include structure Weighting.

...read moreread less

Abstract: Existing ranking schemes assume all term occurrences in a given document are of equal influence. Intuitively, terms occurring in some places should have a greater influence than those elsewhere. An occurrence in an abstract may be more important than an occurrence in the body text. Although this observation is not new, there remains the issue of finding good weights for each structure.Vector space, probability, and Okapi BM25 ranking are extended to include structure weighting. Weights are then selected for the TREC WSJ collection using a genetic algorithm. The learned weights are then tested on an evaluation set of queries. Structure weighted vector space inner product and structure weighted probabilistic retrieval show an about 5% improvement in mean average precision over their unstructured counterparts. Structure weighted BM25 shows nearly no improvement. Analysis suggests BM25 cannot be improved using structure weighting.

...read moreread less

Journal Article•DOI•

[...]

David Nicholas¹, Paul Huntington¹, Tom Dobrowolski¹, Ian Rowlands², R M Hamid Jamali¹, Panayiota Polydoratou² - Show less +2 more•Institutions (2)

University College London¹, City University London²

Bibliographic database access using free-text and controlled vocabulary: an evaluation

TL;DR: The fall of in use with time must have been a function of the way that libraries arranged their material (in reverse chronological order); a lack of time and patience will inevitably result in readers aborting their searches after a few years and those few years will be the most recent ones.

...read moreread less

Abstract: The publication age or date of documents used (or not used) has long fascinated researchers and practitioners alike. Much of this fascination can be attributed to the weeding opportunities the data is thought to provide for libraries in their never-ending battle to find the space to accommodate their expanding collections. In general journal article age studies have shown an initial increase in use/citation, then a gradual or sharp decline, depending on the discipline concerned. This characteristic has been termed obsolescence or decay and was largely measured, in the absence of accurate journal usage/borrowing data, by citations. In the sciences the decay rate was shown to be the greatest. This was largely put down to the rapid obsolescence of much scientific content. New research findings, methods or ensuing events rendered the material obsolescent. Of course, when reviewing the data we need to be reminded of the fact that citation studies reveal ‘‘use’’ by authors, whereas library loans or downloads represent actual use by readers, and it is readers that libraries and digital libraries principally target. Clearly the fall of in use with time must have also been a function of the way that libraries arranged their material (in reverse chronological order); a lack of time and patience will inevitably result in readers aborting their searches after a few years and those few years will be the most recent ones. Similarly, it must also have been a function of the difficulties of searching hard-copy back volumes/issues in libraries over time.

...read moreread less

Journal Article•DOI•

[...]

Jacques Savoy