scispace - formally typeset
Search or ask a question
Author

Gianmaria Silvello

Bio: Gianmaria Silvello is an academic researcher from University of Padua. The author has contributed to research in topics: Digital library & Computer science. The author has an hindex of 17, co-authored 135 publications receiving 1022 citations.


Papers
More filters
Journal ArticleDOI
01 Jan 2018
TL;DR: The current panorama of data citation is many-faceted and an overall view that brings together diverse aspects of this topic is still missing as discussed by the authors, however, this paper aims to describe the lay of the land for data citation, both from the theoretical the why and what and the practical the how angle.
Abstract: Citations are the cornerstone of knowledge propagation and the primary means of assessing the quality of research, as well as directing investments in science. Science is increasingly becoming "data-intensive," where large volumes of data are collected and analyzed to discover complex patterns through simulations and experiments, and most scientific reference works have been replaced by online curated data sets. Yet, given a data set, there is no quantitative, consistent, and established way of knowing how it has been used over time, who contributed to its curation, what results have been yielded, or what value it has. The development of a theory and practice of data citation is fundamental for considering data as first-class research objects with the same relevance and centrality of traditional scientific products. Many works in recent years have discussed data citation from different viewpoints: illustrating why data citation is needed, defining the principles and outlining recommendations for data citation systems, and providing computational methods for addressing specific issues of data citation. The current panorama is many-faceted and an overall view that brings together diverse aspects of this topic is still missing. Therefore, this paper aims to describe the lay of the land for data citation, both from the theoretical the why and what and the practical the how angle.

83 citations

Proceedings ArticleDOI
15 Sep 2009
TL;DR: It is shown how NESTOR can be effectively exploited to enhance Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) for better access and exchange of hierarchical resources on the Web.
Abstract: The paper addresses the problem of representing, managing and exchanging hierarchically structured data in the context of Digital Library (DL) systems in order to enhance the access and exchange DL resources on the Web. We propose the NEsted SeTs for Object hieRarchies (NESTOR) framework, which relies on two set data models — the “Nested Set Model (NS-M)” and the “Inverse Nested Set Model (INS-M)” — to enable the representation of hierarchical data structures by means of a proper organization of nested sets. In particular, we show how NESTOR can be effectively exploited to enhance Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) for better access and exchange of hierarchical resources on the Web.

49 citations

Proceedings ArticleDOI
07 Jul 2016
TL;DR: A methodology based on General Linear Mixed Model (GLMM) is proposed to develop statistical models able to isolate system variance, component effects as well as their interaction by relying on a Grid of Points (GoP) containing all the combinations of analysed components.
Abstract: Topic variance has a greater effect on performances than system variance but it cannot be controlled by system developers who can only try to cope with it. On the other hand, system variance is important on its own, since it is what system developers may affect directly by changing system components and it determines the differences among systems. In this paper, we face the problem of studying system variance in order to better understand how much system components contribute to overall performances. To this end, we propose a methodology based on General Linear Mixed Model (GLMM) to develop statistical models able to isolate system variance, component effects as well as their interaction by relying on a Grid of Points (GoP) containing all the combinations of analysed components. We apply the proposed methodology to the analysis of TREC Ad-hoc data in order to show how it works and discuss some interesting outcomes of this new kind of analysis. Finally, we extend the analysis to different evaluation measures, showing how they impact on the sources of variance.

48 citations

Journal ArticleDOI
TL;DR: A methodology based on the General Linear Mixed Model (GLMM) and analysis of variance (ANOVA) is proposed to develop statistical models able to isolate system variance and component effects as well as their interaction, by relying on a grid of points containing all the combinations of the analyzed components.
Abstract: Information retrieval (IR) systems are the prominent means for searching and accessing huge amounts of unstructured information on the web and elsewhere. They are complex systems, constituted by many different components interacting together, and evaluation is crucial to both tune and improve them. Nevertheless, in the current evaluation methodology, there is still no way to determine how much each component contributes to the overall performances and how the components interact together. This hampers the possibility of a deep understanding of IR system behavior and, in turn, prevents us from designing ahead which components are best suited to work together for a specific search task. In this paper, we move the evaluation methodology one step forward by overcoming these barriers and beginning to devise an “anatomy” of IR systems and their internals. In particular, we propose a methodology based on the General Linear Mixed Model (GLMM) and analysis of variance (ANOVA) to develop statistical models able to isolate system variance and component effects as well as their interaction, by relying on a grid of points (GoP) containing all the combinations of the analyzed components. We apply the proposed methodology to the analysis of two relevant search tasks—news search and web search—by using standard TREC collections. We analyze the basic set of components typically part of an IR system, namely, stop lists, stemmers, and n-grams, and IR models. In this way, we derive insights about English text retrieval.

38 citations

Journal ArticleDOI
TL;DR: An innovative visualization environment: Visual Information Retrieval Tool for Upfront Evaluation (VIRTUE), which eases and makes more effective the experimental evaluation process and reduces the user effort.
Abstract: Objective: Information Retrieval (IR) is strongly rooted in experimentation where new and better ways to measure and interpret the behavior of a system are key to scientific advancement. This paper presents an innovative visualization environment: Visual Information Retrieval Tool for Upfront Evaluation (VIRTUE), which eases and makes more effective the experimental evaluation process. Methods: VIRTUE supports and improves performance analysis and failure analysis. Performance analysis: VIRTUE offers interactive visualizations based on well-known IR metrics allowing us to explore system performances and to easily grasp the main problems of the system. Failure analysis: VIRTUE develops visual features and interaction, allowing researchers and developers to easily spot critical regions of a ranking and grasp possible causes of a failure. Results: VIRTUE was validated through a user study involving IR experts. The study reports on (a) the scientific relevance and innovation and (b) the comprehensibility and efficacy of the visualizations. Conclusion: VIRTUE eases the interaction with experimental results, supports users in the evaluation process and reduces the user effort. Practice: VIRTUE will be used by IR analysts to analyze and understand experimental results. Implications: VIRTUE improves the state-of-the-art in the evaluation practice and integrates visualization and IR research fields in an innovative way.

33 citations


Cited by
More filters
Book
01 Jan 1975
TL;DR: The major change in the second edition of this book is the addition of a new chapter on probabilistic retrieval, which I think is one of the most interesting and active areas of research in information retrieval.
Abstract: The major change in the second edition of this book is the addition of a new chapter on probabilistic retrieval. This chapter has been included because I think this is one of the most interesting and active areas of research in information retrieval. There are still many problems to be solved so I hope that this particular chapter will be of some help to those who want to advance the state of knowledge in this area. All the other chapters have been updated by including some of the more recent work on the topics covered. In preparing this new edition I have benefited from discussions with Bruce Croft, The material of this book is aimed at advanced undergraduate information (or computer) science students, postgraduate library science students, and research workers in the field of IR. Some of the chapters, particularly Chapter 6 * , make simple use of a little advanced mathematics. However, the necessary mathematical tools can be easily mastered from numerous mathematical texts that now exist and, in any case, references have been given where the mathematics occur. I had to face the problem of balancing clarity of exposition with density of references. I was tempted to give large numbers of references but was afraid they would have destroyed the continuity of the text. I have tried to steer a middle course and not compete with the Annual Review of Information Science and Technology. Normally one is encouraged to cite only works that have been published in some readily accessible form, such as a book or periodical. Unfortunately, much of the interesting work in IR is contained in technical reports and Ph.D. theses. For example, most the work done on the SMART system at Cornell is available only in reports. Luckily many of these are now available through the National Technical Information Service (U.S.) and University Microfilms (U.K.). I have not avoided using these sources although if the same material is accessible more readily in some other form I have given it preference. I should like to acknowledge my considerable debt to many people and institutions that have helped me. Let me say first that they are responsible for many of the ideas in this book but that only I wish to be held responsible. My greatest debt is to Karen Sparck Jones who taught me to research information retrieval as an experimental science. Nick Jardine and Robin …

822 citations

01 Jan 2013
TL;DR: Four rationales for sharing data are examined, drawing examples from the sciences, social sciences, and humanities: to reproduce or to verify research, to make results of publicly funded research available to the public, to enable others to ask new questions of extant data, and to advance the state of research and innovation.
Abstract: We must all accept that science is data and that data are science, and thus provide for, and justify the need for the support of, much-improved data curation. (Hanson, Sugden, & Alberts) Researchers are producing an unprecedented deluge of data by using new methods and instrumentation. Others may wish to mine these data for new discoveries and innovations. However, research data are not readily available as sharing is common in only a few fields such as astronomy and genomics. Data sharing practices in other fields vary widely. Moreover, research data take many forms, are handled in many ways, using many approaches, and often are difficult to interpret once removed from their initial context. Data sharing is thus a conundrum. Four rationales for sharing data are examined, drawing examples from the sciences, social sciences, and humanities: (1) to reproduce or to verify research, (2) to make results of publicly funded research available to the public, (3) to enable others to ask new questions of extant data, and (4) to advance the state of research and innovation. These rationales differ by the arguments for sharing, by beneficiaries, and by the motivations and incentives of the many stakeholders involved. The challenges are to understand which data might be shared, by whom, with whom, under what conditions, why, and to what effects. Answers will inform data policy and practice. © 2012 Wiley Periodicals, Inc.

634 citations

Posted Content
TL;DR: This tutorial provides an overview of text ranking with neural network architectures known as transformers, of which BERT (Bidirectional Encoder Representations from Transformers) is the best-known example, and covers a wide range of techniques.
Abstract: The goal of text ranking is to generate an ordered list of texts retrieved from a corpus in response to a query. Although the most common formulation of text ranking is search, instances of the task can also be found in many natural language processing applications. This survey provides an overview of text ranking with neural network architectures known as transformers, of which BERT is the best-known example. The combination of transformers and self-supervised pretraining has been responsible for a paradigm shift in natural language processing (NLP), information retrieval (IR), and beyond. In this survey, we provide a synthesis of existing work as a single point of entry for practitioners who wish to gain a better understanding of how to apply transformers to text ranking problems and researchers who wish to pursue work in this area. We cover a wide range of modern techniques, grouped into two high-level categories: transformer models that perform reranking in multi-stage architectures and dense retrieval techniques that perform ranking directly. There are two themes that pervade our survey: techniques for handling long documents, beyond typical sentence-by-sentence processing in NLP, and techniques for addressing the tradeoff between effectiveness (i.e., result quality) and efficiency (e.g., query latency, model and index size). Although transformer architectures and pretraining techniques are recent innovations, many aspects of how they are applied to text ranking are relatively well understood and represent mature techniques. However, there remain many open research questions, and thus in addition to laying out the foundations of pretrained transformers for text ranking, this survey also attempts to prognosticate where the field is heading.

315 citations

Proceedings ArticleDOI
Susan Brewer1
01 Sep 1959
TL;DR: The letter and/or sound combinations that make up a human language are limited by the human's ability to pronounce tnese sounds° Therefore, the standard library search, which as a rule looks for all possible combinations of letters to find a word, is wasteful.
Abstract: The letter and/or sound combinations that make up a human language are limited by the human's ability to pronounce tnese sounds° Therefore, the standard library search, which as a rule looks for all possible combinations of letters to find a word, is wasteful. Certain letters simply cannot be followed by certain other letters and a search for them is senseless. Following this same line of reasoning, letters very frequently occur in the combinations that are germane to the particular language. The growing amount of alphanumeric information presently being stored on magnetic tape presents increasingly difficult problems in both the number of tape reels used and the time necessary to search this mass of information in order to extract pertinent literature. At the present time most of this literature on tape utilizes the standard IBM 6-bit code to express alphanumeric symbols. ~t is entirely feasible to record standard English literature on tape -be it professional abstracts or novels -using only approximately two-thirds of the binary bits utilized to represent the same piece of written material in the conventional code. This can be accomplished by setting up, in a 9-bit code, the 400-odd letter combinations occurring most frequently. A 9-bit representation allows the programmer to set up as many as 512 symbols, thus leaving sufficient leeway to assign symbols to the most frequentlyused words, mathematical symbols, professional expressions, that are expected to be encountered in the literature to be recorded. In addition, these relatively short 9-bit symbols can be assigned to all key words that it may be necessary to look for later, thereby accelerating any future library search.

298 citations

01 Jan 2015
TL;DR: Borgman as discussed by the authors argues that data have no value or meaning in isolation; they exist within a knowledge infrastructure, an ecology of people, practices, technologies, institutions, material objects, and relationships.
Abstract: "Big Data" is on the covers of Science, Nature, the Economist, and Wired magazines, on the front pages of the Wall Street Journal and the New York Times. But despite the media hyperbole, as Christine Borgman points out in this examination of data and scholarly research, having the right data is usually better than having more data; little data can be just as valuable as big data. In many cases, there are no data -- because relevant data don't exist, cannot be found, or are not available. Moreover, data sharing is difficult, incentives to do so are minimal, and data practices vary widely across disciplines. Borgman, an often-cited authority on scholarly communication, argues that data have no value or meaning in isolation; they exist within a knowledge infrastructure -- an ecology of people, practices, technologies, institutions, material objects, and relationships. After laying out the premises of her investigation -- six "provocations" meant to inspire discussion about the uses of data in scholarship -- Borgman offers case studies of data practices in the sciences, the social sciences, and the humanities, and then considers the implications of her findings for scholarly practice and research policy. To manage and exploit data over the long term, Borgman argues, requires massive investment in knowledge infrastructures; at stake is the future of scholarship.

271 citations