scispace - formally typeset
Search or ask a question

Showing papers in "arXiv: Digital Libraries in 2015"


Posted Content
Ludo Waltman1
TL;DR: An in-depth review of the literature on citation impact indicators with recommendations for future research on normalization for field differences and counting methods for dealing with co-authored publications.
Abstract: Citation impact indicators nowadays play an important role in research evaluation, and consequently these indicators have received a lot of attention in the bibliometric and scientometric literature. This paper provides an in-depth review of the literature on citation impact indicators. First, an overview is given of the literature on bibliographic databases that can be used to calculate citation impact indicators (Web of Science, Scopus, and Google Scholar). Next, selected topics in the literature on citation impact indicators are reviewed in detail. The first topic is the selection of publications and citations to be included in the calculation of citation impact indicators. The second topic is the normalization of citation impact indicators, in particular normalization for field differences. Counting methods for dealing with co-authored publications are the third topic, and citation impact indicators for journals are the last topic. The paper concludes by offering some recommendations for future research.

469 citations


Journal ArticleDOI
TL;DR: In this paper, an analysis of the presence and possibilities of altmetrics for bibliometric and performance analysis is carried out using the web based tool Impact Story, which collects metrics for 20,000 random publications from the Web of Science.
Abstract: In this paper an analysis of the presence and possibilities of altmetrics for bibliometric and performance analysis is carried out. Using the web based tool Impact Story, we have collected metrics for 20,000 random publications from the Web of Science. We studied the presence and frequency of altmetrics in the set of publications, across fields, document types and also through the years. The main result of the study is that less than 50% of the publications have some kind of altmetrics. The source that provides most metrics is Mendeley, with metrics on readerships for around 37% of all the publications studied. Other sources only provide marginal information. Possibilities and limitations of these indicators are discussed and future research lines are outlined. We also assessed the accuracy of the data retrieved through Impact Story by focusing on the analysis of the accuracy of data from Mendeley; in a follow up study, the accuracy and validity of other data sources not included here will be assessed.

210 citations


Posted Content
TL;DR: A framework is presented that describes acts leading to (online) events on which the metrics are based, and select citation and social theories are used to interpret the phenomena being measured.
Abstract: More than 30 years after Cronin's seminal paper on "the need for a theory of citing" (Cronin, 1981), the metrics community is once again in need of a new theory, this time one for so-called "altmetrics". Altmetrics, short for alternative (to citation) metrics -- and as such a misnomer -- refers to a new group of metrics based (largely) on social media events relating to scholarly communication. As current definitions of altmetrics are shaped and limited by active platforms, technical possibilities, and business models of aggregators such as Altmetric.com, ImpactStory, PLOS, and Plum Analytics, and as such constantly changing, this work refrains from defining an umbrella term for these very heterogeneous new metrics. Instead a framework is presented that describes acts leading to (online) events on which the metrics are based. These activities occur in the context of social media, such as discussing on Twitter or saving to Mendeley, as well as downloading and citing. The framework groups various types of acts into three categories -- accessing, appraising, and applying -- and provides examples of actions that lead to visibility and traceability online. To improve the understanding of the acts, which result in online events from which metrics are collected, select citation and social theories are used to interpret the phenomena being measured. Citation theories are used because the new metrics based on these events are supposed to replace or complement citations as indicators of impact. Social theories, on the other hand, are discussed because there is an inherent social aspect to the measurements.

121 citations


Posted Content
TL;DR: In this article, the authors bring empirical data to assess whether IDR is indeed beneficial, and whether costs accompany potential benefits, and assess whether the costs accompanying potential benefits accompany benefits.
Abstract: Inter-disciplinary research (IDR) is being promoted by federal agencies and universities nationwide because it presumably spurs transformative, innovative science. In this paper we bring empirical data to assess whether IDR is indeed beneficial, and whether costs accompany potential benefits.

81 citations


Journal ArticleDOI
TL;DR: Findings provide evidence is that a major consequence of open access policies is to significantly amplify the diffusion of science, through an intermediary like Wikipedia, to a broad audience.
Abstract: With the rise of Wikipedia as a first-stop source for scientific knowledge, it is important to compare its representation of that knowledge to that of the academic literature. Here we identify the 250 most heavily used journals in each of 26 research fields (4,721 journals, 19.4M articles in total) indexed by the Scopus database, and test whether topic, academic status, and accessibility make articles from these journals more or less likely to be referenced on Wikipedia. We find that a journal's academic status (impact factor) and accessibility (open access policy) both strongly increase the probability of it being referenced on Wikipedia. Controlling for field and impact factor, the odds that an open access journal is referenced on the English Wikipedia are 47% higher compared to paywall journals. One of the implications of this study is that a major consequence of open access policies is to significantly amplify the diffusion of science, through an intermediary like Wikipedia, to a broad audience.

73 citations


Journal ArticleDOI
TL;DR: The results show that initial‐based disambiguation can misrepresent statistical properties of coauthorship networks: It deflates the number of unique authors, number of components, average shortest paths, clustering coefficient, and assortativity, while it inflates average productivity, density, average coauthor number per author, and largest component size.
Abstract: Scholars have often relied on name initials to resolve name ambiguities in large-scale coauthorship network research. This approach bears the risk of incorrectly merging or splitting author identities. The use of initial-based disambiguation has been justified by the assumption that such errors would not affect research findings too much. This paper tests this assumption by analyzing coauthorship networks from five academic fields - biology, computer science, nanoscience, neuroscience, and physics - and an interdisciplinary journal, PNAS. Name instances in datasets of this study were disambiguated based on heuristics gained from previous algorithmic disambiguation solutions. We use disambiguated data as a proxy of ground-truth to test the performance of three types of initial-based disambiguation. Our results show that initial-based disambiguation can misrepresent statistical properties of coauthorship networks: it deflates the number of unique authors, number of component, average shortest paths, clustering coefficient, and assortativity, while it inflates average productivity, density, average coauthor number per author, and largest component size. Also, on average, more than half of top 10 productive or collaborative authors drop off the lists. Asian names were found to account for the majority of misidentification by initial-based disambiguation due to their common surname and given name initials.

59 citations


Posted Content
TL;DR: In this paper, the authors compare topic modeling with co-word mapping in terms of co-occurrences and co-absences using network techniques and show that topic models can reveal similarities other than semantic ones (e.g., linguistic ones).
Abstract: Induced by "big data," "topic modeling" has become an attractive alternative to mapping co-words in terms of co-occurrences and co-absences using network techniques. Does topic modeling provide an alternative for co-word mapping in research practices using moderately sized document collections? We return to the word/document matrix using first a single text with a strong argument ("The Leiden Manifesto") and then upscale to a sample of moderate size (n = 687) to study the pros and cons of the two approaches in terms of the resulting possibilities for making semantic maps that can serve an argument. The results from co-word mapping (using two different routines) versus topic modeling are significantly uncorrelated. Whereas components in the co-word maps can easily be designated, the topic models provide sets of words that are very differently organized. In these samples, the topic models seem to reveal similarities other than semantic ones (e.g., linguistic ones). In other words, topic modeling does not replace co-word mapping in small and medium-sized sets; but the paper leaves open the possibility that topic modeling would work well for the semantic mapping of large sets.

56 citations


Posted Content
TL;DR: The role that the DCI could play in encouraging the consistent, standardized citation of research data is emphasized—a role that would enhance their value as a means of following the research process from data collection to publication.
Abstract: We present an analysis of data citation practices based on the Data Citation Index from Thomson Reuters This database launched in 2012 aims to link data sets and data studies with citations received from the other citation indexes The DCI harvests citations to research data from papers indexed in the Web of Science It relies on the information provided by the data repository as data citation practices are inconsistent or inexistent in many cases The findings of this study show that data citation practices are far from common in most research fields Some differences have been reported on the way researchers cite data: while in the areas of Science and Engineering and Technology data sets were the most cited, in Social Sciences and Arts and Humanities data studies play a greater role A total of 881 percent of the records have received no citation, but some repositories show very low uncitedness rates Although data citation practices are rare in most fields, they have expanded in disciplines such as crystallography and genomics We conclude by emphasizing the role that the DCI could play in encouraging the consistent, standardized citation of research data; a role that would enhance their value as a means of following the research process from data collection to publication

53 citations


Journal ArticleDOI
TL;DR: The experiment protocol adopted in Area 13 was substantially modified with respect to all the other research fields, to the point that results for economics and statistics have to be considered as fatally flawed.
Abstract: During the Italian research assessment exercise, the national agency ANVUR performed an experiment to assess agreement between grades attributed to journal articles by informed peer review (IR) and by bibliometrics. A sample of articles was evaluated by using both methods and agreement was analyzed by weighted Cohen's kappas. ANVUR presented results as indicating an overall 'good' or 'more than adequate' agreement. This paper re-examines the experiment results according to the available statistical guidelines for interpreting kappa values, by showing that the degree of agreement, always in the range 0.09-0.42 has to be interpreted, for all research fields, as unacceptable, poor or, in a few cases, as, at most, fair. The only notable exception, confirmed also by a statistical meta-analysis, was a moderate agreement for economics and statistics (Area 13) and its sub-fields. We show that the experiment protocol adopted in Area 13 was substantially modified with respect to all the other research fields, to the point that results for economics and statistics have to be considered as fatally flawed. The evidence of a poor agreement supports the conclusion that IR and bibliometrics do not produce similar results, and that the adoption of both methods in the Italian research assessment possibly introduced systematic and unknown biases in its final results. The conclusion reached by ANVUR must be reversed: the available evidence does not justify at all the joint use of IR and bibliometrics within the same research assessment exercise.

50 citations


Journal ArticleDOI
TL;DR: In this article, the authors show that journal rankings based on JIF variants tend to be more stable over time if the geometric mean is used rather than the standard mean in JIF calculations.
Abstract: Journal impact factors (JIFs) are widely used and promoted but have important limitations. In particular, JIFs can be unduly influenced by individual highly cited articles and hence are inherently unstable. A logical way to reduce the impact of individual high citation counts is to use the geometric mean rather than the standard mean in JIF calculations. Based upon journal rankings 2004-2014 in 50 sub-categories within 5 broad categories, this study shows that journal rankings based on JIF variants tend to be more stable over time if the geometric mean is used rather than the standard mean. The same is true for JIF variants using Mendeley reader counts instead of citation counts. Thus, although the difference is not large, the geometric mean is recommended instead of the arithmetic mean for future JIF calculations. In addition, Mendeley readership-based JIF variants are as stable as those using Scopus citations, confirming the value of Mendeley readership as an academic impact indicator.

48 citations


Book ChapterDOI
TL;DR: A new approach which uses deep neural network to learn features automatically for solving author name ambiguity and the general system architecture for author name disambiguation on any dataset is proposed.
Abstract: Author name ambiguity decreases the quality and reliability of information retrieved from digital libraries. Existing methods have tried to solve this problem by predefining a feature set based on expert's knowledge for a specific dataset. In this paper, we propose a new approach which uses deep neural network to learn features automatically from data. Additionally, we propose the general system architecture for author name disambiguation on any dataset. In this research, we evaluate the proposed method on a dataset containing Vietnamese author names. The results show that this method significantly outperforms other methods that use predefined feature set. The proposed method achieves 99.31% in terms of accuracy. Prediction error rate decreases from 1.83% to 0.69%, i.e., it decreases by 1.14%, or 62.3% relatively compared with other methods that use predefined feature set (Table 3).

Posted Content
TL;DR: Two new methods to identify national differences in average citation impact are introduced, one based on linear modelling for normalised data and the other using the geometric mean, which has the advantage of distinguishing between national contributions to internationally collaborative articles.
Abstract: Governments sometimes need to analyse sets of research papers within a field in order to monitor progress, assess the effect of recent policy changes, or identify areas of excellence. They may compare the average citation impacts of the papers by dividing them by the world average for the field and year. Since citation data is highly skewed, however, simple averages may be too imprecise to robustly identify differences within, rather than across, fields. In response, this article introduces two new methods to identify national differences in average citation impact, one based on linear modelling for normalised data and the other using the geometric mean. Results from a sample of 26 Scopus fields between 2009-2015 show that geometric means are the most precise and so are recommended for smaller sample sizes, such as for individual fields. The regression method has the advantage of distinguishing between national contributions to internationally collaborative articles, but has substantially wider confidence intervals than the geometric mean, undermining its value for any except the largest sample sizes.

Posted Content
TL;DR: In this paper, a geolocation-based method was proposed for simultaneous disambiguation and cross-linking of the inventor and assignee names for a significant fraction of patents in these three major patent collections.
Abstract: Patent data represent a significant source of information on innovation and the evolution of technology through networks of citations, co-invention and co-assignment of new patents. A major obstacle to extracting useful information from this data is the problem of name disambiguation: linking alternate spellings of individuals or institutions to a single identifier to uniquely determine the parties involved in the creation of a technology. In this paper, we describe a new algorithm that uses high-resolution geolocation to disambiguate both inventor and assignees on more than 3.6 million patents found in the European Patent Office (EPO), under the Patent Cooperation treaty (PCT), and in the US Patent and Trademark Office (USPTO). We show that our algorithm has both high precision and recall in comparison to a manual disambiguation of EPO assignee names in Boston and Paris, and show it performs well for a benchmark of USPTO inventor names that can be linked to a high-resolution address (but poorly for inventors that never provided a high quality address). The most significant benefit of this work is the high quality assignee disambiguation with worldwide coverage coupled with an inventor disambiguation that is competitive with other state of the art approaches. To our knowledge this is the broadest and most accurate simultaneous disambiguation and cross-linking of the inventor and assignee names for a significant fraction of patents in these three major patent collections.

Posted Content
Henk F. Moed1, Gali Halevi1
TL;DR: A statistical analysis of full text downloads of articles in Elseviers ScienceDirect covering all disciplines reveals large differences in download frequencies, their skewness, and their correlation with Scopus-based citation counts, between disciplines, journals, and document types as mentioned in this paper.
Abstract: A statistical analysis of full text downloads of articles in Elseviers ScienceDirect covering all disciplines reveals large differences in download frequencies, their skewness, and their correlation with Scopus-based citation counts, between disciplines, journals, and document types. Download counts tend to be two orders of magnitude higher and less skewedly distributed than citations. A mathematical model based on the sum of two exponentials does not adequately capture monthly download counts. The degree of correlation at the article level within a journal is similar to that at the journal level in the discipline covered by that journal, suggesting that the differences between journals are to a large extent discipline specific. Despite the fact that in all study journals download and citation counts per article positively correlate, little overlap may exist between the set of articles appearing in the top of the citation distribution and that with the most frequently downloaded ones. Usage and citation leaks, bulk downloading, differences between reader and author populations in a subject field, the type of document or its content, differences in obsolescence patterns between downloads and citations, different functions of reading and citing in the research process, all provide possible explanations of differences between download and citation distributions.

Journal ArticleDOI
TL;DR: In this article, the citation impact of nations, departments or other groups of researchers within individual fields, three approaches have been proposed: arithmetic means, geometric means, and percentage in the top X%.
Abstract: When comparing the citation impact of nations, departments or other groups of researchers within individual fields, three approaches have been proposed: arithmetic means, geometric means, and percentage in the top X%. This article compares the precision of these statistics using 97 trillion experimentally simulated citation counts from 6875 sets of different parameters (although all having the same scale parameter) based upon the discretised lognormal distribution with limits from 1000 repetitions for each parameter set. The results show that the geometric mean is the most precise, closely followed by the percentage of a country's articles in the top 50% most cited articles for a field, year and document type. Thus the geometric mean citation count is recommended for future citation-based comparisons between nations. The percentage of a country's articles in the top 1% most cited is a particularly imprecise indicator and is not recommended for international comparisons based on individual fields. Moreover, whereas standard confidence interval formulae for the geometric mean appear to be accurate, confidence interval formulae are less accurate and consistent for percentile indicators. These recommendations assume that the scale parameters of the samples are the same but the choice of indicator is complex and partly conceptual if they are not.

Posted Content
TL;DR: A new dynamic growth model reveals how citation networks evolve over time, pointing the way toward reformulated scientometrics.
Abstract: A common consensus in the literature is that the citation profile of published articles in general follows a universal pattern - an initial growth in the number of citations within the first two to three years after publication followed by a steady peak of one to two years and then a final decline over the rest of the lifetime of the article. This observation has long been the underlying heuristic in determining major bibliometric factors such as the quality of a publication, the growth of scientific communities, impact factor of publication venues etc. In this paper, we gather and analyze a massive dataset of scientific papers from the computer science domain and notice that the citation count of the articles over the years follows a remarkably diverse set of patterns - a profile with an initial peak (PeakInit), with distinct multiple peaks (PeakMul), with a peak late in time (PeakLate), that is monotonically decreasing (MonDec), that is monotonically increasing (MonIncr) and that can not be categorized into any of the above (Oth). We conduct a thorough experiment to investigate several important characteristics of these categories such as how individual categories attract citations, how the categorization is influenced by the year and the venue of publication of papers, how each category is affected by self-citations, the stability of the categories over time, and how much each of these categories contribute to the core of the network. Further, we show that the traditional preferential attachment models fail to explain these citation profiles. Therefore, we propose a novel dynamic growth model that takes both the preferential attachment and the aging factor into account in order to replicate the real-world behavior of various citation profiles. We believe that this paper opens the scope for a serious re-investigation of the existing bibliometric indices for scientific research.

Posted Content
TL;DR: ExpertSeer, a generic framework for expert recommendation based on the contents of a digital library, is described, which outperforms Microsoft Academic Search and ArnetMiner in terms of Precision-at-k (P@k) for k=3, 5, 10.
Abstract: We describe ExpertSeer, a generic framework for expert recommendation based on the contents of a digital library Given a query term q, ExpertSeer recommends experts of q by retrieving authors who published relevant papers determined by related keyphrases and the quality of papers The system is based on a simple yet effective keyphrase extractor and the Bayes' rule for expert recommendation ExpertSeer is domain independent and can be applied to different disciplines and applications since the system is automated and not tailored to a specific discipline Digital library providers can employ the system to enrich their services and organizations can discover experts of interest within an organization To demonstrate the power of ExpertSeer, we apply the framework to build two expert recommender systems The first, CSSeer, utilizes the CiteSeerX digital library to recommend experts primarily in computer science The second, ChemSeer, uses publicly available documents from the Royal Society of Chemistry (RSC) to recommend experts in chemistry Using one thousand computer science terms as benchmark queries, we compared the top-n experts (n=3, 5, 10) returned by CSSeer to two other expert recommenders -- Microsoft Academic Search and ArnetMiner -- and a simulator that imitates the ranking function of Google Scholar Although CSSeer, Microsoft Academic Search, and ArnetMiner mostly return prestigious researchers who published several papers related to the query term, it was found that different expert recommenders return moderately different recommendations To further study their performance, we obtained a widely used benchmark dataset as the ground truth for comparison The results show that our system outperforms Microsoft Academic Search and ArnetMiner in terms of Precision-at-k (P@k) for k=3, 5, 10 We also conducted several case studies to validate the usefulness of our system

Journal ArticleDOI
TL;DR: The authors provide an overview of terminology and definitions of alt-metrics and summarizes current research regarding social media use in academia, social media metrics as well as data reliability and validity, yet the theoretical foundation, empirical validity, and extent of use of platforms underlying these metrics lack thorough treatment in the literature.
Abstract: Social media metrics - commonly coined as "altmetrics" - have been heralded as great democratizers of science, providing broader and timelier indicators of impact than citations. These metrics come from a range of sources, including Twitter, blogs, social reference managers, post-publication peer review, and other social media platforms. Social media metrics have begun to be used as indicators of scientific impact, yet the theoretical foundation, empirical validity, and extent of use of platforms underlying these metrics lack thorough treatment in the literature. This editorial provides an overview of terminology and definitions of altmetrics and summarizes current research regarding social media use in academia, social media metrics as well as data reliability and validity. The papers of the special issue are introduced.

Posted Content
TL;DR: In this article, it was shown that Ochiai similarity of the co-occurrence matrix is equal to cosine similarity in the underlying occurrence matrix, and that the similarity is then normalized twice, and therefore over-estimated.
Abstract: We prove that Ochiai similarity of the co-occurrence matrix is equal to cosine similarity in the underlying occurrence matrix. Neither the cosine nor the Pearson correlation should be used for the normalization of co-occurrence matrices because the similarity is then normalized twice, and therefore over-estimated; the Ochiai coefficient can be used instead. Results are shown using a small matrix (5 cases, 4 variables) for didactic reasons, and also Ahlgren et al.'s (2003) co-occurrence matrix of 24 authors in library and information sciences. The over-estimation is shown numerically and will be illustrated using multidimensional scaling and cluster dendograms. If the occurrence matrix is not available (such as in internet research or author co-citation analysis) using Ochiai for the normalization is preferable to using the cosine.

Journal ArticleDOI
TL;DR: In this paper, the hip-index is proposed to identify the subset of references in a bibliography that have a central academic influence on the citing paper, based on the number of times a reference is mentioned in the body of a citing paper.
Abstract: The importance of a research article is routinely measured by counting how many times it has been cited. However, treating all citations with equal weight ignores the wide variety of functions that citations perform. We want to automatically identify the subset of references in a bibliography that have a central academic influence on the citing paper. For this purpose, we examine the effectiveness of a variety of features for determining the academic influence of a citation. By asking authors to identify the key references in their own work, we created a data set in which citations were labeled according to their academic influence. Using automatic feature selection with supervised machine learning, we found a model for predicting academic influence that achieves good performance on this data set using only four features. The best features, among those we evaluated, were those based on the number of times a reference is mentioned in the body of a citing paper. The performance of these features inspired us to design an influence-primed h-index (the hip-index). Unlike the conventional h-index, it weights citations by how many times a reference is mentioned. According to our experiments, the hip-index is a better indicator of researcher performance than the conventional h-index.

Posted Content
TL;DR: The PASTEUR4OA project analyses what makes an Open Access (OA) policy effective and suggests that it would be useful for current and future OA policies to adopt the seven positive conditions so as to accelerate and maximise the growth of OA.
Abstract: The PASTEUR4OA project analyses what makes an Open Access (OA) policy effective. The total number of institutional or funder OA policies worldwide is now 663 (March 2015), over half of them mandatory. ROARMAP, the policy registry, has been rebuilt to record more policy detail and provide more extensive search functionality. Deposit rates were measured for articles in institutions' repositories and compared to the total number of WoS-indexed articles published from those institutions. Average deposit rate was over four times as high for institutions with a mandatory policy. Six positive correlations were found between deposit rates and (1) Must-Deposit; (2) Cannot-Waive-Deposit; (3) Deposit-Linked-to-Research-Evaluation; (4) Cannot-Waive-Rights-Retention; (5) Must-Make-Deposit-OA (after allowable embargo) and (6) Can-Waive-OA. For deposit latency, there is a positive correlation between earlier deposit and (7) Must-Deposit-Immediately as well as with (4) Cannot-Waive-Rights-Retention and with mandate age. There are not yet enough OA policies to test whether still further policy conditions would contribute to mandate effectiveness but the present findings already suggest that it would be useful for current and future OA policies to adopt the seven positive conditions so as to accelerate and maximise the growth of OA.

Posted Content
TL;DR: This paper analysed the structure of items archived in figshare, their usage, and their reception in two altmetrics sources (PlumX and ImpactStory), and found that Twitter was the social media service where research data gained most attention.
Abstract: This is the second paper in a series of bibliometric studies of research data. In this paper, we present an analysis of figshare, one of the largest multidisciplinary repositories for research materials to date. We analysed the structure of items archived in figshare, their usage, and their reception in two altmetrics sources (PlumX and ImpactStory). We found that figshare acts (1) as a personal repository for yet unpublished materials, (2) as a platform for newly published research materials, and (3) as an archive for PLOS. Depending on the function, we found different bibliometric characteristics. Items archived from PLOS tend to be coming from the natural sciences and are often unviewed and non-downloaded. Self-archived items, however, come from a variety of disciplines and exhibit some patterns of higher usage. In the altmetrics analysis, we found that Twitter was the social media service where research data gained most attention; generally, research data published in 2014 were most popular across social media services. PlumX detects considerably more items in social media and also finds higher altmetric scores than ImpactStory.

Posted Content
TL;DR: The date on which the fixed journal article (Version of Record) is first made available on the publisher's website is proposed as a consistent definition of the online date, leading to the conclusion that more transparency and standardization is needed in the reporting of publication dates.
Abstract: With the acceleration of scholarly communication in the digital era, the publication year is no longer a sufficient level of time aggregation for bibliometric and social media indicators. Papers are increasingly cited before they have been officially published in a journal issue and mentioned on Twitter within days of online availability. In order to find a suitable proxy for the day of online publication allowing for the computation of more accurate benchmarks and fine-grained citation and social media event windows, various dates are compared for a set of 58,896 papers published by Nature Publishing Group, PLOS, Springer and Wiley-Blackwell in 2012. Dates include the online date provided by the publishers, the month of the journal issue, the Web of Science indexing date, the date of the first tweet mentioning the paper as well as the this http URL publication and first-seen dates. Comparing these dates, the analysis reveals that large differences exist between publishers, leading to the conclusion that more transparency and standardization is needed in the reporting of publication dates. The date on which the fixed journal article (Version of Record) is first made available on the publisher's website is proposed as a consistent definition of the online date.

Posted Content
TL;DR: The study explores the citedness of research data, its distribution over time and how it is related to the availability of a DOI in Thomson Reuters' DCI (Data Citation Index), and if cited research data "impact" the (social) web, reflected by altmetrics scores, and if there is any relationship between the number of citations and the sum of altmetric scores from various social media-platforms.
Abstract: The study explores the citedness of research data, its distribution over time and how it is related to the availability of a DOI (Digital Object Identifier) in Thomson Reuters' DCI (Data Citation Index). We investigate if cited research data "impact" the (social) web, reflected by altmetrics scores, and if there is any relationship between the number of citations and the sum of altmetrics scores from various social media-platforms. Three tools are used to collect and compare altmetrics scores, i.e. PlumX, ImpactStory, and Altmetric.com. In terms of coverage, PlumX is the most helpful altmetrics tool. While research data remain mostly uncited (about 85%), there has been a growing trend in citing data sets published since 2007. Surprisingly, the percentage of the number of cited research data with a DOI in DCI has decreased in the last years. Only nine repositories account for research data with DOIs and two or more citations. The number of cited research data with altmetrics scores is even lower (4 to 9%) but shows a higher coverage of research data from the last decade. However, no correlation between the number of citations and the total number of altmetrics scores is observable. Certain data types (i.e. survey, aggregate data, and sequence data) are more often cited and receive higher altmetrics scores.

Posted Content
TL;DR: In this age of big data and high social and professional mobility, ranking has become one of the central issues in social life and information technologies, allowing quick redirection of web traffic through small biased updates, and restriction of public access to undesired information.
Abstract: Currently the ranking of scientists is based on the $h$-index, which is widely perceived as an imprecise and simplistic though still useful metric We find that the $h$-index actually favours modestly performing researchers and propose a simple criterion for proper ranking

Journal ArticleDOI
TL;DR: In this paper, a model based on quantile regression is proposed to predict the long-term citation impact of a publication by using the impact factor of the journal in which a publication appeared and the number of citations a publication has received one year after its appearance.
Abstract: A fundamental problem in citation analysis is the prediction of the long-term citation impact of recent publications. We propose a model to predict a probability distribution for the future number of citations of a publication. Two predictors are used: The impact factor of the journal in which a publication has appeared and the number of citations a publication has received one year after its appearance. The proposed model is based on quantile regression. We employ the model to predict the future number of citations of a large set of publications in the field of physics. Our analysis shows that both predictors (i.e., impact factor and early citations) contribute to the accurate prediction of long-term citation impact. We also analytically study the behavior of the quantile regression coefficients for high quantiles of the distribution of citations. This is done by linking the quantile regression approach to a quantile estimation technique from extreme value theory. Our work provides insight into the influence of the impact factor and early citations on the long-term citation impact of a publication, and it takes a step toward a methodology that can be used to assess research institutions based on their most recently published work.

Posted Content
TL;DR: In this article, the authors investigated factors influencing Twitter popularity of medical papers investigating differences between medical study types, such as document age, scientific discipline, number of authors and document type.
Abstract: Twitter has been identified as one of the most popular and promising altmetrics data sources, as it possibly reflects a broader use of research articles by the general public. Several factors, such as document age, scientific discipline, number of authors and document type, have been shown to affect the number of tweets received by scientific documents. The particular meaning of tweets mentioning scholarly papers is, however, not entirely understood and their validity as impact indicators debatable. This study contributes to the understanding of factors influencing Twitter popularity of medical papers investigating differences between medical study types. 162,830 documents indexed in Embase to a medical study type have been analysed for the study type specific tweet frequency. Meta-analyses, systematic reviews and clinical trials were found to be tweeted substantially more frequently than other study types, while all basic research received less attention than the average. The findings correspond well with clinical evidence hierarchies. It is suggested that interest from laymen and patients may be a factor in the observed effects.

Posted Content
TL;DR: This work seeks to provide a theoretical foundation of altmetrics, based on notions developed by Michael Nielsen in his monograph Reinventing Discovery: The New Era of Networked Science, to label the total collection of such metrics as Altmetrics.
Abstract: I propose a broad, multi-dimensional conception of altmetrics, namely as traces of the computerization of the research process. Computerization should be conceived in its broadest sense, including all recent developments in ICT and software, taking place in society as a whole. I distinguish four aspects of the research process: the collection of research data and development of research methods; scientific information processing; communication and organization; and, last but not least, research assessment. I will argue that in each aspect, computerization plays a key role, and metrics are being developed to describe this process. I propose to label the total collection of such metrics as Altmetrics. I seek to provide a theoretical foundation of altmetrics, based on notions developed by Michael Nielsen in his monograph Reinventing Discovery: The New Era of Networked Science. Altmetrics can be conceived as tools for the practical realization of the ethos of science and scholarship in a computerized or digital age.

Posted Content
TL;DR: The trend to cite older papers is not fully explained by technology, but may be the result of a structural shift to fund incremental and applied research over fundamental science.
Abstract: Analyzing 13,455 journals listed in the Journal Citation Report (Thomson Reuters) from 1997 through 2013, we report that the mean cited half-life of the scholarly literature is 6.5 years and growing at a rate of 0.13 years per annum. Focusing on a subset of journals (N=4,937) for which we have a continuous series of half-life observations, 209 of 229 (91%) subject categories experienced increasing cited half-lives. Contrary to the overall trend, engineering and chemistry journals experienced declining cited half-lives. Last, as journals attracted more citations, a larger proportion of them were directed toward older papers. The trend to cite older papers is not fully explained by technology (digital publishing, search and retrieval, etc.), but may be the result of a structural shift to fund incremental and applied research over fundamental science.

Posted Content
TL;DR: In a follow-up to the highly-cited authors list published by Thomson Reuters in June 2014, the authors analyzed the top-1% most frequently cited papers published between 2002 and 2012 included in the Web of Science (WoS) subject category "Information Science & Library Science."
Abstract: As a follow-up to the highly-cited authors list published by Thomson Reuters in June 2014, we analyze the top-1% most frequently cited papers published between 2002 and 2012 included in the Web of Science (WoS) subject category "Information Science & Library Science." 798 authors contributed to 305 top-1% publications; these authors were employed at 275 institutions. The authors at Harvard University contributed the largest number of papers, when the addresses are whole-number counted. However, Leiden University leads the ranking, if fractional counting is used. Twenty-three of the 798 authors were also listed as most highly-cited authors by Thomson Reuters in June 2014 (this http URL). Twelve of these 23 authors were involved in publishing four or more of the 305 papers under study. Analysis of co-authorship relations among the 798 highly-cited scientists shows that co-authorships are based on common interests in a specific topic. Three topics were important between 2002 and 2012: (1) collection and exploitation of information in clinical practices, (2) the use of internet in public communication and commerce, and (3) scientometrics.