scispace - formally typeset
Search or ask a question

Showing papers by "J. Stephen Downie published in 2018"


Proceedings ArticleDOI
23 May 2018
TL;DR: This work distill from prior user studies three key objectives for worksets (extra-digital library manipulation, intra-item properties, and robust representations) and describes how HTRC's implementation of its RDF-compliant workset model helps to satisfy these objectives.
Abstract: Scholars using digital libraries and archives routinely create worksets-aggregations of digital objects-as a way to segregate resources of interest for in-depth scrutiny. To illustrate how worksets can enhance the scholarly utility of digital library content, we distill from prior user studies three key objectives for worksets (extra-digital library manipulation, intra-item properties, and robust representations), and discuss how they motivated the workset model being developed at the HathiTrust Research Center (HTRC). We describe how HTRC's implementation of its RDF-compliant workset model helps to satisfy these objectives.

3 citations


Proceedings ArticleDOI
23 May 2018
TL;DR: This work investigates a data-driven vector representation of word embedding for the task of classifying song lyrics into their semantic topics and adopts the averaged word vectors from the lyrics and user's interpretations about them, which are short in general.
Abstract: In this work we investigate a data-driven vector representation of word embedding for the task of classifying song lyrics into their semantic topics. Previous research on topic classification of song lyrics has used traditional frequency based text representation. On the other hand, empirically driven word embedding has shown sensible performance improvment of text classification tasks, because of its ability to capture semantic relationship between words from big data. As averaging the word vectors from a short text is known to work reasonably well compared to the other comprehensive models utilizing their order, we adopt the averaged word vectors from the lyrics and user's interpretations about them, which are short in general, as the feature for this classification task. This simple approach showed promising classification accuracy of 57%. From this, we envision the potential of the data-driven approaches to creating features, such as the sequence of word vectors and doc2vec models, to improve the performance of the system.

3 citations


Proceedings ArticleDOI
23 May 2018
TL;DR: An extended example of the web environment in use that allows users to search over 1 trillion tokens of text of the HathiTrust Part-of-Speech Extracted Features Dataset to help produce worksets for scholarly analysis is presented.
Abstract: We report on the work undertaken developing a web environment that allows users to search over 1 trillion tokens of text -- down to the page-level -- of the HathiTrust Part-of-Speech Extracted Features Dataset to help produce worksets for scholarly analysis. We present an extended example of the web environment in use, along with details about its implementation.

2 citations


Book ChapterDOI
25 Mar 2018
TL;DR: A large-scale dataset of similar artists recommended in four well-known online music steaming services, namely Spotify, Last.fm, the Echo Nest, and KKBOX, was collected and preliminary results reveal that similar artists in these services were related to the genre and popularity of the artists.
Abstract: In supporting music search, online music streaming services often suggest artists who are deemed as similar to those listened to or liked by users. However, there has been an ongoing debate on what constitutes artist similarity. Approaching this problem from an empirical perspective, this study collected a large-scale dataset of similar artists recommended in four well-known online music steaming services, namely Spotify, Last.fm, the Echo Nest, and KKBOX, on which an exploratory quantitative analysis was conducted. Preliminary results reveal that similar artists in these services were related to the genre and popularity of the artists. The findings shed light on how the concept of artist similarity is manifested in widely adopted real-world applications, which will in turn help enhance our understanding of music similarity and recommendation.

2 citations


Journal ArticleDOI
TL;DR: Insight is provided on understanding the mood metadata on Chinese music websites and uniquely contributes to existing knowledge of culturally diversified music access.
Abstract: Music mood is an important metadata type on online music repositories and stream music services worldwide. Many existing studies on mood metadata have focused on music websites and services in the Western world to the exclusion of those serving users in other cultures. The purpose of this paper is to bridge this gap by exploring mood labels on influential Chinese music websites.,Mood labels and the associated song titles were collected from six Chinese music websites, and analyzed in relation to mood models and findings in the literature. An online music listening test was conducted to solicit users’ feedback on the mood labels on two popular Chinese music websites. Mood label selections on 30 songs from 64 Chinese listeners were collected and compared to those given by the two websites.,Mood labels, although extensively employed on Chinese music websites, may be insufficient in meeting listeners’ needs. More mood labels of high arousal semantics are needed. Song languages and user familiarity to the songs show influence on users’ selection of mood labels given by the websites.,Suggestions are proposed for future development of mood metadata and mood-enabled user interfaces in the context of global online music access.,This paper provides insights on understanding the mood metadata on Chinese music websites and uniquely contributes to existing knowledge of culturally diversified music access.

2 citations


Proceedings ArticleDOI
28 Sep 2018
TL;DR: Critical processes of data curation for digital libraries, including quality assessment of the ingested datasets, are highlighted, including research questions enabled by JazzCats, raise musicological implications, and offer suggestions to overcome current limitations.
Abstract: Applying Linked Data techniques to musical metadata can facilitate new paths of musicological inquiry. JazzCats: Jazz Collection of Aggregated Triples is a prototype project interlinking four discrete jazz performance datasets and external sources as references. Tabular, relational, and graph legacy datasets have necessitated different RDF production and ingestion workflows to support scholarly study of performance traditions. This paper highlights critical processes of data curation for digital libraries, including quality assessment of the ingested datasets. In addition, we describe research questions enabled by JazzCats, raise musicological implications, and offer suggestions to overcome current limitations.

1 citations


Proceedings ArticleDOI
09 Oct 2018
TL;DR: The jazz performance metadata prototype JazzCats:Jazz Collection of Aggregated Triples uses Linked Data to bridge four discrete jazz music datasets and is a new digital resource that can be used to support and enrich scholarship and research in musicology and performance studies.
Abstract: The jazz performance metadata prototype JazzCats:Jazz Collection of Aggregated Triples uses Linked Data to bridge four discrete jazz music datasets: Linked Jazz, with prosopographical and interpersonal information about musicians; the Weimar Jazz Database (WJazzD), containing musicological metadata; a discography of the jazz standard BodyS and J-DISC, a fourth independent but complementary and extensive discographic project. Through the use of custom-built ontological structures the data, originally stored in various different information structures, has been converted to RDF and merged together in a single triplestore. The result is a new digital resource that can be used to support and enrich scholarship and research in musicology and performance studies.

1 citations


01 Jan 2018
TL;DR: This project seeks to pilot a classification process using manually assigned ground truth on a subset of volumes from the HathiTrust, and suggests full-scale deployment of a statistical classifier on a large corpus of literature in order to assemble a disability corpus.
Abstract: As literary text opens to researchers for distant reading, the computational analysis of large corpora of text for literary scholarship, problems beyond typical data science roadblocks, such as data scale and statistical significance of findings have emerged. For scholars studying character and social representation in literature, the identification of characters within the given classes of study is crucial, painstaking, and often a manual process. However, for characters with disabilities, manual identification is prohibitively difficult to undertake at scale, and especially challenging given the coded textual markers that can be used to refer to disability. There currently exists no corpus of characters in fiction with disabilities, which is the first step to at-scale computational study of this topic. This project seeks to pilot a classification process using manually assigned ground truth on a subset of volumes from the HathiTrust. Having successfully built and evaluated a Naïve Bayes classifier, we suggest full-scale deployment of a statistical classifier on a large corpus of literature in order to assemble a disability corpus. This project also covers preliminary exploratory textual analysis of characters with disabilities to yield potential research questions for further exploration.

1 citations


Proceedings ArticleDOI
23 May 2018
TL;DR: The seeding algorithm is introduced and seeding strategies for identifying initial concepts in text volumes, such as books, that are stored in a digital library are explored.
Abstract: and phrases in a text, for which we use an automatically generated Concept-in-Context (CiC) network. Words and phrases rarely belong to a single concept; disambiguation in Capisco relies on interplay between words that are in close vicinity in the text. Starting the disambiguation is a seeding process, that identifies the first concepts, which then form the context for further disambiguation steps. This paper introduces the seeding algorithm and explores seeding strategies for identifying these initial concepts in text volumes, such as books, that are stored in a digital library.

1 citations


Proceedings ArticleDOI
23 May 2018
TL;DR: A bespoke proxying technique is detail called Meddle -- for ModifiED Digital Library Environment -- which is a lightweight agile technique that helps address identified pitfalls in a DL search interface that operates independently of the originating digital library.
Abstract: We document how surprisingly easy it is for user misconceptions to arise when using digital library search interfaces, and the significant unseen impact this can have on the user's interpretation of search results. Further, we detail a bespoke proxying technique we have devised called Meddle -- for ModifiED Digital Library Environment -- which is a lightweight agile technique that helps address identified pitfalls in a DL search interface that operates independently of the originating digital library.

Proceedings ArticleDOI
23 May 2018
TL;DR: Within the context of mass-scale digital libraries, this panel will explore methodologies and uses for-as well as the results of- conceiving of " data as collections" and "collections as data" through use cases involving data mining of the HathiTrust Digital Library.
Abstract: Within the context of mass-scale digital libraries, this panel will explore methodologies and uses for-as well as the results of- conceiving of "data as collections" and "collections as data." The panel will explore the implications of these concepts through use cases involving data mining of the HathiTrust Digital Library, particularly major projects developed at the HathiTrust Research Center. Featured will be the Workset Creation for Scholarly Analysis + Data Capsules (WCSA+DC) project, the Solr Extracted Features project, and the Image Analysis for Archival Discovery (Aida) project. Each of these projects focuses on various aspects of text, image and data mining and analysis of mass-scale digital library collections.