scispace - formally typeset
Open AccessProceedings ArticleDOI

Characterization of public datasets for Recommender Systems

Reads0
Chats0
TLDR
The overall aim of the paper is to offer a convenient resource for finding and selecting datasets as a support for the empirical evaluation of recommendation algorithms and techniques.
Abstract
As Recommender Systems are becoming very common and widespread, there is an increasing need to evaluate their characteristics such as accuracy, diversity, scalability etc. One of the most fruitful ways to do this is by using public datasets with explicit user feedback about the items. In this paper we present and describe more than 20 available datasets covering different domains such as movies, books, music etc. Each dataset is described over a number of attributes such as size, domain, format of the data, type of access. Unfortunately we did not find any information about the quality of the data contained, that remains an open issue. We also refer to examples from the literature about using the datasets to evaluate recommendation algorithms or solutions. Overall aim of the paper is to offer a convenient resource for finding and selecting datasets as a support for the empirical evaluation of recommendation algorithms and techniques.

read more

Content maybe subject to copyright    Report

09 August 2022
POLITECNICO DI TORINO
Repository ISTITUZIONALE
Characterization of Public Datasets for Recommender Systems / Cano, Erion; Morisio, Maurizio. - ELETTRONICO. -
(2015), pp. 249-257. ((Intervento presentato al convegno Research and Technologies for Society and Industry
Leveraging a better tomorrow (RTSI), 2015 IEEE 1st International Forum on tenutosi a Torino (ITALIA) nel 16-18 Sep.
2015 [10.1109/RTSI.2015.7325106].
Original
Characterization of Public Datasets for Recommender Systems
Publisher:
Published
DOI:10.1109/RTSI.2015.7325106
Terms of use:
openAccess
Publisher copyright
(Article begins on next page)
This article is made available under terms and conditions as specified in the corresponding bibliographic description in
the repository
Availability:
This version is available at: 11583/2636630 since: 2017-03-01T11:36:00Z
IEEE

Characterization of Public Datasets for Recommender Systems
Erion Çano
Politecnico di Torino
Email: erion.cano@polito.it
Maurizio Morisio
Politecnico di Torino
Email: maurizio.morisio@polito.it
Abstract As Recommender Systems are becoming
very common and widespread, there is an increasing
need to evaluate their characteristics such as accuracy,
diversity, scalability etc. One of the most fruitful ways
to do this is by using public datasets with explicit user
feedback about the items. In this paper we present and
describe more than 20 available datasets covering
different domains such as movies, books, music etc.
Each dataset is described over a number of attributes
such as size, domain, format of the data, type of access.
Unfortunately we did not find any information about
the quality of the data contained, that remains an open
issue. We also refer to examples from the literature
about using the datasets to evaluate recommendation
algorithms or solutions. Overall aim of the paper is to
offer a convenient resource for finding and selecting
datasets as a support for the empirical evaluation of
recommendation algorithms and techniques.
Index TermsPublic Datasets, Recommender
Systems, Recommendation Evaluations.
I. INTRODUCTION
Recommender Systems (RS) are now being used to
recommend any kind of product or service such as
movies, books, music, food, software etc. Evaluating
a recommender system with respect to different
quality criteria such as accuracy, diversity or novelty
is a very important step. One way to do that is by
running public user evaluation campaigns based on
questionnaires which provide rich feedback but are
often heavy to process. It is also difficult to find a
significant number of feedback providers. Another
very common technique is based in using public
datasets with real user feedback information. In this
paper we describe some of the most common public
datasets available for research in RS. These datasets
can be used to compare the newly developed
algorithms with the existing ones in given settings. In
such datasets a representation of implicit or explicit
feedback from the users regarding the candidate
items is stored in order to allow the RS to produce a
recommendation. The feedback is in different forms.
In many cases it can be ratings or votes upon the
items. The user-rating matrix used in collaborative
filtering is a well-known example. In this case the
evaluation consists in comparing the predicted ratings
with the real ones. In case of content-based or other
RS types it can be item reviews or simple tags
(keywords) that users provide for items.
The public datasets are usually made available by
university research groups or similar institutions. In
many cases they also publish their own results
obtained using the datasets. The aim of this paper is
to describe the main characteristics of the datasets
and provide examples of using them, which can be
helpful to the many researchers who are currently
working in the field of RS. We present 26 datasets,
20 of which are public and active whereas 6 are
retired or restricted. They come from different
domains and have various characteristics. Eight are
used for movies, 5 for books, articles or other
learning materials, 4 for music, 3 for food or
healthcare and 6 from other domains. We also
provide examples of using the datasets in different
contexts and for different purposes extracted from the
literature. More specific and technical details about
how to use the datasets for recommendation
evaluations can be found at [1]. The rest of this paper
is structured as follows: In section II we describe the
active datasets with respect to the domains they
pertain. In section III we describe the restricted or
retired datasets. In section IV we conclude by making
a quantitative discussion.
II. AVAILABLE DATASETS
The datasets were found as part of a systematic
literature review on RS we are conducting. This
paper is actually byproduct of the systematic review
in which we address different research questions

Table 1: Complete list of Datasets and their attributes
Name
Domain
Size
Collected
Status
Format
MovieLens
Movies
DS1: 1K users, 1700 movies, 100K ratings
DS2: 6K users, 4K movies, 1M ratings
DS3: 72K users, 10K movies, 10M ratings
DS4: 138K users, 27K movies, 20M ratings
DS5: 230K users, 27K movies, 21M ratings
Released 4/1998
Released 2/2003
Released 1/2009
Released 4/2015
Released 4/2015
Active/
Non-commercial
Csv
Yahoo Movie
Ratings
Movies
November 2003
Request/
Academic
MovieTwittings
Movies
12425 users, 8458 movies, 65115 ratings
Released 3/1/2013
Active/
Non-commercial
Txt
IMDB
Movies
Since 1998
Active/
Restricted
EachMovie
Movies
72,916 users, 1628 movies, 2,811,983 ratings
Retired in 2004
Netflix
Movies
480k users, 17770 movies, 100M ratings
1998 -2005
Inactive/
Non-commercial
Txt
Cornell
University
Movies
DS1: 1k positive, 1k negative reviews
DS2: 1770, 902, 1307, 1027 movie reviews
DS3: 5k subjective, 5k objective sentences
Released June 2004
Released July 2005
Released June 2004
Active/
Non-commercial
Txt
Rotten
Tomatoes
Movies
222352 reviews about 11855 movies
Active/
Non-commercial
Txt
Last.fm
Music
DS1: 1915086 lines, 992 users, 69420 artists
DS2:17559530 lines, 359347 users, 107373 artists
May 2007
Active/
Non-commercial
Txt
Yahoo Music
Ratings
Music
15,400 users, 1K songs, 300K ratings
2002 - 2006
Request/
Academic
Audioscrobbler
Music
May 2005
Merged with
Last.fm
Txt
Million Songs
Music
1M songs, 44745 artists, 7643 unique terms, 2321
unique tags
Active/
Non-commercial
HDF5
Citation Papers
Research
DS1: 28 researchers x 597 papers
DS2: 50 researchers x 100531 papers
June 2010
Active/
Non-commercial
Txt
Mendely
Research
1857912 articles, 200K users
254681 tags to 27652 articles by 4099 users
Since 2009
Request/
Non-commercial
MACE
Learning
1148 users, 150K resources, 47K tags
2006-2009
Active/
Private
Apostle-DS
Learning
1500 user activities of 6 users for 3 months
3 months
Request/
Non-commercial
BookCrossing
Books
278,858 users, 271379 books, 1149780 ratings
Aug - Sep 2004
Active/
Non-commercial
Sql/Csv
Organic.Edunet
Food
345 tags, 250 ratings, 325 reviews
Jan 2010 - Sep 2010
Active/
Non-commercial
Rdf/Txt
Chicago Entree
Food
Sep 1996 - Apr 1999
Active/
Non-commercial
Txt
Mediacare
Health
Ratings of 15K nursing homes, 4K Hospitals
Active/
Non-commercial
Mdb/Csv
Dating website
Dating
135,359 users, 17,359,346 ratings
April 2006
Active/
Non-commercial
Txt
WikiLens
Various
Feb. 2008
Retired in 2009
Txt
Epinions
Various
131228 users, 317755 items, 1127673 reviews
June 2011
Active/
Non-commercial
Sql
Yahoo Front
Page
News
28041015 user visits
2 - 16 Oct. 2011
Request/
Academic
Tags2Con
URLs
1397 users, 1569 tags, 1681 ratings, 739 URLs,
603 domains
Dec 2007 - Apr 2008
Active/
Non-commercial
Rdf
Jester
Jokes
DS1: 4.1M ratings -10.00 - +10.00 of 100 jokes
from 73421 users,
DS2: 1.7 M ratings -10.00 - +10.00 of 150 jokes
from 59132 users,
DS3: Update of DS2 with 500 K new jokes by
79681 users
Apr 1999 - May 2003
Nov 2006 - May 2009
Nov 2006 - Nov 2012
Active/
Non-commercial
Xls
that have to do with the construction and evaluation
of RS. We searched in SpringerLink, ScienceDirect,
IEEExplore and ACM using keywords like
"Evaluating Recommender Systems", "Public
Datasets for Recommender Systems" etc. In Table 1
we summarize some basic characteristics of the 26

datasets we present. In 'Size' column we give the
number of users, number of items managed and
number of ratings (in some cases reviews or tags).
The 'Collected' column contains the period when the
dataset was started or the release date in some cases.
In the 'Status' column we show the current status of
the dataset and the usage conditions/constrains. Most
of the datasets are "Active" (in the sense that they are
freely available and usually updated). Some of the
datasets can be obtained upon request to the
owner/maintainer (denoted as "Request"). Most of
the datasets can be used for any non-commercial
purpose. Some datasets (denoted as "Academic") can
be used for research purposes only. In the 'Format'
column the data format is given.
A. Movies
MovieLens. This is one of the most used datasets
for algorithms evaluation in the community of
Recommender System Research. The dataset was
collected and made available by GroupLens
Research, a research lab in the department of
computer science, University of Minnesota. There are
four stable versions of the dataset which vary in size
and release date (Table 2). The fifth version is the
most recent. It changes over time and it is not
appropriate for reporting research results [2]. The
datasets are public and open for non-commercial use
only. The many user ratings these datasets contain
make them very suitable for evaluating different
versions of Collaborative Filtering recommendation
algorithms.
Table 2: Versions of MovieLens Dataset
Version
Users
Movies
Ratings
Released
Format
100k
1k
1.7k
100k
4/1998
Txt
1M
6k
4k
1M
2/2003
Csv
10M
72k
10k
10M
1/2009
Csv
20M
138k
27k
20M
4/2015
Csv
Latest
230k
27k
21M
4/2015
Csv
Yahoo Movie Ratings. This dataset contains a
small sample of the Yahoo Movies community's
preferences for various movies. The movies are rated
on a scale from A+ to F. The dataset also contains a
large amount of descriptive information about many
movies released prior to November 2003, including
cast, crew, synopsis, genre, average ratings, awards,
etc. It may be used by researchers to validate
recommender systems of different algorithms,
including hybrid content-based and collaborative
filtering. The dataset is available for download upon
request from [3] and can be used for academic
purposes only.
MovieTwittings. This is a very recent movie
dataset which collects the ratings from user twits. It is
composed of 2 files which contain the movies and the
ratings respectively. Unlike MovieLens and other
filtered datasets which contain only users with a
minimum number (i.e. 20) of ratings, MovieTwittings
has a number of ratings which varies from 1 to 305
per user. The authors started querying the Twitter
search API on March 2013. Since then the dataset
continuously grows based on the daily twits of
different users. Currently the dataset contains 12425
unique users, 8458 unique items and 65115 ratings.
The many and recent movie ratings it contains make
it suitable for the evaluation of various versions of
Collaborative Filtering algorithms. The dataset is
open for public use and can be downloaded from [4].
More details about MovieTwittings can be found at
the authors' publication [5].
Movie Review Dataset. This dataset was created
by researchers at Cornell University and includes
subjective movie reviews. There are actually 3
datasets: Sentiment polarity, Sentiment scale and
Subjectivity dataset. The Sentiment polarity dataset is
a collection of 1000 positive and 1000 negative
movie reviews. The Sentiment scale dataset is a
collection of four subjective review sets (with 1770,
902, 1307 and 1027 movie reviews each) which can
be used to infer ratings. The Subjectivity dataset is a
collection of 5000 subjective and 5000 objective
movie review sentences. The datasets were released
in June 2004 the first and the third and July 2005 the
second. All the data are in Txt format. These datasets
are very suitable to evaluate Machine Learning,
Natural Language Processing or Text Processing
algorithms/techniques used for ratings or
recommendations. The authors use the Sentiment
scale dataset in [6] to evaluate their algorithm that
addresses the rating-inference problem. They also use
the first and third datasets in [7] to train and evaluate

the Naive Bayes and SVM text categorizers they use
to determine the sentiment polarity of the subjective
users' reviews. The datasets are publicly available for
non-commercial use. More info about the datasets
and the download links can be found at [8].
Rotten Tomatoes. This is a highly viewed movie
review website with a rich movie dataset. A subset of
its dataset can be freely downloaded from [9]. It is
comprised of tab separated textual movie review
phrases from the original Rotten Tomato dataset. The
dataset contains 222352 reviews about 11855 movies.
There is a train/test split of the dataset to make it
more suitable for benchmarking algorithms. The
structure of the dataset makes it very suitable for
sentiment analysis and machine learning research and
benchmarking.
The movie domain is obviously the most preferred
for recommender systems research. This is partly
because there are many public and rich datasets
(apart from the 5 above, we describe 3 other
restricted/retired movie datasets in the next section)
with explicit user feedback (movie ratings). The
regular structure and contents of these datasets makes
them very suitable for evaluating different versions of
Collaborative Filtering or hybrid recommendation
algorithms. MovieLens is probably the most used
dataset for research purposes.
B. Music
Last.fm. This is a subset of last.fm website music
dataset, one of the most important in the domain of
music. There are actually two versions of the dataset:
Last.fm-360k which includes the top artists of 360k
users and Last.fm-1k which includes the full listening
history of 1k users. The datasets contain information
about 69420 and 107373 artists respectively. They
also contain information about track preferences of
the users and some attributes of every user. This
dataset is suitable for content-based or hybrid
algorithms evaluation. All the data is stored in tab
separated textual values. The dataset can be freely
downloaded from [10] for non-commercial use only.
It is also required to reference last.fm website when
using it.
Yahoo Music Ratings. This dataset contains music
ratings supplied by users while browsing Yahoo
Music services and ratings for randomly selected
songs collected during an online survey conducted by
Yahoo Research [11]. The rating data includes 15.4k
users, and 1k songs. The dataset includes
approximately 300k user-supplied ratings, and
exactly 54k ratings for randomly selected songs. The
data were collected between 2002 and 2006. The rich
user feedback (many explicit ratings) it contains
makes it very suitable to evaluate different versions
of collaborative filtering or hybrid recommendation
algorithms. It can be downloaded upon request from
[3] and used for academic purposes only.
The Million Song Dataset. The Million Song
Dataset (MSD) is an attempt to help researchers in
the field of Music Analysis and Recommendations by
providing a large-scale dataset. The main purposes of
the dataset are:
encouraging research on algorithms that
scale to commercial sizes
providing a reference dataset for evaluating
algorithms by using the audio features
being a shortcut alternative for creating a
large dataset with the Echo Nest's API
helping new researchers get started in the
MIR field, develop music recommendations
and study music similarity.
A large dataset helps to reveal problems with
algorithms scaling and discover rare phenomena or
patterns that may not be discoverable in small
datasets. The dataset contains 1M songs/files, 44745
unique artists, 7643 unique terms (Echo Nest tags)
and 2321 unique tags (more detail at [12]). The data
is stored in HDF5 format. Its size and the many
music features (i.e. pitches, timbre, loudness etc.) it
contains make MSD very suitable to evaluate
content-based or hybrid recommendation algorithms.
The dataset can be downloaded from [13] for non-
commercial use (be aware that the download size is
about 300 GB).
For a long time Music Information Retrieval
(MIR) research has suffered the lack of publically
available and large-scale open data for personalized
music recommendations, mainly because of the
privacy and intellectual property concerns. Things
seem to have changed with the partial release of
last.fm data (march 2010) and the publication of
MSD (march 2011). These datasets are a very good
source for evaluating different kinds of content-based

Citations
More filters
Journal ArticleDOI

Hybrid recommender systems: A systematic literature review

TL;DR: A systematic literature review as discussed by the authors presents the state-of-the-art in hybrid recommender systems of the last decade and addresses the most relevant problems considered and present the associated data mining and recommendation techniques used to overcome them.
Journal ArticleDOI

Hybrid Recommender Systems: A Systematic Literature Review

TL;DR: This systematic literature review presents the state of the art in hybrid recommender systems of the last decade and addresses the most relevant problems considered and present the associated data mining and recommendation techniques used to overcome them.
Journal ArticleDOI

Social Media Recommender Systems: Review and Open Research Issues

TL;DR: A comprehensive review of the social media RS on research articles published from 2011 to 2015 is provided by exploiting a methodological decision analysis in six aspects, including recommendation approaches, research domains, and data sets used in each domain, data mining techniques, recommendation type, and the use of performance measures.
Journal ArticleDOI

A survey of recommender systems for energy efficiency in buildings: Principles, challenges and prospects

TL;DR: This paper presents the first timely and comprehensive reference for energy-efficiency recommendation systems and provides an original taxonomy of these systems based on specified criteria, including the nature of the recommender engine, its objective, computing platforms, evaluation metrics and incentive measures.
Proceedings ArticleDOI

MoodyLyrics: A Sentiment Annotated Lyrics Dataset

TL;DR: This work uses content words of lyrics and their valence and arousal norms in affect lexicons only to annotate each song with one of the four emotion categories of Russell's model, and also to construct MoodyLyrics, a large dataset of lyrics that will be available for public use.
References
More filters
Proceedings ArticleDOI

A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts

TL;DR: This paper proposed a machine learning method that applies text-categorization techniques to just the subjective portions of the document, extracting these portions can be implemented using efficient techniques for finding minimum cuts in graphs; this greatly facilitates incorporation of cross-sentence contextual constraints.
Proceedings ArticleDOI

Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales

TL;DR: A meta-algorithm is applied, based on a metric labeling formulation of the rating-inference problem, that alters a given n-ary classifier's output in an explicit attempt to ensure that similar items receive similar labels.
Proceedings ArticleDOI

Improving recommendation lists through topic diversification

TL;DR: This work presents topic diversification, a novel method designed to balance and diversify personalized recommendation lists in order to reflect the user's complete spectrum of interests, and introduces the intra-list similarity metric to assess the topical diversity of recommendation lists.
Journal ArticleDOI

Eigentaste: A Constant Time Collaborative Filtering Algorithm

TL;DR: This work compares Eigentaste to alternative algorithms using data from Jester, an online joke recommending system, and uses the Normalized Mean Absolute Error (NMAE) measure to compare performance of different algorithms.
Proceedings ArticleDOI

The million song dataset

TL;DR: The Million Song Dataset, a freely-available collection of audio features and metadata for a million contemporary popular music tracks, is introduced and positive results on year prediction are shown, and the future development of the dataset is discussed.
Related Papers (5)
Frequently Asked Questions (10)
Q1. What have the authors contributed in "Characterization of public datasets for recommender systems" ?

In this paper the authors present and describe more than 20 available datasets covering different domains such as movies, books, music etc. Overall aim of the paper is to offer a convenient resource for finding and selecting datasets as a support for the empirical evaluation of recommendation algorithms and techniques. 

The main purposes ofthe dataset are: encouraging research on algorithms that scale to commercial sizes providing a reference dataset for evaluating algorithms by using the audio features being a shortcut alternative for creating a large dataset with the Echo Nest's API helping new researchers get started in the MIR field, develop music recommendationsand study music similarity. 

The dataset includes annotations from users which have less than 1k tags and have used at least 10 different tags in 5 different websites. 

This dataset originates from theAPOSDLE EU project which is an adaptive work-integrated learning system aiming to improveknowledge worker productivity by supportinglearning situations within everyday work tasks. 

Theme Team of the STELLARNetwork of Excellence lunched the the first dataTELChallange [15] which is a call to research groups tosubmit datasets from Technology Enhanced Learningapplications. 

They hold together about 47k tags, 12k classification terms and many other actions performed by the users such as viewing and downloading. 

There are three versions of it:Dataset1 contains more than 4.1M continuous ratings (-10.00 to +10.00) of 100 jokes from 73421 users collected between April 1999 to May 2003. 

It was actually used at[18] to provide data about library readership, librarystart and article tags and experiment with user-basedand item-based collaborative filtering algorithms forTEL. 

The authors searched in SpringerLink, ScienceDirect,IEEExplore and ACM using keywords like"Evaluating Recommender Systems", "PublicDatasets for Recommender Systems" etc. 

For a long time Music Information Retrieval (MIR) research has suffered the lack of publically available and large-scale open data for personalized music recommendations, mainly because of the privacy and intellectual property concerns.