What have the authors contributed in "Characterization of public datasets for recommender systems" ?

In this paper the authors present and describe more than 20 available datasets covering different domains such as movies, books, music etc. Overall aim of the paper is to offer a convenient resource for finding and selecting datasets as a support for the empirical evaluation of recommendation algorithms and techniques.

How many users have used tags in the Today Module?

The dataset includes annotations from users which have less than 1k tags and have used at least 10 different tags in 5 different websites.

What is the purpose of the dataset?

This dataset originates from theAPOSDLE EU project which is an adaptive work-integrated learning system aiming to improveknowledge worker productivity by supportinglearning situations within everyday work tasks.

What is the purpose of the dataTEL challenge?

Theme Team of the STELLARNetwork of Excellence lunched the the first dataTELChallange [15] which is a call to research groups tosubmit datasets from Technology Enhanced Learningapplications.

How many resources have been accessed by registered users?

They hold together about 47k tags, 12k classification terms and many other actions performed by the users such as viewing and downloading.

What is the version of the dataset that contains the jokes?

There are three versions of it:Dataset1 contains more than 4.1M continuous ratings (-10.00 to +10.00) of 100 jokes from 73421 users collected between April 1999 to May 2003.

What was the purpose of the dataTEL challenge?

It was actually used at[18] to provide data about library readership, librarystart and article tags and experiment with user-basedand item-based collaborative filtering algorithms forTEL.

(Open Access) Characterization of public datasets for Recommender Systems (2015) | Erion Çano

Q: What are the main purposes of the million song dataset?

The main purposes ofthe dataset are: encouraging research on algorithms that scale to commercial sizes providing a reference dataset for evaluating algorithms by using the audio features being a shortcut alternative for creating a large dataset with the Echo Nest's API helping new researchers get started in the MIR field, develop music recommendationsand study music similarity.

Q: What search engines are used to find the datasets?

The authors searched in SpringerLink, ScienceDirect,IEEExplore and ACM using keywords like"Evaluating Recommender Systems", "PublicDatasets for Recommender Systems" etc.

09 August 2022

POLITECNICO DI TORINO

Repository ISTITUZIONALE

Characterization of Public Datasets for Recommender Systems / Cano, Erion; Morisio, Maurizio. - ELETTRONICO. -

(2015), pp. 249-257. ((Intervento presentato al convegno Research and Technologies for Society and Industry

Leveraging a better tomorrow (RTSI), 2015 IEEE 1st International Forum on tenutosi a Torino (ITALIA) nel 16-18 Sep.

2015 [10.1109/RTSI.2015.7325106].

Original

Characterization of Public Datasets for Recommender Systems

Publisher:

Published

DOI:10.1109/RTSI.2015.7325106

openAccess

Publisher copyright

(Article begins on next page)

This article is made available under terms and conditions as specified in the corresponding bibliographic description in

the repository

Availability:

This version is available at: 11583/2636630 since: 2017-03-01T11:36:00Z

IEEE

Characterization of Public Datasets for Recommender Systems

Erion Çano

Politecnico di Torino

Email: erion.cano@polito.it

Maurizio Morisio

Politecnico di Torino

Email: maurizio.morisio@polito.it

Abstract— As Recommender Systems are becoming

very common and widespread, there is an increasing

need to evaluate their characteristics such as accuracy,

diversity, scalability etc. One of the most fruitful ways

to do this is by using public datasets with explicit user

feedback about the items. In this paper we present and

describe more than 20 available datasets covering

different domains such as movies, books, music etc.

Each dataset is described over a number of attributes

such as size, domain, format of the data, type of access.

Unfortunately we did not find any information about

the quality of the data contained, that remains an open

issue. We also refer to examples from the literature

about using the datasets to evaluate recommendation

algorithms or solutions. Overall aim of the paper is to

offer a convenient resource for finding and selecting

datasets as a support for the empirical evaluation of

recommendation algorithms and techniques.

Index Terms—Public Datasets, Recommender

Systems, Recommendation Evaluations.

I. INTRODUCTION

Recommender Systems (RS) are now being used to

recommend any kind of product or service such as

movies, books, music, food, software etc. Evaluating

a recommender system with respect to different

quality criteria such as accuracy, diversity or novelty

is a very important step. One way to do that is by

running public user evaluation campaigns based on

questionnaires which provide rich feedback but are

often heavy to process. It is also difficult to find a

significant number of feedback providers. Another

very common technique is based in using public

datasets with real user feedback information. In this

paper we describe some of the most common public

datasets available for research in RS. These datasets

can be used to compare the newly developed

algorithms with the existing ones in given settings. In

such datasets a representation of implicit or explicit

feedback from the users regarding the candidate

items is stored in order to allow the RS to produce a

recommendation. The feedback is in different forms.

In many cases it can be ratings or votes upon the

items. The user-rating matrix used in collaborative

filtering is a well-known example. In this case the

evaluation consists in comparing the predicted ratings

with the real ones. In case of content-based or other

RS types it can be item reviews or simple tags

(keywords) that users provide for items.

The public datasets are usually made available by

university research groups or similar institutions. In

many cases they also publish their own results

obtained using the datasets. The aim of this paper is

to describe the main characteristics of the datasets

and provide examples of using them, which can be

helpful to the many researchers who are currently

working in the field of RS. We present 26 datasets,

20 of which are public and active whereas 6 are

retired or restricted. They come from different

domains and have various characteristics. Eight are

used for movies, 5 for books, articles or other

learning materials, 4 for music, 3 for food or

healthcare and 6 from other domains. We also

provide examples of using the datasets in different

contexts and for different purposes extracted from the

literature. More specific and technical details about

how to use the datasets for recommendation

evaluations can be found at [1]. The rest of this paper

is structured as follows: In section II we describe the

active datasets with respect to the domains they

pertain. In section III we describe the restricted or

retired datasets. In section IV we conclude by making

a quantitative discussion.

II. AVAILABLE DATASETS

The datasets were found as part of a systematic

literature review on RS we are conducting. This

paper is actually byproduct of the systematic review

in which we address different research questions

Table 1: Complete list of Datasets and their attributes

Name

Domain

Size

Collected

Status

Format

URL

MovieLens

Movies

DS1: 1K users, 1700 movies, 100K ratings

DS2: 6K users, 4K movies, 1M ratings

DS3: 72K users, 10K movies, 10M ratings

DS4: 138K users, 27K movies, 20M ratings

DS5: 230K users, 27K movies, 21M ratings

Released 4/1998

Released 2/2003

Released 1/2009

Released 4/2015

Active/

Non-commercial

Csv

[2]

Yahoo Movie

Ratings

Movies

November 2003

Request/

Academic

[3]

MovieTwittings

Movies

12425 users, 8458 movies, 65115 ratings

Released 3/1/2013

Active/

Non-commercial

Txt

[4]

IMDB

Movies

Since 1998

Active/

Restricted

[37]

EachMovie

Movies

72,916 users, 1628 movies, 2,811,983 ratings

Retired in 2004

Netflix

Movies

480k users, 17770 movies, 100M ratings

1998 -2005

Inactive/

Non-commercial

Txt

[38]

Cornell

University

Movies

DS1: 1k positive, 1k negative reviews

DS2: 1770, 902, 1307, 1027 movie reviews

DS3: 5k subjective, 5k objective sentences

Released June 2004

Released July 2005

Released June 2004

Active/

Non-commercial

Txt

[8]

Rotten

Tomatoes

Movies

222352 reviews about 11855 movies

Active/

Non-commercial

Txt

[9]

Last.fm

Music

DS1: 1915086 lines, 992 users, 69420 artists

DS2:17559530 lines, 359347 users, 107373 artists

May 2007

Active/

Non-commercial

Txt

[10]

Yahoo Music

Ratings

Music

15,400 users, 1K songs, 300K ratings

2002 - 2006

Request/

Academic

[3]

Audioscrobbler

Music

May 2005

Merged with

Last.fm

Txt

[40]

Million Songs

Music

1M songs, 44745 artists, 7643 unique terms, 2321

unique tags

Active/

Non-commercial

HDF5

[13]

Citation Papers

Research

DS1: 28 researchers x 597 papers

DS2: 50 researchers x 100531 papers

June 2010

Active/

Non-commercial

Txt

[21]

Mendely

Research

1857912 articles, 200K users

254681 tags to 27652 articles by 4099 users

Since 2009

Request/

Non-commercial

[16]

MACE

Learning

1148 users, 150K resources, 47K tags

2006-2009

Active/

Private

Apostle-DS

Learning

1500 user activities of 6 users for 3 months

3 months

Request/

Non-commercial

[17]

BookCrossing

Books

278,858 users, 271379 books, 1149780 ratings

Aug - Sep 2004

Active/

Non-commercial

Sql/Csv

[22]

Organic.Edunet

Food

345 tags, 250 ratings, 325 reviews

Jan 2010 - Sep 2010

Active/

Non-commercial

Rdf/Txt

[24]

Chicago Entree

Food

Sep 1996 - Apr 1999

Active/

Non-commercial

Txt

[25]

Mediacare

Health

Ratings of 15K nursing homes, 4K Hospitals

Active/

Non-commercial

Mdb/Csv

[28]

Dating website

Dating

135,359 users, 17,359,346 ratings

April 2006

Active/

Non-commercial

Txt

[29

WikiLens

Various

Feb. 2008

Retired in 2009

Txt

[41]

Epinions

Various

131228 users, 317755 items, 1127673 reviews

June 2011

Active/

Non-commercial

Sql

[30]

Yahoo Front

Page

News

28041015 user visits

2 - 16 Oct. 2011

Request/

Academic

[32]

Tags2Con

URLs

1397 users, 1569 tags, 1681 ratings, 739 URLs,

603 domains

Dec 2007 - Apr 2008

Active/

Non-commercial

Rdf

[33]

Jester

Jokes

DS1: 4.1M ratings -10.00 - +10.00 of 100 jokes

from 73421 users,

DS2: 1.7 M ratings -10.00 - +10.00 of 150 jokes

from 59132 users,

DS3: Update of DS2 with 500 K new jokes by

79681 users

Apr 1999 - May 2003

Nov 2006 - May 2009

Nov 2006 - Nov 2012

Active/

Non-commercial

Xls

[35]

that have to do with the construction and evaluation

of RS. We searched in SpringerLink, ScienceDirect,

IEEExplore and ACM using keywords like

"Evaluating Recommender Systems", "Public

Datasets for Recommender Systems" etc. In Table 1

we summarize some basic characteristics of the 26

datasets we present. In 'Size' column we give the

number of users, number of items managed and

number of ratings (in some cases reviews or tags).

The 'Collected' column contains the period when the

dataset was started or the release date in some cases.

In the 'Status' column we show the current status of

the dataset and the usage conditions/constrains. Most

of the datasets are "Active" (in the sense that they are

freely available and usually updated). Some of the

datasets can be obtained upon request to the

owner/maintainer (denoted as "Request"). Most of

the datasets can be used for any non-commercial

purpose. Some datasets (denoted as "Academic") can

be used for research purposes only. In the 'Format'

column the data format is given.

A. Movies

MovieLens. This is one of the most used datasets

for algorithms evaluation in the community of

Recommender System Research. The dataset was

collected and made available by GroupLens

Research, a research lab in the department of

computer science, University of Minnesota. There are

four stable versions of the dataset which vary in size

and release date (Table 2). The fifth version is the

most recent. It changes over time and it is not

appropriate for reporting research results [2]. The

datasets are public and open for non-commercial use

only. The many user ratings these datasets contain

make them very suitable for evaluating different

versions of Collaborative Filtering recommendation

algorithms.

Table 2: Versions of MovieLens Dataset

Version

Users

Movies

Ratings

Released

Format

100k

1.7k

100k

4/1998

Txt

2/2003

Csv

10M

72k

10k

10M

1/2009

Csv

20M

138k

27k

20M

4/2015

Csv

Latest

230k

27k

21M

4/2015

Csv

Yahoo Movie Ratings. This dataset contains a

small sample of the Yahoo Movies community's

preferences for various movies. The movies are rated

on a scale from A+ to F. The dataset also contains a

large amount of descriptive information about many

movies released prior to November 2003, including

cast, crew, synopsis, genre, average ratings, awards,

etc. It may be used by researchers to validate

recommender systems of different algorithms,

including hybrid content-based and collaborative

filtering. The dataset is available for download upon

request from [3] and can be used for academic

purposes only.

MovieTwittings. This is a very recent movie

dataset which collects the ratings from user twits. It is

composed of 2 files which contain the movies and the

ratings respectively. Unlike MovieLens and other

filtered datasets which contain only users with a

minimum number (i.e. 20) of ratings, MovieTwittings

has a number of ratings which varies from 1 to 305

per user. The authors started querying the Twitter

search API on March 2013. Since then the dataset

continuously grows based on the daily twits of

different users. Currently the dataset contains 12425

unique users, 8458 unique items and 65115 ratings.

The many and recent movie ratings it contains make

it suitable for the evaluation of various versions of

Collaborative Filtering algorithms. The dataset is

open for public use and can be downloaded from [4].

More details about MovieTwittings can be found at

the authors' publication [5].

Movie Review Dataset. This dataset was created

by researchers at Cornell University and includes

subjective movie reviews. There are actually 3

datasets: Sentiment polarity, Sentiment scale and

Subjectivity dataset. The Sentiment polarity dataset is

a collection of 1000 positive and 1000 negative

movie reviews. The Sentiment scale dataset is a

collection of four subjective review sets (with 1770,

902, 1307 and 1027 movie reviews each) which can

be used to infer ratings. The Subjectivity dataset is a

collection of 5000 subjective and 5000 objective

movie review sentences. The datasets were released

in June 2004 the first and the third and July 2005 the

second. All the data are in Txt format. These datasets

are very suitable to evaluate Machine Learning,

Natural Language Processing or Text Processing

algorithms/techniques used for ratings or

recommendations. The authors use the Sentiment

scale dataset in [6] to evaluate their algorithm that

addresses the rating-inference problem. They also use

the first and third datasets in [7] to train and evaluate

the Naive Bayes and SVM text categorizers they use

to determine the sentiment polarity of the subjective

users' reviews. The datasets are publicly available for

non-commercial use. More info about the datasets

and the download links can be found at [8].

Rotten Tomatoes. This is a highly viewed movie

review website with a rich movie dataset. A subset of

its dataset can be freely downloaded from [9]. It is

comprised of tab separated textual movie review

phrases from the original Rotten Tomato dataset. The

dataset contains 222352 reviews about 11855 movies.

There is a train/test split of the dataset to make it

more suitable for benchmarking algorithms. The

structure of the dataset makes it very suitable for

sentiment analysis and machine learning research and

benchmarking.

The movie domain is obviously the most preferred

for recommender systems research. This is partly

because there are many public and rich datasets

(apart from the 5 above, we describe 3 other

restricted/retired movie datasets in the next section)

with explicit user feedback (movie ratings). The

regular structure and contents of these datasets makes

them very suitable for evaluating different versions of

Collaborative Filtering or hybrid recommendation

algorithms. MovieLens is probably the most used

dataset for research purposes.

B. Music

Last.fm. This is a subset of last.fm website music

dataset, one of the most important in the domain of

music. There are actually two versions of the dataset:

Last.fm-360k which includes the top artists of 360k

users and Last.fm-1k which includes the full listening

history of 1k users. The datasets contain information

about 69420 and 107373 artists respectively. They

also contain information about track preferences of

the users and some attributes of every user. This

dataset is suitable for content-based or hybrid

algorithms evaluation. All the data is stored in tab

separated textual values. The dataset can be freely

downloaded from [10] for non-commercial use only.

It is also required to reference last.fm website when

using it.

Yahoo Music Ratings. This dataset contains music

ratings supplied by users while browsing Yahoo

Music services and ratings for randomly selected

songs collected during an online survey conducted by

Yahoo Research [11]. The rating data includes 15.4k

users, and 1k songs. The dataset includes

approximately 300k user-supplied ratings, and

exactly 54k ratings for randomly selected songs. The

data were collected between 2002 and 2006. The rich

user feedback (many explicit ratings) it contains

makes it very suitable to evaluate different versions

of collaborative filtering or hybrid recommendation

algorithms. It can be downloaded upon request from

[3] and used for academic purposes only.

The Million Song Dataset. The Million Song

Dataset (MSD) is an attempt to help researchers in

the field of Music Analysis and Recommendations by

providing a large-scale dataset. The main purposes of

the dataset are:

 encouraging research on algorithms that

scale to commercial sizes

 providing a reference dataset for evaluating

algorithms by using the audio features

 being a shortcut alternative for creating a

large dataset with the Echo Nest's API

 helping new researchers get started in the

MIR field, develop music recommendations

and study music similarity.

A large dataset helps to reveal problems with

algorithms scaling and discover rare phenomena or

patterns that may not be discoverable in small

datasets. The dataset contains 1M songs/files, 44745

unique artists, 7643 unique terms (Echo Nest tags)

and 2321 unique tags (more detail at [12]). The data

is stored in HDF5 format. Its size and the many

music features (i.e. pitches, timbre, loudness etc.) it

contains make MSD very suitable to evaluate

content-based or hybrid recommendation algorithms.

The dataset can be downloaded from [13] for non-

commercial use (be aware that the download size is

about 300 GB).

For a long time Music Information Retrieval

(MIR) research has suffered the lack of publically

available and large-scale open data for personalized

music recommendations, mainly because of the

privacy and intellectual property concerns. Things

seem to have changed with the partial release of

last.fm data (march 2010) and the publication of

MSD (march 2011). These datasets are a very good

source for evaluating different kinds of content-based

Characterization of public datasets for Recommender Systems

Figures

Citations

Hybrid recommender systems: A systematic literature review

Hybrid Recommender Systems: A Systematic Literature Review

Social Media Recommender Systems: Review and Open Research Issues

A survey of recommender systems for energy efficiency in buildings: Principles, challenges and prospects

MoodyLyrics: A Sentiment Annotated Lyrics Dataset

References

A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts

Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales

Improving recommendation lists through topic diversification

Eigentaste: A Constant Time Collaborative Filtering Algorithm

The million song dataset

Related Papers (5)

Evaluating Recommender Systems

Methods of recommender system: A review

A Survey Paper on Recommender Systems

Scalability and sparsity issues in recommender datasets: a survey

Social recommender systems: techniques, domains, metrics, datasets and future scope

Frequently Asked Questions (10)

Q1. What have the authors contributed in "Characterization of public datasets for recommender systems" ?

Q2. What are the main purposes of the million song dataset?

Q3. How many users have used tags in the Today Module?

Q4. What is the purpose of the dataset?

Q5. What is the purpose of the dataTEL challenge?

Q6. How many resources have been accessed by registered users?

Q7. What is the version of the dataset that contains the jokes?

Q8. What was the purpose of the dataTEL challenge?

Q9. What search engines are used to find the datasets?

Q10. What are the main reasons why the million song dataset is not available?