scispace - formally typeset
Open AccessJournal ArticleDOI

EsPal: One-stop shopping for Spanish word properties

Reads0
Chats0
TLDR
EsPal is a Web-accessible repository containing a comprehensive set of properties of Spanish words, based on an extensible set of data sources, beginning with a 300 million token written database and a 460 million token subtitle database.
Abstract
This article introduces EsPal: a Web-accessible repository containing a comprehensive set of properties of Spanish words. EsPal is based on an extensible set of data sources, beginning with a 300 million token written database and a 460 million token subtitle database. Properties available include word frequency, orthographic structure and neighborhoods, phonological structure and neighborhoods, and subjective ratings such as imageability. Subword structure properties are also available in terms of bigrams and trigrams, biphones, and bisyllables. Lemma and part-of-speech information and their corresponding frequencies are also indexed. The website enables users either to upload a set of words to receive their properties or to receive a set of words matching constraints on the properties. The properties themselves are easily extensible and will be added over time as they become available. It is freely available from the following website: http://www.bcbl.eu/databases/espal/ .

read more

Content maybe subject to copyright    Report

After online publication, subscribers (personal/institutional) to this journal will have
access to the complete article via the DOI using the URL:
If you would like to know when your article has been published online, take advantage
of our free alert service. For registration and further information, go to:
http://www.springerlink.com.
Due to the electronic nature of the procedure, the manuscript and the original figures
will only be returned to you on special request. When you return your corrections,
please inform us, if you would like to have these documents returned.
Dear Author
Here are the proofs of your article.
You can submit your corrections online, via e-mail or by fax.
For online submission please insert your corrections in the online correction form.
Always indicate the line number to which the correction refers.
You can also insert your corrections in the proof PDF and email the annotated PDF.
For fax submission, please ensure that your corrections are clearly legible. Use a fine
black pen and write the correction in the margin, not too close to the edge of the page.
Remember to note the journal title, article number, and your name when sending your
response via e-mail or fax.
Check the metadata sheet to make sure that the header information, especially author
names and the corresponding affiliations are correctly shown.
Check the questions that may have arisen during copy editing and insert your
answers/corrections.
Check that the text is complete and that all figures, tables and their legends are included.
Also check the accuracy of special characters, equations, and electronic supplementary
material if applicable. If necessary refer to the Edited manuscript.
The publication of inaccurate data such as dosages and units can have serious
consequences. Please take particular care that all such details are correct.
Please do not make changes that involve only matters of style. We have generally
introduced forms that follow the journal’s style.
Substantial changes in content, e.g., new results, corrected values, title and authorship are
not allowed without the approval of the responsible editor. In such a case, please contact
the Editorial Office and return his/her consent together with the proof.
If we do not receive your corrections within 48 hours, we will send you a reminder.
Your article will be published Online First approximately one week after receipt of your
corrected proofs. This is the official first publication citable with the DOI. Further
changes are, therefore, not possible.
The printed version will follow in a forthcoming issue.
Please note
http://dx.doi.org/10.3758/s13428-013-0326-1

AUTHOR'S PROOF
Metadata of the article that will be visualized in OnlineFirst
1 Article Title EsPal: One-stop shopping for Spanish w ord properties
2 Article Sub- Title
3 Article Copyright -
Year
Psychonomic Society, Inc. 2013
(This w ill be the copyright line in the final PDF)
4 Journal Name
Behavior Research Methods
5
Corresponding
Author
Family Name
Duchon
6 Particle
7 Given Name
Andrew
8 Suffix
9 Organization Basque Center on Cognition, Brain, and
Language
10 Division
11 Address Donostia, Spain
12 e-mail a.duchon@bcbl.eu
13
Author
Family Name
Perea
14 Particle
15 Given Name
Manuel
16 Suffix
17 Organization Universitat of València
18 Division
19 Address Valencia, Spain
20 e-mail
21
Author
Family Name
Sebastián-Gallés
22 Particle
23 Given Name
Nuria
24 Suffix
25 Organization Universitat Pompeu Fabra
26 Division
27 Address Barcelona, Spain
28 e-mail
29
Author
Family Name
Martí
30 Particle

AUTHOR'S PROOF
31 Given Name
Antonia
32 Suffix
33 Organization Universitat de Barcelona
34 Division
35 Address Barcelona, Spain
36 e-mail
37
Author
Family Name
Carreiras
38 Particle
39 Given Name
Manuel
40 Suffix
41 Organization Basque Center on Cognition, Brain, and
Language
42 Division
43 Address Donostia, Spain
44 Organization IKERBASQUE. Basque Foundation for Science
45 Division
46 Address Bilbao, Spain
47 e-mail
48
Schedule
Received
49 Revised
50 Accepted
51 Abstract
This article introduces EsPal: a Web-accessible repository
containing a comprehensive set of properties of Spanish words.
EsPal is based on an extensible set of data sources, beginning with
a 300 million token written database and a 460 million token
subtitle database. Properties available include word frequency,
orthographic structure and neighborhoods, phonological structure
and neighborhoods, and subjective ratings such as imageability.
Subword structure properties are also available in terms of bigrams
and trigrams, biphones, and bisyllables. Lemma and part-of-speech
information and their corresponding frequencies are also indexed.
The Web site enables users either to upload a set of words to
receive their properties or to receive a set of words matching
constraints on the properties. The properties themselves are easily
extensible and will be added over time as they become available.
It is freely available from the following Web site: http://www.bcbl.eu
/databases/espal/.
52 Keywords
separated by ' - '
53 Foot note
information

AUTHOR'S PROOF
UNCORRECTED PROOF
1
2
3
4
EsPal: One-stop shopping for Spanish word properties
5 Andrew Duchon & Manuel Perea & Nuria Sebastián-Gallés &
6 Antonia Martí & Manuel Carreiras
7
8
#
Psychonomic Society, Inc. 2013
9
10
Abstract This article introduces EsPal: a Web-accessible
11 repository containing a comprehensive set of properties
12 of Spanish words. EsPal is based on an extensible set
13 of data sources, beginning with a 300 million token
14 written database and a 460 million token subtitle data-
15 base. Properties available include word frequency, or-
16 thographic structure and neighborhoods, phonological
17 structure a nd neighborhoods, and sub jective ratings
18 such as imageability. Subword structure properties are
19 also available in terms of bigrams and trigrams,
20 biphones, and bisyllables. Lemma and part-of-speech
21 information and their corresponding frequencies are al-
22 so indexed. The Web site enables users either to upload
23 a set of words to receive their properties or to receive a
24 set of words matching constraints on the properties.
25 The properties the mselve s are easily extensible and will
26 be added over time as they become available. It is
27 freely available from the following Web site: http://
28 www.bcbl.eu/databases/espal/.
29 KeywordsQ1
30Researchers from a wide range of disciplines (e.g., neurosci-
31ence, artificial intelligence, psychology, linguistics, and educa-
32tion, among others) who work in the interdisciplinary area of
33language research (e.g., language acquisition, language pro-
34cessing, language learning, bilingualism, and computational
35linguistics) need quick and efficient access to information
36about specific properties of words. For example, word frequen-
37cy is a dominant factor in accounting for visual word recogni-
38tion speed as measured by lexical decision times (Forster &
39Chambers, 1973; Monsell, 1991) and eye fixation durations
40during reading (Rayner, 2009). Unsurprisingly, reading behav-
41ior as measured by, for example, lexical decision, naming,
42fixation times, and so on is affected by a wide range of other
43properties of words, including orthographic neighborhood
44(Carreiras, Perea, & Grainger, 1997;Grainger,1990), syllable
45frequency (Carreiras, Alvarez, & de Vega, 1993; Carreiras &
46Perea, 2004; Perea & Carreiras, 1998), and imageability
47(James, 1975), to cite just a few examples. Similarly, with
48regard to other fields that employ linguistic stimuli, such as
49memory research, it has been shown that word frequency plays
50a role in short-term memory (Hulme et al., 1997) and syllable
51length in working memory (Gathercole & Baddeley, 1990).
52Given the wide range of word properties that can affect
53language and cognitive processing, it is desirable to have a
54single, integrated, and updateable source of data. For
55Spanish, there are now a variety of databases available, but
56some are based on a relatively small number of tokens
57(Davis & Perea, 2005 ; Sebastián-Gallés, Martí, Carreiras,
58& Cuetos, 2000 ; Taulé, Martı, & Recasens, 2008), while
59others provide information about a limited number of vari-
60ables (Alonso, Fernandez, & Díez, 2011; Cuetos-Vega,
61González-Nosti, Barbón- Gutiérrez, & Brysbaert, 2011;
62Davies, 2005; Marian, Bartolotti, Chabal, & Shook, in
63press). EsPal (Español Palabras, meaning simply Spanish
64words) is a Web- based repos itory avail able at http://
65www.bcbl.eu/databases/espal/ that has been designed to fill
66this gap, provi ding information on a comprehensive set of
A. Duchon (*)
:
M. Carreiras
Basque Center on Cognition, Brain, and Language, Donostia,
Spain
e-mail: a.duchon@bcbl.eu
M. Perea
Universitat of València, Valencia, Spain
N. Sebastián-Gallés
Universitat Pompeu Fabra, Barcelona, Spain
A. Martí
Universitat de Barcelona, Barcelona, Spain
M. Carreiras
IKERBASQUE. Basque Foundation for Science, Bilbao, Spain
Behav Res
DOI 10.3758/s13428-013-0326-1
JrnlID 13428_ArtID 326_Proof# 1 - 14/02/2013

AUTHOR'S PROOF
UNCORRECTED PROOF
67 word properties from corpora with hundreds of millions of
68 words.
69 The most similar effort is the Syllabarium (Duñabeitia,
70 Cholin, Corral, Perea, & Carreiras, 2010), which is a Web-
71 based tool accessing a database containing information on
72 word frequencies and syllable frequencies by token and syl-
73 lable position. Standalone software packages are also avail-
74 able for Spanish and other languages that provide subsets of
75 the properties in EsPal (Davis, 2005; Davis & Perea, 2005;
76 New, Pallier, Brysbaert, & Ferrand, 2004; Perea et al., 2006).
77 However, given the size of the corpora (discussed below),
78 some of the calculations for some of the properties take up
79 to a week on a standard PC, so a precomputed set of properties
80 is preferred. With EsPal, the back-end processing for the word
81 and subword properties is conducted using a multistep pro-
82 gram written in Java, which prec omputes n ot only ba sic
83 properties of word frequency and form, but also orthographic
84 structure and neighborhoods, phonological structure and
85 neighborhoods, lemma and part-of-speech properties, and
86 subword structure properties related to letter bigrams and tri-
87 grams, bisyllables, and biphones. In addition, other data such
88 as a words subjective ratings (e.g., familiarity, imageability,
89 etc.) can be easily attached to the data and made searchable.
90 The second important factor of EsPal is the capacity to
91 apply the exact same processing to different corpora. A num-
92 ber of studies have shown that, across many languages, word
93 frequencies derived from movie subtitle corpora provide a
94 better account for various psycholinguistic effects
95 (Brysbaert, New, & Keuleers, 2012;Cai&Brysbaert,2010;
96 Cuetos-Vega et al., 2011; Dimitropoulou, Duñabeitia, Avilés,
97 Corral, & Carreiras, 2010; Keuleers, Brysbaert, & New, 2010;
98 New, Brysbaert, Veronis, & Pallier, 2007). However, proper-
99 ties from written corpora have in the past been more common
100 and may better predict some phenomena, so it is useful to have
101 different sources of data available for researchers, depending
102 on their goals. EsPal currently fulfills this goal by applying the
103 same processing to both a corpus based on movie subtitles and
104 one based on written text (fiction, nonfiction, and Web pages).
105 Finally, the Spanish-speaking community is diverse, and
106 EsPal is constructe d to be able to accommodate this diver-
107 sity, at least in terms of phonological representation.
108 Standard Castilian Spanish spoken on mainland Spain dif-
109 fers in a number of dimensions from the Spanish spoken in
110 the Canary Islands and in Latin America (which itself is
111 quite diverse). EsPal therefore a lso allows the user to choose
112 which phonological representation is used, for example, to
113 derive properties related to phonological neighborhoods.
114 In the remainder of this article, we describe the collection
115 and preprocessing of the written and subtitle databases
116 currently available in EsPal; how we calculate orthographic
117 and phonological properties, subword properties, lemma
118 and part-of-speech properties; and the source of the
119 subjective ratings data.
120Written corpus collection and preprocessing
121Written corpus collection
122The EsPa l Written Corpus is derived from a wide selection
123of texts collec ted from the Web or available in digital
124format. Table 1 provides a listing of percentages in terms
125of word tokens across the different sources and genres. We
126grouped them into nine subsets according to their content:
127academic, culture, law, philosophy, literature, news, politics,
128society, and the Spanish Wikipedia. All these texts had to
129meet the requirements of being freely available a nd not
130subject to copyright. Most documents were gathered from
131Web sites featuring a variety of linguistic styles, including
132formal, colloquial, and specialized language.
133The academic texts are mainly Ph.D. theses selected from
134a wide range of scientific fields: anthropology, architecture,
135art, biology, law, economics, electronics, philology, philos-
136ophy, physics, history, human ities, engineering, mathemat-
137ics, medicine, psychology, chemistry, telecommunications,
138and veterinary science. The set of culture texts is composed
139of news about cult ural events from several newspapers and
140blogs of opinion about films. Legal texts include mainly
141rulings by the High Court of Justice of several autonomous
142regions in Spain, as well as news from the judiciary field as
143it appeared in popular newspapers (El Mundo, El País, and
144El Periódico). The literary texts come from several Web
145pages containing works with expired copyrights (bdigital,
146biblioteca_ignoria, libroteca, logos,andscribd). These
147works are both texts written in Spanish and translations into
148Spanish. The news is from the EFE Agency from January,
149February, and March 2000. The politics set contains news
150texts referring to Spains 2007 autonomic elections,
151speeches by the Spanish President during 2008, and docu-
152ments taken from political party Web sites. The society set is
153composed of Web texts about religion, abortion, and psy-
154chology. Finally, the Web data are from the whole Spanish
155Wikipedia, circa February 2009.
t1:1Table 1 Percentage of terms by source type in the EsPal written
corpus
t1:2Source type Percent of terms
t1:3Academics 1.8 %
t1:4Culture 0.2 %
t1:5Law 1.0 %
t1:6Philosophy 1.1 %
t1:7Literature 22.5 %
t1:8News 8.7 %
t1:9Politics 16.0 %
t1:10Society 4.7 %
t1:11Web/Wikipedia 43.9 %
Behav Res
JrnlID 13428_ArtID 326_Proof# 1 - 14/02/2013

Citations
More filters
Journal ArticleDOI

Norms of valence and arousal for 14,031 Spanish words

TL;DR: A set of norms for valence and arousal for a very large set of Spanish words, including items from a variety of frequencies, semantic categories, and parts of speech, including a subset of conjugated verbs are presented.
Journal ArticleDOI

Converging evidence for functional and structural segregation within the left ventral occipitotemporal cortex in reading.

TL;DR: Two segregated areas along the vOTC posterior–anterior axis are identified involved in two different aspects of visual word recognition: a posterior part responsible for visual feature extraction and an anterior part involved in integrating information from and to the language network.
Journal ArticleDOI

Lextale-Esp: A Test to Rapidly and Efficiently Assess the Spanish Vocabulary Size.

TL;DR: The present study discusses the development of a Spanish version of the LexTALE test, Lextale-Esp, which discriminated well at the high and the low end of Spanish proficiency and returned a big difference between the vocabulary size of Spanish native and non-native speakers.
Journal ArticleDOI

Spanish norms for affective and lexico-semantic variables for 1,400 words

TL;DR: A database that provides subjective ratings for 1,400 Spanish words for valence, arousal, concreteness, imageability, context availability, and familiarity is described, suitable for experimental research into the effects of both affective properties and lexico-semantic variables on word processing and memory.
References
More filters
Journal ArticleDOI

Moving beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English

TL;DR: The size of the corpus, the language register on which the corpus is based, and the definition of the frequency measure were investigated, finding that lemma frequencies are not superior to word form frequencies in English and that a measure of contextual diversity is better than a measure based on raw frequency of occurrence.
Journal ArticleDOI

Eye movements and attention in reading, scene perception, and visual search.

TL;DR: Research on the following topics is reviewed with respect to reading: (a) the perceptual span, (or span of effective vision), (b) preview benefit, (c) eye movement control, and (d) models of eye movements.
Journal ArticleDOI

Lexical Access and Naming Time.

TL;DR: The authors found a positive correlation between naming times and lexical decision times for words, but not for nonwords, indicating that word naming occurred as a result of a lexical search procedure, rather than occurring prior to lexical searching.
BookDOI

Basic processes in reading : visual word recognition

TL;DR: In this paper, the basic processes in word recognition and identification are described, and a review of current Findings and Theories can be found in Section 5.1.1].
Journal ArticleDOI

Visual word recognition of single-syllable words.

TL;DR: Large-scale regression studies were used to investigate the unique predictive variance of phonological features in the onsets, lexical variables, and semantic variables to investigate visual word recognition, shedding light on recent empirical controversies in the available word recognition literature.
Related Papers (5)