EsPal: One-stop shopping for Spanish word properties

doi:10.3758/S13428-013-0326-1

After online publication, subscribers (personal/institutional) to this journal will have

access to the complete article via the DOI using the URL:

If you would like to know when your article has been published online, take advantage

of our free alert service. For registration and further information, go to:

http://www.springerlink.com.

Due to the electronic nature of the procedure, the manuscript and the original figures

will only be returned to you on special request. When you return your corrections,

please inform us, if you would like to have these documents returned.

Dear Author

Here are the proofs of your article.

•

You can submit your corrections online, via e-mail or by fax.

•

For online submission please insert your corrections in the online correction form.

Always indicate the line number to which the correction refers.

•

You can also insert your corrections in the proof PDF and email the annotated PDF.

•

For fax submission, please ensure that your corrections are clearly legible. Use a fine

black pen and write the correction in the margin, not too close to the edge of the page.

•

Remember to note the journal title, article number, and your name when sending your

response via e-mail or fax.

•

Check the metadata sheet to make sure that the header information, especially author

names and the corresponding affiliations are correctly shown.

•

Check the questions that may have arisen during copy editing and insert your

answers/corrections.

•

Check that the text is complete and that all figures, tables and their legends are included.

Also check the accuracy of special characters, equations, and electronic supplementary

material if applicable. If necessary refer to the Edited manuscript.

•

The publication of inaccurate data such as dosages and units can have serious

consequences. Please take particular care that all such details are correct.

•

Please do not make changes that involve only matters of style. We have generally

introduced forms that follow the journal’s style.

•

Substantial changes in content, e.g., new results, corrected values, title and authorship are

not allowed without the approval of the responsible editor. In such a case, please contact

the Editorial Office and return his/her consent together with the proof.

•

If we do not receive your corrections within 48 hours, we will send you a reminder.

•

Your article will be published Online First approximately one week after receipt of your

corrected proofs. This is the official first publication citable with the DOI. Further

changes are, therefore, not possible.

•

The printed version will follow in a forthcoming issue.

Please note

http://dx.doi.org/10.3758/s13428-013-0326-1

AUTHOR'S PROOF

Metadata of the article that will be visualized in OnlineFirst

1 Article Title EsPal: One-stop shopping for Spanish w ord properties

2 Article Sub- Title

3 Article Copyright -

Year

Psychonomic Society, Inc. 2013

(This w ill be the copyright line in the final PDF)

4 Journal Name

Behavior Research Methods

5

Corresponding

Author

Family Name

Duchon

6 Particle

7 Given Name

Andrew

8 Suffix

9 Organization Basque Center on Cognition, Brain, and

Language

10 Division

11 Address Donostia, Spain

12 e-mail a.duchon@bcbl.eu

13

Author

Family Name

Perea

14 Particle

15 Given Name

Manuel

16 Suffix

17 Organization Universitat of València

18 Division

19 Address Valencia, Spain

20 e-mail

21

Author

Family Name

Sebastián-Gallés

22 Particle

23 Given Name

Nuria

24 Suffix

25 Organization Universitat Pompeu Fabra

26 Division

27 Address Barcelona, Spain

28 e-mail

29

Author

Family Name

Martí

30 Particle

AUTHOR'S PROOF

31 Given Name

Antonia

32 Suffix

33 Organization Universitat de Barcelona

34 Division

35 Address Barcelona, Spain

36 e-mail

37

Author

Family Name

Carreiras

38 Particle

39 Given Name

Manuel

40 Suffix

41 Organization Basque Center on Cognition, Brain, and

Language

42 Division

43 Address Donostia, Spain

44 Organization IKERBASQUE. Basque Foundation for Science

45 Division

46 Address Bilbao, Spain

47 e-mail

48

Schedule

Received

49 Revised

50 Accepted

51 Abstract

This article introduces EsPal: a Web-accessible repository

containing a comprehensive set of properties of Spanish words.

EsPal is based on an extensible set of data sources, beginning with

a 300 million token written database and a 460 million token

subtitle database. Properties available include word frequency,

orthographic structure and neighborhoods, phonological structure

and neighborhoods, and subjective ratings such as imageability.

Subword structure properties are also available in terms of bigrams

and trigrams, biphones, and bisyllables. Lemma and part-of-speech

information and their corresponding frequencies are also indexed.

The Web site enables users either to upload a set of words to

receive their properties or to receive a set of words matching

constraints on the properties. The properties themselves are easily

extensible and will be added over time as they become available.

It is freely available from the following Web site: http://www.bcbl.eu

/databases/espal/.

52 Keywords

separated by ' - '

53 Foot note

information

AUTHOR'S PROOF

UNCORRECTED PROOF

1

2

3

4

EsPal: One-stop shopping for Spanish word properties

5 Andrew Duchon & Manuel Perea & Nuria Sebastián-Gallés &

6 Antonia Martí & Manuel Carreiras

7

8

#

Psychonomic Society, Inc. 2013

9

10

Abstract This article introduces EsPal: a Web-accessible

11 repository containing a comprehensive set of properties

12 of Spanish words. EsPal is based on an extensible set

13 of data sources, beginning with a 300 million token

14 written database and a 460 million token subtitle data-

15 base. Properties available include word frequency, or-

16 thographic structure and neighborhoods, phonological

17 structure a nd neighborhoods, and sub jective ratings

18 such as imageability. Subword structure properties are

19 also available in terms of bigrams and trigrams,

20 biphones, and bisyllables. Lemma and part-of-speech

21 information and their corresponding frequencies are al-

22 so indexed. The Web site enables users either to upload

23 a set of words to receive their properties or to receive a

24 set of words matching constraints on the properties.

25 The properties the mselve s are easily extensible and will

26 be added over time as they become available. It is

27 freely available from the following Web site: http://

28 www.bcbl.eu/databases/espal/.

29 KeywordsQ1

30Researchers from a wide range of disciplines (e.g., neurosci-

31ence, artificial intelligence, psychology, linguistics, and educa-

32tion, among others) who work in the interdisciplinary area of

33language research (e.g., language acquisition, language pro-

34cessing, language learning, bilingualism, and computational

35linguistics) need quick and efficient access to information

36about specific properties of words. For example, word frequen-

37cy is a dominant factor in accounting for visual word recogni-

38tion speed as measured by lexical decision times (Forster &

39Chambers, 1973; Monsell, 1991) and eye fixation durations

40during reading (Rayner, 2009). Unsurprisingly, reading behav-

41ior as measured by, for example, lexical decision, naming,

42fixation times, and so on is affected by a wide range of other

43properties of words, including orthographic neighborhood

44(Carreiras, Perea, & Grainger, 1997;Grainger,1990), syllable

45frequency (Carreiras, Alvarez, & de Vega, 1993; Carreiras &

46Perea, 2004; Perea & Carreiras, 1998), and imageability

47(James, 1975), to cite just a few examples. Similarly, with

48regard to other fields that employ linguistic stimuli, such as

49memory research, it has been shown that word frequency plays

50a role in short-term memory (Hulme et al., 1997) and syllable

51length in working memory (Gathercole & Baddeley, 1990).

52Given the wide range of word properties that can affect

53language and cognitive processing, it is desirable to have a

54single, integrated, and updateable source of data. For

55Spanish, there are now a variety of databases available, but

56some are based on a relatively small number of tokens

57(Davis & Perea, 2005 ; Sebastián-Gallés, Martí, Carreiras,

58& Cuetos, 2000 ; Taulé, Martı, & Recasens, 2008), while

59others provide information about a limited number of vari-

60ables (Alonso, Fernandez, & Díez, 2011; Cuetos-Vega,

61González-Nosti, Barbón- Gutiérrez, & Brysbaert, 2011;

62Davies, 2005; Marian, Bartolotti, Chabal, & Shook, in

63press). EsPal (Español Palabras, meaning simply “Spanish

64words”) is a Web- based repos itory avail able at http://

65www.bcbl.eu/databases/espal/ that has been designed to fill

66this gap, provi ding information on a comprehensive set of

A. Duchon (*)

:

M. Carreiras

Basque Center on Cognition, Brain, and Language, Donostia,

Spain

e-mail: a.duchon@bcbl.eu

M. Perea

Universitat of València, Valencia, Spain

N. Sebastián-Gallés

Universitat Pompeu Fabra, Barcelona, Spain

A. Martí

Universitat de Barcelona, Barcelona, Spain

M. Carreiras

IKERBASQUE. Basque Foundation for Science, Bilbao, Spain

Behav Res

DOI 10.3758/s13428-013-0326-1

JrnlID 13428_ArtID 326_Proof# 1 - 14/02/2013

AUTHOR'S PROOF

UNCORRECTED PROOF

67 word properties from corpora with hundreds of millions of

68 words.

69 The most similar effort is the Syllabarium (Duñabeitia,

70 Cholin, Corral, Perea, & Carreiras, 2010), which is a Web-

71 based tool accessing a database containing information on

72 word frequencies and syllable frequencies by token and syl-

73 lable position. Standalone software packages are also avail-

74 able for Spanish and other languages that provide subsets of

75 the properties in EsPal (Davis, 2005; Davis & Perea, 2005;

76 New, Pallier, Brysbaert, & Ferrand, 2004; Perea et al., 2006).

77 However, given the size of the corpora (discussed below),

78 some of the calculations for some of the properties take up

79 to a week on a standard PC, so a precomputed set of properties

80 is preferred. With EsPal, the back-end processing for the word

81 and subword properties is conducted using a multistep pro-

82 gram written in Java, which prec omputes n ot only ba sic

83 properties of word frequency and form, but also orthographic

84 structure and neighborhoods, phonological structure and

85 neighborhoods, lemma and part-of-speech properties, and

86 subword structure properties related to letter bigrams and tri-

87 grams, bisyllables, and biphones. In addition, other data such

88 as a word’s subjective ratings (e.g., familiarity, imageability,

89 etc.) can be easily attached to the data and made searchable.

90 The second important factor of EsPal is the capacity to

91 apply the exact same processing to different corpora. A num-

92 ber of studies have shown that, across many languages, word

93 frequencies derived from movie subtitle corpora provide a

94 better account for various psycholinguistic effects

95 (Brysbaert, New, & Keuleers, 2012;Cai&Brysbaert,2010;

96 Cuetos-Vega et al., 2011; Dimitropoulou, Duñabeitia, Avilés,

97 Corral, & Carreiras, 2010; Keuleers, Brysbaert, & New, 2010;

98 New, Brysbaert, Veronis, & Pallier, 2007). However, proper-

99 ties from written corpora have in the past been more common

100 and may better predict some phenomena, so it is useful to have

101 different sources of data available for researchers, depending

102 on their goals. EsPal currently fulfills this goal by applying the

103 same processing to both a corpus based on movie subtitles and

104 one based on written text (fiction, nonfiction, and Web pages).

105 Finally, the Spanish-speaking community is diverse, and

106 EsPal is constructe d to be able to accommodate this diver-

107 sity, at least in terms of phonological representation.

108 Standard Castilian Spanish spoken on mainland Spain dif-

109 fers in a number of dimensions from the Spanish spoken in

110 the Canary Islands and in Latin America (which itself is

111 quite diverse). EsPal therefore a lso allows the user to choose

112 which phonological representation is used, for example, to

113 derive properties related to phonological neighborhoods.

114 In the remainder of this article, we describe the collection

115 and preprocessing of the written and subtitle databases

116 currently available in EsPal; how we calculate orthographic

117 and phonological properties, subword properties, lemma

118 and part-of-speech properties; and the source of the

119 subjective ratings data.

120Written corpus collection and preprocessing

121Written corpus collection

122The EsPa l Written Corpus is derived from a wide selection

123of texts collec ted from the Web or available in digital

124format. Table 1 provides a listing of percentages in terms

125of word tokens across the different sources and genres. We

126grouped them into nine subsets according to their content:

127academic, culture, law, philosophy, literature, news, politics,

128society, and the Spanish Wikipedia. All these texts had to

129meet the requirements of being freely available a nd not

130subject to copyright. Most documents were gathered from

131Web sites featuring a variety of linguistic styles, including

132formal, colloquial, and specialized language.

133The academic texts are mainly Ph.D. theses selected from

134a wide range of scientific fields: anthropology, architecture,

135art, biology, law, economics, electronics, philology, philos-

136ophy, physics, history, human ities, engineering, mathemat-

137ics, medicine, psychology, chemistry, telecommunications,

138and veterinary science. The set of culture texts is composed

139of news about cult ural events from several newspapers and

140blogs of opinion about films. Legal texts include mainly

141rulings by the High Court of Justice of several autonomous

142regions in Spain, as well as news from the judiciary field as

143it appeared in popular newspapers (El Mundo, El País, and

144El Periódico). The literary texts come from several Web

145pages containing works with expired copyrights (bdigital,

146biblioteca_ignoria, libroteca, logos,andscribd). These

147works are both texts written in Spanish and translations into

148Spanish. The news is from the EFE Agency from January,

149February, and March 2000. The politics set contains news

150texts referring to Spain’s 2007 autonomic elections,

151speeches by the Spanish President during 2008, and docu-

152ments taken from political party Web sites. The society set is

153composed of Web texts about religion, abortion, and psy-

154chology. Finally, the Web data are from the whole Spanish

155Wikipedia, circa February 2009.

t1:1Table 1 Percentage of terms by source type in the EsPal written

corpus

t1:2Source type Percent of terms

t1:3Academics 1.8 %

t1:4Culture 0.2 %

t1:5Law 1.0 %

t1:6Philosophy 1.1 %

t1:7Literature 22.5 %

t1:8News 8.7 %

t1:9Politics 16.0 %

t1:10Society 4.7 %

t1:11Web/Wikipedia 43.9 %

Behav Res

JrnlID 13428_ArtID 326_Proof# 1 - 14/02/2013

EsPal: One-stop shopping for Spanish word properties

Figures

Citations

MultiPic: a standardized set of 750 drawings with norms for six European languages

Norms of valence and arousal for 14,031 Spanish words

Converging evidence for functional and structural segregation within the left ventral occipitotemporal cortex in reading.

Lextale-Esp: A Test to Rapidly and Efficiently Assess the Spanish Vocabulary Size.

Spanish norms for affective and lexico-semantic variables for 1,400 words

References

Moving beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English

Eye movements and attention in reading, scene perception, and visual search.

Lexical Access and Naming Time.

Basic processes in reading : visual word recognition

Visual word recognition of single-syllable words.

Related Papers (5)

Fitting Linear Mixed-Effects Models Using lme4

DMDX: A Windows display program with millisecond accuracy

Wuggy: a multilingual pseudoword generator.

Thirty years and counting: Finding meaning in the N400 component of the event related brain potential (ERP)

Moving beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English