scispace - formally typeset

Journal ArticleDOI

Adaptive networks as a model for human speech development

01 Jan 1990-Computers in Human Behavior (Pergamon)-Vol. 6, Iss: 4, pp 291-313

TL;DR: The network structure used in NETtalk is reproduced to determine which characteristics of the network are responsible for which learning behavior, and how closely that maps human speech development.

AbstractUnrestricted English text can be converted to speech through the use of a look-up table, or through a parallel feed-forward network of deterministic processing units. Here, we reproduce the network structure used in NETtalk. Several experiments are carried out to determine which characteristics of the network are responsible for which learning behavior, and how closely that maps human speech development. The network is also trained with different levels of speech complexity and with a second language. The results are shown to be highly dependent on statistical characteristics of the input.

Topics: NETtalk (65%)

Summary (4 min read)

I . Introduction

  • Connectionist networks are arrays of simple highly interconnected computing elements.
  • These networks are particularly useful in building applications involving adaptive mappings.
  • The authors ultimate goal is to understand to what extent feedforward networks can be used for the study of human speech development and behavior, and what changes, if any, would improve their predictability of this behavior.
  • The feedforward network is composed of three layers.

2. T ext-to-S peech C onversion

  • It may also provide insights into human development, particularly the relationships between input and output.
  • Finally, such simulations should elucidate the process of learning second complex phonetic mappings, thus shedding light on first and second language learning, or synthetic language formation, and on speech recognition.
  • The following simulations represent first steps in this direction.

3. R epresenting T ext and Phonem es

  • Two decision rules are used to define the output phoneme.
  • The first rule compares the output vector with the phoneme vectors one by one.
  • The phoneme vector making the smallest Euclidean angle with the output vector is considered to be the "best guess" phoneme.

4. P reparation o f T raining Input

  • In Carterette and Jones, there is no suprasegmental information accompanying the phonemes.
  • The stresses are inserted according to the natural intonation when the words are spoken.

7. Perform ance

  • The simulations are performed on a Gould NP-I computer.
  • The throughput is 6 phonemes per second during learning and 20 letters per second when the neural network is reading text and producing phonemes without learning.

8. S u m m a ry o f S ta tis tic s *

  • The neural network trained with Spanish does not learn English very readily.
  • This is about the accuracy achieved by the neural network trained with only adult speech for 9 passes through the training input.

9. O bservations

  • One solution to the above problem would be to assign special characters for "ch", 'll", and "rr" in Spanish.
  • There is a need of preprocessing the Spanish text before training and actual text-to-speech processing.
  • The neural network could also be trained with more passes through the training corpus until the output converges to the desired phonemes.
  • The preprocessing is then incorporated into the network architecture.

9.1. O rthographic Transparency

  • The authors transcribe the Spanish text using Spanish pronunciation rules.
  • The Spanish trained network is capable of pronouncing text more accurately than the English trained network.
  • The English trained network retains the errors of connected speech, and appears to generalize them.
  • Children are exposed to corrupted continuous speech signal.
  • This interesting effect could be an explanation for some of the children speech errors.

9.2. Learning Different Languages

  • Many of the errors that occur in the English input sets are reflected in output, even after performance is at its peak.
  • Furthermore, there is some apparent generalization of errors to words that do not contain the error in the input set and to orthographic characters th at are accurately represented in the input.
  • The authors are currently conducting experiments to determine the specific nature and extent of the generalization.

0.3. C apturing th e C haracteristics o f Speech

  • Initial consonants are more frequently accurate in the output.
  • In contrast, final consonants tend to be deleted, reflecting the characteristics of the input.
  • Of course, in the Spanish trained network, these connected speech characteristics are not present because they are not present in the input.
  • Thus, the statistical characteristics of the input are captured by the back propagation model.
  • Just as the errors in input are reflected in output, the frequency of character-phoneme mappings is reflected as well in the overall performance and in the relative accuracy of individual mappings.

0.4. T raining w ith a Second Language

  • This almost suggests th at the patterns of one language coyer another.
  • Care must be taken when training a back propagation network.
  • The authors call this a recency order bias and it is a complex issue that warrants treatm ent elsewhere.
  • This phenomenum will pose severe difficulties when trying to add new knowledge to a pretrained network.

9,5. C ontextual D ependency

  • The results of the experiments concerning different window sizes show th at a window size of 5 letters is sufficient for the Spanish mapping.
  • In Spanish, the vowel values do not depend on other syllables in the word.
  • The statistics from the simulations w ith various window sizes all show th at the larger the window, the more accurate the neural network can achieve in a given number of passes through the training text.
  • There are many ways of pronouncing one word (e.g. "read").
  • Vowels have different values in different words spelled similarly (e.g. "five" and "give").

9.8. Number ofH iddenU nits

  • When the number of hidden units is sufficient to allocate one pattern boundary per hidden unit, basically a look up table is obtained.
  • Kung et al. hypothesizes that theoretically there must be a learning scheme capable of maximizing class separability with the number of hidden units approximately equal to logs M. W ith the above observation, the optimal point for the algorithm in this particular application occurs with a number of hidden units in the tens of thousands, and would practically be intractable.
  • Computationally, each hidden unit increases the number of operations by the number of input units plus one.

9.7. C lu s te r A nalysis

  • In order to study the excitation pattern of the hidden-layer units, cluster analysis on their activation levels is performed.
  • Each training pattern is presented to the network, and let it propagate forward to the hidden-layer and the output-layer.
  • The activation of each hidden-layer unit and the input-output pair are saved.
  • When all the training patterns are presented, if their corresponding central window letter-output phoneme pair of the vectors are the same, they are grouped and averaged.
  • The number of groups or letter-to-phoneme correspondences obtained is more than the number of phonemes.

9.7.1. ClusteringAlgorithm

  • W ith these averaged hidden-layer unit activation vectors, the authors can perform hierarchical cluster analysis.
  • The Lance and William General Algorithm with complete linkage [3] is used.
  • Then the algorithm repeatedly finds the smallest distance in the table.
  • The distance table is revised by deleting the distances between any other vector and any of these two vectors, and adding the distances between the newly formed cluster and the other vectors.
  • The iteration stops when the distance table contains only one entry, i.e. when a single cluster is left.

9.7.2. English Training

  • The letter-j -to-phoneme correspondences form a cluster.
  • The letters-m , n -to-phoneme correspondences form another cluster.

9.7.8. Spanish Training

  • The Spanish input is an exact rule-based transcription, while the English input is a transcription of the production of some first grade children.
  • The English transcription contains a significant number of speech errors.

9.7.4• EnglishDevelopmentalClusterAnalysis

  • In the English-trained network, the authors perform cluster analysis at the end of the 5th, 10th, 15th, 20th, and the 25th passes.
  • The sentence boundary is represented by the pause and fullstop features.
  • A t the IOth pass, the network is eliding less letters, showing its gradual regularizing of letter-to-phoneme correspondences.
  • At the 6th pass, all the vowel letters are in one cluster distinguished from the other consonants, but are grouped with the word and sentence boundaries.

9.7.6. Spanish-English Training Cluster Analysis

  • The other significant change with the second language training is the increase in distance between the most distinct clusters, which grows from 2.5 at the beginning to over 3.5 at the 25th pass.
  • The authors also observe that the vowel-letter-to-phoneme correspondences merge with, some consonant clusters to a small extent.
  • Specifically, the correspondences involving letters "u" and "i" leave the main cluster of vowels and join a cluster with the nasal letters "m" and "n", and stop letters "t" and "d".

9.7.7. English-Spanish Training Cluster Analysis

  • The second r of the double Spanish letter "rr" has an elide symbol associated with it in the training input.
  • This correspondence is learned in the Spanishtrained network but not in the English-trained network with Spanish as a second mapping.
  • The letter-"n"-to-nasal-phoneme-/G/ correspondence and the letter-"v"-to-phoneme-/v/ correspondence are retained.
  • Moreover, the English mapping helps retain the letter-"f"-to-phoneme-/f/ correspondence and the letter-"k"-tophonem e-/k/ correspondence.
  • These two correspondences are not learned by the original Spanish-trained network.

10. O u tp u t Decision A lgorithm s

  • The "best guess" and the "perfect match" strategies are used as ways of making output phoneme decisions based on the output vector and the phoneme dictionary.
  • The best guess is a computationally more expensive strategy using the minimal Euclidean distance between the output and a phoneme vector.
  • It performs between 20 to 30% better than the perfect match, which discards all values that do not represent the phonemes.
  • The authors derive the back propagation algorithm as a function to minimize the expectation of the square of the difference between the output and the target value.
  • In the derivation, a simple 50% thresholding scheme for the output emerges.

11. Problem s, Solutions, and P oten tials

  • For studies of speech development, several interesting aspects could be captured by the back propagation algorithm.
  • Other mechanisms are still to be accounted for, and will require additional subsystems.
  • Among those features th at require attention are: top down processing of speech, grammatical constraints, etc.

Did you find this useful? Give us your feedback

...read more

Content maybe subject to copyright    Report

Purdue University
Purdue e-Pubs
$. /1+$,1-%*$"1/(" * ,#-+.21$/
,&(,$$/(,&$"',(" *$.-/10
$. /1+$,1-%*$"1/(" * ,#-+.21$/
,&(,$$/(,&

Adaptive Networks as a Model for Human Speech
Development
M. Fernando Tenorio
Purdue University
M. Daniel Tom
Purdue University
Richard G. Schwartz
Purdue University
-**-41'(0 ,# ##(1(-, *4-/)0 1 '8.0#-"0*(!.2/#2$$#2$"$1/
7(0#-"2+$,1' 0!$$,+ #$ 3 (* !*$1'/-2&'2/#2$$2!0 0$/3("$-%1'$2/#2$,(3$/0(15(!/ /($0*$ 0$"-,1 "1$.2!0.2/#2$$#2%-/
##(1(-, *(,%-/+ 1(-,
$,-/(-$/, ,#--+ ,($* ,#"'4 /16("' /## .1(3$$14-/)0 0 -#$*%-/2+ ,.$$"'$3$*-.+$,1
 Department of Electrical and Computer Engineering Technical Reports. .$/
'8.0#-"0*(!.2/#2$$#2$"$1/

XX^XXX'X'X'X'XXv^iiiiiiiPXv^XX'X'XXv
V . V .V . V .V .V .y . y . \ \V .\ \ \ v ! v . V . V . \\ \V . V . V . \ V
'XvXXvX'XXvX'XXvXXvXvX'XvXvXvX
.\\v.v. v.v.v.v.w.v.v.v.w.v
Adaptive Networks as a Model
for Human Speech Development
M. F. Tenorio
M. D. Tom
R. G. Schwartz
TR-EE 89-54
August, 1989
School of Electrical Engineering
PurdueUniversity
West Lafayette, Indiana 47907

Adaptive Networks as a Model for
Human Speech Development
M. Fernando Tenorio
M. Daniel Tom
School of Electrical Engineering
' and
Richard G. Schwartz
Department of Audiology and Speech Sciences
Parallel Distributed Structures Laboratory
Purdue University
WestLafayettej IN 47907
August 1989
TR-EE-89-54

TABLE OF CONTENTS
ii
Page
Abstract...
........
.
...................................
.
...................................
......
v ......................
..........
.1
1. Introduction............
........
.
...................
I
2. Text-to-Speech Conversion .........
......
3
3. Representing Text and Phonemes.,.....;...
........
4
4. Preparing the Training Input
.
.......................
5
5. Training Procedures
................................
6
6. Difference with NETtalk.........................
...........
6
7. Performance
.
...........
7
8. Summary of Statistics
........
.
.........
8
9. Observations............
......
.
...............................
.............................9
9.1. Orthographic Transparency...............................
......
............................................10
9.2. Learning Different Languages
..........
;....
..............
........................................................10
9.3. Capturing t Ii e ^^liaracteristics of S p c c c ^ i ....... ,,.... 11
9.4. Training with a Second Language
.........
.
...........
.
..............
........................................12
9,5* Contextual I 11 d en p,y .........a .,...,..................,.....* ,..,....13
9.6. Number of Hidden Units
.....
.
............
.
............
.
.............................................................13
9.7. Cluster Analysis.............
.........
..............
.
........ .
................................... .........................
14
9.7.1. Clustering Algorithm
# + «
»••••••••••••••••••••••
9.7.2. English Training
......
.....
....................
........
.............
...........................................
15
9.7.3. Spanish Training.....................
.......
15
9.7.4. English Developmental Cluster Analysis
.............................................
16
9.7.5. Spanish Developmental Cluster Analysis.............
.....
........................................17
9.7.6. Spanish-English *1'raining Cluster Analys i s ..*.,..*17
9.7.7. English-Spanish Training Cluster Analysis
.
...............
.......................
......
.
......
.18
10. Output Decision Algorithms
.....
..........
19
11. Problems, Solutions, and Potentials
..................................................................................
19
References............................................... 20
Appendices..
........................
22

A daptive Networks as a Model for H um an Speech Developm ent
M. Fernando Tenorio
M. Daniel Tom
School of Electrical Engineering
and
Richard G. Schwartz
Department of Audiology and Speech Sciences
Parallel Distributed Structures Laboratory
Purdue University
West Lafayette, IN 47907
Abstract
Unrestricted English text can be converted to speech through the use of a look up
table, or through a parallel feedforward network of deterministic processing units. Here,
we reproduce the network structure used in NETtalk. Several experiments are carried
out to determine which characteristics of the network are responsible for which learning
behavior, and how closely that maps human speech development. The network is also
trained with different levels of speech complexity, and with a second language. The
results are shown to be highly dependent on statistical characteristics of the input.
I. Introduction
Connectionist networks are arrays of simple highly interconnected computing
elements. Two important properties emerge from this computing paradigm. First, these
networks serve as an alternative form of massively parallel knowledge representation.
Second, these networks can learn the input-output relationships by simple modification
of the connection strengths. Because of their potential and simplicity, feedforward
networks using the Generalized Delta Rule for learning have received the greatest
attention [2,4,5,8].

References
More filters

Journal ArticleDOI
TL;DR: A model of a system having a large number of simple equivalent components, based on aspects of neurobiology but readily adapted to integrated circuits, produces a content-addressable memory which correctly yields an entire memory from any subpart of sufficient size.
Abstract: Computational properties of use of biological organisms or to the construction of computers can emerge as collective properties of systems having a large number of simple equivalent components (or neurons). The physical meaning of content-addressable memory is described by an appropriate phase space flow of the state of a system. A model of such a system is given, based on aspects of neurobiology but readily adapted to integrated circuits. The collective properties of this model produce a content-addressable memory which correctly yields an entire memory from any subpart of sufficient size. The algorithm for the time evolution of the state of the system is based on asynchronous parallel processing. Additional emergent collective properties include some capacity for generalization, familiarity recognition, categorization, error correction, and time sequence retention. The collective properties are only weakly sensitive to details of the modeling or the failure of individual devices.

15,722 citations


Book
03 Jan 1986
Abstract: This chapter contains sections titled: The Problem, The Generalized Delta Rule, Simulation Results, Some Further Generalizations, Conclusion

13,245 citations


Journal Article
TL;DR: H hierarchical clustering techniques applied to NETtalk reveal that these different networks have similar internal representations of letter-to-sound correspondences within groups of processing units, which suggests that invariant internal representations may be found in assemblies of neurons intermediate in size between highly localized and completely distributed representations.
Abstract: This paper describes NETtalk, a class of massively-parallel network systems that learn to convert English text to speech. The memory representations for pronunciations are learned by practice and are shared among many processing units. The performance of NETtalk has some similarities with observed human performance. (i) The learning follows a power law. (ii) The more words the network learns, the better it is at generalizing and correctly pronouncing new words, (iii) The performance of the network degrades very slowly as connections in the network are damaged: no single link or processing unit is essential. (iv) Relearning after damage is much faster than learning during the original training. (v) Distributed or spaced practice is more effective for long-term retention than massed practice. Network models can be constructed that have the same performance and learning characteristics on a particular task, but differ completely at the levels of synaptic strengths and single-unit responses. However, hierarchical clustering techniques applied to NETtalk reveal that these different networks have similar internal representations of letter-to-sound correspondences within groups of processing units. This suggests that invariant internal representations may be found in assemblies of neurons intermediate in size between highly localized and completely distributed representations.

1,459 citations


Book
01 Jan 1988
Abstract: All chapters conclude with "Questions." 1. Normal Aspects of Articulation (Ray Kent). Introduction. Fundamentals of Articulatory Phonetics. Coarticulation: Interactions Among Sounds in Context. Aerodynamic Considerations in Speech Production. Acoustic Considerations of Speech. Sensory Information in Speech Production. Generative Phonology. Optimality Theory. Which Phonological Theory to Select? Summary of Levels of Organization of Speech. Concluding Note on Implications for Speech Acquisition. 2. Early Phonological Development (Marilyn May Vihman). Models of Phonological Development: The Child as an Active Learner. Infant Perception: Breaking into the Code. Infant Production: Interaction of Maturation and Experience. The Transition Period: From Babble to Speech. Individual Differences: Profile of Two One-Year-Old Girls. Systematization and Reorganization: From Word to Segment. Linguistic Perception Beyond the Transition Period: Representing Speech Sounds. 3. Later Phonological Development (Marilyn May Vihman). Establishing Group Norms: Large-Scale Studies. Phonological Processes: Systematicity in Production Errors. Profiling the Preschool Child: Individual Differences Revisited. Development of Perception Beyond Early Childhood: Understanding Running Speech. 4. Etiology/Factors Related to Phonologic Disorders (Nicholas Bankson, John Bernthal). Introduction. Structure and Function of the Speech and Hearing Mechanisms. Cognitive-Linguistic Factors. Psychosocial Factors. Conclusion. 5. Phonological Assessment Procedures (Nicholas Bankson, John Bernthal). Phonological Sampling. Introduction. Screening for Phonological Disorders. Comprehensive Phonological Assessment: Assessment Battery. Related Assessment Procedures. Determining the Need for Intervention. Intelligibility. Severity. Stimulability. Error Patterns. Developmental Appropriateness. Case Selection Guidelines and Summary. Target Behavior Selection. Stimulability. Frequency of Occurrence. Developmental Appropriateness. Contextual Analysis. Phonological Process Analysis. Target Behavior Selection Guidelines. Other Factors to Consider in Case Selection--Intervention Decisions. Dialectal Considerations. Social-Vocational Expectations. Computer Assisted Phonological Analysis. Case Study. Assessment: Phonological Samples Obtained. Assessment: Interpretation. 6. Remediation Considerations (Nicholas Bankson, John Bernthal). Basic Considerations. Introduction. Framework for Conducting Therapy. Making Progress in Therapy: Generalization. Generalization Guidelines. Dismissal from Instruction. Maintenance and Dismissal Guidelines. 7. Treatment Approaches (Nicholas Bankson, John Bernthal). Introduction. Treatment Continuum. Motor Learning Principles. Teaching Sounds: Establishment of Target Behaviors. Beyond Teaching Sounds: Treatment Approaches with a Motor Emphasis. Linguistic-Based Approaches to Intervention. Oral-Motor Activities as Part of Articulation Instruction. Intervention for Children with Developmental Verbal Dyspraxia (DVD). Case Study. Intervention Recommendations. First consideration--do I use a motor or phonological approach to intervention, or both? 8. Language and Dialectal Variations (Brian Goldstein, Aquiles Iglesias). Introduction. Dialect. Characteristics of American English Dialects. Summary. 9. Phonological Awareness: Description, Assessment, and Intervention (Laura M. Justice, C. Melanie Schuele). Introduction. What Is Phonological Awareness? Phonological Awareness as Literacy Development. The Development of Phonological Awareness. Phonological Awareness and Reading. Phonological Awareness and Disorders of Speech Production. Assessment. Intervention.

332 citations


Journal ArticleDOI
J S Denker1
TL;DR: The workings of a standard model with particular emphasis on various schemes for learning and adaptation is reviewed, which can be used as associative memories, or as analog computers to solve optimization problems.
Abstract: Recent work has applied ideas from many fields including biology, physics and computer science, in order to understand how a highly interconnected network of simple processing elements can perform useful computation. Such networks can be used as associative memories, or as analog computers to solve optimization problems. This article reviews the workings of a standard model with particular emphasis on various schemes for learning and adaptation.

59 citations