Two Connectionist Models of Reading
Many languages have more regular letter-to-sound mappings than English. How does this affect young language learners?
It is important to understand why this occurs, and what can be done about it. Such mechanistic questions can be well addressed using computational models. In Hutzler et al's 2004 Cognition paper, they review two connectionist models (by Plaut et al 1996 and Zorzi et al 1998) of this and related phenomena.
As reviewed by Hutzler et al, the Plaut model consists of three layers - an orthographic input layer (105 units), a hidden layer (100 units), and a phonological output layer (61 units) - and is trained with backprop on 3000 words for 300 epochs. The model is able to successfully simulate skilled reading of novel nonwords, and shows the "frequency by consistency" interaction in English - in other words, it shows that words are read faster if the pronounciation is more consistent with spelling rules, but only for low frequency words. Hutzler et al implement this model and trained one network on English word-to-sound mappings, and another network on German word-to-sound mappings.
Both the German and the English networks were tested on 80 monosyllabic nonwords (e.g., fot, lank, plock). Pronounciations were considered correct if they were merely plausible pronounciations - they did not have to correspond to dominant letter-to-sound correspondences in either language. Unfortunately, these models do not capture the correct qualitative pattern of results - instead of showing large differences in nonword pronounciation decrease over time, they show initially small differences in nonword pronounciation increase over time. Hutzler et al suggest that this failure might be due to the multiple layers used here, which could delay the advantage of spelling-sound consistency (although I think it might also be a result of using backprop without also using more bottom-up hebbian learning rules).
Hutzler et al also implemented Zorzi's two layer associative model, which learns orthography to phonology mappings with a delta learning rule - in other words, a rule that changes connection weights based on the difference between produced output and target output. There are 208 orthography input units, 44 phonological output units (each of which is in 7 positions, yielding a total of 308 outputs), and these layers are fully interconnected; unit activation is determined with a standard sigmoidal function on the dot product of all input activations.
As Hutzler et al note, this network has no hidden layer and thus can learn only linear functions. As a result, it never learns to correctly pronounce all words - training is instead stopped when errors reach a global minimum. As before, one version of the network was trained on English and another on German. After training, the German network shows a consistent advantage in nonword pronoucniation over the English network, which remains (and perhaps even widens) throughout training. However, this still does not perfectly match the results, which show a wider advantage at the beginning of training.
Hutzler et al then ask whether this difference might be explained by differences in pedagogical differences, and address this question by "teaching" each network differently. To understand the logic here, consider that direct training of letter-to-sound correspondence is more difficult when each letter-sound relationship is more dependent on surrounding letters. This is the definition of inconsistent spelling-to-sound mapping, such as frequently found in English. Thus, English pronounciation requires training on entire words. In contrast, German can be effectively taught on the basis of single phonemes or at the syllabic unit without appeal to entire words. The increased effectiveness of simpler training methods may provide an early learning advantage to German readers.
To test this hypothesis, the authors used training events that were reflective of spelling-to-sound rules taught in common English and German phonics programs. After the Zorzi two-layer model was pre-trained with these phonics programs, and then retrained on the previous corpus, while tested on nonword pronounciation throughout training. This time, the results fit the human data much better - the German-trained network shows an initial advantage of around 35%, which decreases to 10% by the end of training.
Could English learners benefit more from a different kind of phonics program? Probably, but it's hard to know what sort of training is best, given that Hutzler et al do not attempt to search the "training space" to find an ideal set of training material. In fact, this kind of analysis is rarely done, probably because input representations are sometimes somewhat arbitrary and are thought to reflect a "weak point" in many connectionist models.
But there are also reasons to think that such an analysis would be premature. For one, these models use only error-driven learning, which is generally good for learning specific tasks but is not particularly good for picking up statistical regularities in the environment. For this type of learning, hebbian rules are ideal. It is difficult to predict how these networks might differ if hebbian learning rules were incorporated into the training paradigms.
One test, known as nonword reading, tests the ability of language learners to produce pronounciations of novel nonwords "by analogy" with the words they have previously learned. For example, some researchers have been able to compare nonword reading among learners across different languages by creating nonwords from high-frequency number words. Some studies show English language learners at nearly half the performance of their German peers at 7 years, and still somewhat behind at 12 years of age.
It is important to understand why this occurs, and what can be done about it. Such mechanistic questions can be well addressed using computational models. In Hutzler et al's 2004 Cognition paper, they review two connectionist models (by Plaut et al 1996 and Zorzi et al 1998) of this and related phenomena.
As reviewed by Hutzler et al, the Plaut model consists of three layers - an orthographic input layer (105 units), a hidden layer (100 units), and a phonological output layer (61 units) - and is trained with backprop on 3000 words for 300 epochs. The model is able to successfully simulate skilled reading of novel nonwords, and shows the "frequency by consistency" interaction in English - in other words, it shows that words are read faster if the pronounciation is more consistent with spelling rules, but only for low frequency words. Hutzler et al implement this model and trained one network on English word-to-sound mappings, and another network on German word-to-sound mappings.
Both the German and the English networks were tested on 80 monosyllabic nonwords (e.g., fot, lank, plock). Pronounciations were considered correct if they were merely plausible pronounciations - they did not have to correspond to dominant letter-to-sound correspondences in either language. Unfortunately, these models do not capture the correct qualitative pattern of results - instead of showing large differences in nonword pronounciation decrease over time, they show initially small differences in nonword pronounciation increase over time. Hutzler et al suggest that this failure might be due to the multiple layers used here, which could delay the advantage of spelling-sound consistency (although I think it might also be a result of using backprop without also using more bottom-up hebbian learning rules).
Hutzler et al also implemented Zorzi's two layer associative model, which learns orthography to phonology mappings with a delta learning rule - in other words, a rule that changes connection weights based on the difference between produced output and target output. There are 208 orthography input units, 44 phonological output units (each of which is in 7 positions, yielding a total of 308 outputs), and these layers are fully interconnected; unit activation is determined with a standard sigmoidal function on the dot product of all input activations.
As Hutzler et al note, this network has no hidden layer and thus can learn only linear functions. As a result, it never learns to correctly pronounce all words - training is instead stopped when errors reach a global minimum. As before, one version of the network was trained on English and another on German. After training, the German network shows a consistent advantage in nonword pronoucniation over the English network, which remains (and perhaps even widens) throughout training. However, this still does not perfectly match the results, which show a wider advantage at the beginning of training.
Hutzler et al then ask whether this difference might be explained by differences in pedagogical differences, and address this question by "teaching" each network differently. To understand the logic here, consider that direct training of letter-to-sound correspondence is more difficult when each letter-sound relationship is more dependent on surrounding letters. This is the definition of inconsistent spelling-to-sound mapping, such as frequently found in English. Thus, English pronounciation requires training on entire words. In contrast, German can be effectively taught on the basis of single phonemes or at the syllabic unit without appeal to entire words. The increased effectiveness of simpler training methods may provide an early learning advantage to German readers.
To test this hypothesis, the authors used training events that were reflective of spelling-to-sound rules taught in common English and German phonics programs. After the Zorzi two-layer model was pre-trained with these phonics programs, and then retrained on the previous corpus, while tested on nonword pronounciation throughout training. This time, the results fit the human data much better - the German-trained network shows an initial advantage of around 35%, which decreases to 10% by the end of training.
Could English learners benefit more from a different kind of phonics program? Probably, but it's hard to know what sort of training is best, given that Hutzler et al do not attempt to search the "training space" to find an ideal set of training material. In fact, this kind of analysis is rarely done, probably because input representations are sometimes somewhat arbitrary and are thought to reflect a "weak point" in many connectionist models.
But there are also reasons to think that such an analysis would be premature. For one, these models use only error-driven learning, which is generally good for learning specific tasks but is not particularly good for picking up statistical regularities in the environment. For this type of learning, hebbian rules are ideal. It is difficult to predict how these networks might differ if hebbian learning rules were incorporated into the training paradigms.
0 Comments:
Post a Comment
<< Home