Machine and Human Learning of Word Meanings
How do we ever come to know the meanings of words? Consider the following (likely apocryphal) story:
One might assume that powerful constraints exist on the kinds of conclusions we draw from linguistic experience. Yet a fascinating machine learning technique, known as Latent Semantic Analysis, is capable of acquiring word meanings in ways that eerily resemble human learning - all without ever having undergone direct instruction, and with no word learning biases built-in.
As presented in this paper (from which this post is conceptually derived), Landauer & Dumais describe how Latent Semantic Analysis (LSA) can acquire word meaning simply by reading encyclopedia text. Essentially, the LSA algorithms derive a rating of similarity between every word and every other word by cataloguing the "number of times that a particular word type, say 'model,' appears in a particular paragraph, say this one." These values undergo a log transform, and then division by the entropy of that word with respect to that paragraph (entropy is a measure of how representative a word is of the paragraph in which it is found, with high values being less representative; basically, dividing by this term allows LSA to scale-down the importance of "contextually ill-defined" words). Ultimately, this processing results in an enormous matrix of similarity relationships.
That's "where the miracle occurs." Landauer & Dumais describe the use of "singular value decomposition" (SVD) to compress these similarity relationships into a more manageable data structure - one that represents the most "core" set of relationships that define when and how a certain word will be used. In slightly more technical language, SVD reduces the dimensionality of a data set by identifying the principle components or eigenvectors that compose it (the technique is similar to factor analysis, principle component analysis, and multidimensional scaling).
To use the example in the paper, Landauer & Dumais were able to compress semantic relationship data from over 30,000 encyclopedia articles into around 300 dimensions. In other words, the usage of each and every term out of the 4,000,000 words encountered in the encyclopedia is optimally describable if characterized along each of roughly 300 different continuums. One might claim that LSA has learned the set of "core semantic characteristics" that make up the meanings of English words. (Not only that, but this is just one of many possible dimensionality-reduction techniques, as the authors note!)
And the proof is in the pudding:
In conclusion, LSA well-characterizes many aspects of word learning. How SVD-like computations may be situated in the brain, and the necessity of some of the simplifying assumptions of this model (such as perfect memory, and an empirical determination of optimal dimensionality) are as yet unanswered.
Nonetheless, latent semantic analysis seems like a promising and powerful approach for understanding human semantic learning at a cognitive level. Perhaps most importantly, it shows that Quine's "gavagai problem" might not be so intractable after all - a word's meaning can be understood as a function both of the contexts in which that word appears, the contexts in which it doesn't appear, and likewise for every other word to which it is related. This recursive relationship of word to context provides a nearly infinite amount of linguistic data from which word meaning might be derived.
Walking along one day on the newly-discovered coast of Australia, Captain Cook saw an extraordinary animal leaping through the bush. "What's that?" he asked one of the aborigines accompanying him.This story may have inspired Quine in proposing the "gavagai problem," which can be interpreted as suggesting that word learning is a nearly impossible task because any given word has in principle an infinite number of referents. To make this more clear, consider that gangurru (or meenuah, or "gavagai") could actually mean "marsupial," "good to eat," or "kangaroo jumping through a harvested field prior to 11:45 am." So, how do we recover word meaning from any real-world linguistic experience, when even straightforward and direct instruction of word meaning is actually so ambiguous?
"Uh - gangurru." he replied - or something like that. Captain Cook duly noted down the name of the peculiar beast as 'Kangaroo'. Some time later, Cook had the opportunity to compare notes with Captain King, and mentioned the kangaroo.
"No, no, Cook", said King, "the word for that animal is 'meenuah' - I've checked it carefully.
"So what does 'kangaroo' mean?"
"Well, I think," said King "it probably means something like 'I don't know'..."
One might assume that powerful constraints exist on the kinds of conclusions we draw from linguistic experience. Yet a fascinating machine learning technique, known as Latent Semantic Analysis, is capable of acquiring word meanings in ways that eerily resemble human learning - all without ever having undergone direct instruction, and with no word learning biases built-in.
As presented in this paper (from which this post is conceptually derived), Landauer & Dumais describe how Latent Semantic Analysis (LSA) can acquire word meaning simply by reading encyclopedia text. Essentially, the LSA algorithms derive a rating of similarity between every word and every other word by cataloguing the "number of times that a particular word type, say 'model,' appears in a particular paragraph, say this one." These values undergo a log transform, and then division by the entropy of that word with respect to that paragraph (entropy is a measure of how representative a word is of the paragraph in which it is found, with high values being less representative; basically, dividing by this term allows LSA to scale-down the importance of "contextually ill-defined" words). Ultimately, this processing results in an enormous matrix of similarity relationships.
That's "where the miracle occurs." Landauer & Dumais describe the use of "singular value decomposition" (SVD) to compress these similarity relationships into a more manageable data structure - one that represents the most "core" set of relationships that define when and how a certain word will be used. In slightly more technical language, SVD reduces the dimensionality of a data set by identifying the principle components or eigenvectors that compose it (the technique is similar to factor analysis, principle component analysis, and multidimensional scaling).
To use the example in the paper, Landauer & Dumais were able to compress semantic relationship data from over 30,000 encyclopedia articles into around 300 dimensions. In other words, the usage of each and every term out of the 4,000,000 words encountered in the encyclopedia is optimally describable if characterized along each of roughly 300 different continuums. One might claim that LSA has learned the set of "core semantic characteristics" that make up the meanings of English words. (Not only that, but this is just one of many possible dimensionality-reduction techniques, as the authors note!)
And the proof is in the pudding:
- after training, LSA performed at 64.4% correct on a multiple choice test of synonymity taken from TOEFL (in contrast, humans score around 64.5% on average on this test, which is frequently used as a college entrance examination of English proficiency in non-native speakers. By this metric, LSA would be admitted to many major universities!)
- calculations of the rate of word learning by 7th graders suggests that they acquire .15 words per 70-word text sample; analogous calculations of LSA's rate of acquisition show that LSA acquires .1500 words per text sample read
- the comprehension by college students of several versions of a text sample about heart function is precisely replicated by LSA, when comprehension is measured as the degree of semantic overlap between subsequent sentences;
- Humans initially show facilitated processing of all meanings of a previously-presented word, but after 300 ms show priming only of context-appropriate meanings; LSA shows similar effects insofar as similarity is higher between a homograph and two words related to different meanings of the homograph than between a homograph and unrelated words, and in that LSA considers words related to the context-appropriate definition of a homograph as more related than words related to the context-inappropriate definition of the homograph;
- Human reaction times in judgments of numerical magnitude suggest that the single digit numerals are represented along a "logarithmic mental number line;" LSA was able to replicate this effect in its ratings of similarity among the single digit numerals, which also conform to a logarithmic function
In conclusion, LSA well-characterizes many aspects of word learning. How SVD-like computations may be situated in the brain, and the necessity of some of the simplifying assumptions of this model (such as perfect memory, and an empirical determination of optimal dimensionality) are as yet unanswered.
Nonetheless, latent semantic analysis seems like a promising and powerful approach for understanding human semantic learning at a cognitive level. Perhaps most importantly, it shows that Quine's "gavagai problem" might not be so intractable after all - a word's meaning can be understood as a function both of the contexts in which that word appears, the contexts in which it doesn't appear, and likewise for every other word to which it is related. This recursive relationship of word to context provides a nearly infinite amount of linguistic data from which word meaning might be derived.
10 Comments:
This is very, very cool. Thanks for posting.
Hi - glad you like it! I was really impressed by this paper too. The wikipedia entry on latent semantic analysis is very interesting also.
Very interesting, and a helpful summary. It reminds me of Paul Churchland's semantic theory for (artificial) neural networks.
It brings up lots of interesting issues.
1. LSA won't tell you the meaning of 'gavagai', but only synonymns for the word. If you gave me the results of LSA in Portugese, I wouldn't know the meaning of any given term, even though I'd know its synonyms. This suggests that performing LSA is not sufficient for a semantic theory.
2. What if you tried to effect a translation between two different languages by effectively translating and rotating the metrically arrayed lexicons (via LSA) into one another (in a way that minimizes distortion of the metric)? Would languages with different grammars (e.g., Subject-verb-object versus object-verb-subject) be less translatable into one another?
3. The approach assumes that words which occur close to one another are similar in meaning. Do traditional parts of speech segregate into metaclusters? What of connectives such as 'and'? I'd assume it would be equidistant from other syntactic categories such as nouns (unless 'and' is closer to 'dog' than to 'blender').
4. The fact that they apply it to words only reveals a subtle bias. Namely, words are the fundamental unit of meaning. What if full sentences are the bearers of meaning, and it is the metric structure of sentence space that fixes sentence meaning? Then words would tend to co-occur because substituting them into a sentence would not change the frequency of co-occurrence of the full sentences. Something like this view is espoused by philosopher . Note I think he is probably wrong, but it is an interesting topic.
5. The whole approach seems kind of strange as a semantic theory. Just because two words, on average, tend to co-occur in a corpus, does that really make them similar in meaning? It may be a useful way for us to bootstrap into the meanings of terms in a language (assuming we already have a core vocabulary and corresponding semantics), but that seems to be a theory of inference about the meaning of words. This is certainly interesting, but the assumption that only corpus statistics are used, and not perceptual and pre-existing semantic knowledge, seems like an anemic and probably empirically wrong view. Early on linguistically, kids work pretty hard to learn the referents of terms, and this training usually hooks into their perceptual systems (e.g., pointing to a picture of a pineapple and saying 'pineapple'). You typically don't teach kids the meaning of a word by using it in a bunch of different sentences.
6. Does the LSA algorithm lead to errors that are common in humans? The fact that the frequencies of errors is similar to people for whom English is a second language might be less impressive if the errors are completely different than the ones we make. More nuanced tests would be cool.
My questions are offered in a friendly spirit: the empirical success of the theory offers food for thought. There is probably something to it as a theory of inferring the meanings of words from our preexisting linguistic understanding.
"word learning is an impossible task because any given word has in principle an infinite number of referents".
No. Quine was arguing against Meaning as a formal entity that was put forward by Frege and the like.
The blue above is not due to a problem in the html (I closed my bracket). The blogger software does that sometimes with long posts. :( It is just a link to one guy's page. If anyone knows how to stop blogger from doing this, please let me know!
Thanks anonymous for the note about other interpretations of Quine's gavagai problem. I have revised the post to reflect that my (and by extension Landaeur & Dumais's) summary of the gavagai problem as just one possible interpretation.
Eric, thanks for the very thought provoking questions. I agree with you completely on #1 - tomorrow's post should cover that in more detail. As for #2, I have a feeling that attempting to autotranslate LSA-derived meaning would be a failure because of the fact that every language has a different set of homonyms, which in LSA are bound to the words. I think that your point about differences in grammar would also cause problems, although it's hard to know since LSA does not include any information about grammar per se in its representation of similarity.
With regard to metaclusters, Landauer & Dumais do mention the possibility of increasing the number of contexts that are represented. However, they drop words like "he" "she" "it" and "and" from the analysis (and briefly mention that they are "meaningless"). I have problems with this, but it is a limitation of LSA they mention up front.
As for #4, debates about the fundamental unit of meaning are very difficult to resolve, which suggests to me that it is an improper question to ask. Yes, words (and paragraphs, i.e. contexts) are the fundamental unit of meaning in LSA, and the ideal use of SVD in semantics would use decompose the relationship of every possible unit of meaning with every other unit of meaning. I feel pretty confident this is intractable with current hardware (excluding, of course, the human brain).
I also agree with number 5. In many ways, words that co-occur in a paragraph are dissimilar in meaning - this is the point of writing, to express a variety of thoughts. I hope that the next post about semantics will address some of these concerns.
Thanks!
just wanted to chime in with my two cents... this is an impressive blog. lots of neat info.
thx maximo; encouragement is always welcome ;)
i think i have to agree with Eric on something here. As he pointed out, LSA learns meaning based on synonyms and not based on the actual "item" iteself. I think because of this lack of object permanance in the language context, LSA does not compute "he", "she", or "it".
In my own experience of writing, writers use this technique of writing quite widely. For example writers would put a bunch of synonyms together to describe a single word. however, firstly the reader has to already know what is the object, or at least be able to guess the object. From this technique and way of writing, i have also been able to learn new words with the same meaning. Perhaps LSA ONLY uses this way to learn, however it has no permanent idea of what an object really is. Like, have you ever gotten into a situation when you know what the thing is related to but you don't really know what it actually IS in reality? I think LSA's language learning is based on this concept.
To further comment, I think perhaps ONE of the ways that humans learn language is based on the same way as LSA. Perhaps because humans have their own grounded sense of reality and an "linguistic object permanance" that is related to "he", "she" and "it". To explain, we can refer to something as "he" because we know "what" the object is. But LSA doesn't know "what" the object is but knows "it is related to" this group of things. I think perhaps it uses analogy to find things out, for example: if all of a is c, and all of c is b, then a must be b. I think perhaps the way LSA learns language is only part of how humans learn language. I think perhaps we also have to count in things like social learning and cognitive levels of learning (as in being able to deduce "a" out of a summary of object "a").
Because of this, i think figuring the way that humans learn language will also reveal more about the brain processes. Maybe if we add hearing or viewing aids to LSA and program it to accept information from them, LSA might be able to have a more complete way of learning language? or it could also be used to determine if humans learn as much language from using our senses too, because we all know that humans do learn other things like emotions and sounds from the aid of our senses.
these are my thoughts. what are your's?
PS:i love this kind of stuff =p
Just thought I'd chip in a couple of points on this topic. I've used PCA (i.e. SVD analysis) in imaging, textual and now musical analysis. Certainly it can produce a dimensionality reduction, however, there are a few things to note.
The example - 300,000 to 300 - isn't really a marker of this - 'basic english' has 3000 words or so (i believe), no matter how much text I have in basic english (say all of it 30 million articles?) the dimensionality of the word vectors is 3000, and that is the dimensionality of the space (I wish i had ref to hand but sorry - basically take U^T.U (a 3000*3000 matrix) rather than U. U^T (3mill*3mill) - you'll get the same components with a little maths jiggling).
Note also an encyclopedia has normalized out (usually) a lot of potential variation - I'll come back to that.
When we say that a the PCA vectors 'represents' something we have to be really careful. 'Representation' is a loaded - very loaded - word. Actually a PCA gives the best reconstructive fidelity with N orthogonal vectors, chosen in decreasing eigenvalue ordering of course. Whether this is a good representation of a pattern or not depends not least on what you plan to do with it - e.g. if you are actually looking at discrimination you may find using ALL the components and renormalizing with the Mahalanobis metric is the most rational thing to do.
Suppose I make a dictionary of the (say) 80000 non basic english words and each entry is described in basic english with no cross refs (its just an example!) at say 200 words average per entry - so 16 million words. The first 3000 PCs will be the std english and will reconstruct 15,920,000 words correctly, a pretty good percentage! But of course non of the definitions appear - you will need exactly one dimension per entry.
Kohonens group took a whole bunch of patents and trained a SOM (basically a layered dimensionality reduction) and it looks pretty good intuitively. Of course most patents consist of large chunks of legalise common to all - take these away (very low reconstruction fidelity) and you are left with the important words - e.g. the patent topic. Patents are of course also of a fixed general form and so forth. The point is to get a good discriminatory, searchable index the SOM throws away most of the data (by basically conforming to the curved manifold underlying it) whereas PCA - even after you have chucked out the average, will simply project big vectors in these dimensions.
eric thomson mentions minimal disruption transformation of the metric - rotating the vector space won't change the metric at all of course, but the basic problem is we're not starting with a linear metric space, just trying to approximate the tangent space - so a real mapping between languages is (presumably) a diff mapping between two curved spaces - rotating the basis at zero won't get there.
As I recall - and maybe your doing things a bit different - the vectors will produce sets of words not with the same meaning, but hopefully indicating the same sentence and higher 'topic'. When I was looking at textual analysis (8 years ago though!) things were fine for controlled corpora but when we looked at unstructured things like the internet we start finding that there is so much data that just about all word pairs can be correlated (proper names don't help) and that also the uncontrolled manner of posting can seriously bias the correlations (search for 'atomic' - there are a lot of teenagers writing about kittens - and like these blogs re-including previous bits of text).
So - the point is - LSI is obviously a powerful method, but it needs used with care and purpose in mind - in particular it is important to avoid interpreting the components overly much - especially in any global or universal manner, and to bear in mind the difference between discriminatory tasks and reconstructive tasks.
The main reason I posted here was because I saw the discussion move from LSI->representation->meaning. PCA is JUST a change of basis - 'representation' and 'meaning' are several lifetimes worth of philosophy!
Good Luck and Have Fun!
Post a Comment
<< Home