Watching A Language Evolve Among Robotic Agents
In yesterday's post, I described a solution to the "gavagai problem" (which states, according to one interpretation, that word learning is an intractable task because any word can in principle have an infinite number of referents, even when learned in the context of direct instruction) by Landauer & Dumais, who used a machine learning technique (called latent semantic analysis, or LSA) to extract the meaning of words by plotting a similarity relationship between each word and every other word ever encountered by the progam, as a function of the contexts in which they appear.
Although showing impressive results, this approach seems to miss a fundamental and intuitive aspect of language learning: words seem to be defined primarily as relating to objects in the real world, and only secondarily as relating to other words. In other words, LSA word meanings are not "grounded" in real-world experience ... so to what extent can we think of it as truly understanding the meanings of words, human-competitive data notwithstanding?
A more intuitively fulfilling solution to the gavagai problem might explore the way in which speakers come to understand the real-world objects to which a given term refers, rather than developing a solely recursive understanding of word meaning. In their chapter in Linguistic evolution through language acquisition: formal and computational models, authors Steels and Kaplan describe experiments with two robotic agents whose job is to communicate about objects in their environment by developing their own language.
Steels & Kaplan implement this task as follows: two robots face an environment populated only by 2-D shapes of different colors on a whiteboard. Each robot consists of a camera, a catagorization system, and a verbalization system, which in combination are capable of segmenting a visual image into objects, with each object defined by its spatial position and RGB color values. The robots take turns playing the roles of speaker and hearer in what the authors call "the guessing game", in which a speaker first picks an object from the environment, and then communicates to the hearer a series of syllables that best categorize that object uniquely among all the other objects in the environment (where "best categorize" is defined in terms of weight, as discussed below). For example, the speaker may use the phrase wovota to indicate the object of interest is in the upper left corner. The hearer will then use its database of terms and their associated meanings to pick the object it believes the speaker is referring to. The speaker then verifies whether the picked object is actually the one to which it was referring; if the hearer was correct, both the speaker and hearer increase the "weight" of the relationship between that term (wovota) and the internal representations (upper left) while decreasing the weight of all competing associations. If the hearer picked the wrong referent, the relationship between wovota and the internal representation is decreased in weight. Such games are played thousands of times in a row, in which the agents playing speaker and hearer are rotated so that all agents in the environment have played the games with other agents. By the conclusion of training, the arrive at a set of consistent - or mostly consistent - terms for objects in their environment.
However, consider what would happen if the speaker had used the term wovota to indicate "very red," but this very red object was located in the upper left corner of the environment - in this case, the hearer would have been correct, but for the wrong reason: the term wovota was actually intended to reflect redness, but was interpreted as reflecting spatial position. In this case, future uses of the term wovota by the speaker would be necessary in order for the hearer to correctly reinterpret the phrase.
Consequently, yet another type of misunderstanding is possible: if the speaker uses the term wovota to indicate the very red object in an environment consisting of both one red and one blue objects in the upper left hand corner, the hearer will not be able to successfully identify the referent of wovota. In this case, the hearer will not be able to pick an object, and so the speaker will identify the referent of wovota. Now a new meaning for wovota will be stored (one involving redness, most likely), and will begin to compete with the previous interpretation of "upper left."
These types of miscommunications are prototypical examples of the "gavagai problem." And yet, the communicative performance of these robotic agents is in many ways very language-like and actually quite impressive (even more so, when you consider that they begin with no shared terminology at all!). For example:
Although showing impressive results, this approach seems to miss a fundamental and intuitive aspect of language learning: words seem to be defined primarily as relating to objects in the real world, and only secondarily as relating to other words. In other words, LSA word meanings are not "grounded" in real-world experience ... so to what extent can we think of it as truly understanding the meanings of words, human-competitive data notwithstanding?
A more intuitively fulfilling solution to the gavagai problem might explore the way in which speakers come to understand the real-world objects to which a given term refers, rather than developing a solely recursive understanding of word meaning. In their chapter in Linguistic evolution through language acquisition: formal and computational models, authors Steels and Kaplan describe experiments with two robotic agents whose job is to communicate about objects in their environment by developing their own language.
Steels & Kaplan implement this task as follows: two robots face an environment populated only by 2-D shapes of different colors on a whiteboard. Each robot consists of a camera, a catagorization system, and a verbalization system, which in combination are capable of segmenting a visual image into objects, with each object defined by its spatial position and RGB color values. The robots take turns playing the roles of speaker and hearer in what the authors call "the guessing game", in which a speaker first picks an object from the environment, and then communicates to the hearer a series of syllables that best categorize that object uniquely among all the other objects in the environment (where "best categorize" is defined in terms of weight, as discussed below). For example, the speaker may use the phrase wovota to indicate the object of interest is in the upper left corner. The hearer will then use its database of terms and their associated meanings to pick the object it believes the speaker is referring to. The speaker then verifies whether the picked object is actually the one to which it was referring; if the hearer was correct, both the speaker and hearer increase the "weight" of the relationship between that term (wovota) and the internal representations (upper left) while decreasing the weight of all competing associations. If the hearer picked the wrong referent, the relationship between wovota and the internal representation is decreased in weight. Such games are played thousands of times in a row, in which the agents playing speaker and hearer are rotated so that all agents in the environment have played the games with other agents. By the conclusion of training, the arrive at a set of consistent - or mostly consistent - terms for objects in their environment.
However, consider what would happen if the speaker had used the term wovota to indicate "very red," but this very red object was located in the upper left corner of the environment - in this case, the hearer would have been correct, but for the wrong reason: the term wovota was actually intended to reflect redness, but was interpreted as reflecting spatial position. In this case, future uses of the term wovota by the speaker would be necessary in order for the hearer to correctly reinterpret the phrase.
Consequently, yet another type of misunderstanding is possible: if the speaker uses the term wovota to indicate the very red object in an environment consisting of both one red and one blue objects in the upper left hand corner, the hearer will not be able to successfully identify the referent of wovota. In this case, the hearer will not be able to pick an object, and so the speaker will identify the referent of wovota. Now a new meaning for wovota will be stored (one involving redness, most likely), and will begin to compete with the previous interpretation of "upper left."
These types of miscommunications are prototypical examples of the "gavagai problem." And yet, the communicative performance of these robotic agents is in many ways very language-like and actually quite impressive (even more so, when you consider that they begin with no shared terminology at all!). For example:
- Synonyms tend to emerge during training, such that one agent prefers one term for something, while the other agent may prefer another. As the authors note, this situation arises when one agent invents a new term after having incorrectly interpreted a term that is already in existence.
- Likewise, homonyms also emerge during training, in which one word can have multiple meanings )
- Some meanings of words fall out of usage among the agent population, just as in human language
- When new objects were added to the environment, thus making them more complex, success in the guessing game dipped sharply, but quickly rebounded as agents derived new words or differentiated the possible meanings of old words.
6 Comments:
its matt w. do you still have your old phone number? id like to give you a ring sometime!!! great blog, hope everything is going well for you and catherine in colorado. peace and love to you both!
Hey Matt! yep, I still have the same number. I trashed my old phone, though, so i don't have your number (it probably changed anyway). definitely give me a call - or drop me an email at christopher . chatham [AT] colorado . edu
Chris:
I come at this from the other end (via natural philosophers like Berkeley and Gibson), where the same argument was applied to vision. Berkeley called vision a language, in fact. How we form relationships in vision is as complicated as the aural equivalents. Both involve a transformation between perception and touch that make it darned hard to figure out where thought begins and the world ends.
Anyhow, have been enjoying your blog for a few weeks now and wanted to leave a post thanking you. I know you have little time to read ancient philosophers long pre-robotics, but the arguments in Berkeley might be inspiring. I'll try to dig up a quote or two in the next few days.
--Caroline
Hi Caroline - Thanks so much! I am actually struggling to find time even to blog, so I am afraid that reading philosophy is out of the question, at least for the next two months.
Quotes, however, would be VERY appreciated - it will give me something to look forward to reading when I finally have some free time again.
Why is research so hard? Ugh!
I lost research! =)
I've been going about this blog for sometime and it's cool. I just love the things and questions, theories that it poses.
I'm only doing my second semester in college, so i think i'm just a kid compared to you guys :S
but the stuff that i've learned from this blog is much much much more interesting compared to what i'm learning in college =)
I love cognitive science =)
I love this blog =)
I'll read up more on this super interesting stuff and hopefully add more ideas of exploration to your research Chris! =)
Hi Chris,
I've enjoyed this and this post's prequel. I just wanted to swing by and lay out the following brief defence of LSA against the charge that several have raised, namely, that it is disanalogous to humans insofar as it does not have access to the things that words are supposed to refer to. One way of putting the so-called failing of LSA is that it is merely associating representations with other representations. One way of putting the so-called disanalogy between LSA and the way humans acquire language and knowledge of word meanings is that humans, unlike LSA, not only associate representations with other representations, they also associate representations with the things that the representations are represenations of. In terms of an example, humans, can not only associate the representation "dog" with the representation "canine", they can also associate representations "dog" with dogs themeselves.
Setting things up in the above way allows us to see that maybe things aren't so disanalogous afterall. There is a very real sense in which all we are ever able to do is associate representations with other representations. There is no such thing as accessing things themselves (that is, things unrepresented) or associating representations with things. And this latter point is what I take it that Quine was trying to make after all: there is no determinate relationship between words and "meanings" or words and their referents.
Cheers,
Pete
Post a Comment
<< Home