The False Promise of View-invariance

Any complete theory of visual object recognition must explain how humans are able to reliably identify specific objects from a near-infinity of different orientations. For example, can you identify the image at the start of this article?

Many theorists, although perhaps most emphatically Jeff Hawkins, have claimed that this 'view invariance' is the hardest problem in computational vision. Accordingly, many explorations of possible object recognition mechanisms have posited the existence of view-invariant geometric primitives, such as geons, somewhere in visual cortex. This theory says that objects are recognized by extracting these geons from 2-d retinal images, and the relationships between these geons form the basis for basic-level object categorization.

The major strengths of view-invariant theories (relative to 'template-based' approaches) are their viewpoint-invariance (well, obviously!), their resistance to visual noise, a sufficient combinatorial power to describe the object space of human visual experience, correspondence with some experimental and anecdotal data (complementary-part priming and contour deletion), and a superficial complementarity with what is known about receptive field structure in inferotemporal cortex. Additionally, given that geons are simple combinations of nonaccidental features, it seems to be a tractable way of implementing basic three-dimensionality in object recognition.

But in a seminal paper from Cognition in 1995, Michael Tarr took a step back from the intuitive appeal of invariant object recognition, and forced theorists to face up to the facts: human object recognition is not invariant - we are clearly faster at recognizing objects from specific, characteristic views. Could it be that we actually store a unique representation for every possible view of every possible object?

It sounds implausible, but painstaking experiments show that this is a more parsimonious explanation of human object recognition than view-invariant theories. For example, the cases in which object recognition seems view-invariant can be explained by floor effects in reaction time measures, such that in these cases all the stimuli are discriminable on the basis of features arranged along a single dimension. As soon as these features are arranged with an additional degree of freedom, reaction times are found that are consistent with view-dependence. Results from Shepherd's classic mental rotation experiments clearly show that recognition is not completely invariant: reaction times increase linearly with degree of rotation, as though we require some 3-D mental rotation in order to match images with their stored representations.

Secondly, view-dependence also seems to be more compatible with neurophysiological data. Single cell recordings from monkey IT show view-dependent response patterns. Although there are some neurons that have been identified that are entirely view-invariant ('bill clinton neuron', 'halle berry neuron') these appear to be nearly "everything-invariant": because those neurons fire reliably for both caricatures and photos (both young and old), they appear to be more conceptual than visual.

Further, certain algorithms have been discovered which can recognize objects from novel perspectives by interpolating between (or extrapolating beyond) characteristic views of those objects. It appears that just a few orthogonal images of each object (perhaps as few as five or six, but depedent on expertise and geometric complexity) in combination with some sophisticated processing of these images, are sufficient to recognize an enormous array of perspectives on an enormous number objects.

Related Posts:
Active Maintenance and the Visual Refresh Rate
Kosslyn's Cognitive Architecture
Language Colors Vision


Blogger Chris Chatham said...

I am running a little short on time today - does anyone have a better image to demonstrate robust, near-view-invariant object recognition abilities? Post a comment with the image address if you find one!

1/26/2006 08:19:00 AM  
Anonymous Anonymous said...

There seems to be a contradiction between the variable time requirement for "mental rotation" of an image, and the claim that every possible perspective is represented. If every perspective were represented, why would the time vary?

1/27/2006 07:06:00 AM  
Blogger Chris Chatham said...

Hi - thanks for the question. I probably wasn't very clear, so my apologies if I go overboard this time with a really long answer:

According to the multiple views perspective, not every view must be stored, but instead just a few characteristic or diagnostic views. These are stored probabilistically based on experience, such that when a new view of a familiar object is encountered, it will likely be stored if it is sufficiently different from previously encountered views, or if it is encountered with enough regularity that it is more typical than some previous views.

Accordingly, in Tarr's experiments, he found that the time required to recognize novel views of familiar objects was consistent with the interpretation that subjects were mentally rotating the object only to the closest familiar view (familiar views were those views on which they had previously been tested). In other words, the amount of time to recognize a novel view of an object is proportional to the smallest rotational transformation required to match it to a previously trained view.

This really puts the nails in the coffin of the view-invariant theory, which would claim that after sufficient experience with an object, subjects develop a truly view-invariant representation which contains all the geometric features and dimensions of an object. According to this view, no mental rotation would be necessary after sufficient experience with the object.

As you can see, multiple views theory and view-invariant theories are not dissociable at either extreme: as subjects are just encountering a new object for the first or second time, they could require time to recognize it either because they are trying to compare it to a known view, or because it takes time to develop their view-invariant representation. And conversely, once subjects have had extensive experience with an object from many angles, they will not require much time to recognize it, either because the closest known view is always very close (and hence requires only a very quick transformation), or because they have a view invariant representation that requires no transformations at all.

The only way of dissociating these theories is to slap subjects with a totally novel view of an object after their recognition performance had previously ceased to show variable time requirements (suggesting that either the view-invariant representation had been developed, or they had just represented a sufficient number of multiple views to make it appear as though they were able to immediately recognize the objects) When encountering an entirely novel view, response times were variable in a way that was consistent with continued mental rotation of the objec, showing that view invariant representations had not been acquired.

1/27/2006 08:11:00 AM  
Anonymous Anonymous said...

Aha! And in a dip into your archives I discover you do read and like philosophy of vision!

Gibson explains this seeming paradox best of anyone I have found. And no can stand his arguments, so perhaps I am just as odd a thinker as he.

Gibson's view is best understood through caricatures. Why is it so easy to perceive a cartoon of George Bush as a chimp as Bush and not the chimp? Gibson argues that our brains are constructed not to recognize objects, but to recognize invariant relations. That we break the world into units of what is useful but can, with time or under duress, break things down further or perceive larger units. Thus, when I see an apple I see a very useful unit for throwing, eating or giving to a teacher. How do I see it as an apple? I see the apple because when I move around the environment and interact with it, the visual, perceptual relationships that make up my perception of the apple remain invariant. So I can pick it out of the environment, in the same way I can pick out the table it is resting on. Plus, my handy sense of touch lets me differentiate the apple from the table (even if the table is painted patented apple green).

But when we get too close to something to perceive the relationships that make it useful (like when a child holds the apple in front of my eye), the relationships are no longer perceivable.

Long, I know. Enthusiastic, really.

9/07/2006 07:47:00 PM  

Post a Comment

<< Home