Introduction to modeling motion perception

It is useful to consider how one should attempt to build a theory of motion perception in relation to David Marr's (1982) three levels of analysis - 1) The computational theory level, 2) The algorithmic level and 3) The implementation level. A complete theory of motion perception needs to be able to address the questions that arise at all three levels. The main goal is to explain how the visual system can compute the speed of image motion from the changes in the retinal image over time and therefore motion processing requires an understanding of both temporal image processing as well as spatial processing.

The persistence of vision

Fireworks streaking into the sky appear to leave bright trails behind them. These trails result from the "persistence of vision" i.e. they are a consequence of the fact that neurones in the visual system summate information over time as well as space. Thus we need to think of receptive fields as being spatio-temporal rather than simply spatial.

Temporal filters

We know from adaptation experiments, and other evidence, that there are many spatial frequency selective channels in the visual system but how many temporal channels do we have? Hess and Snowden (1992) investigated temporal processing using a masking paradigm. A probe (a grating reversing contrast at a particular temporal frequency) was set to be just above detection threshold. They then measured the contrast of a mask (narrow band filtered spatial noise reversing in contrast at one of a range of temporal frequencies) at which the probe could just be detected. The rational of the experiment is that, if the probe and mask are detected by separate temporal channels, the mask will not interfere with the detection of the probe. They found evidence for three temporal channels - a low pass channel, a band-pass channel peaking at around 10Hz and another peaking at around 18 Hz

The Reichardt correlation model

Computational principle - correlation.

The Reichardt model (Reichardt,1959) as shown above has two spatially separate detectors. The output of one of the detectors is delayed and then the two signals are multiplied. The output is tuned to speed and therefore one would need many detectors tuned to different speeds to encode the true speed of the pattern. Details of the implementation of this stage are not usually specified. Since a single detector can respond to static pattern one takes the difference of pairs of detectors tuned to different directions of motion. There are problems with aliasing when the signal contains high spatial frequencies relative to the separation of the detectors. The response of the detector depends upon the phase of the stimulus and the contrast as well as the speed.

The Energy model (Adelson and Bergen, 1985)

Computational principle - phase-invariant matched filters.

The basic idea behind the energy model is to build spatio-temporal filters which are oriented in space-time and therefore match the oriented space-time structure of moving spatial patterns. This is accomplished by adding together space-time separable filters. A separable filter is one in which the spatial profile remains the same shape over time but is scaled by the value of the temporal filter. For each direction two space-time filters are generated one which is symmetric (bar-like) and one which is asymmetric (edge-like). The sum of the squares of these filters is called the motion energy. The difference in the signal for the two directions is called the opponent energy. However the response of this system will also depend upon contrast and so the result must be divided though by the squared output of another filter which is tuned to static contrast. This gives a phase independent measure which increases with speed but does not reliably give the correct speed value. The model can account for a number of motion phenomenon.

Fourier energy models

Oriented structure in space-time produces oriented structure in Fourier space where the coordinate basis are now spatial and temporal frequency. In Fourier Energy models (Adelson & Bergen, 1985; Watson & Ahmuada, 1985; Heeger, 1987) the visual system is thought to construct spatio-temporally tuned filters oriented in frequency space which allow the measurement of power in the oriented Fourier transform of the stimulus. These oriented filters are produced by a linear combination of separable spatial and temporal filter profiles. The linear nature of combining the two separable profiles has the effect of producing filter kernels centred on the origin of frequency space. To extract a measure of the orientation Fourier energy approaches are then required to combine the outputs of a population of different frequency space oriented filters. The filters to be combined must exhibit different frequency tuning curves as no single filter contains sufficient information to uniquely determine the orientation of the Fourier energy spectrum due to their extended frequency space area.

An example of a Fourier energy model

A typical approach is the Heeger motion model which uses quadrature pairs of narrow-band spatiotemporal linear filters (Heeger, 87). Quadrature refers to the mathematical property that the two filters form a pair where a maximum response in one filter falls at the same point as the minimum in the other. Formally the filters differ in their phase response profiles by exactly 90 degrees, for example sin and cosine functions are in quadrature. Heeger uses sine and cosine phase Gabor functions, which are sine and cosine profiles multiplied by a Gaussian envelope.

It is a property of these quadrature pairs of filters that when squared and added the motion energy signal produced is independent of the phase of the underlying stimulus. This energy measure is contained in a Gaussian neighbourhood of Fourier frequency space near the origin. To recover the velocity at a given point in space-time a least squares fit of the filter response profiles is used to calculate the orientation that accounts for the greatest measure of motion energy in a plane through the frequency space origin. One problem associated with this paradigm is that the contrast of the stimulus must be controlled for if the correct values of the oriented signal are to be extracted. There is a fundamental confound of stimulus contrast with measuring the Fourier energy component orientation. A high stimulus contrast gives a rise to a large amplitude Fourier spectrum. These high energy components in turn produce large filter responses. Therefore to be able to disambiguate between a high power component being measured by a filter at a point displaced from its maximum tuning and a low power component present at the position of maximum filter response an additional explicit contrast gain control must be implemented to normalise the energy signals. These Fourier Energy Models (FEMs) have also been used to measure spatial orientation (Landy & Bergen, 1991) and stereo disparity (Ohzawa, DeAngelis & Freeman, 1990). However the widely held belief in the validity of FEMs has had to be reassessed since the introduction of a class of visual stimuli that prove to be invisible to the Fourier energy approach (Chubb & Sperling, 1988).

First and second-order stimuli

In a variety of domains a distinction has been drawn between "first-order" stimulus which can be measured by the standard FEMs, and "second-order" stimuli which can not. A first-order motion stimulus is defined by spatiotemporal variations in luminance giving rise to a Fourier spectrum oriented about the frequency space origin. Second-order stimuli, such as the motion of a contrast modulation over a texture, have the property of being characterised in the spatiotemporal Fourier domain by power spectra which are displaced away from the frequency space origin (Fleet & Langley,1994). It is generally believed that second-order motion is invisible to standard FEMs, but observers can readily perceive second-order motions (Chubb & Sperling, 1988; Johnston & Clifford, 1995b).

A non-Fourier channel

A second, non-Fourier, motion channel preceded by some form of rectification has been proposed (Chubb & Sperling, 1988) to account for the perception of second-order motion. Rectification is the mathematical process by which all negative signal values are made positive. There are two types of rectification possible, full wave where negative values are made positive and retained, and half wave rectification where negative signal values are discarded. This nonlinear rectification stage brings the Fourier power spectrum of the motion back to the frequency space origin and in doing so removes the effect of the underlying texture, making the motion signal once more amenable to analysis by FEMs. Evidence for Fourier non-Fourier motion channels The contrast reversing grating The contrast reversing bar stimulus of Chubb & Sperling (Chubb & Sperling , 1989) has been seen as evidence for the presence of a second non-Fourier channel fronted by a full wave rectification. The apparent motion phenomena they study is called reverse phi. If a picture is flashed twice in quick succession with a spatial displacement between the two frames observers perceive motion in the direction of displacement, this is called phi motion. If however the contrast of the picture is reversed in sign then observers may perceive motion in the reverse direction, the direction opposite to the displacement. The stimulus Chubb and Sperling examined combines both forward and reverse phi motions. The display comprises a spatial series of bars stepping forward one half of their spatial period with the contrast reversing on successive time frames. The direction of motion in this apparent motion stimulus is ambiguous however subjects report a change in the direction of motion of this stimulus when observed at near and far viewing distances. Chubb and Sperling have chosen to interpret this as strong evidence for two motion channels. In the near viewing condition the non-Fourier channel rectifies the stimulus, thus all the blocks entering the subsequent FEM stage have the same contrast giving rise to the perceived reversal of direction. At the far viewing distance the Fourier channel, with it's much broader filters, dominates and the motion is seen in the direction of the stepping structure. However it should be noted that it has been shown that this reversal of direction may also be accounted for by a scale dependency in a single gradient motion model as described in (Johnston & Clifford, 95a). Interleaved motion sequences

Interleaved motion sequences

In interleaved motion sequences there is an frame by frame alternation of first and second-order motions. The inter frame displacement is set at one half the spatial period so that a consistent perception of motion requires the integration of information from both types of motion. Ledgeway and Smith (1995) have studied a stimulus interleaving a first-order motion, a grating with additive noise, and a second-order motion, a grating of the same spatial frequency modulating noise. The study reports the stimuli did not elicit a consistent perception of motion. This data is taken as strong evidence for two motion systems carrying the separate motion information

A third attentional channel

Lu and Sperling (1995) have suggested that there exists a third motion channel in addition to the luminance defined Fourier and texture defined non-Fourier. This channel they suggest operates through selective attention and saliency maps. Using an interleaving paradigm similar to that of Ledgeway and Smith two motion sequences are presented to the observer on alternate frames. Both of these two sequences are again offset by an inter-frame displacement of one half the spatial period, thus a coherent percept of direction of motion requires integration of information from the two sequences. The sequences are defined by stereoscopic depth and texture. The grating in depth has a natural salient feature in the near peaks, the texture comprises alternate course and fine grating features. When instructed to attend the course textures subjects can perceive a consistent direction of motion, immediately or after considerable practice up to four blocks of 100trials. The motion they conjecture is brought about by the movement of feature salience, from frame to frame, rather than a feature itself. The salient feature location is entered in the subjects salience map and standard Fourier motion analysis of this map structure allows the extraction of the direction of motion. They can, by asking subjects to attend to a different texture feature achieve a reversal of direction. The use of stereo depth, a binocular stimulus, precludes they argue, the primarily monocular first and second-order mechanisms.

A single motion channel

Are multiple motion pathways required to account for the analysis of second order motion, or can a single pathway operating on some other principle suffice? Multiple channels must address how these separate motion signals are combined into a coherent motion percept. One such model is that of Wilson Ferrera and Yo (1992), where the integration is performed using an inhibition feedback neural network. A single channel model, which accounts for the data, would be more parsimonious and remove the need for explicit integration mechanisms. A single pathway gradient based approach to motion perception has been put forward by Johnston, McOwan and Buxton (1992). This method places the emphasis on local measurements of the gradients of the space-time brightness surface rather than the decomposition of the signal into its Fourier components.


The spatio-temporal gradient model (Harris, 1986; Johnston et al, 1992)

Computational principle - speed = ratio of the temporal derivative of image brightness to the spatial derivative of image brightness.

This is an approach that was initially developed in computer vision. It is based on computing the ratio of outputs of separable spatio-temporal filters. However the basic gradient model is flawed (as is the energy model) because the ratio can become infinite when the spatial derivative filter in the denominator becomes zero. This will occur at peaks and troughs in the image. To get round this one can examine how the first and higher spatial derivatives are changing with respect to space and time in addition to the brightness. These give a well conditioned measure of the speed at all points in the image.

In simple gradient models the speed at a point in the space-time image is calculated as the value of the temporal derivative divided by the spatial derivative. The motion constraint equation can be formed by using the Taylor series expansion of the space time image I(x,t). We may disregard terms of greater than 1st order, and by demanding the conservation of luminance i.e. I(x+dx, t+dt) = I(x,t), we find This ratio, equation(1), for recovering velocity is ill-conditioned, the ratio may change by a large amount if the values of the derivatives are only changed by a small amount. The function does not behave well mathematically as the spatial derivative on the denominator may become zero at some points in the space-time image. It has been found that simple cells in the visual cortex can be modeled as differential operators of increasing order (Young, 1993; Koenderink &van Doorn 1987). Using this framework it has been shown that the ill-conditioned nature of the speed calculation may be resolved by forming a series of spatial and temporal derivatives of increasing order and combining them in a form where the denominator is always non-zero (Johnston et al, 1992a; Johnston and Clifford, 1995a). All the information required to condition the calculation is contained in measures of the brightness surface of the space-time image. The model is able to recover the speed of first-order patterns, detect the second-order modulation of band limited noise, and give a unified account of a number of apparent motion illusions (Johnston & Clifford, 1995a) which have previously been taken as evidence of separate motion pathways.

Evidence for a single gradient based motion channel

The three temporal filters are related by differentiation (Johnston and Clifford, 1995),

The contrast modulated grating. In a detailed psychophysical analysis of the second-order motion produced by the passage of a contrast modulation over an underlying grating it has been shown that subjects perceived a slowing in the low contrast regions (Johnston & Clifford, 1995b). This perceived slowing is a function of the spatial frequency of the underlying grating. Theories of motion perception which require a non-Fourier channel have difficulty in accounting for these findings. Rectification would theoretically be able to separate the modulation signal from the underlying carrier signal, providing a veridical speed estimate and, since the carrier becomes redundant, this approach would also predict no effect of the spatial frequency of the underlying grating. Observed results can however be modeled using the gradient approach of Johnston et al. and a closer examination of the stimulus brightness surface reveals the necessary space-time oriented structures to account for the model's predictions (Johnston & Clifford, 1995b).

Comparison of model components and neural structures. A detailed analysis of the component stages of this model (Johnston, McOwan & Buxton, 1992a; Johnston, McOwan & Buxton, 1992b ) shows that it produces processing elements with the properties attributed to simple, phase dependent, and complex, phase independent, cortical cells. Directionally selective cells are the fundamental nonlinear components required in the process of extracting motion. Their nonlinear properties may be characterised by recording the cell's response to the presentation of pairs of bars with various spatial and temporal offsets. Analytic techniques allow the removal of the effect of any linear operations and produce the cell's nonlinear spatiotemporal interaction field. Emerson, Bergen & Adelson (1992) reported that the interaction field found for cells in cat cortex supports a Fourier energy model. Simulations of the two bar interaction stimuli with the gradient model also give rise to spatiotemporal oriented interaction fields of the type found physiologically (Johnston, McOwan & Benton, 1995).