**Introduction to modeling motion
perception **

It is useful to consider how one
should attempt to build a theory of motion perception in relation to David
Marr's (1982) three levels of analysis - 1) The computational theory level, 2)
The algorithmic level and 3) The implementation level. A complete theory of
motion perception needs to be able to address the questions that arise at all
three levels. The main goal is to explain how the visual system can compute the
speed of image motion from the changes in the retinal image over time and
therefore motion processing requires an understanding of both temporal image
processing as well as spatial processing.

*The persistence of vision *

Fireworks streaking into the sky
appear to leave bright trails behind them. These trails result from the
"persistence of vision" i.e. they are a consequence of the fact that
neurones in the visual system summate information over time as well as space.
Thus we need to think of receptive fields as being spatio-temporal rather than
simply spatial.

*Temporal filters *

We know from adaptation experiments, and other evidence, that there are many spatial frequency selective channels in the visual system but how many temporal channels do we have? Hess and Snowden (1992) investigated temporal processing using a masking paradigm. A probe (a grating reversing contrast at a particular temporal frequency) was set to be just above detection threshold. They then measured the contrast of a mask (narrow band filtered spatial noise reversing in contrast at one of a range of temporal frequencies) at which the probe could just be detected. The rational of the experiment is that, if the probe and mask are detected by separate temporal channels, the mask will not interfere with the detection of the probe. They found evidence for three temporal channels - a low pass channel, a band-pass channel peaking at around 10Hz and another peaking at around 18 Hz

**The Reichardt correlation model**

*Computational principle -
correlation.*

The Reichardt model (Reichardt,1959)
as shown above has two spatially separate detectors. The output of one of the
detectors is delayed and then the two signals are multiplied. The output is
tuned to speed and therefore one would need many detectors tuned to different
speeds to encode the true speed of the pattern. Details of the implementation
of this stage are not usually specified. Since a single detector can respond to
static pattern one takes the difference of pairs of detectors tuned to
different directions of motion. There are problems with aliasing when the
signal contains high spatial frequencies relative to the separation of the
detectors. The response of the detector depends upon the phase of the stimulus
and the contrast as well as the speed.

**The Energy model (Adelson and
Bergen, 1985) **

*Computational principle -
phase-invariant matched filters*.

The basic idea behind the energy model is to build spatio-temporal filters which are oriented in space-time and therefore match the oriented space-time structure of moving spatial patterns. This is accomplished by adding together space-time separable filters. A separable filter is one in which the spatial profile remains the same shape over time but is scaled by the value of the temporal filter. For each direction two space-time filters are generated one which is symmetric (bar-like) and one which is asymmetric (edge-like). The sum of the squares of these filters is called the motion energy. The difference in the signal for the two directions is called the opponent energy. However the response of this system will also depend upon contrast and so the result must be divided though by the squared output of another filter which is tuned to static contrast. This gives a phase independent measure which increases with speed but does not reliably give the correct speed value. The model can account for a number of motion phenomenon.

*Fourier energy models *

Oriented structure in space-time
produces oriented structure in Fourier space where the coordinate basis are now
spatial and temporal frequency. In Fourier Energy models (Adelson & Bergen,
1985; Watson & Ahmuada, 1985; Heeger, 1987) the visual system is thought to
construct spatio-temporally tuned filters oriented in frequency space which
allow the measurement of power in the oriented Fourier transform of the
stimulus. These oriented filters are produced by a linear combination of
separable spatial and temporal filter profiles. The linear nature of combining
the two separable profiles has the effect of producing filter kernels centred
on the origin of frequency space. To extract a measure of the orientation
Fourier energy approaches are then required to combine the outputs of a
population of different frequency space oriented filters. The filters to be
combined must exhibit different frequency tuning curves as no single filter
contains sufficient information to uniquely determine the orientation of the
Fourier energy spectrum due to their extended frequency space area.

*An example of a Fourier energy model *

A typical approach is the Heeger
motion model which uses quadrature pairs of narrow-band spatiotemporal linear
filters (Heeger, 87). Quadrature refers to the mathematical property that the
two filters form a pair where a maximum response in one filter falls at the same
point as the minimum in the other. Formally the filters differ in their phase
response profiles by exactly 90 degrees, for example sin and cosine functions
are in quadrature. Heeger uses sine and cosine phase Gabor functions, which are
sine and cosine profiles multiplied by a Gaussian envelope.

It is a property of these quadrature
pairs of filters that when squared and added the motion energy signal produced
is independent of the phase of the underlying stimulus. This energy measure is
contained in a Gaussian neighbourhood of Fourier frequency space near the
origin. To recover the velocity at a given point in space-time a least squares
fit of the filter response profiles is used to calculate the orientation that
accounts for the greatest measure of motion energy in a plane through the
frequency space origin. One problem associated with this paradigm is that the
contrast of the stimulus must be controlled for if the correct values of the
oriented signal are to be extracted. There is a fundamental confound of
stimulus contrast with measuring the Fourier energy component orientation. A
high stimulus contrast gives a rise to a large amplitude Fourier spectrum.
These high energy components in turn produce large filter responses. Therefore
to be able to disambiguate between a high power component being measured by a
filter at a point displaced from its maximum tuning and a low power component
present at the position of maximum filter response an additional explicit
contrast gain control must be implemented to normalise the energy signals.
These Fourier Energy Models (FEMs) have also been used to measure spatial
orientation (Landy & Bergen, 1991) and stereo disparity (Ohzawa, DeAngelis
& Freeman, 1990). However the widely held belief in the validity of FEMs
has had to be reassessed since the introduction of a class of visual stimuli
that prove to be invisible to the Fourier energy approach (Chubb &
Sperling, 1988).

*First and second-order stimuli *

In a variety of domains a
distinction has been drawn between "first-order" stimulus which can
be measured by the standard FEMs, and "second-order" stimuli which
can not. A first-order motion stimulus is defined by spatiotemporal variations
in luminance giving rise to a Fourier spectrum oriented about the frequency
space origin. Second-order stimuli, such as the motion of a contrast modulation
over a texture, have the property of being characterised in the spatiotemporal
Fourier domain by power spectra which are displaced away from the frequency
space origin (Fleet & Langley,1994). It is generally believed that
second-order motion is invisible to standard FEMs, but observers can readily
perceive second-order motions (Chubb & Sperling, 1988; Johnston &
Clifford, 1995b).

*A non-Fourier channel *

A second, non-Fourier, motion
channel preceded by some form of rectification has been proposed (Chubb &
Sperling, 1988) to account for the perception of second-order motion.
Rectification is the mathematical process by which all negative signal values
are made positive. There are two types of rectification possible, full wave
where negative values are made positive and retained, and half wave
rectification where negative signal values are discarded. This nonlinear rectification
stage brings the Fourier power spectrum of the motion back to the frequency
space origin and in doing so removes the effect of the underlying texture,
making the motion signal once more amenable to analysis by FEMs. Evidence for
Fourier non-Fourier motion channels The contrast reversing grating The contrast
reversing bar stimulus of Chubb & Sperling (Chubb & Sperling , 1989)
has been seen as evidence for the presence of a second non-Fourier channel
fronted by a full wave rectification. The apparent motion phenomena they study
is called reverse phi. If a picture is flashed twice in quick succession with a
spatial displacement between the two frames observers perceive motion in the
direction of displacement, this is called phi motion. If however the contrast
of the picture is reversed in sign then observers may perceive motion in the
reverse direction, the direction opposite to the displacement. The stimulus
Chubb and Sperling examined combines both forward and reverse phi motions. The
display comprises a spatial series of bars stepping forward one half of their
spatial period with the contrast reversing on successive time frames. The
direction of motion in this apparent motion stimulus is ambiguous however
subjects report a change in the direction of motion of this stimulus when
observed at near and far viewing distances. Chubb and Sperling have chosen to
interpret this as strong evidence for two motion channels. In the near viewing
condition the non-Fourier channel rectifies the stimulus, thus all the blocks
entering the subsequent FEM stage have the same contrast giving rise to the
perceived reversal of direction. At the far viewing distance the Fourier
channel, with it's much broader filters, dominates and the motion is seen in
the direction of the stepping structure. However it should be noted that it has
been shown that this reversal of direction may also be accounted for by a scale
dependency in a single gradient motion model as described in (Johnston &
Clifford, 95a). Interleaved motion sequences

*Interleaved motion sequences *

In interleaved motion sequences
there is an frame by frame alternation of first and second-order motions. The
inter frame displacement is set at one half the spatial period so that a
consistent perception of motion requires the integration of information from
both types of motion. Ledgeway and Smith (1995) have studied a stimulus
interleaving a first-order motion, a grating with additive noise, and a
second-order motion, a grating of the same spatial frequency modulating noise.
The study reports the stimuli did not elicit a consistent perception of motion.
This data is taken as strong evidence for two motion systems carrying the
separate motion information

*A third attentional channel *

Lu and Sperling (1995) have
suggested that there exists a third motion channel in addition to the luminance
defined Fourier and texture defined non-Fourier. This channel they suggest
operates through selective attention and saliency maps. Using an interleaving
paradigm similar to that of Ledgeway and Smith two motion sequences are
presented to the observer on alternate frames. Both of these two sequences are
again offset by an inter-frame displacement of one half the spatial period,
thus a coherent percept of direction of motion requires integration of information
from the two sequences. The sequences are defined by stereoscopic depth and
texture. The grating in depth has a natural salient feature in the near peaks,
the texture comprises alternate course and fine grating features. When
instructed to attend the course textures subjects can perceive a consistent
direction of motion, immediately or after considerable practice up to four
blocks of 100trials. The motion they conjecture is brought about by the
movement of feature salience, from frame to frame, rather than a feature
itself. The salient feature location is entered in the subjects salience map
and standard Fourier motion analysis of this map structure allows the
extraction of the direction of motion. They can, by asking subjects to attend
to a different texture feature achieve a reversal of direction. The use of
stereo depth, a binocular stimulus, precludes they argue, the primarily
monocular first and second-order mechanisms.

*A single motion channel *

Are multiple motion pathways
required to account for the analysis of second order motion, or can a single
pathway operating on some other principle suffice? Multiple channels must
address how these separate motion signals are combined into a coherent motion
percept. One such model is that of Wilson Ferrera and Yo (1992), where the
integration is performed using an inhibition feedback neural network. A single
channel model, which accounts for the data, would be more parsimonious and
remove the need for explicit integration mechanisms. A single pathway gradient
based approach to motion perception has been put forward by Johnston, McOwan
and Buxton (1992). This method places the emphasis on local measurements of the
gradients of the space-time brightness surface rather than the decomposition of
the signal into its Fourier components.

**The spatio-temporal gradient
model (Harris, 1986; Johnston et al, 1992) **

*Computational principle - speed =
ratio of the temporal derivative of image brightness to the spatial derivative
of image brightness.*

This is an approach that was initially developed in computer vision. It is based on computing the ratio of outputs of separable spatio-temporal filters. However the basic gradient model is flawed (as is the energy model) because the ratio can become infinite when the spatial derivative filter in the denominator becomes zero. This will occur at peaks and troughs in the image. To get round this one can examine how the first and higher spatial derivatives are changing with respect to space and time in addition to the brightness. These give a well conditioned measure of the speed at all points in the image.

In simple gradient models the speed
at a point in the space-time image is calculated as the value of the temporal
derivative divided by the spatial derivative. The motion constraint equation
can be formed by using the Taylor series expansion of the space time image
I(x,t). We may disregard terms of greater than 1st order, and by demanding the conservation
of luminance i.e. I(x+dx, t+dt) = I(x,t), we find This ratio, equation(1), for
recovering velocity is ill-conditioned, the ratio may change by a large amount
if the values of the derivatives are only changed by a small amount. The
function does not behave well mathematically as the spatial derivative on the
denominator may become zero at some points in the space-time image. It has been
found that simple cells in the visual cortex can be modeled as differential operators of increasing order (Young, 1993; Koenderink &van Doorn
1987). Using this framework it has been shown that the ill-conditioned nature
of the speed calculation may be resolved by forming a series of spatial and
temporal derivatives of increasing order and combining them in a form where the
denominator is always non-zero (Johnston et al, 1992a; Johnston and Clifford,
1995a). All the information required to condition the calculation is contained
in measures of the brightness surface of the space-time image. The model is
able to recover the speed of first-order patterns, detect the second-order
modulation of band limited noise, and give a unified account of a number of
apparent motion illusions (Johnston & Clifford, 1995a) which have
previously been taken as evidence of separate motion pathways.

**Evidence for a single gradient based motion channel **

The three temporal filters are related by differentiation
(Johnston and Clifford, 1995),

The contrast modulated grating. In a
detailed psychophysical analysis of the second-order motion produced by the
passage of a contrast modulation over an underlying grating it has been shown
that subjects perceived a slowing in the low contrast regions (Johnston &
Clifford, 1995b). This perceived slowing is a function of the spatial frequency
of the underlying grating. Theories of motion perception which require a
non-Fourier channel have difficulty in accounting for these findings.
Rectification would theoretically be able to separate the modulation signal
from the underlying carrier signal, providing a veridical speed estimate and,
since the carrier becomes redundant, this approach would also predict no effect
of the spatial frequency of the underlying grating. Observed results can
however be modeled using the gradient approach of Johnston et al. and a closer
examination of the stimulus brightness surface reveals the necessary space-time
oriented structures to account for the model's predictions (Johnston &
Clifford, 1995b).

Comparison of model components and
neural structures. A detailed analysis of the component stages of this model
(Johnston, McOwan & Buxton, 1992a; Johnston, McOwan & Buxton, 1992b )
shows that it produces processing elements with the properties attributed to
simple, phase dependent, and complex, phase independent, cortical cells.
Directionally selective cells are the fundamental nonlinear components required
in the process of extracting motion. Their nonlinear properties may be
characterised by recording the cell's response to the presentation of pairs of
bars with various spatial and temporal offsets. Analytic techniques allow the
removal of the effect of any linear operations and produce the cell's nonlinear
spatiotemporal interaction field. Emerson, Bergen & Adelson (1992) reported
that the interaction field found for cells in cat cortex supports a Fourier
energy model. Simulations of the two bar interaction stimuli with the gradient
model also give rise to spatiotemporal oriented interaction fields of the type
found physiologically (Johnston, McOwan & Benton, 1995).

** References **

- Adelson, E. H. & Bergen, J.R. (1985). Spatiotemporal energy
models for the perception of motion. J. Opt. Soc. Am. A. 2, 284-299.
- Adelson, E. H. & Bergen, J. R. (1991). The plenoptic function
and the elements of early vision. In Landy M. S. & Movshon J. A.
(Eds). Computational models of visual processing (pp 2-20), Cambridge
Mass. :MIT press.
- Chubb, C. & Sperling, G. (1988). Drift balanced random dot
stimuli; a general basis for studying non Fourier motion. J. Opt. Soc. Am.
A. 5 ,1986-2007.
- Emerson R., Bergen J. C. & Adelson E. H. (1992) Directionally
selective complex cells and the computation of motion energy in cat visual
cortex, Vision Research, 32, 203-218.
- Ferrera V. P., & Wilson H. R. (1990) Perceived speed of moving
two-dimensional patterns, Vision Research, 31, 877-893.
- Fleet, D. J. & Langley, K. (1994) Computation analysis of
non-Fourier motion. Vision Research, 34, 3057-3079.
- Heeger, D. J. (1987). Model for the extraction of image flow. J.
Opt. Soc. Am. A. 4, 1455-1471.
- Hess R. F. & Snowden R. J., (1992) Temporal properties of
human visual filters: Number, shapes and spatial covariance, Vision
Research. 32, 47-60
- Johnston, A., McOwan, P.W. & Buxton, H. (1992a). A
computational model of the analysis of some first-order and second-order
motion patterns by simple and complex cells. Proc. R. Soc. Lond. B. 250,
297-306.
- Johnston, A., P. W. McOwan & H. Buxton. (1992b) A biologically
plausible scheme for measuring image velocity. J. Physiol. 452, 288.
- Johnston A. & Clifford C. W. G. (1995a) A unified account of
three apparent motion illusions Vision Research, 35, 1109-1123.
- Johnston A. & Clifford C. W. G. (1995b) Perceived motion of
contrast modulated gratings: predictions of the Multi-channel Gradient
model and the role of full wave rectification, Vision Research, 35,
1771-1783.
- Johnston A., McOwan P. W. & Benton C. P. (1995),Nonlinear
interactions are in direction selective complex cells are predicted by a
gradient motion model Investigative Ophthalmology and Visual Science
(Supplement) 36 277
- Koenderink, J. J. & Van Doorn A. J. (1987) Representation of
local geometry in the visual system. Biological Cybernetics 55 367-375
- Landy M. S., & Bergen J. R., (1991) Texture segregation and
orientation gradient, Vision Research. 31 , 679-691.
- Ledgeway T. & Smith A. T (1994)., Evidence for separate
motion-detecting mechanisms for first- and second-order motion in human
vision, Vision Research. 34, 2727-2724.
- Lu Z. & Sperling G. (1995) Attention-generated apparent
motion, Nature 377, 237-239.
- Mather G. & West S.(1993), Evidence for second-order motion
detectors, Vision Research. 33, 1109-1112.
- Ohzawa, I., De Angelis, G. C. & Freeman, R. D. (1990)
Stereoscopic depth discrimination in the visual system; neurons ideally
suited as disparity detectors. Science 249, 1037 -1040.
- Reichardt W, (1959) Autocorrelation and the central nervious
system. In W. A. Rosenblith (Ed.) Sensory Communication MIT Press
Cambridge 303-318
- Watson, A. B. & Ahumada, A. J.. (1985). Model of human
visual-motion sensing. J. Opt. Soc. Am. A. 2, 322-341.
- Wilson, H. R., V. P. Ferrera & Yo, C. (1992). A
psychophysically motivated model for two-dimensional motion perception.
Vis. Neurosci. 9, 79-97.
- Young, R.A. & Lesperance, R.M. (1993) A physiological model of
motion analysis for machine vision Technical Report General Motors
Research Laboratories. GMR-7878. 1-76.