Sound is inherently a spatial perception.
The compact disc format, which records audio with 16-bit resolution at a sampling rate of 44.1 kHz, was engineered to reproduce audio with fidelity exceeding the limits of human perception. And it works. However, sound is inherently a spatial perception. We perceive the direction, distance, and size of sound sources, and reproducing the spatial properties of sound accurately remains a challenge. In this paper, I review the technologies for spatial sound reproduction and discuss future directions, focusing on the promise of individualized, binaural technology.
People hear with two ears, and the two audio signals received at the eardrums completely define the auditory experience. An amazing feature of the auditory system is that with only two ears sounds can be perceived from all directions, and the listener can even sense the distance and size of sound sources. The perceptual cues for sound localization include the amplitude of the sound at each ear, the arrival time at each ear, and the spectrum of the sound, that is, the relative amplitude of the sound at different frequencies.
The spectrum of a sound is modified by interactions between sound waves and the torso, head, and external ear (pinna). Furthermore, spectral modification depends on the location of the source in a complex way. The auditory system uses spectral modifications as cues to the location of sound, but because the complex shape of the pinna varies significantly among individuals, the cues for sound localization are idiosyncratic. Each individual’s auditory system is adapted to the idiosyncratic spectral cues produced by his or her head features.
Binaural audio refers specifically to the recording and reproduction of sound at the ears. Binaural recordings can be made by placing miniature microphones in the ear canals of a human subject. Exact reproduction of the recording is possible through properly equalized headphones. If the recording and playback are for the same subject and there are no head movements, the results are stunningly realistic.
Many virtual-reality audio applications attempt to position a sound arbitrarily around a listener wearing headphones. They rely on a stored database of head-related transfer functions (HRTFs), that is, mathematical descriptions of the transformation of sound by the torso, head, and external ear. HRTFs for the left and right ears of a subject specify how sound from a particular direction is transformed en route to the ear drums. A complete description of a subject’s head response requires hundreds of HRTF measurements from all directions surrounding the subject. Any sound source can be virtually located by filtering the sound with the HRTFs corresponding to the desired location and presenting the resulting binaural signal to the subject using properly equalized headphones. When this procedure is individualized by using the subject’s own HRTFs, the localization performance is equivalent to free-field listening (Wightman and Kistler, 1989a,b).
Figure 1 (see pdf version) shows the magnitude spectra for right ear HRTFs measured for three different human subjects with a sound source located on the horizontal plane at 60 degrees right azimuth. Note that the spectra are similar up to 6 kHz; the significant differences in HRTFs at higher frequencies are attributable to variations in pinna shape. Figure 2 (se pdf version) shows the magnitude spectra of HRTFs measured from a dummy head microphone for all locations on the horizontal plane. Note how the spectral features change as a function of source direction.
Most research has been focused on localization; subjects presented with an acoustic stimulus are asked to report the apparent direction. Their localization is then compared to free-field listening to assess the quality of reproduction. But this method does not account for many attri-butes of sound perception, including distance, timbre, and size. In an experimental paradigm developed by Hartmann and Wittenburg (1996), the virtual stimulus is reproduced using open-air headphones that allow free-field listening. Thus, real and virtual stimuli can be compared directly. In these experiments, subjects are presented with a stimulus and asked to decide if it is real or virtual. If a virtual stimulus cannot be distinguished from a real stimulus, then the reproduction error is within the limits of perception. When this experimental paradigm was used to study the externalization of virtual sound, the results demonstrated that individualized spectral cues are necessary for proper externalization.
The major limitation of binaural techniques is that all listeners are different. Binaural signals recorded for subject A may not sound correct to subject B. Nevertheless, by necessity, binaural systems are seldom individualized. Instead, a reference head, often a model that represents a typical listener or HRTFs known to perform adequately for a range of different listeners, is used to encode binaural signals for all listeners. This is called a “non-individualized” system (Wenzel et al., 1993).
The use of non-individualized HRTFs is limited by a lack of externalization (the sounds are localized in the head or very close to the head), incorrect perception of elevation angle, and front/back reversals. Externalization can be improved somewhat by adding dynamic head tracking and reverberation. Nevertheless, the lack of realistic externalization is often cited as a problem with these systems.
The great challenge in binaural technology is to devise a practical method by which binaural signals can be individualized to a specific listener. There are several possible approaches to meeting this challenge: acoustic measurement, statistical models, calibration procedure, simplified geometrical models, and accurate head models solved using computational acoustics.
With the proper equipment, measuring the HRTFs of a listener is a straightforward procedure, although not practical for commercial applications. Microphones are placed in the ears of the listener, either probe microphones placed somewhere in the ear canal or microphones that block the entrances to the ear canals. Measurement signals are produced from speakers surrounding the listener to measure the impulse response of each source direction to each ear. Because tens or hundreds of directions may be measured, the listener is positioned either in a rotating chair or in a fixed position surrounded by hundreds of speakers. The measurements are often made in an anechoic (echo-free) chamber.
Various statistical methods have been used to analyze databases of HRTF measurements in an effort to tease out some underlying structure in the data. One important study applied principal component analysis (PCA) to a database of HRTFs from 10 listeners at 256 directions (Kistler and Wightman, 1992). Using the log magnitude spectra of the HRTFs as input, the analysis indicated that 90 percent of the variance in the data could be accounted for using only five principal components. The study tested the localization performance using individualized HRTFs approximated by weighted sums of the five principal components. When the listener’s own HRTFs were used, the results were nearly identical. The study gathered only directional judgments, and externalization was not considered. The study showed that a five-parameter model is sufficient for synthesizing individualized HRTF spectra, at least in terms of directional localization and for a single direction. Unfortunately, the five parameters must be calculated for each source direction, which means individualized measurements are still necessary.
One can imagine a simple calibration procedure that would involve the listener adjusting knobs to match a parameterized HRTF model with the listener’s characteristics. The listener could be given a test stimulus and asked to adjust a knob until some attribute of his perception was maximized. After adjusting several knobs in this manner, the parameter values of the internal model would be optimized for the listener, and the model would be able to generate individualized HRTFs for that individual. Some progress has been made in this area. For example, it has been demonstrated that calibrating HRTFs according to overall head size improves localization performance (Middlebrooks et al., 2000). However, to date, detailed methods of modeling and calibrating the data have not been found.
Many researchers have developed geometrical models for the torso, head, and ears. The head and torso can be modeled using ellipsoids (Algazi and Duda, 2002), and the pinna can be modeled as a set of simple geometrical objects (Lopez-Poveda and Meddis, 1996). For simple geometries, the acoustic-wave equation can be solved to determine head response. For more complicated geometries, head response can be approximated using a multipath model, wherein each reflecting or diffracting object contributes an echo to the response (Brown and Duda, 1997). In theory, head models should be easy to fit to any particular listener by making anthropometric measurements of the listener and plugging these into the model. However, studies have shown that although simplified geometrical models are accurate at low frequencies, they become increasingly inaccurate at higher frequencies. Because of the importance of high-frequency localization cues for proper externalization, elevation perception, and front/back resolution of sound, simplified geometrical models are not suitable for creating individualized HRTFs.
A more promising approach has been to use a three-dimensional laser scan to produce an accurate geometrical representation of a head as a basis for computational acoustic simulation using finite-element modeling (FEM) or boundary-element modeling (BEM) (Kahana et al., 1998, 1999). With this method, HRTFs can be determined computationally with the same accuracy as acoustical measurements, even at high frequencies. Using a 15,000-element model of the head and ear, Kahana demonstrated computation of HRTFs that match acoustical measurements very precisely up to 15 kHz.
There are, however, a number of practical difficulties with this method. First, scanning the head is complicated by the presence of hair, obscured areas behind the ears, and the obscured internal features of the ear. Second, replicating the interior features of the ear requires making molds and then scanning them separately. Third, after the various scans are spliced together, the number of elements in the model must be pruned to computationally tractable quantities without compromising spatial resolution. Finally, solution of the acoustical equations requires significant computation. For all of these reasons, this approach currently requires more effort and expense than acoustical measurement of HRTFs.
The technique does suggest an alternative approach to determining individualized HRTFs. A deformable head model could be fashioned from finite elements and parameterized with a set of anthropometric measurements. After making head measurements of a particular subject and plugging these into the model, the model head would “morph” into a close approximation of the subject’s head. At that point, the computational acoustics procedure could be used to determine individualized HRTFs for the subject. Ideally, the subject’s measurements could be determined from images using computer vision techniques. The goal would be a system that could automatically determine individualized HRTFs based on a few digital images of the subject’s head and ears. The challenges will be to develop a head model that can morph to fit any head, to obtain a sufficiently accurate ear shape, and to develop ways to estimate the parameters from images of the subject.
Binaural audio can be delivered to a listener over conventional stereo loudspeakers, but each loudspeaker (unlike headphones) creates significant “cross-talk” to the opposite ear. The cross-talk can be cancelled by preprocessing the speaker signals (called cross-talk cancellers) with the inverse of the 2 x 2 matrix of transfer functions from the speakers to the ears. Cross-talk cancellers use a model of the head to anticipate what cross-talk will occur, then add an out-of-phase cancellation signal to the opposite channel. Thus, the cross-talk is acoustically cancelled at the listener’s ears. If the head responses of the listener are known, and if the listener’s head remains fixed, an individualized cross-talk cancellation system can be designed that works extremely well.
Non-individualized systems are effective only up to 6 kHz and then only when the listener’s position is known (Gardner, 1998). However, despite poor high-frequency performance, cross-talk–cancelled audio is capable of producing stunning, well externalized, virtual sounds to the sides of the listener using frontally placed loudspeakers. As a result of the listener’s pinna cues, the sounds are well externalized. The sounds are shifted to the side as a result of the dominance of low-frequency time-delay cues in lateral localization; the cross-talk cancellation works effectively at low frequencies to provide this cue.
The first audio reproduction systems were monophonic, reproducing a single audio signal through one transducer. Stereophonic audio systems, recording and reproducing two independent channels of audio, sound much more realistic. With two loudspeakers, it is possible to position a sound source at either speaker or to position sounds between the speakers by sending a proportion of the sound to each speaker. Stereo has a great advantage over mono because it reproduces a set of locations between the speakers. Also with stereo, uncorrelated signals can be sent to the two ears, which is necessary to achieve a sense of space.
Multichannel audio systems, such as the current 5.1 surround systems, have continued the trend of adding channels around the listener to improve spatial reproduction. 5.1 systems have left, center, and right frontal speakers, left and right surround speakers positioned to the sides of the listener, and a subwoofer to reproduce low frequencies. Because 5.1 systems were designed for cinema sound, the focus is on accurate frontal reproduction so that movie dialogue is spatially aligned with images of the actors speaking. The surround speakers are used for off-screen sounds or uncorrelated ambient effects. The trend in multichannel audio is to add more speaker channels to improve the accuracy of on-screen sounds and provide additional locations for off-screen sounds. As increasing numbers of speakers are added at the perimeter of the listening space, it becomes possible to reconstruct arbitrary sound fields within the space, a technology called wave-field synthesis.
Stereophonic udio systems reproduce a set of locations
between the speakers.
Ultrasonics can be used to produce highly directional audible sound beams. This technology is based on physical properties of air, particularly that air becomes a nonlinear medium at high sound pressures. Hence, it is possible to transmit two high-intensity ultrasonic tones, say at 100 kHz and 101 kHz, and produce an audible 1 kHz tone as a result of the intermodulation between the two ultrasonic tones. However, the demodulated signal will be significantly distorted, so the audio must be preprocessed to reduce the distortion after demodulation (Pompei, 1999). Although this technology is impressive, it cannot reproduce low-frequency sounds effectively, and it has lower fidelity than standard loudspeakers.
Binaural audio has the potential to reproduce sound that is indistinguishable from sounds in the real world. However, the playback must be individualized to each listener’s head response. This is currently possible by making acoustical measurements or by making geometrical scans and applying computational acoustic modeling. A practical means of individualizing head responses has yet to be developed.
Algazi, V.R., and R.O. Duda. 2002. Approximating the head-related transfer function using simple geometric models of the head and torso. Journal of the Acoustical Society of America 112(5): 2053–2064.
Brown, C.P., and R.O. Duda. 1997. An efficient HRTF model for 3-D sound. Pp. 298–301 in Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. New York: IEEE.
Gardner, W.G. 1998. 3-D Audio Using Loudspeakers. Boston, Mass.: Kluwer Academic Publishers.
Hartmann, W.M., and A. Wittenberg. 1996. On the externalization of sound images. Journal of the Acoustical Society of America 99(6): 3678–3688.
Kahana, Y., P.A. Nelson, and M. Petyt. 1998. Boundary element simulation of HRTFs and sound fields produced by virtual acoustic imaging. Proceedings of the Audio Engineering Society’s 105th Convention. Preprint 4817, unpaginated.
Kahana, Y., P.A. Nelson, M. Petyt, and S. Choi. 1999. Numerical modeling of the transfer functions of a dummy-head and of the external ear. Pp. 330–334 in Proceedings of the Audio Engineering Society’s 16th International Conference. New York: Audio Engineering Society.
Kistler, D.J., and F.L. Wightman. 1992. A model of head-related transfer functions based on principal components analysis and minimum-phase reconstruction. Journal of the Acoustical Society of America 91(3): 1637–1647.
Lopez-Poveda, E.A., and R. Meddis. 1996. A physical model of sound diffraction and reflections in the human concha. Journal of the Acoustical Society of America 100(5): 3248–3259.
Middlebrooks, J.C., E.A. Macpherson, and Z.A. Onsan. 2000. Psychophysical customization of directional transfer functions for virtual sound localization. Journal of the Acoustical Society of America 108(6): 3088–3091.
Pompei, F.J. 1999. The use of airborne ultrasonics for generating audible sound beams. Journal of the Audio Engineering Society 47(9): 726–731.
Wenzel, E.M., M. Arruda, D.J. Kistler, and F.L. Wightman. 1993. Localization using nonindividualized head-related transfer functions. Journal of the Acoustical Society of America 94(1): 111–123.
Wightman, F.L., and D.J. Kistler. 1989a. Headphone simulation of free-field listening I: stimulus synthesis. Journal of the Acoustical Society of America 85(2): 858–867.
Wightman, F.L., and D.J. Kistler. 1989b. Headphone simulation of free-field listening II: psychophysical validation. Journal of the Acoustical Society of America 85(2): 868–878.