| Audio
Interpolation Richard J. Radke and Scott Rickard. Submitted to AES22, International Conference on Virtual, Synthetic and Entertainment Audio, June 2002, Espoo, Finland |
![]() |
|||
|
Summary In this paper, our goal is to understand when there is enough information contained in the audio signals received at two microphones to produce the audio that would have actually been heard had a third microphone been present in the environment. We show that for anechoic environments, when the virtual microphone is located along the line connecting the two real microphones, the audio can be synthesized with no knowledge besides the distance between the two microphones. We call our algorithm "audio interpolation" as an analogy to the term "view interpolation" from computer vision. View interpolation techniques use two or more real views of a scene to synthesize other, physically consistent views of the scene from the perspective of a virtual camera. By combining audio interpolation with view interpolation, we can obtain "virtual video" that contains both images and sound. Jourjine et al. presented a novel method (the "DUET" algorithm) for blindly separating any number of sources using only two mixtures. The main assumption of the algorithm is that the sources are W-disjoint orthogonal, i.e. the supports of the windowed Fourier transforms of each pair of source signals are disjoint. This assumption has been shown to be true in an approximate sense for mixtures of multiple voices speaking simultaneously. The mixing parameters of the sources are estimated by clustering ratios of the time-frequency representations of the mixtures. The estimates of the mixing parameters can then be used to partition the time-frequency representation of one mixture to recover the original sources. The technique is valid even when the number of sources is larger than the number of mixtures. Our audio interpolation algorithm decouples into two parts. First, a method based on the DUET algorithm is used to blindly associate a physical location with each time-frequency point of the mixtures based on a model of anechoic mixing. Second, the time-frequency representations of the mixtures are altered to synthesize the virtual audio signals as they would have been heard at a third microphone placed along the line connecting the two microphones. We expect this research to have immediate applications in video conferencing and virtual video. Experimental ResultsOur first experiment, illustrated in Figure 1, consists of two microphones,
five sources, and a line describing the virtual microphone path. In this
example, each source consisted of a pure tone played at a different frequency.
The interpolated audio was obtained using our algorithm from a virtual
microphone that slides from -5m to 5m over a period of 4 seconds. As expected,
as the virtual microphone moves in front of each source position, that
source becomes the loudest source present in the mixture. This effect
is illustrated in Figure 2 by plotting the normalized power of each source
as a function of the position of the virtual microphone. Figure 3 shows
that the computation is accurate by comparing the estimated power for
source 3 to the power computed analytically. Figure 4 is a Flash animation
that includes the original sources and mixtures, as well as the interpolated
audio.
Figure 1. Experimental setup, sources at x = -4, -2.5,
-1, 1, 3m, y = 0.5m. |
||||
|
Figure 2. Normalized power of each source vs. |
Figure 3. Theoretical (dotted line) vs. estimated |
|||
|
Figure 4. Flash animation of experiment 1. Our second experiment had the same configuration as in Figure 1 with
microphone separation 1 cm. Only the sources at positions s1 and s5 were
active. The source at s1 was a female voice recording; the source at s5
was a male voice recording. As before, a virtual microphone was moved
from -5m to 5m, this time over a period of 8 seconds. Figure 5 shows the
result of applying the audio interpolation equations as in the first experiment
(i.e. without explicit demixing of the sources) to generate the virtual
audio signal. The green line in Figure 5 shows the theoretical signal-to-interference
ratio (SIR) of the two sources at the virtual microphone. A positive SIR
corresponds to dominance of the source s1, while a negative SIR corresponds
to dominance of the source s5. The blue line in Figure 5 is the instantaneous
SIR of the interpolated mixture. The resulting relative strength of the
voices in the virtual audio does not match the theoretical prediction
as it did in the first experiment. Analysis revealed that the effect of
the of the violations of the W-disjoint orthogonality assumption in the
speech case caused errors in the distance estimates that prevented the
method from functioning as desired. In order to combat this, a preprocessing
step was added to explicitly estimate the number of sources in the environment
and assign a single source to each time-frequency bin. Using these modified
estimates, Figure 6 was generated. While the two curves still do not match
exactly due to deviations from modeling assumptions, the interpolated
audio curve has the correct character and the audio itself sounds realistic.
A Flash animation of the second experiment is included in Figure 7. |
||||
|
Figure 5. Theoretical (green) vs. estimated (blue) SIR
of |
Figure 6. Theoretical (green) vs. estimated (blue) SIR
of |
|||
|
Figure 7. Flash animation of experiment 2. |
||||