Audio Interpolation
Richard J. Radke and Scott Rickard.
Submitted to AES22, International Conference on Virtual, Synthetic and Entertainment Audio,
June 2002, Espoo, Finland
Audio Interpolation

Summary

In this paper, our goal is to understand when there is enough information contained in the audio signals received at two microphones to produce the audio that would have actually been heard had a third microphone been present in the environment. We show that for anechoic environments, when the virtual microphone is located along the line connecting the two real microphones, the audio can be synthesized with no knowledge besides the distance between the two microphones.

We call our algorithm "audio interpolation" as an analogy to the term "view interpolation" from computer vision. View interpolation techniques use two or more real views of a scene to synthesize other, physically consistent views of the scene from the perspective of a virtual camera. By combining audio interpolation with view interpolation, we can obtain "virtual video" that contains both images and sound.

Jourjine et al. presented a novel method (the "DUET" algorithm) for blindly separating any number of sources using only two mixtures. The main assumption of the algorithm is that the sources are W-disjoint orthogonal, i.e. the supports of the windowed Fourier transforms of each pair of source signals are disjoint. This assumption has been shown to be true in an approximate sense for mixtures of multiple voices speaking simultaneously. The mixing parameters of the sources are estimated by clustering ratios of the time-frequency representations of the mixtures. The estimates of the mixing parameters can then be used to partition the time-frequency representation of one mixture to recover the original sources. The technique is valid even when the number of sources is larger than the number of mixtures.

Our audio interpolation algorithm decouples into two parts. First, a method based on the DUET algorithm is used to blindly associate a physical location with each time-frequency point of the mixtures based on a model of anechoic mixing. Second, the time-frequency representations of the mixtures are altered to synthesize the virtual audio signals as they would have been heard at a third microphone placed along the line connecting the two microphones. We expect this research to have immediate applications in video conferencing and virtual video.

Experimental Results

Our first experiment, illustrated in Figure 1, consists of two microphones, five sources, and a line describing the virtual microphone path. In this example, each source consisted of a pure tone played at a different frequency. The interpolated audio was obtained using our algorithm from a virtual microphone that slides from -5m to 5m over a period of 4 seconds. As expected, as the virtual microphone moves in front of each source position, that source becomes the loudest source present in the mixture. This effect is illustrated in Figure 2 by plotting the normalized power of each source as a function of the position of the virtual microphone. Figure 3 shows that the computation is accurate by comparing the estimated power for source 3 to the power computed analytically. Figure 4 is a Flash animation that includes the original sources and mixtures, as well as the interpolated audio.

Figure 1. Experimental setup, sources at x = -4, -2.5, -1, 1, 3m, y = 0.5m.


Figure 2. Normalized power of each source vs.
position of virtual microphone.


Figure 3. Theoretical (dotted line) vs. estimated
(solid line) power for source 3

 

Figure 4. Flash animation of experiment 1.

Our second experiment had the same configuration as in Figure 1 with microphone separation 1 cm. Only the sources at positions s1 and s5 were active. The source at s1 was a female voice recording; the source at s5 was a male voice recording. As before, a virtual microphone was moved from -5m to 5m, this time over a period of 8 seconds. Figure 5 shows the result of applying the audio interpolation equations as in the first experiment (i.e. without explicit demixing of the sources) to generate the virtual audio signal. The green line in Figure 5 shows the theoretical signal-to-interference ratio (SIR) of the two sources at the virtual microphone. A positive SIR corresponds to dominance of the source s1, while a negative SIR corresponds to dominance of the source s5. The blue line in Figure 5 is the instantaneous SIR of the interpolated mixture. The resulting relative strength of the voices in the virtual audio does not match the theoretical prediction as it did in the first experiment. Analysis revealed that the effect of the of the violations of the W-disjoint orthogonality assumption in the speech case caused errors in the distance estimates that prevented the method from functioning as desired. In order to combat this, a preprocessing step was added to explicitly estimate the number of sources in the environment and assign a single source to each time-frequency bin. Using these modified estimates, Figure 6 was generated. While the two curves still do not match exactly due to deviations from modeling assumptions, the interpolated audio curve has the correct character and the audio itself sounds realistic. A Flash animation of the second experiment is included in Figure 7.


Figure 5. Theoretical (green) vs. estimated (blue) SIR of
virtual audio, without explicitly demixing the sources.


Figure 6. Theoretical (green) vs. estimated (blue) SIR of
virtual audio, after explicitly demixing with DUET.

 

Figure 7. Flash animation of experiment 2.