{___________________________________________________________________________ FFIR.ASM Fast FFT-based FIR Filter This routine demostrates the use of FFTs to implement fast FIR filters. By using FFT-based FIRs (FFIR) one can realize appreciable gains in speed at the expense of memory, complexity and filter delay. An FFIR can be considered to be a fast convolution of an infinite duration waveform (the sample input stream) and a finite duration waveform of size Q (the tap coefficients). We can perform the convolution by breaking-up the input waveform into N-sized, overlapping chunks where N > Q and the size of the FFT we will perform. We are acquiring this chunks of data with the intention of performing a fast circular convolution on each chunk and then re-combining these chunks to form the convolved, FIR-filtered output data stream we desire. To do the fast circular convolution, we perform an FFT on the data chunk, multiply it by the Fourier transform of the tap coefficients with (N-Q) zeros appended, and perform an inverse FFT on the result. Note that the first Q-1 samples of each circular convolution are "spoiled" for our purposes because of the wrap-around edge effect of the convolution. Therefore, if we overlap these chunks by Q-1 samples and throw away the Q-1 spoiled samples from each chunk, we can construct the unbroken, FIR-filtered data stream we desire. This technique has been called the "overlap-save" method. For a more complete discussion of FFT convolutions and the overlap-save technique, please refer to "The Fast Fourier Transform" by E. Oran Brigham (Prentice-Hall:1974). This particular routine has been optimized for speed. Data for two FFTs is collected before processing so that we can perform two real FFTs simultaneously with one complex FFT (see "The Fast Fourier Transform"). This incurs extra memory and filter delay costs. These costs could be reduced at the expense of speed by gathering data for one FFT at a time and employing a real FFT algorithm. If the incoming sample rate is known, the buffering memory requirements may also be able to be reduced. This routine has been written to run on the 21020 EZ-lab board and works in the following way: 1) An interrupt routine periodically adds a sample to the circular input buffer *inaudio* of length N and reads a sample for output from the circular output buffer *outaudio*. The buffer *outaudio* is of length 2*SEG where SEG is N-Q+1, the number of good sample results per a FFT FIR-filter chunk. 2) At every SEG sample boundry, which is marked by *outaudio* being half-full or full, N samples from *inaudio* are copied to *redata* or *imdata* respectively in preparation for processing. This accomplishes the necessary Q-1 sample overlap (N-SEG = Q-1). Also the SEG processed samples are being copied from the buffer*imfft* or *refft* respectively to the *outaudio* buffer for timely output by the interrupt routine; the first Q-1 samples are ignored. Scaling and fix=>float and float=>fix conversions are also done during data movement. 3) After the data movement (2) at every 2*SEG boundry, which is marked by *outaudio* being full, the input samples in *redata* and *imdata* are processed to yield the FIR-filter output in *imfft* and *refft*. This consists of the following steps: a) Perform the FFT on *redata* and *imdata* to produce their transform in *refft* and *imfft*. For the purposes of fast convolution, it is not necessary to fully calculate RFFT(*redata*) and RFFT(*imdata*); The joint FFT(*redata* + j*imdata*) is sufficient. b) Multiply the complex results of (a) with the FFT of the taps+zeros which resides in *retrans* and *imtrans*. (Note that this is a complex multiply.) c) Do a inverse FFT on the results of (b). The inverse FFT is implemented by swapping the real and imaginary inputs, doing the forward FFT, swapping the real and imaginary again, and scaling the results by 1/N (which is done in the data movement). Specifically, we put the imaginary results of (b) into *redata* and the real results of (b) into *imdata* and perform FFT. Then when we move the data to the output buffer, we grab *imfft* first and then *refft* later. For this routine to work correctly, the program "twidrad4.c" must be used to generate the real and imaginary twiddle factor tables, "tc.dat" & "ts.dat", for the appropiate FFT length. Also the program "ffircoef.c" must be used to generate the real ("retrans.dat") and imaginary ("imtrans.dat") parts of the transfer function ( FFT(taps+zeros) ) from the desired tap coefficients. Author: Ronnin Yee, Analog Devices, SPD division, DSP group, (617) 461-3672 Revision: 11-JUN-91 Target Platform: 21020 EZ-Lab board Benchmarks: FFT-based FIR, not including interrupt service routine. FFT Length Cycles per Sample Comments ---------- ----------------- -------- 64 1243.5/(64-taps+1) better than direct FIR for taps > 19 256 5452.5/(256-taps+1) optimum at 9 <= taps < 42 1024 24877.5/(1024-taps+1) optimum at 42 <= taps < 160 4096 113230.5/(4096-taps+1) where taps = the number of taps in the FIR filter One should take the optimum ranges for each of the FFT sizes with a grain of salt. For instance, if one uses 256-point FFTs instead of 1K FFTs for a 60 tap filter, there is only a 7% performance loss (1.9 more cycles per a sample) but needing only one-fourth of the data memory requirement. One must balance the increase in performance with the memory requirements. The table below compares the performance of the FFT-based FIR with direct and radix-2 FIR filters. (numbers below in cycles per a sample) FIR taps FFIR (FFT length) radix-2 FIR direct FIR -------- ---------------- ----------- ---------- 10 22.1 (256) 23.0 18 20 23.0 (256) 30.5 28 50 25.5 (1k) 53.5 58 100 26.9 (1k) 90.5 108 200 29.1 (4k) 165.5 208 500 31.5 (4k) 390.5 508 1000 36.6 (4k) 765.5 1008 (.75N+15.5) (N+8) Memory Usage: pm code = 301 words, pm data = 3.75*N words, dm data = 6.75*N - 2*taps +3 Other Files: ezlab21k.ach 21020 EZ-lab board architecture file ffir.h contains preprocessor variable definitions _fftrad4.asm assembly code for the radix-4 FFT twidrad4.c generates tables for radix-4 FFT ffircoef.c converts FIR taps into coefficients for FFIR Assembler Preprocessor Variables: TAPS Number of taps in the FIR filter PNTS Number of points in the FFTs to be done. Must be a power of four, >= 64, > TAPS and <= 4096 (ez-lab memory constr.). LOGPNTS =log2(PNTS) OST =bitrev(32-bit PNTS/2) These variables MUST be altered for different tap and/or FFT sizes. If the FFT size changes, tables and coefficients must also be re-generated by the appropiate programs. ____________________________________________________________________________} #include "ffir.h" {_____________________External Variables and Routines_______________________} .EXTERN _fftrad4; {Routine that performs the radix-4 FFT} .EXTERN redata; { Real data input } .EXTERN imdata; { Imag data input } .EXTERN refft; { Real FFT output } .EXTERN imfft; { Imag FFT output } .SEGMENT/PM pm_sram; {FFT of FIR coefficients} .VAR retrans[PNTS] = "retrans.dat"; .VAR imtrans[PNTS] = "imtrans.dat"; .ENDSEG; .SEGMENT/DM dm_sram; {variables for the audio interface routine} .VAR inaudio[PNTS]; {audio input buffer} .VAR outaudio[2*SEG]; {audio output buffer} .VAR inbufptr; .ENDSEG; {These are the 2111 host port addresses} .SEGMENT/DM hip_regs; .PORT hdr0; {in left audio} .PORT hdr1; {in right audio} .PORT hdr2; {out left audio} .PORT hdr3; {out right audio} .PORT hdr4; {2111 status} .PORT hdr5; {2111 commands} .ENDSEG; {______________________reset service routine______________________} .SEGMENT/PM rst_svc; { program starts at the reset vector } PMWAIT=0x0021; {pgsz=0,pmwtstates=0,intrn.wtstates only} DMWAIT=0xa421; {pgsz=0,dmwtstates=0,intrn.wtstates only} {except DM bank2 has 2 waitstates} bit set mode2 0x10; {cache init workaround} read cache 0; bit clr mode2 0x10; jump ffir; nop; nop; .ENDSEG; .SEGMENT/PM tmzh_svc; rti; nop; nop; .ENDSEG; {______________________2111 hip service routine______________________} .SEGMENT/PM irq3_svc; jump hip_rnt (db); bit set mode1 SRRFL|SRD1L; {set alternate DAG 0-3, R(0-7) registers} bit clr mode1 BR0; {make sure we are not bit reversing } .ENDSEG; {__________________Interrupt button service routine__________________} .SEGMENT/PM irq2_svc; rti; nop; nop; .ENDSEG; .SEGMENT/PM pm_sram; ffir: ustat1 =0x00000000; {clears status of routines} imask =IRQ3I; {unmask irq2} mode2 =IRQ0E|IRQ1E|IRQ2E|IRQ3E|FLG1O|FLG2O|FLG3O; {edge-sen. interrupts, flag1-3 out} {perform init for hip_rnt} bit set mode1 SRD1L|SRRFL; {set alternate DAG 0-3, R(0-7) registers} nop; {set up buffers} b0=inaudio; b2=outaudio; l0=PNTS; l2=2*SEG; m0=1; bit clr mode1 SRD1L|SRRFL; {back to regular DAG 0-3, R(0-7) registers} nop; bit clr astat FLG1|FLG2|FLG3; espl: IRPTL=0x0; {clear out interrupts} bit set mode1 IRPTEN; {enable interrupts} wait_again: do wait until forever; bit tst ustat1 OUTBH; if tf jump buf_half (la); bit tst ustat1 OUTBF; wait: if tf jump buf_full (la); buf_half: {move incoming data in preparation for next FFT and move the last} { FFTs data to the output buffer. Fix->Flt and Flt->Fix are also} { performed.} bit clr ustat1 OUTBH; {move incoming data to *redata* FFT buffer} r0=dm(inbufptr); {get the inaudio buffer pointer} b0 = inaudio; i0 = r0; l0 = PNTS; i1 = redata; l1 = 0; m1 = 1; r15=-32; r0=dm(i0,m1); lcntr=PNTS, do (pc,2) until lce; f1=float r0 by r15, r0=dm(i0,m1); dm(i1,m1)=f1; {move outgoing data from *imfft* to *outaudio* buffer} b0 = imfft; i0 = imfft+TAPS-1; i1 = outaudio; l1 = 0; r15= -LOGPNTS+32; {-LOGPNTS for 1/N inv FFT scaling} f0 = dm(i0,m1); lcntr=SEG, do (pc,2) until lce; r1=fix f0 by r15, f0=dm(i0,m1); dm(i1,m1)=r1; jump wait_again; {go back to watching buffer flags} buf_full: {move incoming data, move outgoing data, do 1 complex FFT to } { accomplish 2 real FFTs, multiply by the transfer function, } { do complex inv FFT. } bit clr ustat1 OUTBF; bit set ASTAT FLG1; {turn-on usage meter} {move incoming data to *imdata* FFT buffer} r0=dm(inbufptr); {get the inaudio buffer pointer} b0 = inaudio; i0 = r0; l0 = PNTS; m1 = 1; i8 = imdata; l8 = 0; m8 = 1; r15= -32; r0=dm(i0,m1); f1=float r0 by r15, r0=dm(i0,m1); lcntr=PNTS-1, do (pc,1) until lce; f1=float r0 by r15, r0=dm(i0,m1), pm(i8,m8)=f1; pm(i8,m8)=f1; {move outgoing data from *refft* to *outaudio*+SEG buffer} b0 = refft; i0 = refft+TAPS-1; i1 = outaudio+SEG; r15= -LOGPNTS+32; {-LOGPNTS for 1/N inv FFT scaling} f0 = dm(i0,m1); lcntr=SEG, do (pc,2) until lce; r1=fix f0 by r15, f0=dm(i0,m1); dm(i1,m1)=r1; {do the FFT} call _fftrad4; {multiply by the transfer function and prepare for the inv FFT} {note that all l registers are set to zero by *_fftrad4* call} i0=refft; i1=imfft; i2=redata; m1=1; i8=retrans; i9=imtrans; i10=imdata; m8=1; f0=dm(i0,m1), f4=pm(i8,m8); f8=f0*f4, f1=dm(i1,m1), f5=pm(i9,m8); f12=f1*f5; f9=f0*f5, f11=f8-f12; f13=f1*f4, f0=dm(i0,m1), f4=pm(i8,m8); lcntr=PNTS-1, do multil until lce; f8=f0*f4, f15=f9+f13, f1=dm(i1,m1), f5=pm(i9,m8); f12=f1*f5, dm(i2,m1)=f15, pm(i10,m8)=f11; f9=f0*f5, f11=f8-f12; multil: f13=f1*f4, f0=dm(i0,m1), f4=pm(i8,m8); f15=f9+f13; dm(i2,m1)=f15, pm(i10,m8)=f11; {note switch of real and imag} {do the inv FFT, note that the inverse FFT is accomplished by } {switching the real and imaginary parts, doing the FFT, switching } {the real and imaginary parts again, and then scaling by 1/N. } call _fftrad4; bit clr ASTAT FLG1; {turn-off usage meter} jump wait_again; {we are finished, going back to waiting} {____________________________2111 hip service routine__________________________ This routine gets data from and puts data to the ADSP-2111. More specifically it: 1) grabs audio data from the HDR0 in the 2111 hip 2) puts proc. audio data into HDR2 in the 2111 hip Defined alternate DAG registers are: b0,i0: buffer and pointer to inaudio b2,i2: buffer and pointer to outaudio m0: 1 USTAT1 bits: 16: output buffer half filled flag 17: output buffer filled flag ___________________________________________________________________________} hip_rnt: r3=SEG; {some initializion} bit clr ustat1 0x0f0000; {clear buffer flags} {get audio data from hip} r0=dm(hdr0); dm(i0,m0)=r0; {save *inbufptr* for main routine} r1=i0; dm(inbufptr)=r1; {put audio data to the hip} r0=dm(i2,m0); dm(hdr2)=r0; {set flags} r0=i2; r5=b2; r0=r0-r5; {find offset} if ne jump (pc,2); bit set ustat1 OUTBF; {zero offset =>flag buffer filled} comp(r0,r3); if ne jump (pc,2); bit set ustat1 OUTBH; {flag half buffer offset} pop sts; push sts; rti; nop; nop; .ENDSEG;