{___________________________________________________________________________
FFIR.ASM	         Fast FFT-based FIR Filter

This routine demostrates the use of FFTs to implement fast FIR filters.  By
using FFT-based FIRs (FFIR) one can realize appreciable gains in speed at 
the expense of memory, complexity and filter delay.

An FFIR can be considered to be a fast convolution of an infinite duration
waveform (the sample input stream) and a finite duration waveform of size Q (the
tap coefficients).  We can perform the convolution by breaking-up the input
waveform into N-sized, overlapping chunks where N > Q and the size of the FFT
we will perform.  We are acquiring this chunks of data with the intention
of performing a fast circular convolution on each chunk and then re-combining
these chunks to form the convolved, FIR-filtered output data stream we desire.

To do the fast circular convolution, we perform an FFT on the data chunk,
multiply it by the Fourier transform of the tap coefficients with (N-Q) zeros
appended, and perform an inverse FFT on the result.  Note that the first
Q-1 samples of each circular convolution are "spoiled" for our purposes 
because of the wrap-around edge effect of the convolution.  Therefore, if we
overlap these chunks by Q-1 samples and throw away the Q-1 spoiled samples
from each chunk, we can construct the unbroken, FIR-filtered data stream we
desire.  This technique has been called the "overlap-save" method.

For a more complete discussion of FFT convolutions and the overlap-save 
technique, please refer to "The Fast Fourier Transform" by E. Oran Brigham
(Prentice-Hall:1974).

This particular routine has been optimized for speed.  Data for two FFTs is
collected before processing so that we can perform two real FFTs simultaneously
with one complex FFT (see "The Fast Fourier Transform").  This incurs extra
memory and filter delay costs.  These costs could be reduced at the expense of
speed by gathering data for one FFT at a time and employing a real FFT
algorithm.  If the incoming sample rate is known, the buffering memory 
requirements may also be able to be reduced.

This routine has been written to run on the 21020 EZ-lab board and works in the
following way:

1)  An interrupt routine periodically adds a sample to the circular input buffer
    *inaudio* of length N and reads a sample for output from the circular output
    buffer *outaudio*.  The buffer *outaudio* is of length 2*SEG where SEG is 
    N-Q+1, the number of good sample results per a FFT FIR-filter chunk.

2)  At every SEG sample boundry, which is marked by *outaudio* being half-full or
    full, N samples from *inaudio* are copied to *redata* or *imdata* respectively
    in preparation for processing.  This accomplishes the necessary Q-1 sample
    overlap (N-SEG = Q-1).  Also the SEG processed samples are being copied from
    the buffer*imfft* or *refft* respectively to the *outaudio* buffer for timely
    output by the interrupt routine;  the first Q-1 samples are ignored.  Scaling
    and fix=>float and float=>fix conversions are also done during data movement.

3)  After the data movement (2) at every 2*SEG boundry, which is marked by 
    *outaudio* being full, the input samples in *redata* and *imdata* are processed
    to yield the FIR-filter output in *imfft* and *refft*.  This consists of the
    following steps:

	a)  Perform the FFT on *redata* and *imdata* to produce their transform
	    in *refft* and *imfft*.  For the purposes of fast convolution, it 
	    is not necessary to fully calculate RFFT(*redata*) and 
	    RFFT(*imdata*);  The joint FFT(*redata* + j*imdata*) is sufficient.

	b)  Multiply the complex results of (a) with the FFT of the taps+zeros
	    which resides in *retrans* and *imtrans*.  (Note that this is a
	    complex multiply.)

	c)  Do a inverse FFT on the results of (b).  The inverse FFT is 
	    implemented by swapping the real and imaginary inputs, doing the 
	    forward FFT, swapping the real and imaginary again, and scaling
	    the results by 1/N (which is done in the data movement).  
	    Specifically, we put the imaginary results of (b) into *redata* and
	    the real results of (b) into *imdata* and perform FFT.  Then when we
	    move the data to the output buffer, we grab *imfft* first and then
	    *refft* later.

For this routine to work correctly, the program "twidrad4.c" must be used to
generate the real and imaginary twiddle factor tables, "tc.dat" & "ts.dat", for
the appropiate FFT length.  Also the program "ffircoef.c" must be used to
generate the real ("retrans.dat") and imaginary ("imtrans.dat") parts of the
transfer function ( FFT(taps+zeros) ) from the desired tap coefficients.

Author:	   Ronnin Yee, Analog Devices, SPD division, DSP group, (617) 461-3672
Revision:  11-JUN-91

Target Platform:
    21020 EZ-Lab board

Benchmarks:	FFT-based FIR, not including interrupt service routine.

    FFT Length	  Cycles per Sample	    Comments
    ----------	  -----------------	    --------
	64	    1243.5/(64-taps+1)	    better than direct FIR for taps > 19
       256	    5452.5/(256-taps+1)	    optimum at 9  <= taps < 42
      1024	   24877.5/(1024-taps+1)    optimum at 42 <= taps < 160 
      4096	  113230.5/(4096-taps+1)	

    where taps = the number of taps in the FIR filter

One should take the optimum ranges for each of the FFT sizes with a grain of
salt.  For instance, if one uses 256-point FFTs instead of 1K FFTs for a 60 tap
filter, there is only a 7% performance loss (1.9 more cycles per a sample) but
needing only one-fourth of the data memory requirement.  One must balance the 
increase in performance with the memory requirements.  The table below compares 
the performance of the FFT-based FIR with direct and radix-2 FIR filters.

			    (numbers below in cycles per a sample)

    FIR taps	    FFIR (FFT length)	radix-2 FIR    direct FIR
    --------	    ----------------	-----------    ----------
	10	      22.1  (256)	    23.0	    18
	20	      23.0  (256)	    30.5	    28
	50	      25.5  (1k)	    53.5	    58
       100	      26.9  (1k)	    90.5	   108
       200	      29.1  (4k)	   165.5	   208
       500	      31.5  (4k)	   390.5	   508
      1000	      36.6  (4k)	   765.5	  1008
					(.75N+15.5)	    (N+8)

Memory Usage:
    pm code = 301 words, pm data = 3.75*N words, dm data = 6.75*N - 2*taps +3


Other Files:
    ezlab21k.ach	21020 EZ-lab board architecture file
    ffir.h		contains preprocessor variable definitions
    _fftrad4.asm	assembly code for the radix-4 FFT
    twidrad4.c		generates tables for radix-4 FFT
    ffircoef.c		converts FIR taps into coefficients for FFIR


Assembler Preprocessor Variables:
    TAPS	Number of taps in the FIR filter
    PNTS	Number of points in the FFTs to be done.  Must be a power of
		    four, >= 64, > TAPS and <= 4096 (ez-lab memory constr.).
    LOGPNTS	=log2(PNTS)
    OST		=bitrev(32-bit PNTS/2)

These variables MUST be altered for different tap and/or FFT sizes.  If the
FFT size changes, tables and coefficients must also be re-generated by the 
appropiate programs.
____________________________________________________________________________}

#include "ffir.h"

{_____________________External Variables and Routines_______________________}
.EXTERN	    _fftrad4;       {Routine that performs the radix-4 FFT}
.EXTERN	    redata;	    {	Real data input	}
.EXTERN	    imdata;	    {	Imag data input	}
.EXTERN	    refft;	    {	Real FFT output	}
.EXTERN	    imfft;	    {	Imag FFT output	}

.SEGMENT/PM             pm_sram;
    {FFT of FIR coefficients}
.VAR	retrans[PNTS] = "retrans.dat";
.VAR	imtrans[PNTS] = "imtrans.dat";
.ENDSEG;

.SEGMENT/DM             dm_sram;
    {variables for the audio interface routine}
.VAR    inaudio[PNTS];                   {audio input buffer}
.VAR    outaudio[2*SEG];                {audio output buffer}
.VAR	inbufptr;
.ENDSEG;

  {These are the 2111 host port addresses}
.SEGMENT/DM             hip_regs;
.PORT   hdr0;			{in left audio}
.PORT   hdr1;			{in right audio}
.PORT   hdr2;			{out left audio}
.PORT   hdr3;			{out right audio}
.PORT   hdr4;			{2111 status}
.PORT   hdr5;			{2111 commands}
.ENDSEG;


{______________________reset service routine______________________}
.SEGMENT/PM             rst_svc; { program starts at the reset vector }
        PMWAIT=0x0021; {pgsz=0,pmwtstates=0,intrn.wtstates only}
        DMWAIT=0xa421; {pgsz=0,dmwtstates=0,intrn.wtstates only}
                        {except DM bank2 has 2 waitstates}

	bit set mode2 0x10;	{cache init workaround}
	read cache 0;
	bit clr mode2 0x10;

        jump ffir;
        nop;
        nop;

.ENDSEG;


.SEGMENT/PM             tmzh_svc;
	rti;
	nop;
	nop;
.ENDSEG;

{______________________2111 hip service routine______________________}
.SEGMENT/PM             irq3_svc;
        jump hip_rnt (db);
            bit set mode1 SRRFL|SRD1L; {set alternate DAG 0-3, R(0-7) registers}     
            bit clr mode1 BR0;    {make sure we are not bit reversing     }
.ENDSEG;

{__________________Interrupt button service routine__________________}
.SEGMENT/PM             irq2_svc;
	rti;
	nop;
	nop;
.ENDSEG;


.SEGMENT/PM             pm_sram;
ffir: 
        ustat1  =0x00000000;            {clears status of routines}
        imask   =IRQ3I;			{unmask irq2}
        mode2   =IRQ0E|IRQ1E|IRQ2E|IRQ3E|FLG1O|FLG2O|FLG3O;
					{edge-sen. interrupts, flag1-3 out}

    {perform init for hip_rnt}
        bit set mode1 SRD1L|SRRFL; {set alternate DAG 0-3, R(0-7) registers} 
        nop;        {set up buffers}
            b0=inaudio;  
            b2=outaudio; 
            l0=PNTS;
            l2=2*SEG;
            m0=1;
        bit clr mode1 SRD1L|SRRFL; {back to regular DAG 0-3, R(0-7) registers}       
        nop;

	bit clr astat FLG1|FLG2|FLG3;

espl:   IRPTL=0x0;                  {clear out interrupts}
        bit set mode1 IRPTEN;       {enable interrupts}   

wait_again:
	do wait until forever;
	    bit tst ustat1 OUTBH;
            if tf jump buf_half (la);
	    bit tst ustat1 OUTBF;
wait:	    if tf jump buf_full (la);

 
buf_half:   {move incoming data in preparation for next FFT and move the last}
	    { FFTs data to the output buffer.  Fix->Flt and Flt->Fix are also}
	    { performed.}
	bit clr ustat1 OUTBH;

    {move incoming data to *redata* FFT buffer}
	r0=dm(inbufptr);    {get the inaudio buffer pointer}
	b0 = inaudio;
	i0 = r0;
	l0 = PNTS;
	i1 = redata;
	l1 = 0;
	m1 = 1;
	r15=-32;

	r0=dm(i0,m1);
	lcntr=PNTS, do (pc,2) until lce;
	    f1=float r0 by r15,    r0=dm(i0,m1);
			    dm(i1,m1)=f1;

    {move outgoing data from *imfft* to *outaudio* buffer}
	b0 = imfft;
	i0 = imfft+TAPS-1;
	i1 = outaudio;
        l1 = 0;
	r15= -LOGPNTS+32;       {-LOGPNTS for 1/N inv FFT scaling}

	f0 = dm(i0,m1);
	lcntr=SEG, do (pc,2) until lce;
	    r1=fix f0 by r15,	f0=dm(i0,m1);
				dm(i1,m1)=r1;

	jump wait_again;	{go back to watching buffer flags}


buf_full:   {move incoming data, move outgoing data, do 1 complex FFT to  }
	    {  accomplish 2 real FFTs, multiply by the transfer function, }
	    {  do complex inv FFT. }
	bit clr ustat1 OUTBF;
	bit set ASTAT FLG1;	{turn-on usage meter}

    {move incoming data to *imdata* FFT buffer}
	r0=dm(inbufptr);    {get the inaudio buffer pointer}
	b0 = inaudio;
	i0 = r0;
	l0 = PNTS;
	m1 = 1;
	i8 = imdata;
	l8 = 0;
	m8 = 1;
	r15= -32;
			           r0=dm(i0,m1);
	    f1=float r0 by r15,    r0=dm(i0,m1);
	lcntr=PNTS-1, do (pc,1) until lce;
	    f1=float r0 by r15,    r0=dm(i0,m1),  pm(i8,m8)=f1;

	pm(i8,m8)=f1;

    {move outgoing data from *refft* to *outaudio*+SEG buffer}
	b0 = refft;
	i0 = refft+TAPS-1;
	i1 = outaudio+SEG;
	r15= -LOGPNTS+32;	 {-LOGPNTS for 1/N inv FFT scaling}

	f0 = dm(i0,m1);
	lcntr=SEG, do (pc,2) until lce;
	    r1=fix f0 by r15,	f0=dm(i0,m1);
				dm(i1,m1)=r1;

    {do the FFT}
	call _fftrad4;

    {multiply by the transfer function and prepare for the inv FFT}
    {note that all l registers are set to zero by *_fftrad4* call}
	i0=refft;
	i1=imfft;
	i2=redata;
	m1=1;

	i8=retrans;
	i9=imtrans;
	i10=imdata;
	m8=1;
					f0=dm(i0,m1),   f4=pm(i8,m8);
	    f8=f0*f4,			f1=dm(i1,m1),   f5=pm(i9,m8);
	    f12=f1*f5;	    
	    f9=f0*f5,	f11=f8-f12;
	    f13=f1*f4,			f0=dm(i0,m1),   f4=pm(i8,m8);
							
	lcntr=PNTS-1, do multil until lce;
	    f8=f0*f4,	f15=f9+f13,	f1=dm(i1,m1),   f5=pm(i9,m8);
	    f12=f1*f5,			dm(i2,m1)=f15,  pm(i10,m8)=f11;
	    f9=f0*f5,	f11=f8-f12;
multil:	    f13=f1*f4,			f0=dm(i0,m1),   f4=pm(i8,m8);

			f15=f9+f13;
					dm(i2,m1)=f15,  pm(i10,m8)=f11;
					{note switch of real and imag}

    {do the inv FFT, note that the inverse FFT is accomplished by	}
    {switching the real and imaginary parts, doing the FFT, switching	}
    {the real and imaginary parts again, and then scaling by 1/N.	}
	call _fftrad4;

	bit clr ASTAT FLG1; {turn-off usage meter}
	jump wait_again;	    {we are finished, going back to waiting}


{____________________________2111 hip service routine__________________________
This routine gets data from and puts data to the ADSP-2111.  More specifically
it:
        1) grabs audio data from the HDR0 in the 2111 hip
        2) puts proc. audio data into HDR2 in the 2111 hip

Defined alternate DAG registers are:
    b0,i0:      buffer and pointer to inaudio
    b2,i2:      buffer and pointer to outaudio
    m0:         1

USTAT1 bits:
    16:     output buffer half filled flag
    17:     output buffer filled flag

___________________________________________________________________________}
hip_rnt:
        r3=SEG;			    {some initializion}
        bit clr ustat1 0x0f0000;    {clear buffer flags}

    {get audio data from hip}
        r0=dm(hdr0);
        dm(i0,m0)=r0;

    {save *inbufptr* for main routine}
	r1=i0;
	dm(inbufptr)=r1;
        
    {put audio data to the hip}
        r0=dm(i2,m0);
        dm(hdr2)=r0;

    {set flags}
        r0=i2;
        r5=b2;
        r0=r0-r5;                   {find offset}
        if ne jump (pc,2);
            bit set ustat1 OUTBF;       {zero offset =>flag buffer filled}
        comp(r0,r3);
        if ne jump (pc,2);
            bit set ustat1 OUTBH;       {flag half buffer offset}

	pop sts;
	push sts;

        rti;
        nop;
	nop;

.ENDSEG;