Skip to main content

Full text of "NASA Technical Reports Server (NTRS) 20080005023: Method and system for training dynamic nonlinear adaptive filters which have embedded memory"

See other formats


(12) United States Patent 

Rabinowitz 


US006351740B1 


(io) Patent No.: US 6,351,740 B1 

( 45 ) Date of Patent: Feb. 26, 2002 


(54) METHOD AND SYSTEM FOR TRAINING 
DYNAMIC NONLINEAR ADAPTIVE 
FILTERS WHICH HAVE EMBEDDED 
MEMORY 

(75) Inventor: Matthew Rabinowitz, Palo Alto, CA 

(US) 

(73) Assignee: The Board of Trustees of the Leland 
Stanford Junior University, Palo Alto, 
CA (US) 

( * ) Notice: Subject to any disclaimer, the term of this 

patent is extended or adjusted under 35 
U.S.C. 154(b) by 0 days. 

(21) Appl. No.: 09/201,927 

(22) Filed: Dec. 1, 1998 

Related U.S. Application Data 

(60) Provisional application No. 60/067,490, filed on Dec. 1, 
1997. 


(51) Int. Cl. 7 G06F 15/18 

(52) U.S. Cl 706/22; 706/25 

(58) Field of Search 706/22, 25 

(56) References Cited 

U.S. PATENT DOCUMENTS 

4,843,583 A * 6/1989 White et al 708/322 

5,175,678 A * 12/1992 Frerichs et al 700/47 

5,272,656 A * 12/1993 Genereux 706/22 

5,376,962 A * 12/1994 Zortea 706/22 

5,542,054 A * 7/1996 Batten, Jr 706/22 

5,548,192 A * 8/1996 Hanks 318/560 

5,617,513 A * 4/1997 Schnitta 706/22 

5,761,383 A * 6/1998 Engel et al 706/22 

5,963,929 A * 10/1999 Lo 706/22 

6,064,997 A * 5/2000 Jagannathan et al 76/22 


OTHER PUBLICATIONS 

Ong et al, “A Decision Feedback Recurrent Neural Equalizer 
as an Infinite Impulse Response Filter”, IEEE Transactions 
on Signal Processing, Nov. 1997.* 


Rui J. P. de Figueiredo, “Optimal Neural Network Realiza- 
tions of Nonlinear FIR and HR Filters” IEEE International 
Syposium on Circuits and Systems, Jun. 1997.* 

Yu et al, “Dynamic Learning Rate Optimization of the Back 
Propagation Algorithm”, IEEE Transactions on Neural Net- 
works, May 1995.* 

(List continued on next page.) 

Primary Examiner — George B. Davis 

(74) Attorney, Agent, or Firm — Lumen Intellectual 

Property Services, Inc. 

(57) ABSTRACT 

Described herein is a method and system for training non- 
linear adaptive filters (or neural networks) which have 
embedded memory. Such memory can arise in a multi-layer 
finite impulse response (FIR) architecture, or an infinite 
impulse response (HR) architecture. We focus on filter 
architectures with separate linear dynamic components and 
static nonlinear components. Such filters can be structured 
so as to restrict their degrees of computational freedom 
based on a priori knowledge about the dynamic operation to 
be emulated. The method is detailed for an FIR architecture 
which consists of linear FIR filters together with nonlinear 
generalized single layer subnets. For the IIR case, we extend 
the methodology to a general nonlinear architecture which 
uses feedback. For these dynamic architectures, we describe 
how one can apply optimization techniques which make 
updates closer to the Newton direction than those of a 
steepest descent method, such as backpropagation. We detail 
a novel adaptive modified Gauss-Newton optimization 
technique, which uses an adaptive learning rate to determine 
both the magnitude and direction of update steps. For a wide 
range of adaptive filtering applications, the new training 
algorithm converges faster and to a smaller value of cost 
than both steepest-descent methods such as 
backpropagation-through-time, and standard quasi-Newton 
methods. We apply the algorithm to modeling the inverse of 
a nonlinear dynamic tracking system 5, as well as a nonlin- 
ear amplifier 6. 

25 Claims, 15 Drawing Sheets 















US 6,351,740 B1 

Page 2 


OTHER PUBLICATIONS 

White et al., “The Learning Rate in Back-Proprogation 
Systems =an Application of Newton’s Method”, IEEE 
IJCNN, May 1990.* 

Nobakht et al, “Nolinear Adaptive Filtering using Annealed 
Neural Networks”, IEEE International Sympsium on Cir- 
cuits and Systems, May 1990.* 

Pataki, B-, “Neural Network based Dynamic Models,” 
Third International Conference on Artificial Neural Net- 
works, IEEE. 1993* 

Dimitri P-Bertsekas, “Incremental Least Squares, Methods 
and the Extended kalman Filter”, IEEE proceedings of the 
33rd conference on Decision and control Dec. 1994.* 
Puskorius et al, “Multi-Stream Extended Kalman” Filter 
Training for Static and Dynamic Neural Networks IEEE 
International conference on System, Man, and Cybernetics- 
Oct. 1997.* 


Back et al, “Internal Representation of Data, in Multilayer 
Perceptrons with HR Synapses”, proceedings of 1992 I 
International Conference on Circuits and Systems May 
1992.* 

Sorensen, O, “Neural Networks Performing System Identi- 
fication for Control Applications”, IEEE Third International 
Conference on Artificial Neural Networks, 1993.* 

Back et al, “A Unifying View of Some Training Algorithm 
for Multilayer Perceptrons with FIR Filled Synapses”, Pro- 
ceedings of the 1994 IEEEE. Sep. 1994.* 

Workshop on Neural Networks for Signal Processing.* 

Sorensen, O, “Neural Networks for Non-Linear Control”, 
Proceedings of the Third IEEE Conference on Control 
Applications, Aug. 1994.* 

* cited by examiner 



U.S. Patent Feb. 26 , 2002 


Sheet 1 of 15 


US 6,351,740 B1 


Fig. 1 
Fig , 1A 
Fig . IB 

Fig . 1A 1 













U.S. Patent Feb. 26 , 2002 


Sheet 2 of 15 


US 6,351,740 B1 


Fig . IB 























layer I 








U.S. Patent Feb. 26 , 2002 


Sheet 5 of 15 


US 6,351,740 B1 















U.S. Patent Feb. 26 , 2002 


Sheet 7 of 15 


US 6,351,740 B1 



24 


Fig. 7 







U.S. Patent Feb. 26 , 2002 


Sheet 8 of 15 


US 6,351,740 B1 














U.S. Patent Feb. 26 , 2002 


Sheet 9 of 15 


US 6,351,740 B1 


















weight magnitude 


U.S. Patent Feb. 26, 2002 


Sheet 11 of 15 


US 6,351,740 B1 



Fig . 12 



0.05 1 i 1 0.05 


U.S. Patent Feb. 26 , 2002 


Sheet 12 of 15 


US 6,351,740 B1 



estimation error 





U.S. Patent Feb. 26 , 2002 


Sheet 13 of 15 


US 6,351,740 B1 



Fig . 14 






U.S. Patent Feb. 26, 2002 Sheet 14 of 15 US 6,351,740 B1 



Fig . 15 







magnitude ( dB ) magnitude ( dB ) 


U.S. Patent Feb. 26 , 2002 


Sheet 15 of 15 


US 6,351,740 B1 



Fig . 16 


4 



US 6,351,740 B1 


1 

METHOD AND SYSTEM FOR TRAINING 
DYNAMIC NONLINEAR ADAPTIVE 
FILTERS WHICH HAVE EMBEDDED 
MEMORY 

5 

I. CROSS-REFERENCE TO RELATED 
APPLICATIONS 

This application claims priority from U.S. Provisional 
Patent Application No. 60/067,490 filed Dec. 1, 1997, which 
is incorporated herein by reference. 

This invention was reduced to practise with support from 
NASA under contract NAS8-36125. The U.S. Government 
has certain rights to the invention. 

II. FIELD OF THE INVENTION 

This invention relates to neural networks or adaptive 
nonlinear filters which contain linear dynamics, or memory, 
embedded within the filter. In particular, this invention 
describes a new system and method by which such filters can 
be efficiently trained to process temporal data. 

III. BACKGROUND OF THE INVENTION 

In problems concerning the emulation, control or post- 
processing of nonlinear dynamic systems, it is often the case 
that the exact system dynamics are difficult to model. A 
typical solution is to train the parameters of a nonlinear filter 
to perform the desired processing, based on a set of inputs 
and a set of desired outputs, termed the training signals. 
Since its discovery in 1985, backpropagation (BP) has 
emerged as the standard technique for training multi-layer 
adaptive filters to implement static functions, to operate on 
tapped-delay line inputs, and in recursive filters where the 
desired outputs of the filters are known — [1, 2, 3, 4, 5]. The 
principle of static BP was extended to networks with embed- 
ded memory via backpropagation-through-time (BPTT) the 
principle of which has been used to train network parameters 
in feedback loops when components in the loop are modeled 
[6] or un-molded [7]. For the special case of finite impulse 
response (FIR) filters, of the type discussed in this paper, the 
BPTT algorithm has been further refined [8]. Like BP, BPTT 
is a steepest-descent method, but it accounts for the outputs 
of a layer in a filter continuing to propagate through a 
network for an extended length of time. Consequently, the 
algorithm updates network parameters according to the error 
they produce over the time spanned by the training data. In 
essence, BP and BPTT are steepest descent algorithms, 
applied successively to each layer in a nonlinear filter. It has 
been shown [9] that the steepest descent approach is locally 
H°° optimal in prediction applications where training inputs 
vary at each weight update, or training epoch. However 
when the same training data is used for several epochs, 
BPTT is suboptimal, and techniques which generate updates 
closer to the Newton update direction (see section 10) are 
preferable. We will refer to such techniques, which generate 
updates closer to the Newton update direction, as Newton- 
like methods. 

Since steepest-descent techniques such as BPTT often 
behave poorly in terms of convergence rates and error 
minimization, it is therefore an object of this invention to 
create a method by which Newton-like optimization tech- 
niques can be applied to nonlinear adaptive filters containing 
embedded memory for the purpose of processing temporal 
data. It is further an object of this invention to create an 
optimization technique which is better suited to training a 
FIR or HR network to process temporal data than classical 


2 

Newton-like [10] techniques. It is further an object of this 
invention to create multi-layer adaptive filters which are 
Taylor made for specific applications, and can be efficiently 
trained with the novel Newton-like algorithm. 

IV. BRIEF DESCRIPTION OF DRAWINGS 

FIG. 1 provides a block diagram according to which the 
components of the invention may be combined in a preferred 
embodiment. 

FIG. 2 displays the general architecture for a nonlinear 
multi-layered FIR network with restricted degrees of com- 
putational freedom using separate static nonlinear and linear 
dynamic components. 

FIG. 3 displays a generic nonlinear HR filter architecture, 
for which the next states and the output are assumed to be 
some static function of the current states, and the current 
inputs. 

FIG. 4 displays a simple two -weight polynomial-based 
network without memory. 

FIG. 5 displays the weight sequence over 100 epochs of 
training the network of FIG. 4 using BP (x) and the adaptive 
modified Gauss-Newton algorithm (o) respectively. Con- 
tours of constant cost are shown. 

FIG. 6 displays the weight sequence over 100 epochs of 
training the network of FIG. 4 using BP (top line) and the 
adaptive modified Gauss-Newton algorithm with update 
steps constrained to a magnitude of 0.1. 

FIG. 7 displays a block-diagram of a generic tracking 
system with static nonlinearities embedded within linear 
dynamics, and parasitic dynamics on the feedback path. 

FIG. 8 displays the FIR network employed to linearize the 
nonlinear tracking system of FIG. 7 and equ. (43) 

FIG. 9 displays a technique for acquiring training data by 
exciting the tracking system. 

FIG. 10 displays the RMS error of the network of FIG. 8 
over 100 epochs using the adaptive modified Gauss-Newton 
algorithm, and over 200 epochs using BPTT, Kalman 
Filtering, the Gauss-Newton technique, and BFGS with a 
line search. 

FIG. 11 displays the learning rate, fi k , over 100 training 
epochs for the network of FIG. 8 using the adaptive modified 
Gauss-Newton algorithm. 

FIG. 12 displays the trained weights 42 of the FIR filter 
at layer 2 of path 1 in FIG. 8. 

FIG. 13 displays on the left the error produced by an 
analytical inversion of the linearized tracking system of FIG. 
7. This illustrates the extent of the error due to nonlinear 
distortion. On the right is shown the error in the output of the 
trained network of FIG. 8. 

FIG. 14 displays the Wiener model filter used to emulate 
a nonlinear audio amplifier. 

FIG. 15 displays a method by which data is gathered for 
training the filter of FIG. 14. 

FIG. 16 displays the spectral response of the nonlinear 
amplifier to a dual tone signal. 

FIG. 17 displays the spectral response of the nonlinear 
amplifier to a prewarped dual tone signal. 

V. SUMMARY OF THE INVENTION 

This invention concerns a method for applying a Newton- 
like training algorithm to a dynamic nonlinear adaptive filter 
which has embedded memory. FIG. 1 provides a block 
diagram according to which the components of the invention 


10 


15 


20 


25 


30 


35 


40 


45 


50 


55 


60 


65 



US 6,351,740 B1 


3 

may be combined in a preferred embodiment. The solid lines 
indicate the flow of data between steps; the dashed lines 
indicate the flow of control between steps. Initially, 47, the 
dynamic nonlinear system to be emulated by the filter is 
decomposed to linear dynamic elements, and nonlinear 
elements. Based upon this decomposition, a filter 
architecture, h 48, consisting of adaptable static nonlinear 
components, and adaptable linear components, is Taylor- 
made 49 for the desired dynamic operation 47. Next 51, 
using an initial set of filter parameters w 53, and architecture 
h 48, a training input sequence {u n } 50 is propagated 
through the filter to obtain an output sequence {y n } 52. An 
error signal {e n } 55 is created by subtracting from {y„} 52 
the desired output sequence {y n } 56. Using h 48, {e n } 55, 
and w 53 a novel technique 57 is used to construct a single 
matrix Vh w 58 which relates each network parameter to the 
error it produces over the full sequence {e n } 55. Using Vh w 
58, {e n } 55, and a learning rate //, a novel Newton-like 
update algorithm 59, is used to determine a weight update 
vector Aw 60. A temporary parameter set w f 62 is created by 
summing Aw 60 and w 61. Using w f 62 and h 48, the training 
input sequence {u n } 50 is again propagated 63 through the 
filter to obtain output sequence {y t } 64. {e f } 66 is created 
by 65 differencing {y f } 64 and {y n } 56. Costs 67,68 are 
computed from the error sequences 55,66, such as J=2 n c n 2 
and J f =2„ e f 2 . If J>J f , jbi 70 is increased 69, the new w is set 
equal to w f 69, and the process is commenced at block 51. 
Otherwise, ^ 71 is decreased 71 and the process commences 
at 59. 

In section VI is described the novel Newton-like algo- 
rithm for updating the parameters of a network with embed- 
ded memory. The technique is based on a modification of the 
Gauss-Newton optimization method. The modification 
involves an adaptive parameter which ensures that the 
approximation of the cost’s Hessian matrix is well- 
conditioned for forming an inverse. In section VII is 
described a technique for relating the output error of a 
nonlinear filter with embedded FIR dynamics to the filter 
parameters, such that a Newton-like technique can be 
applied. This method involves creating a single matrix 
which relates all the network parameters to all the error they 
generate over the full time sequence. In section VIII is 
described a technique for relating the output error of a 
general nonlinear HR filter to the filter parameters, such that 
a Newton-like technique can be applied. Section IX dis- 
cusses the superior behavior of the adaptive modified Gauss- 
Newton method over steepest-descent methods such as 
BPTT. In section X, we describe how a dynamic nonlinear 
adaptive filter may be Taylor-made for a particular problem, 
and trained by the new algorithm. In this section, the 
performance of the algorithm is compared with that of other 
optimization techniques, namely BPTT, Kalman Filtering, a 
Gauss-Newton method, and the Broyden-Fletcher-Goldfarb- 
Shanno (BFGS) method which are known to those skilled in 
the art. 

VI. The Adaptive Modified Gauss-Newton 
Algorithm 

In order to clearly explain the invention, this section 
describes the new method, as well as the rationale behind it. 
In a preferred embodiment, we assume that a training input 
sequence {u n } 53, n E {1 . . . N} is propagated 51 through 
the adaptive network. We consider the network to be rep- 
resented by a known nonlinear filter operator, h 48, which is 
determined solely by the network architecture, h 51 operates 
on (w, u) 50,53 where u is an input vector of length N and 
w is a vector of all the adaptable parameters in the network. 


4 

The vector which is created from the outputs of the network 
{y„} 52 in response to the training, sequence 50 is denoted 
h(u,w). 

At each training epoch, which we denote with subscript k, 
5 we find an iterative update of the parameter vector, w k . At 
epoch k, one may describe the sequence of desired training 
outputs as a vector: 

y k =h(w* u k)+v k (i) 

where w* is the vector of ideal weights which training seeks 
to discover, u k is a vector of inputs at epoch k and v* is a 
vector of unknown disturbances. If the data in u k and y* are 
sampled from real-world continuous signals, then v* can be 
15 regarded as a vector of noise on the ideal system inputs or 
sampled outputs, v* Can also be regarded as error resulting 
from the architectural imperfections of the network, such as 
emulating HR linear dynamics with an FIR structure. 
Alternatively, in a control applications, v* could result from 
2Q the physical impossibility of a control law bringing all states 
instantaneously to 0 in a stable system. The subscript k on 
y k and u k indicates that the inputs, and desired output set can 
be changed at each epoch, however, in many applications 
they are held constant over several epochs. The invention 
25 involves an updating strategy F, which iteratively deter- 
mines the current weight vector from the previous weight 
vector and the input-output truth model, w^=F(w^_ 1 , n k , y*). 
To select amongst updating strategies, we define the follow- 
ing cost function: 

30 

J ( w k> u k )-h{w, u^f{h{w k , u^-h{w, u k )) (2) 

where is the vector of filtered errors, or errors in the 
absence of disturbance v*. Then, we seek that F which 
35 minimizes J (F(w*_ 1 , u*, y k ), %). 

Note that the inputs to F restrict our exact knowledge of 
J’s variation with respect to the weights to the first 
derivative, Vw 7 . Additional measurements can be per- 
formed on a network to estimate higher derivatives of J at 
40 each epoch, however, the performance of such methods for 
our class of problem does not justify their added computa- 
tional complexity. 

We determine F by expanding the filter function h around 
the current value of the weights vector. By evoking the mean 
45 value theorem, we can describe the filter outputs at epoch k: 

K W k, U k )=K W jL- 1> U k)+^Jl(Wk-l,Uk) T (Wk-Wk-i)+tt(w k -w k - 

i) r VVK%i> u k )(w k -w k _ (3) 

where w^_ 2 is some weight vector for which each compo- 
50 nent lies between the corresponding components of w Ar _ 1 and 
w^. Note that the notation in (3) is inaccurate — the term 
V vv 2 h(w Ar _ 1 , %) is not a matrix in this context, but a tensor. 
Note also that the matrix V w h(w k _ l7 %), which contains the 
derivatives of the outputs with respect to all parameters in 
55 the network over the full time sequence, is difficult to 
calculate for nonlinear FIR and HR filters which contain 
embedded memory. This will be discussed in the next 
section. Suppose now, that at every epoch, we apply the old 
weight vector to the current inputs and compare the filter 
60 outputs with the current desired set of outputs. The measured 
error is: 

e m k=yk~K w k-i, Uk) (4) 

and in a preferred embodiment, the cost function we actually 
65 measure is 

Jmk =1 / 2 mk e mk- 


(5) 



US 6,351,740 B1 


5 

The filtered error, e^, Which we seek to minimize, is not 
directly measurable. However, using equation (3) and (4) we 
can describe this filtered error in terms of the measured error 
as 

er e m rV>(^_ 1; Uff{w k -w k _^ -v yt - 1 / 2 (>v yt ->v yt _ 1 ) r V (V 2 /?(iv Jfe _ 1 , 

OK-Wjfc-i) (6) 

Consider the last term on the RHS involving the second 
derivative of h at w k _ 1 . This indicates how far h(w^, %) is 
from a first-order approximation at w yt _ 1 . We assume that 
this term->0 as the weights converge. If we further make the 
strong assumption that the disturbance term v k is negligible, 
then we drop these terms from the estimation of and 
approximate cost as 

J( w k, Uk) =1 6(e m k-V u k ) r (w'j fe -Wj fe _ 1 )) r (e / „j fe - V ^Ji(w k _ l9 

0 T {Wk~Wk-i)) ( 7 ) 

We can calculate the derivative of this cost with respect to 

w*. 

^ J(^k-i)=~^ Jt(^k-i, u k )(e mi r V w h(w k _i, (8) 

The invention introduces the novel criterion that the 
update step taken, Aw^^w^-w^^ be parallel with the 
gradient of the quadratic cost estimate at the new location, 
w*, to which we are stepping: 

Aw*-! = -ilk V w J (w k , u k ) ( 9 ) 


6 

is poorly conditioned. In the context of the invention, the 
latter phenomenon is particularly prevalent when the train- 
ing signal is narrowband, or the network architecture has 
excessive degrees of freedom. To ameliorate these effects, 
the Gauss-Newton algorithm can be modified by adding 
some positive definite matrix to (14) before taking the 
inverse, with the objective of improving the condition 
number, and the convergence characteristics. This invention 
describes an adaptive method for controlling the condition- 
ing of the inverse in (10) using the learning rate ju k . 

If /u k is too large, this can negatively impact the update 
direction for the reasons discussed above; if ju k is set too 
small, convergence is retarded. Therefore, we determine /u k 
adaptively by successive step size augmentation and reduc- 
tion. In a preferred embodiment // is augmented by some 
factor, e a , after each update of the weights which diminishes 
cost, and reduced by some factor, e r , after each update which 
enlarges cost. It will be noted by one skilled in the art that 
other methods for augmenting and reducing // based on the 
performance of each weight update can be used, without 
changing this essential idea in the invention. 

One skilled in the art will also recognize that for complex 
cost functions, AMGN can be trapped in non-global minima 
of cost. One method of ameliorating this problem is to 
multiply each weight update by some term proportional to 

ma x(J mk -J\ 0) 

\\V w h(w k _ l , u k )e mk || 2 


= ^i k V w h(w k - l , v k )(e mk -V w h(w k - l , u k y Aw*_i) 

where we have introduced ju k , as a constant of proportion- 
ality. In a linear filter, this criterion would encourage updates 
which move more directly towards the quadratic cost mini- 
mum than a steepest descent update. For the invention, since 
ju k is a learning rate which can be freely adjusted, we solve 
(9) for Aw*.,: 


. = (— +V W %_ 1 , ju*)V w A(w*_i, 
VHk 

V w h(w k _ l , fi k )e mk 


where J mk is the measured cost at epoch k, J* is some 
predetermined good enough value of cost, and || \\ 2 denotes 
the 2-norm of the vector. In this way, if the parameters 
approach a local minimum, but the weight updates 

35 will grow large so that the local minimum can be escaped. 
This and many other techniques for avoiding local minima 
may be used to augment AMGN without changing the 
essential idea. 

VII. Calculating the Term VJti(w k _ l7 %) for a Nonlinear 
40 FIR Filter 

This section addresses the calculation of V vv h(w^_ 1 , %) 
for a general nonlinear FIR network which can be decom- 
posed into static nonlinear and linear dynamic components, 


Equ. 10 is the Adaptive Modified Gauss-Newton 
algorithm, which will henceforth be referred to as AMGN. 
To place this aspect of the invention in context with respect 
to classical optimization, consider Newton’s iterative tech- 
nique applied to the new set of inputs at each epoch: 

w*=w A _i-/« A (V Mr 2 /(w Jfc _i, O) «*) (li) 

Since the Hessian of the cost V vv 2 J(w A: _ 1 ,%) is difficult to 
calculate, the Gauss-Newton technique uses an estimation of 
the Hessian of the cost, which for the cost of equ.(2) would 
be 


such as that displayed in FIG. 2. Each layer 7 of this network 
consists of a linear FIR filter 2 feeding into a nonlinear 
generalized single-layer subnetwork (GSLN) 3 which gen- 
erates the weighted sum of a set of static nonlinear functions. 
A GSLN refers to any weighted sum of nonlinear functions 
[f ± . . . f v \ where the weights [x 1 . . . x v ] are adjustable 
parameters. 

Since the initial states of the FIR filters 2 are unknown, we 
must wait until our known set of inputs 8 have propagated 
through the network before useful training data is generated 
at the output 9. Therefore, the output sequence 9 which we 
use to train the filter will be of length 


V 2 J(w*_!, u k ) = V 2 Mwjfc-i, u k )e fk + (12) 

V w h(w k ~ l , ; u k )V w h(w k - l , fi k f 

* V w h(w k _ l ,fi k )V w h(w k _ l ,fi k ) T (13) 

This approximation is sound as one converges to a mini- 
mum at which e^.43 0 as k->oo. Problems arise, however, 
when V vv 2 h(w Ar _ 1 , %)e^ is not negligible, or when the matrix 

V>K_ 1; u k )V Ji(w k _ 1} u k ) T (14) 



where M^ ' is the number of taps of the FIR filter at layer 
1 of path p, and the sup operation selects that path with the 
most cumulative delays. One may consider the multi-layer 
65 filter of FIG. 2 to be represented by a known nonlinear filter 
operator, h, which is determined solely by the network 
architecture, h operates on (w, u) where u is a vector of 



US 6,351,740 B1 


8 


7 

length N containing all the inputs 8 and w is a vector, of 
length a p,i = A p,i x it) p,i (19) 


YjW-' + v), 

p,l 


5 


■■■ a 


p,i 

NP’ 1 


\T 


(20) 


which contains every adaptable parameter in the network. 
The output sequence 9 is stored in a vector of length 


N - su p\ Yj M P,i 


+ 1 


10 


which we denote h(w, u) as described above. 15 

The matrix term V vv h(w yt _ 1 , u*), which contains the 
derivative of each element of the output sequence with 
respect to each parameter in the network, is easy to calculate 
in a one-layered network (where L=l). In this case, each row 
is simply the sequence of inputs that excite the correspond- 20 
ing weight in w k _ 1 . We will now describe how this term is 
found in a multi-layered network such as FIG. 2, with FIR 
filters 2 and nonlinear components 3 at hidden layers. All 
indication of the epoch k will be abandoned in this section 
for economy of notation. Keep in mind that the network 25 
parameters and output sequences are epoch-dependent, how- 
ever. 

Firstly, we need to detail notation for the propagation of 
data through the layers of the network. Consider a single 
layer 1 7 in the path p of the filter of FIG. 2. For the sake of 30 
clarity, we have boxed this layer in the figure. In isolation, 
the single layer implements a Wiener-type filter [11]. The 
layer has two sets of weights, those determining the output 
of the FIR filter 

35 

. . w M pf’ l Y (15) 

and those determining the weighting of each function of the 
volterra node 

x pJ =[xY l . . . x/’T- (16) 40 

Imagine the output of the GSLN of layer 1-1 is described 
by the matrix 


F p ’ l ~ l = [Ff'- 1 - F p n 1 YU] T 


(17) 45 


where W’ 1 1 is the length of the vector of outputs from layer 
1-1 of path p: 



where x is a matrix multiplication. This output vector is used 
in turn to construct the matrix, W 1 , of inputs which excite 
the weights of the GSLN 3 of layer 1 7: 


/iK’O h{4’ 1 ) "■ fv{a P i l ) 

/iKO fi(4 J ) 

hKii) fy Kii) 


The output 10 of the GSLN is then found by the weighted 
summation: 


F pJ = B pl x xF’ 1 


(22) 




F U 


(23) 


In a similar fashion, we can describe the gradient of the 
function implemented by the GSLN 3 for each set of inputs 
according to: 


F'P’ 1 =B' p ’ l xx pJ 


where 


B /pJ = 


/l'K’O /2KO 

/i'K’0 f{(4’ 1 ) 


- m i ) 

■■■ fv{ aP N l pj) 


(24) 

(25) 


(26) 


We can begin to describe the calculation of V w h(w^_ 1? %) 
by recognizing, as would one skilled in the art, that this term 
is trivial to find in a one-layer linear network. Consequently, 
we seek to condition the inputs to each network layer so that 
weights of hidden layers 1={1 . . . L-l} can be trained 
similarly to the weights of the output layer L. This is 
achieved by pre-multiplying the inputs to each layer by the 
differential gain which that layer’s output 10 sees due to 
subsequent layers in the network. To illustrate this idea, 
consider the function \f’ 1 to represent the reduced filter 
function for a single layer, i.e. 


We can create a matrix of the sequential activation states 
of the FIR filter 2 at layer 1 7, 


F mpJ 

F p,t - 1 
( MP’ l -l ) 

■ ■ ft 1 

F p,l - 1 

( mp 4 + 1 ) 

F p,l - 1 
F MP’ 1 


f p,i- 1 

F NP ’ l ~ 1 


F p,l - 1 

■■ F np1 


The output of the FIR filter 2 at this layer 7 can then be 
found according to: 


55 hP’ l (w k _f l , u k )=F p ’ 1 (27) 

To find V yv p,ih p,l (w k _ 1 p,! , u^), the gradient of this layers’ 
outputs 10 with respect to each of the weights in the layer, 
we pre-amplify the matrix of filter states, AF’\ by the 
60 gradient of the nonlinear function 3 which corresponds to 
each row of FIR filter states. In this way, given the weights 
vector for the whole layer, w p, =[(o p,/ x p,! ] T , and the set of 
inputs 8 at epoch k, we can determine 

65 u k )=[A pJ .F' p ’ 1 B p ’ n (28) 


generating the matrix 



9 


US 6,351,740 B1 


10 


F MP.‘ F ‘ 

ppj- 1 p'pj 

F (MP>‘- i) Fl ■' 

p,p,l—l P’WJ 

/.«) • 

•• /vK) 

(29) 

F (MPf 1) F 2 

T-,p,l—i rp>p,l 

F MPJ F 2 

T-,p,l—l T-’fpJ 

■ r 2 r 2 

/l(«2 J ) • 

- fv(a pJ ) 


P’PJ-I f’p,1 


p,i— l p’/pj 

NP’ 1 NP’ 1 

/.«.) • 

■■ 



Note that the operation in (28) multiplies each row of A^ j/ 
with the element in the corresponding row of F p ’ 1 . The 
concept applied to (28) can be extended to the entire filter of 
FIG. 2 if we realize that the outputs of layer 1, which excite 
the FIR filter 2 at layer 1+1, do not see a static differential 
gain due to the subsequent layer, but rather a time -varying 
gain as the outputs propagate through each tap in the filters 
of layer 1+1 — as well as subsequent layers. This time- 
varying gain can be taken into account by convolving the 
inputs to each layer with the weights of the FIR filters 2 at 
subsequent layers, and then multiplying the convolved 
inputs by the gradient due to the nonlinear GSLN 3 at the 
subsequent layer. We define the convolution operation as 
follows: 

Given any set of weights for an FIR filter 2 


V w p h p (w p , u) = [V wP ,i u ) V^ p ,2 h p { u) ■■■ (34) 

h p (w p , «)] 


It is trivial to extend this to the multiple paths in the 
network. We combine the matrices of weights for each path, 
to form a vector of weights for the entire network of FIG. 2, 
with P total paths, 


iA p-i 


(35) 


®=[ Wi . . . W J 


(30) 


and a matrix of inputs 

r f p - g 


The derivative of the full network operator with respect to 
w, acting upon the current weights, w, and the inputs, u, is 
(31) 30 then 


F = 


F n ■ ■ ■ F/v-jg+i 


we have the convolution of F with co 


m)=[V^/j p (w p , m)V w /7-i/j p 1 (w p 1 ,u) . . . V w i h 1 ^ 1 , w)J36) 

With this matrix in hand, we are now able to implement 
AMGN for the dynamic nonlinear network of FIG. 2. The 
matrix V W h(w, u) also enables one skilled in the art to apply 


a 

^ Fp + i-i(x) a - i+l 


a 

^ Fp+i- 2 (Oct-i+i 


Cu{F) = 


a 

^ Fp+i°->a-i+l 


a 

Yj FiO>a-i + 1 


(32) 


a 

^ F N -a+iM a -i + \ 


a 



where the dimensions of C^F) are (N-(3-a+2)x(3. This 
convolution operation is sequentially applied to each layer to 50 
account for the total time-varying differential gain which the 
outputs of a layer see due to all subsequent layers in the path. 
Hence, we can describe the derivative of the filter operator 
for the full path p 11 with respect to the weight vector for 
layer 1, w p ’, operating on the current set of weights and 
inputs, as follows " 

V^/ h p (w p , u) = (33) 

[C^jr{C^jf- 1( - C wP , M (lA p -‘-F"’-‘ B p ’ l ])-F’ p - M ) 

... . fpXP-^ffp,iF j 

where If is the number of layers in path p — refer to FIG. (2). 
Now, by combining matrices like (33) for the weights of 
each layer, we can extend our calculation to find the gradient 65 
with respect to the weights vector of an entire path 11 as 
follows: 


other Newton-like optimization methods, such as BFGS, 
Kalman Filtering and Gauss-Newton to an FIR network with 
embedded memory, as is illustrated in section X. Note that 
method described naturally corrects each weight for the 
error it produces over the entire sequence of data. While the 
technique has been elucidated for the case of an FIR filter, 
a similar approach can be used to train HR filters by 
propagating data through the network for a finite interval, 
long enough to capture the relevant system dynamics. 

VIII. Calculating the Term V w h(w^_ 1? %) for a 
General HR Filter Architecture 

In this section, the extension of the algorithm to nonlinear 
iir filters is discussed. The approach described here is very 
general and applies to a vast range of filter architectures. 
While the approach described in this section could also be 
applied to an FIR filter as in FIG. 2, it is less computationally 
efficient for that problem than the approach described in the 
previous section. The generic architecture for the filter of 
this discussion is displayed in FIG. 3. 



US 6,351,740 B1 


12 


11 

The system has M states 12, represented at time n by the 
state vector a n =[a ln . . . & Mn \ T - We assume that the subse- 
quent value of each state in the filter or system is some 
function 14 of the current states 12, the inputs 13 and the set 
of parameters within the network 


a ln+l 


fi(a n , w) ' 



u n , w)_ 


where w=[w 2 . . . w v ] T is a vector of length V, containing all 
adaptable parameters in the network, and once again {u n } 
n=l . . . N 13 is the input sequence to the system. We assume 
that each output 16 of the filter or system is generated by the 15 
function 15 y„=f 0 (a n , u n , w). For the preferred embodiment, 
we assume that the states of the system 12 are all 0 before 
excitation with the input sequence 13 {u n }. The sequence of 
outputs is then stored in a vector of length N which we 
denote h(w, u) as described above. The task, as in the 
previous section, is to determine the matrix V vv h(w, u). 

To calculate the dependence of some output y n on w, we 
use a partial derivative expansion with respect to the state 
vector: 


n 

= Yj 

i= 1 


(38) 25 


1. J (N,:)=V a „yN 

2. H(N,:)=V ajv yNxV W & N 

3. for i=N-l:l step-1 

4. J(i+1: N,:)«-J(i+1: N,:)xV a .a, +1 

5. J(i: N,:)-V yi 

6. H(i: N,:)«-H(i: N,:)+J(i: N^xV^a,- 

7. end 

8. V w h(w,u)^H 

IX. Criteria for Selection of the AMGN Algorithm 
Above BPTT 

In this section we will briefly discuss the performance of 
BP, and algorithm 10, applied to nonlinear adaptive prob- 
lems. 

It has been shown [12] that the steepest-descent method is 
H°° — optimal for linear prediction problems, and that 
BP — we use the term BP to include BPTT for dynamic 
networks — is locally H°° — optimal for multi-layered nonlin- 
ear prediction problems. In essence, this optimality means 
that the ratio from the second norm of the disturbance in equ 

(i), 


X 


to the second norm of the filtered error, 


The term V^a,- in equ.(38) can be directly calculated 


dd Vl 

da u 


dwi 

dwy 


da Mi 

da Mi 


dwi 

dwy 


Ui, 

w) 

dfi(di, Ui, w) 

dwi 


dwy 

df M {ai, ^ 

, w) 

df M (ai, Ui, w) 

dwi 


dwy 


In order to calculate the term in equ. (38), we again 
apply a partial derivative expansion as follows: 







(40) 

Each of these terms can now be directly calculated: 



dai i+ 

1 

day. . 1 


(41) 


day. 


da Ml 



^ a i a i+l — 







da M - i+ 

i 

aa «i+i 




da i- 


da Ml 



J’» = [ 

dy n 

da i n 

dy n 

da Mn 



(42) 

_[ 

9fo(On, 

u n , w) 

dfo(a n 

,u n , wOj 


“l 

da 

In 

da Mn \ 



Based upon these equations, we present below an algo- 
rithm by which VJi(w,u) can be calculated. We begin with 
two all-zeros matrices, H=0 ArxV and J=0 ArxAf For notational 
clarity, note that H(i,:) refers to all the columns at row i in 
H, H(:, i) refers to all the rows in column i of H, and H(i: n,:) 
refers to the matrix block defined by all the columns and the 
rows from i to n. 


L e fn' 

n= 1 

can be made arbitrarily close to 1.0 for any {v„} as N->oo ? 
35 by selecting an initial set of weights close enough to the 
ideal value. These results explain the superior robustness of 
BP for nonlinear applications, and linear applications with 
non-Gaussian disturbances. However, this optimality result 
for BP applies to scenarios where a new set of inputs is 
40 presented to a filter, thus changing cost as a function of the 
weights, at each training epoch. In this case, the nonlinear 
modeling error at each training session is treated by BP as 
an unknown disturbance. This approach to training would 
occur, for example, in nonlinear prediction applications in 
45 which processing speed is limited. It is worth noting that in 
these scenarios, where parameters are updated according to 
the error of each newly predicted datum, the matrix (14) will 
have rank 1. Consequently, the update according to AMGN 
would be in the same direction as that of BP. In fact, in this 
50 situation, Gauss-Newton learning is simply a normalized 
least squares algorithm, which is H°° — optimal for a poste- 
riori error for linear problems [9]. In many nonlinear appli- 
cations however, multiple weight updates are performed 
using the same set of inputs. In these cases, the performance 
55 of BP and AMGN become clearly distinguishable. 

We will clearly illustrate the different algorithms’ char- 
acteristics by transferring our considerations to a simple 
two-dimensional parameter-space. Consider the simple 
static network of FIG. (4). Assume that the desired values for 
60 the two weights 17 are w 1 =l, w 2 =l, so that the desired 
output 18 of the network is y n =u„+u„ 2 . We select a vector of 
inputs 19 [1 2 3] T so that the desired vector of outputs is [2 
6 12] r . FIG. (5) displays the performance over 100 training 
epochs of both BP (shown as x) and AMGN (shown as o). 
65 For the purpose of comparison, the learning rate was 
adjusted so that the initial steps taken by both algorithms 
were of the same magnitude. The learning rate was then 



US 6,351,740 B1 


14 

-continued 


13 

adjusted according to the successive augmentation and 
reduction rule. Notice that AMGN is very close after a few 
updates. By contrast, BP finds itself stuck at w 1 =0.84, 
w2=1.41. This non-stationary limit point occurs since the 
cost improvement at each update is not substantial enough to 5 
achieve convergence. This type of trap for steepest descent 
algorithms is typical of dynamic networks involving poly- 
nomial nonlinearities, or volterra architectures. Note that if 
we set a lower bound on the learning rate then BP will 
eventually converge to the point (1,1)? but it would take a 1C 
very large number of updates! This point is illustrated in 
FIG. (6) where we have imposed a constant step size 
restriction, 0.1, on both algorithms. Notice the directions of 
the steps taken as the algorithms approach the global mini- 15 
mum. 


X. Examples of Taylor-made Adaptive Dynamic 
Nonlinear Filters 


In this section, we describe two sample applications of the 
AMGN algorithm and illustrate how one may Taylor-make 
a network to perform a specific nonlinear dynamic 
operation, such network being efficiently trained with the 
new algorithm. Consider the tracking system displayed in 25 
FIG. 7. This system can be considered as some nonlinear 
sensor 20, feeding electronics with nonlinear components 21 
and dynamics 22, which output to some plant 23. The 
dynamics on the feedback path 24 could represent parasitic 
capacitance in the system. For this example, we choose the 30 
dynamic transfer functions and nonlinearities according to: 


Ci(s) = 


5 + 300 
5 + 700 


100 


C 2( 5)= — 


C 3 (5) = 


1000 
5 + 1000 


1 - 1 , 
fi{u) = u+-u z + -u i 

f 2 (u) = arcsin(w) 


(43) 


35 


40 


Ideally, we would like the system outputs to exactly track 45 
system inputs, however, with nonlinearities, dynamics and 
feedback, the outputs are a severely distorted version of the 
original system inputs. We seek an adaptive filter for the 
purpose of identifying the original system inputs 25 from the 
system outputs 26, in other words for inverting the system 50 
of FIG. 7. 

The most general technique in prior art for modeling 
nonlinear systems with decaying memory, such as FIG. 7, is 
via a Volterra series. Consider, for example, a nonlinear 55 
system which one models with a memory of M taps, and 
with a nonlinear expansion order of V. The output of the 
system in response to an input sequence {u n } can then be 
expressed as the Volterra expansion: 

60 

m (44) 

y»= Z + 

*1,1=1 

MM M 

X X **2,1 **2,2 W «-*2,l W «-*2,2 + ■ ■ ■ ■ ■ ■ 65 

*2,1 = 1 *2,2 =1 k V,l= l 


M 

^ k V,l • • • y u n-ky I ■■■ U n-kyy 

k v,v= 1 


The multi-dimensional sequences specified by h in the 
expansion are termed Volterra kernels. Each component of a 
kernel is a coefficient — or weight — for one term in the 
expansion, and each of these weights must be trained in 
order to model a system. A network architecture as in FIG. 
2 which contains polynomial nonlinearities 3 and embedded 
linear filters 2, has substantially fewer parameters than 
would be required in a Volterra series expansion of a similar 
system. More precisely, the number of Volterra series terms 
necessary to emulate a single FIR filter of M taps feeding a 
polynomial GSLN of order V is 


V n + 1 «M- 2 

ZEE -i* 


To emulate the operation of a multi-layered filter as in 
FIG. 2, the number of terms required in a Volterra series 
would increase very rapidly! In general, networks are less 
able to generalize from training data as the degrees of 
freedom associated with the network parameter-space 
increase. 

Therefore, if one knows roughly how the system one 
seeks to emulate is decomposed to static nonlinear blocks 
and dynamic linear components, one can Taylor-make a 
filter of the form of FIG. 2 to perform the required process- 
ing with a limited number of parameters. A polynomial- 
based GSLN, where the functions are simply 

ascending powers of the input, can model the truncated 
Taylor series expansion of differentiable functions. 
Consequently, with few parameters, it can emulate accu- 
rately a wide range of static nonlinearities. Therefore, based 
on an estimate of the time constants of the linear dynamics 
to be emulated, and an estimate of the order to the nonlinear 
functions to be emulated, we can Taylor-make a, network for 
inverting the system of FIG. 7 as shown in FIG. 8. One 
skilled in the art will recognize that this architecture is based 
on a block by block inversion of the tracking system. 

The acquisition of the training data for this network is 
described in FIG. 9. We obtain the desired network output, 
{y n } 27 by sampling 36 the original inputs 28 to the tracking 
system 29. Sampling 37 the output 30 of the tracking system 
29 creates the training inputs sequence {u n } 31 for the 
network of FIG. 8. The objectives in creating the training 
signal 28 are two-fold. First, the nonlinearities 20,22 must be 
excited over the full domain of operation. This is achieved 
with a chirp signal 32 which starts at frequency 0 and ramps 
to a frequency beyond the bandwidth of the tracking system 
29, and which has an amplitude larger than those encoun- 
tered during regular tracking. Secondly, in order to improve 
the condition of the matrix (14), the signal must have 
significant frequency content near to the Nyquist rate, 1/2T S . 
We achieve this by adding 34 zero-mean white Gaussian 
noise 33 to the chirp signal 32, which we then input to a 
zero-order hold 35 with a hold time three times that of the 
sampling rate, T s . Notice also the factor-of-3 upsamplers on 
the network paths 38 in FIG. 8. Used in conjunction with a 
3-period zero-order-hold 35 in preparation of the training 
input 31, the upsampling enables an FIR filter 39 to estimate 
accurately the Euler derivative [13] of a signal which has 



US 6,351,740 B1 


15 

high frequency content. In brief, every third point of {y n } 
has a well defined gradient; the rest of the calculated 
gradients are discarded. The unit delay 40 after sampling 36 
the desired output shown in FIG. (9) aligns the training data 
27 for the non-causal Euler differentiation. This technique 5 
can also be used to implement steepest-descent high-pass 
filters and pure differentiators. 

Training sequences of 1000 data points, sampled at 1 kHz, 
were used. FIG. (10) shows the RMS network error for 100 
parameter updates using the AMGN algorithm, and 200 10 
updates respectively using optimization algorithms known 
in the art as the Kalman Filter, the Gauss-Newton technique, 
the BFGS algorithm with a line search [10], and BPTT. The 
Kalman filter is presented as a standard; the measurement 
noise and initial covariance were empirically chosen to 15 
maximally reduce cost after 200 epochs. For the Gauss- 
Newton Algorithm, reduced order inverses were formed 
when matrix (14) was ill-conditioned. The line search for 
BFGS was conducted using the method of false position 
[14]. Notice that all Newton-like techniques outperform 20 
BPTT. The superior convergence rate and cost minimization 
achieved with BFGS and AMGN are clearly evident. Note 
that in contrast to BFGS, AMGN doesn’t require a line 
search so each epoch involves substantially less computation 
than is required for a BFGS update. FIG. (11) shows the 25 
learning rate ju k over 100 updates for AMGN. Note that the 
learning rate increases geometrically until the poor condi- 
tioning of matrix (14) significantly distorts the update direc- 
tion. While it is difficult to establish that the weights 
converge to a global minimum, one can show that the 30 
tracking system dynamics have been captured accurately. 
This is evidenced in FIG. (12), where the trained weight 41 
vector [wj 1,2 . . . w 30 1,2 ] emulates the impulse response of 
a system with dominant pole at 300 rad/sec. 

A sum of sinusoids test reference signal was injected into 35 
the tracking system. When the system outputs are injected 
into the trained network, the resultant output error is shown 
in FIG. 13 (right), together with the error which arises from 
an exact inversion of all the linear dynamics of the system 
(left). This indicates the extent to which the nonlinear 40 
distortion of the signal has been corrected. 

The second example involves the identification and inver- 
sion of a Wiener-Type Nonlinear System. The Wiener model 
applies to an audio amplifier which exhibits crossover 
distortion. The network architecture employed is displayed 45 
in FIG. 14. The memory less nonlinearity at the amplifier’s 
output is emulated with a parameterized function 42 which 
can accurately emulate crossover distortion: 

lW 0(k 2 7 n _ (45) 50 

fiy n ,xi,x 2 ,x 3 ) = x l — + x 3 y„ 

The linear dynamics of the amplifier are estimated over 
the audible band with an HR digital filter 43. The network 55 
training data is gathered according to the method of FIG. 15. 
The amplifier 47 is excited with a chirp signal 49 ranging in 
frequency from 19.2-0 kHz, summed 50 with zero-mean 
Gaussian noise 48. The amplifier outputs 52 are sampled 51 
to obtain the desired network output sequence 46. The 60 
network input sequence 45 is obtained by sampling 54 the 
input signal 53. The AMGN algorithm is then used to 
identify the parameters of the filter in FIG. 14. 

It is known in the art that a non-minimum phase zero 
cannot be precisely dynamically compensated, since it will 65 
cause an unstable pole in the compensator. Consequently, if 
the identification yields a non-minimum-phase zero for the 


16 

linear filter 43, the output sequence 46 is delayed 55 relative 
to the input sequence 45 and the identification is repeated. 
For input sequence 45 {u„,}, the filter is trained to estimate 
outputs 46 {y„_i}. Successive delays may be added until the 
zeros of the identified linear filter 43 are minimum-phase. 
Once the filter parameters are identified, the linear 43 and 
nonlinear blocks 42 are analytically inverted using tech- 
niques well known in the art. A lowpass digital filter with 
cutoff at roughly 20 kHz is appended to the HR filter inverse 
to limit high-frequency gain; and a lookup table is used to 
emulate the inverse of the memoryless nonlinearity. An 
audio signal is pre-warped by the inverse nonlinearity and 
then the inverse linear dynamics before being input to the 
amplifier. The A-D and D-A conversions can be performed 
with an AD 1847 Codec, and the pre- warping of the audio 
signal can be performed with an ADSP2181 microprocessor. 
FIG. 16 shows the spectral response of the nonlinear ampli- 
fier when excited with a dual tone test signal. FIG. 17 
displays the spectral response of the amplifier to a pre- 
warped dual tone. Note that the amplitudes of nonlinear 
distortion harmonics have been reduced in the pre-warped 
signal by more than 20 dB. 

These examples are intended only as illustrations of the 
use of the invention; they do not in any way suggest a limited 
scope for its application. 

REFERENCES 

[1] S. D. Sterns , Adaptive Signal Processing, Prentice Hall, 
1985. 

[2] D. K. Frerichs et al., “U.S. patent: Method and procedure 
for neural control of dynamic processes”, Technical 
Report U.S. Pat. No. 5,175,678, U.S. Patent and Trade- 
mark Office, December 1992. 

[3] A. Mathur, “U.S. patent: Method for process system 
identification using neural networks”, Technical Report 
U.S. Pat. No. 5,740,324, U.S. Patent and Trademark 
Office, April 1998. 

[4] D. C. Hyland, “U.S. patent: Multiprocessor system and 
method for identification and adaptive control of dynamic 
systems”, Technical Report U.S. Pat. No. 5,796,920, U.S. 
Patent and Trademark Office, August 1998. 

[5] S. A. White et al., “U.S. patent: Nonlinear adaptive 
filter”, Technical Report U.S. Pat. No. 4,843,583, U.S. 
Patent and Trademark Office, June 1989. 

[6] D. H. Nguyen, Applications of Neural Networks in 
Adaptive Control , PhD thesis, Stanford University, June 
1991. 

[7] D. M. Hanks, “U.S. patent: Adaptive feedback system for 
controlling head/arm position in a disk drive”, Technical 
Report U.S. Pat. No. 5,548,192, U.S. Patent and Trade- 
mark Office, August 1996. 

[8] A. Back et al., “A unifying view of some training 
algorithms for multilayer perceptrons with fir filter 
synapses”, Proceedings of the 1994 IEEE Workshop on 
Neural Networks for Signal Processing , vol. 1, pp. 
146-154, 1994. 

[9] B. Hassibi, “h°° optimality of the 1ms algorithm”, IEEE 
transaction on Signal Processing , vol. 44, no. 2, pp. 
267-280, February 1996. 

[10] D. P. Bertsekas, Nonlinear Programming, vol. 1, Ath- 
ena Scientific, 2nd edition, 1995. 

[11] N. Wiener, Nonlinear Problems in Random Theory \ 
Wiley, New- York, 1958. 

[12] B. Hassibi, “h°° optimal training algorithms and their 
relation to backpropagation”, Proceedings of the 
NIPS94 — Neural Information Processing Systems: Natu- 
ral and Synthetic , pp. 191-198, November-December 
1994. 



US 6,351,740 B1 


17 

[13] G. F. Franklin, Digital Control of Dynamic Systems, 

Addison- Wesley, 2nd edition, 1990. 

[14] D. G. Luenberger, Linear and Nonlinear Programming, 

Addison- Wesley, 2nd edition, 1984. 

What is claimed is: 5 

1. A method for training a dynamic nonlinear adaptive 
filter with embedded memory, the method comprising: 

a) propagating a training sequence through the dynamic 
nonlinear adaptive filter to obtain an output sequence, 
where the filter is represented by a filter architecture h io 
and a set of filter parameters w, and where the filter 
architecture h comprises adaptable static nonlinear 
components and adaptable linear components; 

b) constructing a matrix Vh w from filter architecture h, 
from the set of filter parameters w, and from an error 15 
sequence, where the matrix Vh w relates each parameter 

of w to an error produced by the parameter over all 
components of the error sequence, where the error 
sequence measures a difference between the output 
sequence and a desired output sequence; and 20 

c) updating the filter parameters w based on a learning 
rate, current (and possibly past) parameter vectors w, 
current (and possibly past) matrices Vh^, and current 
(and possibly past) error sequences, where an update 
direction is closer to a Newton update direction than 25 
that of a steepest descent method. 

2. The method of claim 1 wherein updating the filter 
parameter w is achieved with a modified Gauss-Newton 
optimization technique which uses only the current param- 
eter vector w, the current matrix Vh w , the current error J 
sequence, and where the update direction and an update 
magnitude are determined by the learning rate. 

3. The method of claim 1 wherein constructing the matrix 
Vh w comprises repeated convolutions with time -varying 
gains at various layers of the filter. 

4. The method of claim 1 wherein constructing the matrix 
Vh w comprises repeated matrix multiplications that account 
for a dependence of filter output on all prior states within the 
filter. 

5. The method of claim 1 wherein updating the filter 4 
parameters w comprises: 

propagating the training sequence through the filter to 
obtain a temporary output sequence, where a temporary 
set of filter parameters w f is used during the propagat- 45 
ing; 

evaluating a cost function J depending on the error 
sequence; evaluating a temporary cost function J f 
depending on a temporary error sequence, where the 
temporary error sequence measures a difference 50 
between the temporary output sequence and the desired 
output sequence; 

setting w=w f and increasing the learning rate if J>J f 

decreasing the learning rate and repeating the parameter 
update if J=J r 55 

6. The method of claim 1 wherein the filter architecture h 
comprises adaptable static nonlinear components and adapt- 
able linear components. 

7. The method of claim 1 wherein (a) through (c) are 
repeated with training signals selected to excite an entire 60 
domain of operation of the nonlinear components, and to 
excite an entire operational bandwidth of the linear compo- 
nents. 

8. The method of claim 1 further comprising delaying the 
training sequence relative to the output sequence, whereby 65 
the filter may be trained to implement a non-causal Euler 
derivative in mapping from the outputs to the inputs. 


18 

9. The method of claim 1 further comprising delaying the 
output sequence relative to the training sequence, whereby 
the filter may be trained to implement a minimum -phase 
model of a dynamic system. 

10. A method for training a dynamic nonlinear adaptive 
filter, the method comprising: 

a) propagating a training sequence through the dynamic 
nonlinear adaptive filter to obtain an output sequence, 
where the filter is represented by a filter architecture h 
and a set of filter parameters w; 

b) constructing a matrix Vh w from filter architecture h, 
from the set of filter parameters w, and from an error 
sequence, where the error sequence measures a differ- 
ence between the output sequence and a desired output 
sequence; and 

c) determining an update vector Aw from matrix Vh w , 
from the error sequence, and from a learning rate, 
where both a magnitude and a direction of update 
vector Aw depends on a value of the learning rate; and 

d) updating the filter parameters w based on the update 
vector Aw. 

11. The method of claim 10 wherein the direction of the 
update vector Aw is closer to the Newton update direction 
than that of a steepest-descent method. 

12. The method of claim 10 wherein the matrix Vh M , 
relates each of the parameters in w to an error produced by 
that parameter over all components of the error sequence. 

13. The method of claim 10 wherein the filter architecture 
h comprises adaptable static nonlinear components and 
adaptable linear components. 

14. The method of claim 10 wherein updating the filter 
parameters w comprises: 

propagating the training sequence through the filter to 
obtain a temporary output sequence, where a temporary 
set of filter parameters w t is used during the propagat- 
ing; 

evaluating a cost function J depending on the error 
sequence; 

evaluating a temporary cost function J f depending on a 
temporary error sequence, where the temporary error 
sequence measures a difference between the temporary 
output sequence and the desired output sequence; 

setting w=w f and increasing the learning rate if J>J f , 

decreasing the learning rate and repeating the parameter 
update if J=J r 

15. The method of claim 10 wherein (a) through (d) are 
repeated with training sequences selected to excite an entire 
domain of operation of the nonlinear components, and to 
excite an entire operational bandwidth of the linear compo- 
nents. 

16. The method of claim 10 further comprising delaying 
the training sequence relative to the output sequence, 
whereby the filter may be trained to implement a non-causal 
Euler derivative in mapping from the outputs to the inputs. 

17. The method of claim 10 further comprising delaying 
the output sequence relative to the training sequence, 
whereby the filter may be trained to implement a minimum- 
phase model of a dynamic system. 

18. A dynamic nonlinear adaptive filter with embedded 
memory comprising: 

a) a plurality of adjustable linear filter blocks; 

b) a plurality of adjustable nonlinear filter blocks; 

c) a filter training circuit; and 

wherein the linear and nonlinear filter blocks are inter- 
connected to form a network; 



US 6,351,740 B1 


19 


20 


wherein the linear and nonlinear filter blocks comprise 
memory for storing filter parameters; 
wherein the filter training circuit updates the filter 
parameters using a method with parameter updates 
closer to a Newton update direction than those of a 
steepest descent method, and which comprises com- 
paring a desired filter output sequence with an actual 
filter output sequence resulting from a filter training 
sequence propagated through the filter; and 
wherein updating the filter parameter w is achieved 
with a modified Gauss-Newton optimization tech- 
nique for which the update direction and magnitude 
are determined by a learning rate. 

19. The filter of claim 18 wherein the filter training circuit 
calculates a matrix Vh^ relating each of the filter parameters 
to an error produced by the parameter over all components 
of an error sequence, where the error sequence measures a 
difference between the output sequence and the desired 
output sequence. 

20. The filter of claim 18 wherein the conditioning of the 
matrix inverse performed in calculating an update vector Aw 
is determined by the learning rate. 

21. The filter of claim 18 wherein the filter training circuit 
selects training sequences to excite an entire domain of 
operation of the nonlinear blocks, and to excite an entire 
operational bandwidth of the linear blocks. 

22. A nonlinear filter system comprising: 

a) a nonlinear filter comprising a plurality of linear filter 
blocks embedded in a plurality of nonlinear filter 


blocks, wherein the filter comprises a memory for 
storing filter parameters; and 
b) a filter training circuit for updating the filter 
parameters, wherein the circuit compares a desired 
5 filter output sequence with an actual filter output 
sequence resulting from a filter training sequence 
propagated through the filter, and the circuit imple- 
ments a filter parameter update with a direction closer 
to a Newton update direction than that of a steepest 
10 descent method; 

wherein the training circuit selects a plurality of train- 
ing sequences to excite an entire domain of operation 
of the nonlinear blocks, and to excite an entire 
operational bandwidth of the linear blocks. 

15 23. The system of claim 22 wherein the training circuit 

calculates a matrix Vh w relating each of the filter parameters 
to an error produced by that parameter over all components 
of an error sequence, where the error sequence measures a 
difference between the output sequence and the desired 
20 output sequence. 

24. The system of claim 22 wherein updating the filter 
parameter w is achieved with a modified Gauss-Newton 
optimization technique for which an update direction and 
magnitude are determined by a learning rate. 

25 25. The system of claim 24 wherein a conditioning of the 

matrix inverse performed in determining the update is 
determined by the learning rate.