(12) United States Patent
Rabinowitz
US006351740B1
(io) Patent No.: US 6,351,740 B1
( 45 ) Date of Patent: Feb. 26, 2002
(54) METHOD AND SYSTEM FOR TRAINING
DYNAMIC NONLINEAR ADAPTIVE
FILTERS WHICH HAVE EMBEDDED
MEMORY
(75) Inventor: Matthew Rabinowitz, Palo Alto, CA
(US)
(73) Assignee: The Board of Trustees of the Leland
Stanford Junior University, Palo Alto,
CA (US)
( * ) Notice: Subject to any disclaimer, the term of this
patent is extended or adjusted under 35
U.S.C. 154(b) by 0 days.
(21) Appl. No.: 09/201,927
(22) Filed: Dec. 1, 1998
Related U.S. Application Data
(60) Provisional application No. 60/067,490, filed on Dec. 1,
1997.
(51) Int. Cl. 7 G06F 15/18
(52) U.S. Cl 706/22; 706/25
(58) Field of Search 706/22, 25
(56) References Cited
U.S. PATENT DOCUMENTS
4,843,583 A * 6/1989 White et al 708/322
5,175,678 A * 12/1992 Frerichs et al 700/47
5,272,656 A * 12/1993 Genereux 706/22
5,376,962 A * 12/1994 Zortea 706/22
5,542,054 A * 7/1996 Batten, Jr 706/22
5,548,192 A * 8/1996 Hanks 318/560
5,617,513 A * 4/1997 Schnitta 706/22
5,761,383 A * 6/1998 Engel et al 706/22
5,963,929 A * 10/1999 Lo 706/22
6,064,997 A * 5/2000 Jagannathan et al 76/22
OTHER PUBLICATIONS
Ong et al, “A Decision Feedback Recurrent Neural Equalizer
as an Infinite Impulse Response Filter”, IEEE Transactions
on Signal Processing, Nov. 1997.*
Rui J. P. de Figueiredo, “Optimal Neural Network Realiza-
tions of Nonlinear FIR and HR Filters” IEEE International
Syposium on Circuits and Systems, Jun. 1997.*
Yu et al, “Dynamic Learning Rate Optimization of the Back
Propagation Algorithm”, IEEE Transactions on Neural Net-
works, May 1995.*
(List continued on next page.)
Primary Examiner — George B. Davis
(74) Attorney, Agent, or Firm — Lumen Intellectual
Property Services, Inc.
(57) ABSTRACT
Described herein is a method and system for training non-
linear adaptive filters (or neural networks) which have
embedded memory. Such memory can arise in a multi-layer
finite impulse response (FIR) architecture, or an infinite
impulse response (HR) architecture. We focus on filter
architectures with separate linear dynamic components and
static nonlinear components. Such filters can be structured
so as to restrict their degrees of computational freedom
based on a priori knowledge about the dynamic operation to
be emulated. The method is detailed for an FIR architecture
which consists of linear FIR filters together with nonlinear
generalized single layer subnets. For the IIR case, we extend
the methodology to a general nonlinear architecture which
uses feedback. For these dynamic architectures, we describe
how one can apply optimization techniques which make
updates closer to the Newton direction than those of a
steepest descent method, such as backpropagation. We detail
a novel adaptive modified Gauss-Newton optimization
technique, which uses an adaptive learning rate to determine
both the magnitude and direction of update steps. For a wide
range of adaptive filtering applications, the new training
algorithm converges faster and to a smaller value of cost
than both steepest-descent methods such as
backpropagation-through-time, and standard quasi-Newton
methods. We apply the algorithm to modeling the inverse of
a nonlinear dynamic tracking system 5, as well as a nonlin-
ear amplifier 6.
25 Claims, 15 Drawing Sheets
US 6,351,740 B1
Page 2
OTHER PUBLICATIONS
White et al., “The Learning Rate in Back-Proprogation
Systems =an Application of Newton’s Method”, IEEE
IJCNN, May 1990.*
Nobakht et al, “Nolinear Adaptive Filtering using Annealed
Neural Networks”, IEEE International Sympsium on Cir-
cuits and Systems, May 1990.*
Pataki, B-, “Neural Network based Dynamic Models,”
Third International Conference on Artificial Neural Net-
works, IEEE. 1993*
Dimitri P-Bertsekas, “Incremental Least Squares, Methods
and the Extended kalman Filter”, IEEE proceedings of the
33rd conference on Decision and control Dec. 1994.*
Puskorius et al, “Multi-Stream Extended Kalman” Filter
Training for Static and Dynamic Neural Networks IEEE
International conference on System, Man, and Cybernetics-
Oct. 1997.*
Back et al, “Internal Representation of Data, in Multilayer
Perceptrons with HR Synapses”, proceedings of 1992 I
International Conference on Circuits and Systems May
1992.*
Sorensen, O, “Neural Networks Performing System Identi-
fication for Control Applications”, IEEE Third International
Conference on Artificial Neural Networks, 1993.*
Back et al, “A Unifying View of Some Training Algorithm
for Multilayer Perceptrons with FIR Filled Synapses”, Pro-
ceedings of the 1994 IEEEE. Sep. 1994.*
Workshop on Neural Networks for Signal Processing.*
Sorensen, O, “Neural Networks for Non-Linear Control”,
Proceedings of the Third IEEE Conference on Control
Applications, Aug. 1994.*
* cited by examiner
U.S. Patent Feb. 26 , 2002
Sheet 1 of 15
US 6,351,740 B1
Fig. 1
Fig , 1A
Fig . IB
Fig . 1A 1
U.S. Patent Feb. 26 , 2002
Sheet 2 of 15
US 6,351,740 B1
Fig . IB
layer I
U.S. Patent Feb. 26 , 2002
Sheet 5 of 15
US 6,351,740 B1
U.S. Patent Feb. 26 , 2002
Sheet 7 of 15
US 6,351,740 B1
24
Fig. 7
U.S. Patent Feb. 26 , 2002
Sheet 8 of 15
US 6,351,740 B1
U.S. Patent Feb. 26 , 2002
Sheet 9 of 15
US 6,351,740 B1
weight magnitude
U.S. Patent Feb. 26, 2002
Sheet 11 of 15
US 6,351,740 B1
Fig . 12
0.05 1 i 1 0.05
U.S. Patent Feb. 26 , 2002
Sheet 12 of 15
US 6,351,740 B1
estimation error
U.S. Patent Feb. 26 , 2002
Sheet 13 of 15
US 6,351,740 B1
Fig . 14
U.S. Patent Feb. 26, 2002 Sheet 14 of 15 US 6,351,740 B1
Fig . 15
magnitude ( dB ) magnitude ( dB )
U.S. Patent Feb. 26 , 2002
Sheet 15 of 15
US 6,351,740 B1
Fig . 16
4
US 6,351,740 B1
1
METHOD AND SYSTEM FOR TRAINING
DYNAMIC NONLINEAR ADAPTIVE
FILTERS WHICH HAVE EMBEDDED
MEMORY
5
I. CROSS-REFERENCE TO RELATED
APPLICATIONS
This application claims priority from U.S. Provisional
Patent Application No. 60/067,490 filed Dec. 1, 1997, which
is incorporated herein by reference.
This invention was reduced to practise with support from
NASA under contract NAS8-36125. The U.S. Government
has certain rights to the invention.
II. FIELD OF THE INVENTION
This invention relates to neural networks or adaptive
nonlinear filters which contain linear dynamics, or memory,
embedded within the filter. In particular, this invention
describes a new system and method by which such filters can
be efficiently trained to process temporal data.
III. BACKGROUND OF THE INVENTION
In problems concerning the emulation, control or post-
processing of nonlinear dynamic systems, it is often the case
that the exact system dynamics are difficult to model. A
typical solution is to train the parameters of a nonlinear filter
to perform the desired processing, based on a set of inputs
and a set of desired outputs, termed the training signals.
Since its discovery in 1985, backpropagation (BP) has
emerged as the standard technique for training multi-layer
adaptive filters to implement static functions, to operate on
tapped-delay line inputs, and in recursive filters where the
desired outputs of the filters are known — [1, 2, 3, 4, 5]. The
principle of static BP was extended to networks with embed-
ded memory via backpropagation-through-time (BPTT) the
principle of which has been used to train network parameters
in feedback loops when components in the loop are modeled
[6] or un-molded [7]. For the special case of finite impulse
response (FIR) filters, of the type discussed in this paper, the
BPTT algorithm has been further refined [8]. Like BP, BPTT
is a steepest-descent method, but it accounts for the outputs
of a layer in a filter continuing to propagate through a
network for an extended length of time. Consequently, the
algorithm updates network parameters according to the error
they produce over the time spanned by the training data. In
essence, BP and BPTT are steepest descent algorithms,
applied successively to each layer in a nonlinear filter. It has
been shown [9] that the steepest descent approach is locally
H°° optimal in prediction applications where training inputs
vary at each weight update, or training epoch. However
when the same training data is used for several epochs,
BPTT is suboptimal, and techniques which generate updates
closer to the Newton update direction (see section 10) are
preferable. We will refer to such techniques, which generate
updates closer to the Newton update direction, as Newton-
like methods.
Since steepest-descent techniques such as BPTT often
behave poorly in terms of convergence rates and error
minimization, it is therefore an object of this invention to
create a method by which Newton-like optimization tech-
niques can be applied to nonlinear adaptive filters containing
embedded memory for the purpose of processing temporal
data. It is further an object of this invention to create an
optimization technique which is better suited to training a
FIR or HR network to process temporal data than classical
2
Newton-like [10] techniques. It is further an object of this
invention to create multi-layer adaptive filters which are
Taylor made for specific applications, and can be efficiently
trained with the novel Newton-like algorithm.
IV. BRIEF DESCRIPTION OF DRAWINGS
FIG. 1 provides a block diagram according to which the
components of the invention may be combined in a preferred
embodiment.
FIG. 2 displays the general architecture for a nonlinear
multi-layered FIR network with restricted degrees of com-
putational freedom using separate static nonlinear and linear
dynamic components.
FIG. 3 displays a generic nonlinear HR filter architecture,
for which the next states and the output are assumed to be
some static function of the current states, and the current
inputs.
FIG. 4 displays a simple two -weight polynomial-based
network without memory.
FIG. 5 displays the weight sequence over 100 epochs of
training the network of FIG. 4 using BP (x) and the adaptive
modified Gauss-Newton algorithm (o) respectively. Con-
tours of constant cost are shown.
FIG. 6 displays the weight sequence over 100 epochs of
training the network of FIG. 4 using BP (top line) and the
adaptive modified Gauss-Newton algorithm with update
steps constrained to a magnitude of 0.1.
FIG. 7 displays a block-diagram of a generic tracking
system with static nonlinearities embedded within linear
dynamics, and parasitic dynamics on the feedback path.
FIG. 8 displays the FIR network employed to linearize the
nonlinear tracking system of FIG. 7 and equ. (43)
FIG. 9 displays a technique for acquiring training data by
exciting the tracking system.
FIG. 10 displays the RMS error of the network of FIG. 8
over 100 epochs using the adaptive modified Gauss-Newton
algorithm, and over 200 epochs using BPTT, Kalman
Filtering, the Gauss-Newton technique, and BFGS with a
line search.
FIG. 11 displays the learning rate, fi k , over 100 training
epochs for the network of FIG. 8 using the adaptive modified
Gauss-Newton algorithm.
FIG. 12 displays the trained weights 42 of the FIR filter
at layer 2 of path 1 in FIG. 8.
FIG. 13 displays on the left the error produced by an
analytical inversion of the linearized tracking system of FIG.
7. This illustrates the extent of the error due to nonlinear
distortion. On the right is shown the error in the output of the
trained network of FIG. 8.
FIG. 14 displays the Wiener model filter used to emulate
a nonlinear audio amplifier.
FIG. 15 displays a method by which data is gathered for
training the filter of FIG. 14.
FIG. 16 displays the spectral response of the nonlinear
amplifier to a dual tone signal.
FIG. 17 displays the spectral response of the nonlinear
amplifier to a prewarped dual tone signal.
V. SUMMARY OF THE INVENTION
This invention concerns a method for applying a Newton-
like training algorithm to a dynamic nonlinear adaptive filter
which has embedded memory. FIG. 1 provides a block
diagram according to which the components of the invention
10
15
20
25
30
35
40
45
50
55
60
65
US 6,351,740 B1
3
may be combined in a preferred embodiment. The solid lines
indicate the flow of data between steps; the dashed lines
indicate the flow of control between steps. Initially, 47, the
dynamic nonlinear system to be emulated by the filter is
decomposed to linear dynamic elements, and nonlinear
elements. Based upon this decomposition, a filter
architecture, h 48, consisting of adaptable static nonlinear
components, and adaptable linear components, is Taylor-
made 49 for the desired dynamic operation 47. Next 51,
using an initial set of filter parameters w 53, and architecture
h 48, a training input sequence {u n } 50 is propagated
through the filter to obtain an output sequence {y n } 52. An
error signal {e n } 55 is created by subtracting from {y„} 52
the desired output sequence {y n } 56. Using h 48, {e n } 55,
and w 53 a novel technique 57 is used to construct a single
matrix Vh w 58 which relates each network parameter to the
error it produces over the full sequence {e n } 55. Using Vh w
58, {e n } 55, and a learning rate //, a novel Newton-like
update algorithm 59, is used to determine a weight update
vector Aw 60. A temporary parameter set w f 62 is created by
summing Aw 60 and w 61. Using w f 62 and h 48, the training
input sequence {u n } 50 is again propagated 63 through the
filter to obtain output sequence {y t } 64. {e f } 66 is created
by 65 differencing {y f } 64 and {y n } 56. Costs 67,68 are
computed from the error sequences 55,66, such as J=2 n c n 2
and J f =2„ e f 2 . If J>J f , jbi 70 is increased 69, the new w is set
equal to w f 69, and the process is commenced at block 51.
Otherwise, ^ 71 is decreased 71 and the process commences
at 59.
In section VI is described the novel Newton-like algo-
rithm for updating the parameters of a network with embed-
ded memory. The technique is based on a modification of the
Gauss-Newton optimization method. The modification
involves an adaptive parameter which ensures that the
approximation of the cost’s Hessian matrix is well-
conditioned for forming an inverse. In section VII is
described a technique for relating the output error of a
nonlinear filter with embedded FIR dynamics to the filter
parameters, such that a Newton-like technique can be
applied. This method involves creating a single matrix
which relates all the network parameters to all the error they
generate over the full time sequence. In section VIII is
described a technique for relating the output error of a
general nonlinear HR filter to the filter parameters, such that
a Newton-like technique can be applied. Section IX dis-
cusses the superior behavior of the adaptive modified Gauss-
Newton method over steepest-descent methods such as
BPTT. In section X, we describe how a dynamic nonlinear
adaptive filter may be Taylor-made for a particular problem,
and trained by the new algorithm. In this section, the
performance of the algorithm is compared with that of other
optimization techniques, namely BPTT, Kalman Filtering, a
Gauss-Newton method, and the Broyden-Fletcher-Goldfarb-
Shanno (BFGS) method which are known to those skilled in
the art.
VI. The Adaptive Modified Gauss-Newton
Algorithm
In order to clearly explain the invention, this section
describes the new method, as well as the rationale behind it.
In a preferred embodiment, we assume that a training input
sequence {u n } 53, n E {1 . . . N} is propagated 51 through
the adaptive network. We consider the network to be rep-
resented by a known nonlinear filter operator, h 48, which is
determined solely by the network architecture, h 51 operates
on (w, u) 50,53 where u is an input vector of length N and
w is a vector of all the adaptable parameters in the network.
4
The vector which is created from the outputs of the network
{y„} 52 in response to the training, sequence 50 is denoted
h(u,w).
At each training epoch, which we denote with subscript k,
5 we find an iterative update of the parameter vector, w k . At
epoch k, one may describe the sequence of desired training
outputs as a vector:
y k =h(w* u k)+v k (i)
where w* is the vector of ideal weights which training seeks
to discover, u k is a vector of inputs at epoch k and v* is a
vector of unknown disturbances. If the data in u k and y* are
sampled from real-world continuous signals, then v* can be
15 regarded as a vector of noise on the ideal system inputs or
sampled outputs, v* Can also be regarded as error resulting
from the architectural imperfections of the network, such as
emulating HR linear dynamics with an FIR structure.
Alternatively, in a control applications, v* could result from
2Q the physical impossibility of a control law bringing all states
instantaneously to 0 in a stable system. The subscript k on
y k and u k indicates that the inputs, and desired output set can
be changed at each epoch, however, in many applications
they are held constant over several epochs. The invention
25 involves an updating strategy F, which iteratively deter-
mines the current weight vector from the previous weight
vector and the input-output truth model, w^=F(w^_ 1 , n k , y*).
To select amongst updating strategies, we define the follow-
ing cost function:
30
J ( w k> u k )-h{w, u^f{h{w k , u^-h{w, u k )) (2)
where is the vector of filtered errors, or errors in the
absence of disturbance v*. Then, we seek that F which
35 minimizes J (F(w*_ 1 , u*, y k ), %).
Note that the inputs to F restrict our exact knowledge of
J’s variation with respect to the weights to the first
derivative, Vw 7 . Additional measurements can be per-
formed on a network to estimate higher derivatives of J at
40 each epoch, however, the performance of such methods for
our class of problem does not justify their added computa-
tional complexity.
We determine F by expanding the filter function h around
the current value of the weights vector. By evoking the mean
45 value theorem, we can describe the filter outputs at epoch k:
K W k, U k )=K W jL- 1> U k)+^Jl(Wk-l,Uk) T (Wk-Wk-i)+tt(w k -w k -
i) r VVK%i> u k )(w k -w k _ (3)
where w^_ 2 is some weight vector for which each compo-
50 nent lies between the corresponding components of w Ar _ 1 and
w^. Note that the notation in (3) is inaccurate — the term
V vv 2 h(w Ar _ 1 , %) is not a matrix in this context, but a tensor.
Note also that the matrix V w h(w k _ l7 %), which contains the
derivatives of the outputs with respect to all parameters in
55 the network over the full time sequence, is difficult to
calculate for nonlinear FIR and HR filters which contain
embedded memory. This will be discussed in the next
section. Suppose now, that at every epoch, we apply the old
weight vector to the current inputs and compare the filter
60 outputs with the current desired set of outputs. The measured
error is:
e m k=yk~K w k-i, Uk) (4)
and in a preferred embodiment, the cost function we actually
65 measure is
Jmk =1 / 2 mk e mk-
(5)
US 6,351,740 B1
5
The filtered error, e^, Which we seek to minimize, is not
directly measurable. However, using equation (3) and (4) we
can describe this filtered error in terms of the measured error
as
er e m rV>(^_ 1; Uff{w k -w k _^ -v yt - 1 / 2 (>v yt ->v yt _ 1 ) r V (V 2 /?(iv Jfe _ 1 ,
OK-Wjfc-i) (6)
Consider the last term on the RHS involving the second
derivative of h at w k _ 1 . This indicates how far h(w^, %) is
from a first-order approximation at w yt _ 1 . We assume that
this term->0 as the weights converge. If we further make the
strong assumption that the disturbance term v k is negligible,
then we drop these terms from the estimation of and
approximate cost as
J( w k, Uk) =1 6(e m k-V u k ) r (w'j fe -Wj fe _ 1 )) r (e / „j fe - V ^Ji(w k _ l9
0 T {Wk~Wk-i)) ( 7 )
We can calculate the derivative of this cost with respect to
w*.
^ J(^k-i)=~^ Jt(^k-i, u k )(e mi r V w h(w k _i, (8)
The invention introduces the novel criterion that the
update step taken, Aw^^w^-w^^ be parallel with the
gradient of the quadratic cost estimate at the new location,
w*, to which we are stepping:
Aw*-! = -ilk V w J (w k , u k ) ( 9 )
6
is poorly conditioned. In the context of the invention, the
latter phenomenon is particularly prevalent when the train-
ing signal is narrowband, or the network architecture has
excessive degrees of freedom. To ameliorate these effects,
the Gauss-Newton algorithm can be modified by adding
some positive definite matrix to (14) before taking the
inverse, with the objective of improving the condition
number, and the convergence characteristics. This invention
describes an adaptive method for controlling the condition-
ing of the inverse in (10) using the learning rate ju k .
If /u k is too large, this can negatively impact the update
direction for the reasons discussed above; if ju k is set too
small, convergence is retarded. Therefore, we determine /u k
adaptively by successive step size augmentation and reduc-
tion. In a preferred embodiment // is augmented by some
factor, e a , after each update of the weights which diminishes
cost, and reduced by some factor, e r , after each update which
enlarges cost. It will be noted by one skilled in the art that
other methods for augmenting and reducing // based on the
performance of each weight update can be used, without
changing this essential idea in the invention.
One skilled in the art will also recognize that for complex
cost functions, AMGN can be trapped in non-global minima
of cost. One method of ameliorating this problem is to
multiply each weight update by some term proportional to
ma x(J mk -J\ 0)
\\V w h(w k _ l , u k )e mk || 2
= ^i k V w h(w k - l , v k )(e mk -V w h(w k - l , u k y Aw*_i)
where we have introduced ju k , as a constant of proportion-
ality. In a linear filter, this criterion would encourage updates
which move more directly towards the quadratic cost mini-
mum than a steepest descent update. For the invention, since
ju k is a learning rate which can be freely adjusted, we solve
(9) for Aw*.,:
. = (— +V W %_ 1 , ju*)V w A(w*_i,
VHk
V w h(w k _ l , fi k )e mk
where J mk is the measured cost at epoch k, J* is some
predetermined good enough value of cost, and || \\ 2 denotes
the 2-norm of the vector. In this way, if the parameters
approach a local minimum, but the weight updates
35 will grow large so that the local minimum can be escaped.
This and many other techniques for avoiding local minima
may be used to augment AMGN without changing the
essential idea.
VII. Calculating the Term VJti(w k _ l7 %) for a Nonlinear
40 FIR Filter
This section addresses the calculation of V vv h(w^_ 1 , %)
for a general nonlinear FIR network which can be decom-
posed into static nonlinear and linear dynamic components,
Equ. 10 is the Adaptive Modified Gauss-Newton
algorithm, which will henceforth be referred to as AMGN.
To place this aspect of the invention in context with respect
to classical optimization, consider Newton’s iterative tech-
nique applied to the new set of inputs at each epoch:
w*=w A _i-/« A (V Mr 2 /(w Jfc _i, O) «*) (li)
Since the Hessian of the cost V vv 2 J(w A: _ 1 ,%) is difficult to
calculate, the Gauss-Newton technique uses an estimation of
the Hessian of the cost, which for the cost of equ.(2) would
be
such as that displayed in FIG. 2. Each layer 7 of this network
consists of a linear FIR filter 2 feeding into a nonlinear
generalized single-layer subnetwork (GSLN) 3 which gen-
erates the weighted sum of a set of static nonlinear functions.
A GSLN refers to any weighted sum of nonlinear functions
[f ± . . . f v \ where the weights [x 1 . . . x v ] are adjustable
parameters.
Since the initial states of the FIR filters 2 are unknown, we
must wait until our known set of inputs 8 have propagated
through the network before useful training data is generated
at the output 9. Therefore, the output sequence 9 which we
use to train the filter will be of length
V 2 J(w*_!, u k ) = V 2 Mwjfc-i, u k )e fk + (12)
V w h(w k ~ l , ; u k )V w h(w k - l , fi k f
* V w h(w k _ l ,fi k )V w h(w k _ l ,fi k ) T (13)
This approximation is sound as one converges to a mini-
mum at which e^.43 0 as k->oo. Problems arise, however,
when V vv 2 h(w Ar _ 1 , %)e^ is not negligible, or when the matrix
V>K_ 1; u k )V Ji(w k _ 1} u k ) T (14)
where M^ ' is the number of taps of the FIR filter at layer
1 of path p, and the sup operation selects that path with the
most cumulative delays. One may consider the multi-layer
65 filter of FIG. 2 to be represented by a known nonlinear filter
operator, h, which is determined solely by the network
architecture, h operates on (w, u) where u is a vector of
US 6,351,740 B1
8
7
length N containing all the inputs 8 and w is a vector, of
length a p,i = A p,i x it) p,i (19)
YjW-' + v),
p,l
5
■■■ a
p,i
NP’ 1
\T
(20)
which contains every adaptable parameter in the network.
The output sequence 9 is stored in a vector of length
N - su p\ Yj M P,i
+ 1
10
which we denote h(w, u) as described above. 15
The matrix term V vv h(w yt _ 1 , u*), which contains the
derivative of each element of the output sequence with
respect to each parameter in the network, is easy to calculate
in a one-layered network (where L=l). In this case, each row
is simply the sequence of inputs that excite the correspond- 20
ing weight in w k _ 1 . We will now describe how this term is
found in a multi-layered network such as FIG. 2, with FIR
filters 2 and nonlinear components 3 at hidden layers. All
indication of the epoch k will be abandoned in this section
for economy of notation. Keep in mind that the network 25
parameters and output sequences are epoch-dependent, how-
ever.
Firstly, we need to detail notation for the propagation of
data through the layers of the network. Consider a single
layer 1 7 in the path p of the filter of FIG. 2. For the sake of 30
clarity, we have boxed this layer in the figure. In isolation,
the single layer implements a Wiener-type filter [11]. The
layer has two sets of weights, those determining the output
of the FIR filter
35
. . w M pf’ l Y (15)
and those determining the weighting of each function of the
volterra node
x pJ =[xY l . . . x/’T- (16) 40
Imagine the output of the GSLN of layer 1-1 is described
by the matrix
F p ’ l ~ l = [Ff'- 1 - F p n 1 YU] T
(17) 45
where W’ 1 1 is the length of the vector of outputs from layer
1-1 of path p:
where x is a matrix multiplication. This output vector is used
in turn to construct the matrix, W 1 , of inputs which excite
the weights of the GSLN 3 of layer 1 7:
/iK’O h{4’ 1 ) "■ fv{a P i l )
/iKO fi(4 J )
hKii) fy Kii)
The output 10 of the GSLN is then found by the weighted
summation:
F pJ = B pl x xF’ 1
(22)
F U
(23)
In a similar fashion, we can describe the gradient of the
function implemented by the GSLN 3 for each set of inputs
according to:
F'P’ 1 =B' p ’ l xx pJ
where
B /pJ =
/l'K’O /2KO
/i'K’0 f{(4’ 1 )
- m i )
■■■ fv{ aP N l pj)
(24)
(25)
(26)
We can begin to describe the calculation of V w h(w^_ 1? %)
by recognizing, as would one skilled in the art, that this term
is trivial to find in a one-layer linear network. Consequently,
we seek to condition the inputs to each network layer so that
weights of hidden layers 1={1 . . . L-l} can be trained
similarly to the weights of the output layer L. This is
achieved by pre-multiplying the inputs to each layer by the
differential gain which that layer’s output 10 sees due to
subsequent layers in the network. To illustrate this idea,
consider the function \f’ 1 to represent the reduced filter
function for a single layer, i.e.
We can create a matrix of the sequential activation states
of the FIR filter 2 at layer 1 7,
F mpJ
F p,t - 1
( MP’ l -l )
■ ■ ft 1
F p,l - 1
( mp 4 + 1 )
F p,l - 1
F MP’ 1
f p,i- 1
F NP ’ l ~ 1
F p,l - 1
■■ F np1
The output of the FIR filter 2 at this layer 7 can then be
found according to:
55 hP’ l (w k _f l , u k )=F p ’ 1 (27)
To find V yv p,ih p,l (w k _ 1 p,! , u^), the gradient of this layers’
outputs 10 with respect to each of the weights in the layer,
we pre-amplify the matrix of filter states, AF’\ by the
60 gradient of the nonlinear function 3 which corresponds to
each row of FIR filter states. In this way, given the weights
vector for the whole layer, w p, =[(o p,/ x p,! ] T , and the set of
inputs 8 at epoch k, we can determine
65 u k )=[A pJ .F' p ’ 1 B p ’ n (28)
generating the matrix
9
US 6,351,740 B1
10
F MP.‘ F ‘
ppj- 1 p'pj
F (MP>‘- i) Fl ■'
p,p,l—l P’WJ
/.«) •
•• /vK)
(29)
F (MPf 1) F 2
T-,p,l—i rp>p,l
F MPJ F 2
T-,p,l—l T-’fpJ
■ r 2 r 2
/l(«2 J ) •
- fv(a pJ )
P’PJ-I f’p,1
p,i— l p’/pj
NP’ 1 NP’ 1
/.«.) •
■■
Note that the operation in (28) multiplies each row of A^ j/
with the element in the corresponding row of F p ’ 1 . The
concept applied to (28) can be extended to the entire filter of
FIG. 2 if we realize that the outputs of layer 1, which excite
the FIR filter 2 at layer 1+1, do not see a static differential
gain due to the subsequent layer, but rather a time -varying
gain as the outputs propagate through each tap in the filters
of layer 1+1 — as well as subsequent layers. This time-
varying gain can be taken into account by convolving the
inputs to each layer with the weights of the FIR filters 2 at
subsequent layers, and then multiplying the convolved
inputs by the gradient due to the nonlinear GSLN 3 at the
subsequent layer. We define the convolution operation as
follows:
Given any set of weights for an FIR filter 2
V w p h p (w p , u) = [V wP ,i u ) V^ p ,2 h p { u) ■■■ (34)
h p (w p , «)]
It is trivial to extend this to the multiple paths in the
network. We combine the matrices of weights for each path,
to form a vector of weights for the entire network of FIG. 2,
with P total paths,
iA p-i
(35)
®=[ Wi . . . W J
(30)
and a matrix of inputs
r f p - g
The derivative of the full network operator with respect to
w, acting upon the current weights, w, and the inputs, u, is
(31) 30 then
F =
F n ■ ■ ■ F/v-jg+i
we have the convolution of F with co
m)=[V^/j p (w p , m)V w /7-i/j p 1 (w p 1 ,u) . . . V w i h 1 ^ 1 , w)J36)
With this matrix in hand, we are now able to implement
AMGN for the dynamic nonlinear network of FIG. 2. The
matrix V W h(w, u) also enables one skilled in the art to apply
a
^ Fp + i-i(x) a - i+l
a
^ Fp+i- 2 (Oct-i+i
Cu{F) =
a
^ Fp+i°->a-i+l
a
Yj FiO>a-i + 1
(32)
a
^ F N -a+iM a -i + \
a
where the dimensions of C^F) are (N-(3-a+2)x(3. This
convolution operation is sequentially applied to each layer to 50
account for the total time-varying differential gain which the
outputs of a layer see due to all subsequent layers in the path.
Hence, we can describe the derivative of the filter operator
for the full path p 11 with respect to the weight vector for
layer 1, w p ’, operating on the current set of weights and
inputs, as follows "
V^/ h p (w p , u) = (33)
[C^jr{C^jf- 1( - C wP , M (lA p -‘-F"’-‘ B p ’ l ])-F’ p - M )
... . fpXP-^ffp,iF j
where If is the number of layers in path p — refer to FIG. (2).
Now, by combining matrices like (33) for the weights of
each layer, we can extend our calculation to find the gradient 65
with respect to the weights vector of an entire path 11 as
follows:
other Newton-like optimization methods, such as BFGS,
Kalman Filtering and Gauss-Newton to an FIR network with
embedded memory, as is illustrated in section X. Note that
method described naturally corrects each weight for the
error it produces over the entire sequence of data. While the
technique has been elucidated for the case of an FIR filter,
a similar approach can be used to train HR filters by
propagating data through the network for a finite interval,
long enough to capture the relevant system dynamics.
VIII. Calculating the Term V w h(w^_ 1? %) for a
General HR Filter Architecture
In this section, the extension of the algorithm to nonlinear
iir filters is discussed. The approach described here is very
general and applies to a vast range of filter architectures.
While the approach described in this section could also be
applied to an FIR filter as in FIG. 2, it is less computationally
efficient for that problem than the approach described in the
previous section. The generic architecture for the filter of
this discussion is displayed in FIG. 3.
US 6,351,740 B1
12
11
The system has M states 12, represented at time n by the
state vector a n =[a ln . . . & Mn \ T - We assume that the subse-
quent value of each state in the filter or system is some
function 14 of the current states 12, the inputs 13 and the set
of parameters within the network
a ln+l
fi(a n , w) '
u n , w)_
where w=[w 2 . . . w v ] T is a vector of length V, containing all
adaptable parameters in the network, and once again {u n }
n=l . . . N 13 is the input sequence to the system. We assume
that each output 16 of the filter or system is generated by the 15
function 15 y„=f 0 (a n , u n , w). For the preferred embodiment,
we assume that the states of the system 12 are all 0 before
excitation with the input sequence 13 {u n }. The sequence of
outputs is then stored in a vector of length N which we
denote h(w, u) as described above. The task, as in the
previous section, is to determine the matrix V vv h(w, u).
To calculate the dependence of some output y n on w, we
use a partial derivative expansion with respect to the state
vector:
n
= Yj
i= 1
(38) 25
1. J (N,:)=V a „yN
2. H(N,:)=V ajv yNxV W & N
3. for i=N-l:l step-1
4. J(i+1: N,:)«-J(i+1: N,:)xV a .a, +1
5. J(i: N,:)-V yi
6. H(i: N,:)«-H(i: N,:)+J(i: N^xV^a,-
7. end
8. V w h(w,u)^H
IX. Criteria for Selection of the AMGN Algorithm
Above BPTT
In this section we will briefly discuss the performance of
BP, and algorithm 10, applied to nonlinear adaptive prob-
lems.
It has been shown [12] that the steepest-descent method is
H°° — optimal for linear prediction problems, and that
BP — we use the term BP to include BPTT for dynamic
networks — is locally H°° — optimal for multi-layered nonlin-
ear prediction problems. In essence, this optimality means
that the ratio from the second norm of the disturbance in equ
(i),
X
to the second norm of the filtered error,
The term V^a,- in equ.(38) can be directly calculated
dd Vl
da u
dwi
dwy
da Mi
da Mi
dwi
dwy
Ui,
w)
dfi(di, Ui, w)
dwi
dwy
df M {ai, ^
, w)
df M (ai, Ui, w)
dwi
dwy
In order to calculate the term in equ. (38), we again
apply a partial derivative expansion as follows:
(40)
Each of these terms can now be directly calculated:
dai i+
1
day. . 1
(41)
day.
da Ml
^ a i a i+l —
da M - i+
i
aa «i+i
da i-
da Ml
J’» = [
dy n
da i n
dy n
da Mn
(42)
_[
9fo(On,
u n , w)
dfo(a n
,u n , wOj
“l
da
In
da Mn \
Based upon these equations, we present below an algo-
rithm by which VJi(w,u) can be calculated. We begin with
two all-zeros matrices, H=0 ArxV and J=0 ArxAf For notational
clarity, note that H(i,:) refers to all the columns at row i in
H, H(:, i) refers to all the rows in column i of H, and H(i: n,:)
refers to the matrix block defined by all the columns and the
rows from i to n.
L e fn'
n= 1
can be made arbitrarily close to 1.0 for any {v„} as N->oo ?
35 by selecting an initial set of weights close enough to the
ideal value. These results explain the superior robustness of
BP for nonlinear applications, and linear applications with
non-Gaussian disturbances. However, this optimality result
for BP applies to scenarios where a new set of inputs is
40 presented to a filter, thus changing cost as a function of the
weights, at each training epoch. In this case, the nonlinear
modeling error at each training session is treated by BP as
an unknown disturbance. This approach to training would
occur, for example, in nonlinear prediction applications in
45 which processing speed is limited. It is worth noting that in
these scenarios, where parameters are updated according to
the error of each newly predicted datum, the matrix (14) will
have rank 1. Consequently, the update according to AMGN
would be in the same direction as that of BP. In fact, in this
50 situation, Gauss-Newton learning is simply a normalized
least squares algorithm, which is H°° — optimal for a poste-
riori error for linear problems [9]. In many nonlinear appli-
cations however, multiple weight updates are performed
using the same set of inputs. In these cases, the performance
55 of BP and AMGN become clearly distinguishable.
We will clearly illustrate the different algorithms’ char-
acteristics by transferring our considerations to a simple
two-dimensional parameter-space. Consider the simple
static network of FIG. (4). Assume that the desired values for
60 the two weights 17 are w 1 =l, w 2 =l, so that the desired
output 18 of the network is y n =u„+u„ 2 . We select a vector of
inputs 19 [1 2 3] T so that the desired vector of outputs is [2
6 12] r . FIG. (5) displays the performance over 100 training
epochs of both BP (shown as x) and AMGN (shown as o).
65 For the purpose of comparison, the learning rate was
adjusted so that the initial steps taken by both algorithms
were of the same magnitude. The learning rate was then
US 6,351,740 B1
14
-continued
13
adjusted according to the successive augmentation and
reduction rule. Notice that AMGN is very close after a few
updates. By contrast, BP finds itself stuck at w 1 =0.84,
w2=1.41. This non-stationary limit point occurs since the
cost improvement at each update is not substantial enough to 5
achieve convergence. This type of trap for steepest descent
algorithms is typical of dynamic networks involving poly-
nomial nonlinearities, or volterra architectures. Note that if
we set a lower bound on the learning rate then BP will
eventually converge to the point (1,1)? but it would take a 1C
very large number of updates! This point is illustrated in
FIG. (6) where we have imposed a constant step size
restriction, 0.1, on both algorithms. Notice the directions of
the steps taken as the algorithms approach the global mini- 15
mum.
X. Examples of Taylor-made Adaptive Dynamic
Nonlinear Filters
In this section, we describe two sample applications of the
AMGN algorithm and illustrate how one may Taylor-make
a network to perform a specific nonlinear dynamic
operation, such network being efficiently trained with the
new algorithm. Consider the tracking system displayed in 25
FIG. 7. This system can be considered as some nonlinear
sensor 20, feeding electronics with nonlinear components 21
and dynamics 22, which output to some plant 23. The
dynamics on the feedback path 24 could represent parasitic
capacitance in the system. For this example, we choose the 30
dynamic transfer functions and nonlinearities according to:
Ci(s) =
5 + 300
5 + 700
100
C 2( 5)= —
C 3 (5) =
1000
5 + 1000
1 - 1 ,
fi{u) = u+-u z + -u i
f 2 (u) = arcsin(w)
(43)
35
40
Ideally, we would like the system outputs to exactly track 45
system inputs, however, with nonlinearities, dynamics and
feedback, the outputs are a severely distorted version of the
original system inputs. We seek an adaptive filter for the
purpose of identifying the original system inputs 25 from the
system outputs 26, in other words for inverting the system 50
of FIG. 7.
The most general technique in prior art for modeling
nonlinear systems with decaying memory, such as FIG. 7, is
via a Volterra series. Consider, for example, a nonlinear 55
system which one models with a memory of M taps, and
with a nonlinear expansion order of V. The output of the
system in response to an input sequence {u n } can then be
expressed as the Volterra expansion:
60
m (44)
y»= Z +
*1,1=1
MM M
X X **2,1 **2,2 W «-*2,l W «-*2,2 + ■ ■ ■ ■ ■ ■ 65
*2,1 = 1 *2,2 =1 k V,l= l
M
^ k V,l • • • y u n-ky I ■■■ U n-kyy
k v,v= 1
The multi-dimensional sequences specified by h in the
expansion are termed Volterra kernels. Each component of a
kernel is a coefficient — or weight — for one term in the
expansion, and each of these weights must be trained in
order to model a system. A network architecture as in FIG.
2 which contains polynomial nonlinearities 3 and embedded
linear filters 2, has substantially fewer parameters than
would be required in a Volterra series expansion of a similar
system. More precisely, the number of Volterra series terms
necessary to emulate a single FIR filter of M taps feeding a
polynomial GSLN of order V is
V n + 1 «M- 2
ZEE -i*
To emulate the operation of a multi-layered filter as in
FIG. 2, the number of terms required in a Volterra series
would increase very rapidly! In general, networks are less
able to generalize from training data as the degrees of
freedom associated with the network parameter-space
increase.
Therefore, if one knows roughly how the system one
seeks to emulate is decomposed to static nonlinear blocks
and dynamic linear components, one can Taylor-make a
filter of the form of FIG. 2 to perform the required process-
ing with a limited number of parameters. A polynomial-
based GSLN, where the functions are simply
ascending powers of the input, can model the truncated
Taylor series expansion of differentiable functions.
Consequently, with few parameters, it can emulate accu-
rately a wide range of static nonlinearities. Therefore, based
on an estimate of the time constants of the linear dynamics
to be emulated, and an estimate of the order to the nonlinear
functions to be emulated, we can Taylor-make a, network for
inverting the system of FIG. 7 as shown in FIG. 8. One
skilled in the art will recognize that this architecture is based
on a block by block inversion of the tracking system.
The acquisition of the training data for this network is
described in FIG. 9. We obtain the desired network output,
{y n } 27 by sampling 36 the original inputs 28 to the tracking
system 29. Sampling 37 the output 30 of the tracking system
29 creates the training inputs sequence {u n } 31 for the
network of FIG. 8. The objectives in creating the training
signal 28 are two-fold. First, the nonlinearities 20,22 must be
excited over the full domain of operation. This is achieved
with a chirp signal 32 which starts at frequency 0 and ramps
to a frequency beyond the bandwidth of the tracking system
29, and which has an amplitude larger than those encoun-
tered during regular tracking. Secondly, in order to improve
the condition of the matrix (14), the signal must have
significant frequency content near to the Nyquist rate, 1/2T S .
We achieve this by adding 34 zero-mean white Gaussian
noise 33 to the chirp signal 32, which we then input to a
zero-order hold 35 with a hold time three times that of the
sampling rate, T s . Notice also the factor-of-3 upsamplers on
the network paths 38 in FIG. 8. Used in conjunction with a
3-period zero-order-hold 35 in preparation of the training
input 31, the upsampling enables an FIR filter 39 to estimate
accurately the Euler derivative [13] of a signal which has
US 6,351,740 B1
15
high frequency content. In brief, every third point of {y n }
has a well defined gradient; the rest of the calculated
gradients are discarded. The unit delay 40 after sampling 36
the desired output shown in FIG. (9) aligns the training data
27 for the non-causal Euler differentiation. This technique 5
can also be used to implement steepest-descent high-pass
filters and pure differentiators.
Training sequences of 1000 data points, sampled at 1 kHz,
were used. FIG. (10) shows the RMS network error for 100
parameter updates using the AMGN algorithm, and 200 10
updates respectively using optimization algorithms known
in the art as the Kalman Filter, the Gauss-Newton technique,
the BFGS algorithm with a line search [10], and BPTT. The
Kalman filter is presented as a standard; the measurement
noise and initial covariance were empirically chosen to 15
maximally reduce cost after 200 epochs. For the Gauss-
Newton Algorithm, reduced order inverses were formed
when matrix (14) was ill-conditioned. The line search for
BFGS was conducted using the method of false position
[14]. Notice that all Newton-like techniques outperform 20
BPTT. The superior convergence rate and cost minimization
achieved with BFGS and AMGN are clearly evident. Note
that in contrast to BFGS, AMGN doesn’t require a line
search so each epoch involves substantially less computation
than is required for a BFGS update. FIG. (11) shows the 25
learning rate ju k over 100 updates for AMGN. Note that the
learning rate increases geometrically until the poor condi-
tioning of matrix (14) significantly distorts the update direc-
tion. While it is difficult to establish that the weights
converge to a global minimum, one can show that the 30
tracking system dynamics have been captured accurately.
This is evidenced in FIG. (12), where the trained weight 41
vector [wj 1,2 . . . w 30 1,2 ] emulates the impulse response of
a system with dominant pole at 300 rad/sec.
A sum of sinusoids test reference signal was injected into 35
the tracking system. When the system outputs are injected
into the trained network, the resultant output error is shown
in FIG. 13 (right), together with the error which arises from
an exact inversion of all the linear dynamics of the system
(left). This indicates the extent to which the nonlinear 40
distortion of the signal has been corrected.
The second example involves the identification and inver-
sion of a Wiener-Type Nonlinear System. The Wiener model
applies to an audio amplifier which exhibits crossover
distortion. The network architecture employed is displayed 45
in FIG. 14. The memory less nonlinearity at the amplifier’s
output is emulated with a parameterized function 42 which
can accurately emulate crossover distortion:
lW 0(k 2 7 n _ (45) 50
fiy n ,xi,x 2 ,x 3 ) = x l — + x 3 y„
The linear dynamics of the amplifier are estimated over
the audible band with an HR digital filter 43. The network 55
training data is gathered according to the method of FIG. 15.
The amplifier 47 is excited with a chirp signal 49 ranging in
frequency from 19.2-0 kHz, summed 50 with zero-mean
Gaussian noise 48. The amplifier outputs 52 are sampled 51
to obtain the desired network output sequence 46. The 60
network input sequence 45 is obtained by sampling 54 the
input signal 53. The AMGN algorithm is then used to
identify the parameters of the filter in FIG. 14.
It is known in the art that a non-minimum phase zero
cannot be precisely dynamically compensated, since it will 65
cause an unstable pole in the compensator. Consequently, if
the identification yields a non-minimum-phase zero for the
16
linear filter 43, the output sequence 46 is delayed 55 relative
to the input sequence 45 and the identification is repeated.
For input sequence 45 {u„,}, the filter is trained to estimate
outputs 46 {y„_i}. Successive delays may be added until the
zeros of the identified linear filter 43 are minimum-phase.
Once the filter parameters are identified, the linear 43 and
nonlinear blocks 42 are analytically inverted using tech-
niques well known in the art. A lowpass digital filter with
cutoff at roughly 20 kHz is appended to the HR filter inverse
to limit high-frequency gain; and a lookup table is used to
emulate the inverse of the memoryless nonlinearity. An
audio signal is pre-warped by the inverse nonlinearity and
then the inverse linear dynamics before being input to the
amplifier. The A-D and D-A conversions can be performed
with an AD 1847 Codec, and the pre- warping of the audio
signal can be performed with an ADSP2181 microprocessor.
FIG. 16 shows the spectral response of the nonlinear ampli-
fier when excited with a dual tone test signal. FIG. 17
displays the spectral response of the amplifier to a pre-
warped dual tone. Note that the amplitudes of nonlinear
distortion harmonics have been reduced in the pre-warped
signal by more than 20 dB.
These examples are intended only as illustrations of the
use of the invention; they do not in any way suggest a limited
scope for its application.
REFERENCES
[1] S. D. Sterns , Adaptive Signal Processing, Prentice Hall,
1985.
[2] D. K. Frerichs et al., “U.S. patent: Method and procedure
for neural control of dynamic processes”, Technical
Report U.S. Pat. No. 5,175,678, U.S. Patent and Trade-
mark Office, December 1992.
[3] A. Mathur, “U.S. patent: Method for process system
identification using neural networks”, Technical Report
U.S. Pat. No. 5,740,324, U.S. Patent and Trademark
Office, April 1998.
[4] D. C. Hyland, “U.S. patent: Multiprocessor system and
method for identification and adaptive control of dynamic
systems”, Technical Report U.S. Pat. No. 5,796,920, U.S.
Patent and Trademark Office, August 1998.
[5] S. A. White et al., “U.S. patent: Nonlinear adaptive
filter”, Technical Report U.S. Pat. No. 4,843,583, U.S.
Patent and Trademark Office, June 1989.
[6] D. H. Nguyen, Applications of Neural Networks in
Adaptive Control , PhD thesis, Stanford University, June
1991.
[7] D. M. Hanks, “U.S. patent: Adaptive feedback system for
controlling head/arm position in a disk drive”, Technical
Report U.S. Pat. No. 5,548,192, U.S. Patent and Trade-
mark Office, August 1996.
[8] A. Back et al., “A unifying view of some training
algorithms for multilayer perceptrons with fir filter
synapses”, Proceedings of the 1994 IEEE Workshop on
Neural Networks for Signal Processing , vol. 1, pp.
146-154, 1994.
[9] B. Hassibi, “h°° optimality of the 1ms algorithm”, IEEE
transaction on Signal Processing , vol. 44, no. 2, pp.
267-280, February 1996.
[10] D. P. Bertsekas, Nonlinear Programming, vol. 1, Ath-
ena Scientific, 2nd edition, 1995.
[11] N. Wiener, Nonlinear Problems in Random Theory \
Wiley, New- York, 1958.
[12] B. Hassibi, “h°° optimal training algorithms and their
relation to backpropagation”, Proceedings of the
NIPS94 — Neural Information Processing Systems: Natu-
ral and Synthetic , pp. 191-198, November-December
1994.
US 6,351,740 B1
17
[13] G. F. Franklin, Digital Control of Dynamic Systems,
Addison- Wesley, 2nd edition, 1990.
[14] D. G. Luenberger, Linear and Nonlinear Programming,
Addison- Wesley, 2nd edition, 1984.
What is claimed is: 5
1. A method for training a dynamic nonlinear adaptive
filter with embedded memory, the method comprising:
a) propagating a training sequence through the dynamic
nonlinear adaptive filter to obtain an output sequence,
where the filter is represented by a filter architecture h io
and a set of filter parameters w, and where the filter
architecture h comprises adaptable static nonlinear
components and adaptable linear components;
b) constructing a matrix Vh w from filter architecture h,
from the set of filter parameters w, and from an error 15
sequence, where the matrix Vh w relates each parameter
of w to an error produced by the parameter over all
components of the error sequence, where the error
sequence measures a difference between the output
sequence and a desired output sequence; and 20
c) updating the filter parameters w based on a learning
rate, current (and possibly past) parameter vectors w,
current (and possibly past) matrices Vh^, and current
(and possibly past) error sequences, where an update
direction is closer to a Newton update direction than 25
that of a steepest descent method.
2. The method of claim 1 wherein updating the filter
parameter w is achieved with a modified Gauss-Newton
optimization technique which uses only the current param-
eter vector w, the current matrix Vh w , the current error J
sequence, and where the update direction and an update
magnitude are determined by the learning rate.
3. The method of claim 1 wherein constructing the matrix
Vh w comprises repeated convolutions with time -varying
gains at various layers of the filter.
4. The method of claim 1 wherein constructing the matrix
Vh w comprises repeated matrix multiplications that account
for a dependence of filter output on all prior states within the
filter.
5. The method of claim 1 wherein updating the filter 4
parameters w comprises:
propagating the training sequence through the filter to
obtain a temporary output sequence, where a temporary
set of filter parameters w f is used during the propagat- 45
ing;
evaluating a cost function J depending on the error
sequence; evaluating a temporary cost function J f
depending on a temporary error sequence, where the
temporary error sequence measures a difference 50
between the temporary output sequence and the desired
output sequence;
setting w=w f and increasing the learning rate if J>J f
decreasing the learning rate and repeating the parameter
update if J=J r 55
6. The method of claim 1 wherein the filter architecture h
comprises adaptable static nonlinear components and adapt-
able linear components.
7. The method of claim 1 wherein (a) through (c) are
repeated with training signals selected to excite an entire 60
domain of operation of the nonlinear components, and to
excite an entire operational bandwidth of the linear compo-
nents.
8. The method of claim 1 further comprising delaying the
training sequence relative to the output sequence, whereby 65
the filter may be trained to implement a non-causal Euler
derivative in mapping from the outputs to the inputs.
18
9. The method of claim 1 further comprising delaying the
output sequence relative to the training sequence, whereby
the filter may be trained to implement a minimum -phase
model of a dynamic system.
10. A method for training a dynamic nonlinear adaptive
filter, the method comprising:
a) propagating a training sequence through the dynamic
nonlinear adaptive filter to obtain an output sequence,
where the filter is represented by a filter architecture h
and a set of filter parameters w;
b) constructing a matrix Vh w from filter architecture h,
from the set of filter parameters w, and from an error
sequence, where the error sequence measures a differ-
ence between the output sequence and a desired output
sequence; and
c) determining an update vector Aw from matrix Vh w ,
from the error sequence, and from a learning rate,
where both a magnitude and a direction of update
vector Aw depends on a value of the learning rate; and
d) updating the filter parameters w based on the update
vector Aw.
11. The method of claim 10 wherein the direction of the
update vector Aw is closer to the Newton update direction
than that of a steepest-descent method.
12. The method of claim 10 wherein the matrix Vh M ,
relates each of the parameters in w to an error produced by
that parameter over all components of the error sequence.
13. The method of claim 10 wherein the filter architecture
h comprises adaptable static nonlinear components and
adaptable linear components.
14. The method of claim 10 wherein updating the filter
parameters w comprises:
propagating the training sequence through the filter to
obtain a temporary output sequence, where a temporary
set of filter parameters w t is used during the propagat-
ing;
evaluating a cost function J depending on the error
sequence;
evaluating a temporary cost function J f depending on a
temporary error sequence, where the temporary error
sequence measures a difference between the temporary
output sequence and the desired output sequence;
setting w=w f and increasing the learning rate if J>J f ,
decreasing the learning rate and repeating the parameter
update if J=J r
15. The method of claim 10 wherein (a) through (d) are
repeated with training sequences selected to excite an entire
domain of operation of the nonlinear components, and to
excite an entire operational bandwidth of the linear compo-
nents.
16. The method of claim 10 further comprising delaying
the training sequence relative to the output sequence,
whereby the filter may be trained to implement a non-causal
Euler derivative in mapping from the outputs to the inputs.
17. The method of claim 10 further comprising delaying
the output sequence relative to the training sequence,
whereby the filter may be trained to implement a minimum-
phase model of a dynamic system.
18. A dynamic nonlinear adaptive filter with embedded
memory comprising:
a) a plurality of adjustable linear filter blocks;
b) a plurality of adjustable nonlinear filter blocks;
c) a filter training circuit; and
wherein the linear and nonlinear filter blocks are inter-
connected to form a network;
US 6,351,740 B1
19
20
wherein the linear and nonlinear filter blocks comprise
memory for storing filter parameters;
wherein the filter training circuit updates the filter
parameters using a method with parameter updates
closer to a Newton update direction than those of a
steepest descent method, and which comprises com-
paring a desired filter output sequence with an actual
filter output sequence resulting from a filter training
sequence propagated through the filter; and
wherein updating the filter parameter w is achieved
with a modified Gauss-Newton optimization tech-
nique for which the update direction and magnitude
are determined by a learning rate.
19. The filter of claim 18 wherein the filter training circuit
calculates a matrix Vh^ relating each of the filter parameters
to an error produced by the parameter over all components
of an error sequence, where the error sequence measures a
difference between the output sequence and the desired
output sequence.
20. The filter of claim 18 wherein the conditioning of the
matrix inverse performed in calculating an update vector Aw
is determined by the learning rate.
21. The filter of claim 18 wherein the filter training circuit
selects training sequences to excite an entire domain of
operation of the nonlinear blocks, and to excite an entire
operational bandwidth of the linear blocks.
22. A nonlinear filter system comprising:
a) a nonlinear filter comprising a plurality of linear filter
blocks embedded in a plurality of nonlinear filter
blocks, wherein the filter comprises a memory for
storing filter parameters; and
b) a filter training circuit for updating the filter
parameters, wherein the circuit compares a desired
5 filter output sequence with an actual filter output
sequence resulting from a filter training sequence
propagated through the filter, and the circuit imple-
ments a filter parameter update with a direction closer
to a Newton update direction than that of a steepest
10 descent method;
wherein the training circuit selects a plurality of train-
ing sequences to excite an entire domain of operation
of the nonlinear blocks, and to excite an entire
operational bandwidth of the linear blocks.
15 23. The system of claim 22 wherein the training circuit
calculates a matrix Vh w relating each of the filter parameters
to an error produced by that parameter over all components
of an error sequence, where the error sequence measures a
difference between the output sequence and the desired
20 output sequence.
24. The system of claim 22 wherein updating the filter
parameter w is achieved with a modified Gauss-Newton
optimization technique for which an update direction and
magnitude are determined by a learning rate.
25 25. The system of claim 24 wherein a conditioning of the
matrix inverse performed in determining the update is
determined by the learning rate.