analysis and synthesis of amplitude modulation components in

ANALYSIS AND SYNTHESIS OF AMPLITUDE
MODULATION COMPONENTS IN PATHOLOGICAL
VOICES
Brian Gabelman and Abeer Alwan
Department of Electrical Engineering, UCLA
Los Angeles 90095, USA
[email protected], [email protected]
ABSTRACT
Previous work [1] addressed analysis and synthesis of the FM
and aspiration noise components of pathological voices. The
current study refines the previous analysis and adds amplitude
modulation to model non-periodic components. The cepstral
aspiration noise estimate is improved by removal of HFPV (high
frequency pitch variation) from the original voice prior to
estimating the noise. Amplitude and pitch pulse power tracking
are performed on a pulse-by-pulse basis to accurately estimate
the power time series, which is then segregated into low
frequency power variation and high frequency power variation
(shimmer). These AM effects are then also removed from the
original voice and cepstral noise analysis is applied to estimate
aspiration noise. Results show that incorporation of the AM
effects improves the fidelity of the synthesized voice, and that
AM has a minimal effect on the measured level of aspiration
noise.
1.0 INTRODUCTTION
The objective of this study is to further improve the accuracy of
analysis of nonperiodic components in pathological voices, thus
improving the fidelity of synthesis of tokens used in subjective
analysis by synthesis perceptual studies. This study continues
the modeling of the non-periodic components of pathological
voices in [1]. Previously, all non-periodic components were
modeled with spectrally shaped aspiration noise, and high and
low frequency FM modulation of pitch. The effects of
amplitude modulation are added to the analysis model and voice
synthesizer in this paper. Starting with established pitch
tracking algorithms, enhancements are added to track the
amplitude modulation (AM) features of the voice time series,
such as power level. The power level time series is high and
low pass filtered to yield shimmer and volume time series
respectively, which are then incorporated into the synthesizer,
yielding another increment in the naturalness of synthesized
pathological voices.
In a manner analogous to the previous FM analysis [1], the
original voice time series is AM de-modulated using the power
time series to yield a modified original voice that contains no
power variation. This constant power time series is then reanalyzed (in various combinations with the FM demodulated
versions of [1]) with cepstral notch filtering [2] to estimate the
aspiration noise level. Results indicate that AM has a minimal
effect on the aspiration noise estimate, and thus will not
significantly lower the computer set level of aspiration noise for
the synthesizer, which has been found to be unreasonably high
in analysis-by-synthesis comparisons. However, removal of
HFPV prior to cepstral noise does lower aspiration noise
estimates significantly.
2.0 ANALYSIS
All steps of data analysis are summarized in Fig. 1. In the
following sequential description of analysis, steps carried out in
[1] are briefly summarized, and modifications for the current
AM analysis are described in detail. Note that Figures 2 – 7
pertain to the same voice token.
2.1 Initial Steps
Thirty one pathological voice samples were collected at the
UCLA Medical Center and analyzed as described in [1]. These
samples exhibit significant FM and/or AM. Using the sourcefilter model, voices are analyzed using the LPC autocorrelation
and covariance methods, and then inverse filtered to obtain an
estimate of the source time series [4]. The raw estimated source
time series is fitted to a simplified LF model [3].
2.2 Pitch Tracking
Pitch tracking proceeds as described in [1] with time domain
feature (typically minima points) tracking and sub-sample
interpolation, allowing the generation of a quantization-free,
high-resolution pitch time series. In order to achieve best
results, the pitch tracking algorithm is applied to four variations
of the voice signal: the original, the inverse filtered source, or
their smoothed derivatives. The pitch track resulting from
successful tracking of one of the variations is selected to
represent the voice. When the algorithm successfully, as
determined visually, selects features in the original voice time
series that identify successive pitch pulses (rather than being
desynchronized by spurious features), tracking-lock has been
achieved. Successful pitch tracking exhibits no loss of trackinglock over the entire voice sample and clustered HFPV
measurements in the expected range of 0 to 1%. The pitch time
series is then high/low pass filtered into HFPV (high frequency
pitch variation) and tremor time series, representing these two
fairly distinct processes.
2.3 Removal of HFPV
MIKE
9040. TRK=11 CLK=2002528152723.2
SIGNAL
MAXAMP=BLU -, SUM(ABS())=RED.., ENERGY=GRN-., PWR=YEL--
1
LPC FORMANT
JITTER
ESTIMATE.
PITCH (FM)
ANALYSIS /
SHIMMER
ESTIMATE
POWER (AM)
TRACKER
MANUAL OPS
SUM
0.9
TRACKER
MAX AMP
VOLUME
TIME HIST
TREMOR
TIME HIST
0.8
FORMANTS
POWER
PITCH
TIME HIST
TIME HIST
0.7
DEMODULATE
RESAMPLE
INVERSE
FILTERING
TO REMOVE
FM
TO REMOVE
AM
CONST. PITCH
CONST. POWER
VOICE
VOICE
SYNTHESIZER
0.6
RAW FLOW
DERIVATIVE
0.5
ENERGY &
POWER
CEPSTRAL
LEAST SQRS
0.4
NOISE
LF FIT
ANALYSIS
0.3
FITTED LF
SRC NOISE
SPECTRUM
SOURCE PULSE
NSR ESTIMATE
0
0.5
1
1.5
2
2.5
3
3.5
4
SAMPLE #
4
x 10
Figure 1. Overall voice analysis/synthesis steps.
Figure 3. Metrics of amplitude modulation for
a one second voice sample.
PITCH TRACK FEATURES
x 10
4
8
x 10
5
9040. TRK=11 CLK=2002528154055.49 ORIGINAL POWER (RED 0) AND TREMOR (GREEN LINE)
2
4.5
1.5
POWER
4
1
3
0.5
2.5
0
0.2
0.4
0.6
0.8
1
TIME (SEC)
2
9040. TRK=11 CLK=2002528154055.49 SHIM% = 100*(POWER - LOWPASS POWER)/POWER
20
1.5
10
1
0.5
0
0
100
200
300
400
500
SAMPLE NUMBER
ENVELOPE MINIMA
DELTA POWER PERCENT
SMOOTHED ABS ORIG VOICE
3.5
0
-10
-20
0
0.2
0.4
0.6
0.8
1
TIME (SEC)
Figure 2. Power tracking on absolute value waveform.
Pitch track features are used in locating pulse boundaries.
Figure 4. Power time history resolved into low frequency
(volume) and high frequency (shimmer) components.
In an effort to achieve maximum reduction of nonperiodic
components prior to aspiration noise estimation, the tremor
removal approach of [1] is extended to include the elimination
of all pitch period variation, including jitter or HFPV. The
method used is the same as in [1]: re-sampling the original voice
based on the measured pitch frequency. However, here, each
pitch period of the original voice is individually resampled to
force all periods to be the same length. By contrast, in [1], resampling was based on the low frequency tremor component, so
only the longer period FM variations were removed, and the
HFPV remained. This zero FM version is resent through the
pitch tracking analysis to verify success of FM demodulation.
The first step of amplitude analysis is segregation of the pitch
pulses of the original time waveform. The starting and ending
instants of each pulse are estimated using results obtained
during the pitch tracking analysis. The set of instants of pitch
track features determined in Section 2.2 are assumed to exist
near the center of each pitch pulse; adjacent amplitude envelope
minima to either side define the pulse boundaries. The intent is
to separate pulses by placing pulse boundaries so that the
maxima of voice power in each pulse occurs near the center of
the pulse and the minima of power occurs near the boundaries.
This provides a natural separation of normal voice pulses, and a
reasonable approximation for pathological voices.
2.4 Amplitude/shimmer AM Analysis
Analysis proceeds via generation of the envelope of absolute
value and power of the original voice. Using the pitch track
feature instants, the corresponding envelope minima following
each feature are tracked; the relation of feature instants and
minima are constrained to be one to one, so the resulting power
minima instants are phase-locked to the pitch tracking feature
instants. The envelope minima thus determined form a natural
boundary for pitch pulses. Fig. 2 displays the absolute value of
a short segment of a voice time series showing typical pitch
track features and pulse (envelope minima) selected by the
algorithms.
Analysis of the original pathological voice continues after pitch
tracking and jitter analysis with amplitude tracking and
shimmer analysis. The results of pitch tracking are used as a
starting point for estimating the maximum amplitude, sum of
absolute value of samples, energy, and power of each pitch
pulse.
2.4.1 Identifying Pitch Pulse Boundaries
9040. TRK=11 CLK=2002528154843.79 SHIM% HIST. STAND DEV = 4.823%
9 0 4 0 . T R K = 1 1 C L K = 2 0 0 2 5 2 9 1 5 0 2 5 . 4 2 O R I G I N A L P I T C H T R A J ( R E D 0 ) A N D T R E M O R ( G R E E N L I N E )
40
266.6
FREQ (Hz)
35
30
266.4
25
266.2
0.2
0.4
0.6
0.8
1
TIME (SEC)
9040. TRK=11 CLK=200252915025.42 JIT% = 100*(PITCH - TREMOR)/TREMOR
0.1
15
DELTA FREQ PERCENT
0.05
10
5
0
-0.05
-0.1
0
-20
-15
-10
-5
0
5
10
0
15
0.2
0.4
DELT PWR (%) TOTSKIP=0 AMCUT=10 TOTMAN=0
0.6
0.8
1
TIME (SEC)
Figure 5. A histogram of shimmer values displays
a Gaussian form.
Figure 7. Re-tracking of FM demodulated voice. Upper
(tremor) and lower (HFPV) vary less than 0.2 Hz.
N S R I N D E S C . N S R 1 = o r N S R 2 = + b N S R 3 =x g N S R 4 = * y N S R 5 = *c N S R 6 = o w
9040. TRK=11 CLK=200252914378.56 MAXAMP=BLU -, SUM(ABS())=RED.., ENERGY=GRN-., PWR=YEL O
1
0
TREMOR FM DEMOD &
TREMOR FM DEMOD + AM DEMOD
-5
0.95
POWER
ENERGY
ORIGINAL &
ORIGINAL +AM DEMOD
SUM
-10
0.9
NSR ( DB )
#OCCURRANCES
0
20
0.85
-15
-20
0.8
-25
ALL FM DEMOD &
ALL FM DEMOD + AM DEMOD
MAX AMPLITUDE
0.75
-30
0
0.5
1
1.5
2
2.5
SAMPLE #
3
3.5
4
x 10
4
Figure 6. Power analysis of AM demodulated voice.
Constant power level verifies processing steps.
2.4.2 Pulse Analysis
Having defined the pulses, several measures of pulse size are
calculated: maximum amplitude, sum of absolute value of pulse
samples, energy, and power. These are later used for AM and
shimmer analysis and for de-modulation of the original voice to
improve the accuracy of the cepstral measure of aspiration noise
to periodic signal (NSR).
The maximum amplitude is the absolute value of the greatest
extent (plus or minus) of the original voice samples within the
pulse (between the pulse demarcations.) The sum of the
absolute value is the addition of the absolute value of all the
samples within the pulse. The energy is the sum of the squares
of all samples in a pulse. The power is the energy divided by
the number of samples within the pulse. These measures
usually track each other. Fig. 3 displays these pulse metrics for
a 1-second voice sample.
2.4.3 AM Modulation/shimmer analysis
0
5
10
15
20
25
30
35
CASE# - SORTED BY ASCENDING NSR
Figure 8. NSR for six combinations of AM and FM
demodulation. AM modulation appears to have minimal effect.
The original voice amplitude variations are now analyzed into a
low frequency AM time series track and a high frequency
shimmer measure. The power measure described in 2.4.2 is
selected as the basis of this analysis, since it should be most
closely related to the perceived signal level. A cutoff frequency
is selected (usually 10 Hz), and a low pass FIR filter is
constructed and applied to the power time series. The resulting
low passed signal defines the AM time series. The difference
between the original power time series and the AM time series
defines the shimmer time series; the standard deviation of the
shimmer time series estimates the amount of shimmer present.
Fig. 4 illustrates the high and low frequency power variations.
Fig. 5 illustrates the histogram of the high frequency power
variations.
2.4.4 Original voice AM demodulation
The AM variations may be removed from the original voice by
dividing the original time series sample by sample by the square
root of the measured power time series. Audio presentation of
the AM demodulated waveform verifies that apparent changes
in perceived volume level have been removed. Fig. 6 illustrates
zero pulse power (top trace in Fig 6) variation in the AM
demodulated original voice. Note in Fig. 6 that the remaining
traces which display the other measures of AM (maximum level,
average level, and pulse energy) are also fairly constant, but
they are not exactly invariant.
2.5 NSR Measurement
Starting with the AM demodulated original voice, pitch
variation is next removed by re-sampling, as described in [1].
The combination of both AM demodulation and re-sampling
yields a voice that is free of both AM and FM effects. Audio
presentation of the resulting time series sounds synthetic, as
expected. When the cepstral NSR noise measurement is applied
to this time series, the resulting measurement is free of AM and
FM effects, and aspiration noise should be a major component
of the remaining measured noise. Fig. 7 illustrates an example
of successful removal of pitch variation.
A comparison of the effects of AM and FM component removal
on aspiration noise estimation is made. Six versions of the
original voice are created:
1. Unmodified original voice.
2. Original voice with AM demodulation
3. Original voice with FM tremor demodulation
4. Original voice with FM tremor and AM demodulation
5. Original voice with complete FM demodulation
6. Original voice with complete FM and AM demodulation
The NSR value, (here assumed to be entirely due to aspiration
noise) is calculated for each version of 31 pathological voices.
The result is plotted in Fig. 8. Two effects clearly emerge:
1. The effect of AM demodulation on NSR analysis is very
small (usually less than 1 dB).
2. The added decrease in estimated NSR when jitter is removed
varies from less than 1 dB to 5 dB, with an average of about
2dB.
3. RESYNTHESIS OF AM EFFECTS
The AM analysis results are applied in re-synthesis to improve
fidelity and provide a basis for planned experiments that
measure the perceptual significance of AM. The AM volume
track is applied to the synthesizer source pulse calculation to
cause the low frequency variations in the synthetic voice to
match the original. High frequency variations (shimmer) are
modeled using the measured shimmer value to generate
Gaussian random variations in synthesizer pulse amplitude to
match the original voice.
4. SUMMARY
Continuing the development of [1], AM modulation features
have been added to the analysis/synthesis model and software
algorithms. In the same manner as FM features, AM effects
have been precisely tracked and verified in the original voice,
and then decomposed into high and low rate phenomena. As
with the FM component, the high frequency AM component
(shimmer) displays roughly Gaussian probability density, and is
so modeled in the synthesizer. FM component removal is
enhanced to include the option of removal of all pitch variation,
on a pulse-to-pulse time scale. In the same manner as with the
FM component, the AM component is removed from the
original voice to observe the effect on the cepstral noise
estimation; all combinations of tremor, HFPV, and AM removal
from the original voice are tested. Results indicate that removal
of the AM component has minimal effect on the estimate of
aspiration noise level, while removal of the HFPV results in an
average of about 2dB decrease in NSR.
Work was supported in part by NIH/NIDCD grant DC01797.
We thank Drs. Bruce Gerratt and Jody Krieman for their help.
5. REFERENCES
1. Gabelman, Brian and Alwan, Abeer. “Analysis by synthesis
of FM modulation and aspiration noise components in
pathological voices” ICASSP, Orlando, FL, 5/2002. 449-452.
2. Krom, Guus de, 1993. “A Cepstrum-Based Technique for
Determining a Harmonics –to-Noise Ratio in Speech Signals,”.
JSHR 93, Vol 36, 254-266.
3. Qi, Y., and Bi, N. 1994. “A simplified approximation of the
four-parameter LF model of voice source,” JASA 96 , 11821185.
4. The inverse filter program developed by Norma Antonanzas
can be investigated online at the following web site:
www.surgery.medsch.ucla.edu/glottalaffairs/software_of_the_bo
ga.htm