ANALYSIS AND SYNTHESIS OF AMPLITUDE MODULATION COMPONENTS IN PATHOLOGICAL VOICES Brian Gabelman and Abeer Alwan Department of Electrical Engineering, UCLA Los Angeles 90095, USA [email protected], [email protected] ABSTRACT Previous work [1] addressed analysis and synthesis of the FM and aspiration noise components of pathological voices. The current study refines the previous analysis and adds amplitude modulation to model non-periodic components. The cepstral aspiration noise estimate is improved by removal of HFPV (high frequency pitch variation) from the original voice prior to estimating the noise. Amplitude and pitch pulse power tracking are performed on a pulse-by-pulse basis to accurately estimate the power time series, which is then segregated into low frequency power variation and high frequency power variation (shimmer). These AM effects are then also removed from the original voice and cepstral noise analysis is applied to estimate aspiration noise. Results show that incorporation of the AM effects improves the fidelity of the synthesized voice, and that AM has a minimal effect on the measured level of aspiration noise. 1.0 INTRODUCTTION The objective of this study is to further improve the accuracy of analysis of nonperiodic components in pathological voices, thus improving the fidelity of synthesis of tokens used in subjective analysis by synthesis perceptual studies. This study continues the modeling of the non-periodic components of pathological voices in [1]. Previously, all non-periodic components were modeled with spectrally shaped aspiration noise, and high and low frequency FM modulation of pitch. The effects of amplitude modulation are added to the analysis model and voice synthesizer in this paper. Starting with established pitch tracking algorithms, enhancements are added to track the amplitude modulation (AM) features of the voice time series, such as power level. The power level time series is high and low pass filtered to yield shimmer and volume time series respectively, which are then incorporated into the synthesizer, yielding another increment in the naturalness of synthesized pathological voices. In a manner analogous to the previous FM analysis [1], the original voice time series is AM de-modulated using the power time series to yield a modified original voice that contains no power variation. This constant power time series is then reanalyzed (in various combinations with the FM demodulated versions of [1]) with cepstral notch filtering [2] to estimate the aspiration noise level. Results indicate that AM has a minimal effect on the aspiration noise estimate, and thus will not significantly lower the computer set level of aspiration noise for the synthesizer, which has been found to be unreasonably high in analysis-by-synthesis comparisons. However, removal of HFPV prior to cepstral noise does lower aspiration noise estimates significantly. 2.0 ANALYSIS All steps of data analysis are summarized in Fig. 1. In the following sequential description of analysis, steps carried out in [1] are briefly summarized, and modifications for the current AM analysis are described in detail. Note that Figures 2 – 7 pertain to the same voice token. 2.1 Initial Steps Thirty one pathological voice samples were collected at the UCLA Medical Center and analyzed as described in [1]. These samples exhibit significant FM and/or AM. Using the sourcefilter model, voices are analyzed using the LPC autocorrelation and covariance methods, and then inverse filtered to obtain an estimate of the source time series [4]. The raw estimated source time series is fitted to a simplified LF model [3]. 2.2 Pitch Tracking Pitch tracking proceeds as described in [1] with time domain feature (typically minima points) tracking and sub-sample interpolation, allowing the generation of a quantization-free, high-resolution pitch time series. In order to achieve best results, the pitch tracking algorithm is applied to four variations of the voice signal: the original, the inverse filtered source, or their smoothed derivatives. The pitch track resulting from successful tracking of one of the variations is selected to represent the voice. When the algorithm successfully, as determined visually, selects features in the original voice time series that identify successive pitch pulses (rather than being desynchronized by spurious features), tracking-lock has been achieved. Successful pitch tracking exhibits no loss of trackinglock over the entire voice sample and clustered HFPV measurements in the expected range of 0 to 1%. The pitch time series is then high/low pass filtered into HFPV (high frequency pitch variation) and tremor time series, representing these two fairly distinct processes. 2.3 Removal of HFPV MIKE 9040. TRK=11 CLK=2002528152723.2 SIGNAL MAXAMP=BLU -, SUM(ABS())=RED.., ENERGY=GRN-., PWR=YEL-- 1 LPC FORMANT JITTER ESTIMATE. PITCH (FM) ANALYSIS / SHIMMER ESTIMATE POWER (AM) TRACKER MANUAL OPS SUM 0.9 TRACKER MAX AMP VOLUME TIME HIST TREMOR TIME HIST 0.8 FORMANTS POWER PITCH TIME HIST TIME HIST 0.7 DEMODULATE RESAMPLE INVERSE FILTERING TO REMOVE FM TO REMOVE AM CONST. PITCH CONST. POWER VOICE VOICE SYNTHESIZER 0.6 RAW FLOW DERIVATIVE 0.5 ENERGY & POWER CEPSTRAL LEAST SQRS 0.4 NOISE LF FIT ANALYSIS 0.3 FITTED LF SRC NOISE SPECTRUM SOURCE PULSE NSR ESTIMATE 0 0.5 1 1.5 2 2.5 3 3.5 4 SAMPLE # 4 x 10 Figure 1. Overall voice analysis/synthesis steps. Figure 3. Metrics of amplitude modulation for a one second voice sample. PITCH TRACK FEATURES x 10 4 8 x 10 5 9040. TRK=11 CLK=2002528154055.49 ORIGINAL POWER (RED 0) AND TREMOR (GREEN LINE) 2 4.5 1.5 POWER 4 1 3 0.5 2.5 0 0.2 0.4 0.6 0.8 1 TIME (SEC) 2 9040. TRK=11 CLK=2002528154055.49 SHIM% = 100*(POWER - LOWPASS POWER)/POWER 20 1.5 10 1 0.5 0 0 100 200 300 400 500 SAMPLE NUMBER ENVELOPE MINIMA DELTA POWER PERCENT SMOOTHED ABS ORIG VOICE 3.5 0 -10 -20 0 0.2 0.4 0.6 0.8 1 TIME (SEC) Figure 2. Power tracking on absolute value waveform. Pitch track features are used in locating pulse boundaries. Figure 4. Power time history resolved into low frequency (volume) and high frequency (shimmer) components. In an effort to achieve maximum reduction of nonperiodic components prior to aspiration noise estimation, the tremor removal approach of [1] is extended to include the elimination of all pitch period variation, including jitter or HFPV. The method used is the same as in [1]: re-sampling the original voice based on the measured pitch frequency. However, here, each pitch period of the original voice is individually resampled to force all periods to be the same length. By contrast, in [1], resampling was based on the low frequency tremor component, so only the longer period FM variations were removed, and the HFPV remained. This zero FM version is resent through the pitch tracking analysis to verify success of FM demodulation. The first step of amplitude analysis is segregation of the pitch pulses of the original time waveform. The starting and ending instants of each pulse are estimated using results obtained during the pitch tracking analysis. The set of instants of pitch track features determined in Section 2.2 are assumed to exist near the center of each pitch pulse; adjacent amplitude envelope minima to either side define the pulse boundaries. The intent is to separate pulses by placing pulse boundaries so that the maxima of voice power in each pulse occurs near the center of the pulse and the minima of power occurs near the boundaries. This provides a natural separation of normal voice pulses, and a reasonable approximation for pathological voices. 2.4 Amplitude/shimmer AM Analysis Analysis proceeds via generation of the envelope of absolute value and power of the original voice. Using the pitch track feature instants, the corresponding envelope minima following each feature are tracked; the relation of feature instants and minima are constrained to be one to one, so the resulting power minima instants are phase-locked to the pitch tracking feature instants. The envelope minima thus determined form a natural boundary for pitch pulses. Fig. 2 displays the absolute value of a short segment of a voice time series showing typical pitch track features and pulse (envelope minima) selected by the algorithms. Analysis of the original pathological voice continues after pitch tracking and jitter analysis with amplitude tracking and shimmer analysis. The results of pitch tracking are used as a starting point for estimating the maximum amplitude, sum of absolute value of samples, energy, and power of each pitch pulse. 2.4.1 Identifying Pitch Pulse Boundaries 9040. TRK=11 CLK=2002528154843.79 SHIM% HIST. STAND DEV = 4.823% 9 0 4 0 . T R K = 1 1 C L K = 2 0 0 2 5 2 9 1 5 0 2 5 . 4 2 O R I G I N A L P I T C H T R A J ( R E D 0 ) A N D T R E M O R ( G R E E N L I N E ) 40 266.6 FREQ (Hz) 35 30 266.4 25 266.2 0.2 0.4 0.6 0.8 1 TIME (SEC) 9040. TRK=11 CLK=200252915025.42 JIT% = 100*(PITCH - TREMOR)/TREMOR 0.1 15 DELTA FREQ PERCENT 0.05 10 5 0 -0.05 -0.1 0 -20 -15 -10 -5 0 5 10 0 15 0.2 0.4 DELT PWR (%) TOTSKIP=0 AMCUT=10 TOTMAN=0 0.6 0.8 1 TIME (SEC) Figure 5. A histogram of shimmer values displays a Gaussian form. Figure 7. Re-tracking of FM demodulated voice. Upper (tremor) and lower (HFPV) vary less than 0.2 Hz. N S R I N D E S C . N S R 1 = o r N S R 2 = + b N S R 3 =x g N S R 4 = * y N S R 5 = *c N S R 6 = o w 9040. TRK=11 CLK=200252914378.56 MAXAMP=BLU -, SUM(ABS())=RED.., ENERGY=GRN-., PWR=YEL O 1 0 TREMOR FM DEMOD & TREMOR FM DEMOD + AM DEMOD -5 0.95 POWER ENERGY ORIGINAL & ORIGINAL +AM DEMOD SUM -10 0.9 NSR ( DB ) #OCCURRANCES 0 20 0.85 -15 -20 0.8 -25 ALL FM DEMOD & ALL FM DEMOD + AM DEMOD MAX AMPLITUDE 0.75 -30 0 0.5 1 1.5 2 2.5 SAMPLE # 3 3.5 4 x 10 4 Figure 6. Power analysis of AM demodulated voice. Constant power level verifies processing steps. 2.4.2 Pulse Analysis Having defined the pulses, several measures of pulse size are calculated: maximum amplitude, sum of absolute value of pulse samples, energy, and power. These are later used for AM and shimmer analysis and for de-modulation of the original voice to improve the accuracy of the cepstral measure of aspiration noise to periodic signal (NSR). The maximum amplitude is the absolute value of the greatest extent (plus or minus) of the original voice samples within the pulse (between the pulse demarcations.) The sum of the absolute value is the addition of the absolute value of all the samples within the pulse. The energy is the sum of the squares of all samples in a pulse. The power is the energy divided by the number of samples within the pulse. These measures usually track each other. Fig. 3 displays these pulse metrics for a 1-second voice sample. 2.4.3 AM Modulation/shimmer analysis 0 5 10 15 20 25 30 35 CASE# - SORTED BY ASCENDING NSR Figure 8. NSR for six combinations of AM and FM demodulation. AM modulation appears to have minimal effect. The original voice amplitude variations are now analyzed into a low frequency AM time series track and a high frequency shimmer measure. The power measure described in 2.4.2 is selected as the basis of this analysis, since it should be most closely related to the perceived signal level. A cutoff frequency is selected (usually 10 Hz), and a low pass FIR filter is constructed and applied to the power time series. The resulting low passed signal defines the AM time series. The difference between the original power time series and the AM time series defines the shimmer time series; the standard deviation of the shimmer time series estimates the amount of shimmer present. Fig. 4 illustrates the high and low frequency power variations. Fig. 5 illustrates the histogram of the high frequency power variations. 2.4.4 Original voice AM demodulation The AM variations may be removed from the original voice by dividing the original time series sample by sample by the square root of the measured power time series. Audio presentation of the AM demodulated waveform verifies that apparent changes in perceived volume level have been removed. Fig. 6 illustrates zero pulse power (top trace in Fig 6) variation in the AM demodulated original voice. Note in Fig. 6 that the remaining traces which display the other measures of AM (maximum level, average level, and pulse energy) are also fairly constant, but they are not exactly invariant. 2.5 NSR Measurement Starting with the AM demodulated original voice, pitch variation is next removed by re-sampling, as described in [1]. The combination of both AM demodulation and re-sampling yields a voice that is free of both AM and FM effects. Audio presentation of the resulting time series sounds synthetic, as expected. When the cepstral NSR noise measurement is applied to this time series, the resulting measurement is free of AM and FM effects, and aspiration noise should be a major component of the remaining measured noise. Fig. 7 illustrates an example of successful removal of pitch variation. A comparison of the effects of AM and FM component removal on aspiration noise estimation is made. Six versions of the original voice are created: 1. Unmodified original voice. 2. Original voice with AM demodulation 3. Original voice with FM tremor demodulation 4. Original voice with FM tremor and AM demodulation 5. Original voice with complete FM demodulation 6. Original voice with complete FM and AM demodulation The NSR value, (here assumed to be entirely due to aspiration noise) is calculated for each version of 31 pathological voices. The result is plotted in Fig. 8. Two effects clearly emerge: 1. The effect of AM demodulation on NSR analysis is very small (usually less than 1 dB). 2. The added decrease in estimated NSR when jitter is removed varies from less than 1 dB to 5 dB, with an average of about 2dB. 3. RESYNTHESIS OF AM EFFECTS The AM analysis results are applied in re-synthesis to improve fidelity and provide a basis for planned experiments that measure the perceptual significance of AM. The AM volume track is applied to the synthesizer source pulse calculation to cause the low frequency variations in the synthetic voice to match the original. High frequency variations (shimmer) are modeled using the measured shimmer value to generate Gaussian random variations in synthesizer pulse amplitude to match the original voice. 4. SUMMARY Continuing the development of [1], AM modulation features have been added to the analysis/synthesis model and software algorithms. In the same manner as FM features, AM effects have been precisely tracked and verified in the original voice, and then decomposed into high and low rate phenomena. As with the FM component, the high frequency AM component (shimmer) displays roughly Gaussian probability density, and is so modeled in the synthesizer. FM component removal is enhanced to include the option of removal of all pitch variation, on a pulse-to-pulse time scale. In the same manner as with the FM component, the AM component is removed from the original voice to observe the effect on the cepstral noise estimation; all combinations of tremor, HFPV, and AM removal from the original voice are tested. Results indicate that removal of the AM component has minimal effect on the estimate of aspiration noise level, while removal of the HFPV results in an average of about 2dB decrease in NSR. Work was supported in part by NIH/NIDCD grant DC01797. We thank Drs. Bruce Gerratt and Jody Krieman for their help. 5. REFERENCES 1. Gabelman, Brian and Alwan, Abeer. “Analysis by synthesis of FM modulation and aspiration noise components in pathological voices” ICASSP, Orlando, FL, 5/2002. 449-452. 2. Krom, Guus de, 1993. “A Cepstrum-Based Technique for Determining a Harmonics –to-Noise Ratio in Speech Signals,”. JSHR 93, Vol 36, 254-266. 3. Qi, Y., and Bi, N. 1994. “A simplified approximation of the four-parameter LF model of voice source,” JASA 96 , 11821185. 4. The inverse filter program developed by Norma Antonanzas can be investigated online at the following web site: www.surgery.medsch.ucla.edu/glottalaffairs/software_of_the_bo ga.htm
© Copyright 2025 Paperzz