Jitter and Shimmer Measurements for Speaker Recognition
Mireia Farrús, Javier Hernando, Pascual Ejarque
TALP Research Center, Department of Signal Theory and Communications
Universitat Politècnica de Catalunya, Barcelona, Spain
{mfarrus,javier,pascual}@gps.tsc.upc.edu
Abstract
Jitter and shimmer are measures of the cycle-to-cycle variations
of fundamental frequency and amplitude, respectively, which
have been largely used for the description of pathological voice
quality. Since they characterise some aspects concerning
particular voices, it is a priori expected to find differences in the
values of jitter and shimmer among speakers. In this paper,
several types of jitter and shimmer measurements have been
analysed. Experiments performed with the Switchboard-I
conversational speech database show that jitter and shimmer
measurements give excellent results in speaker verification as
complementary features of spectral and prosodic parameters.
Index Terms: speaker recognition, jitter, shimmer, prosody,
voice spectrum, fusion
1.
Introduction
State-of-the-art speaker recognition systems tend to use only
short-term spectral features as voice information. Spectral
parameters take into account some aspects of the acoustic level
of the signal, like spectral magnitudes, formant frequencies, etc.,
and they are highly related to the physical traits of the speaker.
However, humans tend to use several linguistic levels like
lexicon, prosody or phonetics to recognise others with voice.
These levels of information are more related to learned habits or
style, and they are mainly manifested in the dialect, sociolect or
idiolect of the speaker.
Since these linguistic levels play an important role in the
human recognition process, a lot of effort has been placed in
adding this kind of information to automatic speaker recognition
systems. [1] showed that idiolectal information provided a good
recognition performance given a sufficient amount of data, and
more recent works [2-4] have demonstrated that prosody helps
to improve voice spectrum based recognition systems, supplying
complementary information not captured in the traditional
acoustic systems. Moreover, some of these parameters have the
advantage of being more robust to some common problems like
noise, transmission channel, speech level or distance between
the speaker and the microphone than spectral features.
There are probably many more characteristics which may
provide complementary information and should be of a great
value for speaker recognition. This work focuses on the use of
jitter and shimmer for a speaker verification system. Jitter and
shimmer are acoustic characteristics of voice signals, and they
are quantified as the cycle-to-cycle variations of fundamental
frequency and waveform amplitude, respectively. Both features
have been largely used to detect voice pathologies (see, e.g. [5,
6]). They are commonly measured for long sustained vowels,
and values of jitter and shimmer above a certain threshold are
considered being related to pathological voices, which are
usually perceived by humans as breathy, rough or hoarse voices.
In [7] it was reported that significant differences can occur in
jitter and shimmer measurements between different speaking
styles, especially in shimmer measurement. Nevertheless,
prosody is also highly-dependant on the emotion of the speaker,
and prosodic features are useful in automatic recognition
systems even when no emotional state is distinguished.
The aim of this work is to improve a prosodic and voice
spectral verification system by introducing new features based
on jitter and shimmer measurements. The experiments have
been done over the Switchboard-I conversational speech
database. Fusion of different features has been performed at the
score level by using z-score normalization and matcher
weighting fusion method.
This paper is organised as follows. In the next section, an
overview of the features used in this work is presented,
including a description of jitter and shimmer measurements. The
experimental setup and verification experiments are shown in
section 3. Finally, conclusions of the experiments are given in
section 4.
2.
Voice features
Cepstral coefficients are the usual way of representing the short-
time spectral envelope of a speech frame in current speaker
recognition systems. These parameters are the most prevalent
representations of the speech signal and contain a high degree of
speaker specificity. However, cepstral coefficients have some
disadvantages that are overcome by using Frequency Filtering
(FF) parameters. These parameters have been used in our
experiments since they give comparable or better results than
mel-cepstrum coefficients in most of the experiments that have
been done [8, 9].
Prosodic parameters are known as suprasegmental
parameters since the segments affected (syllables, words and
phrases) are larger than phonetic units. These features are
mainly manifested as sound duration, tone and intensity
variation. The prosodic recognition baseline system used in this
work is constituted by nine prosodic features already used in [2,
3]: three features related to word and segmental durations and
six features related to fundamental frequency, all of them
averaged over all words with voiced frames.
The novel component in this paper is the analysis of jitter
and shimmer features in order to test their usefulness in speaker
verification. These features have been extracted by using the
Praat voice analysis software [10]. Praat reports different kinds
of measurements for both jitter and shimmer features, which are
listed below.
2.1.
Jitter measurements
•
Jitter (absolute) is the cycle-to-cycle variation of
fundamental frequency, i.e. the average absolute
difference between consecutive periods, expressed as:
1
1
1
1
(
)
1
−
+
=
=
−
−
∑
N
i
i
i
Jitter absolute
T
T
N
(1)
where T
i
are the extracted F
0
period lengths and N is the
number of extracted F
0
periods.
•
Jitter (relative) is the average absolute difference between
consecutive periods, divided by the average period. It is
expressed as a percentage:
1
1
1
1
1
1
(
)
1
−
+
=
=
−
−
=
∑
∑
N
i
i
i
N
i
i
T
T
N
Jitter relative
T
N
(2)
•
Jitter (rap) is defined as the Relative Average
Perturbation, the average absolute difference between a
period and the average of it and its two neighbours,
divided by the average period.
•
Jitter (ppq5) is the five-point Period Perturbation
Quotient, computed as the average absolute difference
between a period and the average of it and its four closest
neighbours, divided by the average period.
2.2.
Shimmer measurements
•
Shimmer (dB) is expressed as the variability of the peak-
to-peak amplitude in decibels, i.e. the average absolute
base-10 logarithm of the difference between the
amplitudes of consecutive periods, multiplied by 20:
(
)
1
1
1
1
(
)
20log
1
−
+
=
=
−
∑
N
i
i
i
Shimmer dB
A
A
N
(3)
where A
i
are the extracted peak-to-peak amplitude data
and N is the number of extracted fundamental frequency
periods.
•
Shimmer (relative) is defined as the average absolute
difference between the amplitudes of consecutive periods,
divided by the average amplitude, expressed as a
percentage:
1
1
1
1
1
1
(
)
1
−
+
=
=
−
−
=
∑
∑
N
i
i
i
N
i
i
A
A
N
Shimmer relative
A
N
(4)
•
Shimmer (apq3) is the three-point Amplitude Perturbation
Quotient, the average absolute difference between the
amplitude of a period and the average of the amplitudes of
its neighbours, divided by the average amplitude.
•
Shimmer (apq5) is defined as the five-point Amplitude
Perturbation Quotient, the average absolute difference
between the amplitude of a period and the average of the
amplitudes of it and its four closest neighbours, divided
by the average amplitude.
•
Shimmer (apq11) is expressed as the 11-point Amplitude
Perturbation Quotient, the average absolute difference
between the amplitude of a period and the average of the
amplitudes of it and its ten closest neighbours, divided by
the average amplitude.
3.
Recognition experiments
3.1.
Experimental setup
All the recognition experiments described in this paper have
been performed with the Switchboard-I database [11], a
collection of 2430 two-sided telephone conversations among
543 speakers from all areas of the United States.
In the prosody based recognition system, a nine-feature
vector (already used in [2]) was obtained for each conversation
side: three features related to word and segmental durations -
number of frames per word and length of word-internal voiced
and unvoiced segments - and six features related to fundamental
frequency - mean, maximum, minimum, range, pseudo-slope
and slope -. Another feature vector was extracted for the
acoustic system based on the nine jitter and shimmer
measurements described in section 2.
Features were extracted using the Praat software for acoustic
analysis [10], performing an acoustic periodicity detection based
on a cross-correlation method, with a window length of 40/3 ms
and a shift of 10/3 ms. The mean and standard deviation over all
words were computed for each individual feature. The system
was tested using the k-Nearest Neighbour classifier (with k=3),
comparing the distance of the test feature vector to the k closest
vectors of the claimed speaker vs. the distance of the test vector
to the k closest vectors of the cohort speakers. The symmetrised
Kullback-Leibler divergence expressed as:
(
)
2
2
1
2
1
2
2
2
1
2
2
1
1
1
1
2
KL
d
σ
σ
µ µ
σ
σ
σ
σ
=
−
+
+
−
(5)
where µ is the mean and σ the standard deviation, was used as a
distance measure.
The spectrum based recognition system was a 32-component
GMM-UBM system using short-term feature vectors consisting
of 20 Frequency Filtering parameters [8] with a frame size of 30
ms and a shift of 10 ms. 20 corresponding delta and acceleration
coefficients were included, and the UBM was trained with 116
conversation sides.
All the systems used 8 conversation sides to train the
speaker models. Training was performed using splits 1-3 of
Switchboard-I database. The three held out splits provided the
cohort speakers in prosodic and jitter-shimmer based systems.
The systems were tested with one conversation-side according
to the NIST’s 2001 Extended Data task [12]. Fusion of
individual features was performed at the score level for splits 1-
3, using the matcher weighting method [13] with a previous z-
score normalization. Weights were trained from the splits 4-6
using splits 1-3 as cohort speakers.
3.2.
Verification results
First of all, the prosodic system used as baseline is presented.
Table 1 shows the EER obtained for each individual prosodic
feature and the resulting fusion of the prosodic set.
Table 1. EER for prosodic features (isolated and fused).
Feature
EER (%)
log (#frames/word)
31.5
length of word-internal voiced segments
30.0
length of word-internal unvoiced segments
30.0
log (mean F
0
)
20.3
log (max F
0
)
20.9
log (min F
0
)
22.3
log (range F
0
)
26.6
pseudo-slope: (last F
0
- first F
0
)/(#frames)
38.3
F
0
slope
29.9
Fusion
15.8
The same experiments were performed for the jitter and
shimmer measurements described in section 2. Tables 2 and 3
show the EER results for jitter and shimmer features
respectively. Both tables give the EER for the individual
measurements and the combination of the measurements set.
Table 2. EER for jitter measurements.
Jitter measurement
EER (%)
Jitter (absolute)
26.9
Jitter (relative)
33.7
Jitter (rap)
34.2
Jitter (ppq5)
33.8
Fusion
29.2
Table 3. EER for shimmer measurements.
Shimmer measurement
EER (%)
Shimmer (dB)
26.9
Shimmer (relative)
28.9
Shimmer (apq3)
28.1
Shimmer (apq5)
32.9
Shimmer (apq11)
33.8
Fusion
25.5
The results show that at least both absolute measurements of
jitter and shimmer are potentially useful in speaker recognition.
In the case of jitter, its relative measurements do not seem to
supply helpful information, since the fusion of all jitter
measurements does not outperform the result obtained with the
isolated absolute measurement. In order to ensure this
assumption, the absolute measurement of jitter was fused with
the best-performing relative measurement: the Jitter (relative).
The combination of both measurements provided an EER of
29.3%, so that fusion of both measurements does not improve
the absolute jitter measurement result either.
In the case of shimmer measurements, their final fusion
improves slightly the best isolated result (Shimmer (dB)). Since
all relative measurements of the same feature are highly
correlated, we will only use the relative measurement of
shimmer giving the best EER: the Shimmer (apq3). To ensure
that
this
measurement
provides
some
complementary
information to Shimmer (absolute), both measurements were
combined. The EER obtained in the fusion equalled 26.3%,
improving slightly the isolated absolute measurement of
shimmer.
From now on, only three cycle-to-cycle variability
measurements will be used as new features: Jitter (absolute),
Shimmer (dB) and Shimmer (apq3), and we will refer to this set
of three measurements as the JitShim system. The EER of the
combination of these measurements equals 22.5%.
In order to see how jitter and shimmer are able to improve
the prosodic and the voice spectral based recognition systems,
the new features are added to both systems separately. First of
all, the nine prosodic features used in our baseline system are
combined with the three features of our novel JitShim system,
resulting in a new twelve-featured system. Secondly, the JitShim
system is added to our voice spectral baseline system. This
allows comparing how complementary jitter and shimmer are to
prosodic and spectral features, respectively. Finally, the JitShim
system is combined with both baselines, in order to see how the
new features improve our speaker verification system. The
results of these experiments are shown in Table 4. The EER
before the introduction of the JitShim system are given in the
middle column of the table, and results after adding jitter and
shimmer features are shown in the right column.
Table 4. EER (%) for prosodic and spectral systems
before and after adding jitter and shimmer features.
Baseline system
without JitShim
with JitShim
Prosodic
15.8
13.1
Spectral
10.1
8.6
Fusion
7.7
6.8
The results and the DET curves plotted in Fig.1 show that
both prosodic and spectral baselines are clearly improved when
jitter and shimmer features are added to the systems. The best
relative improvement is achieved by adding the JitShim to the
prosody based system (17%). By fusing JitShim with the
spectral system, the improvement is less considerable (15%).
That suggests that the information provided by jitter and
shimmer to prosodic parameters is more complementary than
the information supplied to the spectral system.
Our preliminary speaker verification system based on
prosodic and spectral parameters is also improved by adding the
JitShim system, as in can be seen in the DET curves plotted in
Fig. 2, achieving the lowest EER equalling 6.8%. So, jitter and
shimmer features seem to be useful in speaker recognition and
should be taken into account in future experiments.
1
2
5
10
20
40
1
2
5
10
20
40
False Alarm probability (in %)
M
is
s
p
ro
b
ab
il
it
y
(
in
%
)
prosodic
prosodic + JitShim
spectral
spectral + JitShim
Figure 1. DET curves for prosodic and spectral systems
before and after adding jitter and shimmer features.
1
2
5
10
20
40
1
2
5
10
20
40
False Alarm probability (in %)
M
is
s
p
ro
b
ab
il
it
y
(
in
%
)
prosodic + spectral
prosodic + spectral + JitShim
Figure 2. DET plot showing the improvement of the
baseline system after adding jitter and shimmer.
4.
Conclusions
In this work, a preliminary speaker verification system based on
prosodic and spectral parameters is improved by adding jitter
and shimmer features, which analyse the perturbation of
fundamental frequency and waveform amplitude, respectively.
In these experiments, the absolute measurements of both
features seem to be more discriminant than their relative
measurements. Furthermore, the results show that jitter and
shimmer can provide complementary information to both
spectral and prosodic systems, especially to the prosodic one.
5.
Acknowledgements
The authors would like to thank Jan Anguita for his contribution
with the voice spectrum based system and Michael Wagner for
his valuable comments.
6.
References
[1] G. Doddington, "Speaker recognition based on idiolectal
differences between speakers," presented at Eurospeech,
2001.
[2] M. Farrús, A. Garde, P. Ejarque, J. Luque, and J.
Hernando, "On the Fusion of Prosody, Voice Spectrum and
Face Features for Multimodal Person Verification,"
presented at ICSLP, Pittsburgh, 2006.
[3] B. Peskin, J. Navratil, J. Abramson, D. Jones, D. Klusacek,
D. A. Reynolds, and B. Xiang, "Using prosodic and
conversational features for high-performance speaker
recognition: Report from JHU WS'02," presented at
ICASSP, 2003.
[4] D. A. Reynolds, W. Andrews, J. Campbell, J. Navratil, B.
Peskin, A. Adami, Q. Jin, D. Klusacek, J. Abramson, R.
Mihaescu, J. Godfrey, D. Jones, and B. Xiang, "The
SuperSID project: exploiting high-level information for
high-accuracy speaker recognition," presented at ICASSP,
2003.
[5] J. Kreiman and B. R. Gerrat, "Perception of aperiodicity in
pathological voice," Acoustical Society of America, vol.
117, pp. 2201-2211, 2005.
[6] D. Michaelis, M. Fröhlich, H. W. Strube, E. Kruse, B.
Story, and I. R. Titze, "Some simulations concerning jitter
and shimmer measurement," presented at 3rd International
Workshop on Advances in Quantitative Laryngoscopy,
Aachen, Germany, 1998.
[7] R. E. Slyh, W. T. Nelson, and E. G. Hansen, "Analysis of
mrate, shimmer, jitter, and F0 contour features across stress
and speaking style in the SUSAS database," presented at
ICASSP, 1999.
[8] C. Nadeu, J. Hernando, and M. Gorricho, "On the
decorrelation of filter bank energies in speech recognition,"
presented at Eurospeech, 1995.
[9] A. Abad, C. Nadeu, J. Hernando, and J. Padrell, "Jacobian
Adaptation based on the Frequency-Filtered Spectral
Energies," presented at Eurospeech, Geneva, Switzerland,
2003.
[10] Praat software website:
http://www.fon.hum.uva.nl/praat/
.
[11] J. J. Godfrey, E. C. Holliman, and J. McDaniel,
"Switchboard: Telephone speech corpus for research and
development," presented at ICASSP, 1990.
[12] NIST 2001 Speaker Recognition Evaluation website:
http://www.nist.gov/speech/tests/spk/2001/index.htm
.
[13] M. Indovina, U. Uludag, R. Snelik, A. Mink, and A. Jain,
"Multimodal Biometric Authentication Methods: A COTS
Approach,"
presented
at
MMUA,
Workshop
on
Multimodal User Authentication, Santa Barbara, CA, 2003.
Dostları ilə paylaş: |