The Opus Codec To be presented at the 135th aes convention

Yüklə 129,45 Kb.

Pdf görüntüsü

tarix	19.11.2017
ölçüsü	129,45 Kb.
	#11239

.oOo.

The Opus Codec

To be presented at the 135th AES Convention

2013 October 17–20

New York, USA

This paper was accepted for publication at the 135

AES Convention. This version of the paper is from the authors

and not from the AES.

Voice Coding with Opus

Koen Vos, Karsten Vandborg Sørensen

, Søren Skak Jensen

, and Jean-Marc Valin

1

Microsoft, Applications and Services Group, Audio DSP Team, Stockholm, Sweden

GN Netcom A/S, Ballerup, Denmark

Mozilla Corporation, Mountain View, CA, USA

Correspondence should be addressed to Koen Vos (koenvos74@gmail.com)

ABSTRACT

In this paper, we describe the voice mode of the Opus speech and audio codec. As only the decoder is

standardized, the details in this paper will help anyone who wants to modify the encoder or gain a better

understanding of the codec. We go through the main components that constitute the voice part of the codec,

provide an overview, give insights, and discuss the design decisions made during the development. Tests have

shown that Opus quality is comparable to or better than several state-of-the-art voice codecs, while covering

a much broader application area than competing codecs.

INTRODUCTION

The Opus speech and audio codec [1] was standard-

ized by the IETF as RFC6716 in 2012 [2]. A com-

panion paper [3], gives a high-level overview of the

codec and explains its music mode. In this paper we

discuss the voice part of Opus, and when we refer

to Opus we refer to Opus in the voice mode only,

unless explicitly speciﬁed otherwise.

Opus is a highly ﬂexible codec, and in the following

we outline the modes of operation. We only list what

is supported in voice mode.

• Supported sample rates are shown in Table 1.

• Target bitrates down to 6 kbps are supported.

Recommended bitrates for different sample ra-

tes are shown in Table 2.

• The frame duration can be 10 and 20 ms, and

for NB, MB, and WB, there is also support for

40 and 60 ms, where 40 and 60 ms are concate-

nations of 20 ms frames with some of the coding

of the concatenated frames being conditional.

• Complexity mode can be set from 0-10 with 10

being the most complex mode.

Opus has several control options speciﬁcally for voice

applications:

Vos et al.

Voice Coding with Opus

Sample

Frequency

Name

Acronym

48 kHz

Fullband

24 kHz

Super-wideband

SWB

16 kHz

Wideband

12 kHz

Mediumband

8 kHz

Narrowband

Table 1: Supported sample frequencies.

Input

Recommended Bitrate Range

Type

Mono

Stereo

28-40 kbps

48-72 kbps

SWB

20-28 kbps

36-48 kbps

16-20 kbps

28-36 kbps

12-16 kbps

20-28 kbps

8-12 kbps

14-20 kbps

Table 2: Recommended bitrate ranges.

• Discontinuous Transmission (DTX). This re-

duces the packet rate when the input signal is

classiﬁed as silent, letting the decoder’s Packet-

Loss Concealment (PLC) ﬁll in comfort noise

during the non-transmitted frames.

• Forward Error Correction (FEC). To aid pac-

ket-loss robustness, this adds a coarser descrip-

tion of a packet to the next packet. The de-

coder can use the coarser description if the ear-

lier packet with the main description was lost,

provided the jitter buﬀer latency is suﬃcient.

• Variable inter-frame dependency.

This ad-

justs the dependency of the Long-Term Predic-

tor (LTP) on previous packets by dynamically

down scaling the LTP state at frame bound-

aries. More down scaling gives faster conver-

gence to the ideal output after a lost packet, at

the cost of lower coding eﬃciency.

The remainder of the paper is organized as follows:

In Section 2 we start by introducing the coding mod-

els. Then, in Section 3, we go though the main func-

tions in the encoder, and in Section 4 we brieﬂy go

through the decoder. We then discuss listening re-

sults in Section 5 and ﬁnally we provide conclusions

in Section 6.

CODING MODELS

The Opus standard deﬁnes models based on the

Modiﬁed Discrete Cosine Transform (MDCT) and

on Linear-Predictive Coding (LPC). For voice sig-

nals, the LPC model is used for the lower part of

the spectrum, with the MDCT coding taking over

above 8 kHz. The LPC based model is based on the

SILK codec, see [4]. Only frequency bands between

8 and (up to) 20 kHz

are coded with MDCT. For

details on the MDCT-based model, we refer to [3].

As evident from Table 3 there are no frequency

ranges for which both models are in use.

Sample

Frequency Range

Frequency

LPC

MDCT

48 kHz

0-8 kHz

8-20 kHz

24 kHz

0-8 kHz

8-12 kHz

16 kHz

0-8 kHz

12 kHz

0-6 kHz

8 kHz

0-4 kHz

Table 3: Model uses at diﬀerent sample frequencies,

for voice signals.

The advantage of using a hybrid of these two models

is that for speech, linear prediction techniques, such

as Code-Excited Linear Prediction (CELP), code

low frequencies more eﬃciently than transform (e.g.,

MDCT) domain techniques, while for high speech

frequencies this advantage diminishes and transform

coding has better numerical and complexity charac-

teristics. A codec that combines the two models can

achieve better quality at a wider range of sample

frequencies than by using either one alone.

ENCODER

The Opus encoder operates on frames of either 10 or

20 ms, which are divided into 5 ms subframes. The

following paragraphs describe the main components

of the encoder. We refer to Figure 1 for an overview

of how the individual functions interact.

3.1.

VAD

The Voice Activity Detector (VAD) generates a mea-

sure of speech activity by combining the signal-to-

noise ratios (SNRs) from 4 separate frequency bands.

Opus never codes audio above 20 kHz, as that is the upper

limit of human hearing.

AES 135

Convention, New York, USA, 2013 October 17–20

Page 2 of 10

Vos et al.

Voice Coding with Opus

Fig. 1: Encoder block diagram.

In each band the background noise level is estimated

by smoothing the inverse energy over time frames.

Multiplying this smoothed inverse energy with the

subband energy gives the SNR.

3.2.

HP Filter

A high-pass (HP) ﬁlter with a variable cutoﬀ

frequency between 60 and 100 Hz removes low-

frequency background and breathing noise. The cut-

oﬀ frequency depends on the SNR in the lowest fre-

quency band of the VAD, and on the smoothed pitch

frequencies found in the pitch analysis, so that high

pitched voices will have a higher cutoﬀ frequency.

3.3.

Pitch Analysis

As shown in Figure 2, the pitch analysis begins by

pre-whitening the input signal, with a ﬁlter of or-

der between 6 and 16 depending the the complex-

ity mode. The whitening makes the pitch analysis

equally sensitive to all parts of the audio spectrum,

thus reducing the inﬂuence of a strong individual

harmonic. It also improves the accuracy of the cor-

relation measure used later to classify the signal as

voiced or unvoiced.

The whitened signal is then downsampled in two

steps to 8 and 4 kHz, to reduce the complexity of

computing correlations. A ﬁrst analysis step ﬁnds

peaks in the autocorrelation of the most downsam-

pled signal to obtain a small number of coarse pitch

lag candidates. These are input to a ﬁner analysis

step running at 8 kHz, searching only around the

preliminary estimates. After applying a small bias

towards shorter lags to avoid pitch doubling, a single

candidate pitch lag with highest correlation is found.

The candidate’s correlation value is compared to a

threshold that depends on a weighted combination

of:

• Signal type of the prevous frame.

• Speech activity level.

• The slope of the SNR found in the VAD with

respect to frequency.

If the correlation is below the threshold, the sig-

nal is classiﬁed as unvoiced and the pitch analysis

is aborted without returning a pitch lag estimate.

The ﬁnal analysis step operates on the input sample

frequency (8, 12 or 16 kHz), and searches for integer-

sample pitch lags around the previous stage’s esti-

mate, limited to a range of 55.6 to 500 Hz . For each

lag being evaluated, a set of pitch contours from a

codebook is tested. These pitch contours deﬁne a de-

viation from the average pitch lag per 5 ms subframe,

thus allowing the pitch to vary within a frame. Be-

tween 3 and 34 pitch contour vectors are available,

depending on the sampling rate and frame size. The

pitch lag and contour index resulting in the highest

correlation value are encoded and transmitted to the

decoder.

AES 135

Convention, New York, USA, 2013 October 17–20

Page 3 of 10

Vos et al.

Voice Coding with Opus

Fig. 2: Block diagram of the pitch analysis.

3.3.1.

Correlation Measure

Most correlation-based pitch estimators normalize

the correlation with the geometric mean of the en-

ergies of the vectors being correlated:

C =

x

T

x · y

(1)

whereas Opus normalizes with the arithmetic mean:

Opus

x + y

(2)

This correlation measures similarity not just in

shape, but also in scale. Two vectors with very dif-

ferent energies will have a lower correlation, similar

to frequency-domain pitch estimators.

3.4.

Prediction Analysis

As described in Section 3.3, the input signal is pre-

whitened as part of the pitch analysis.

The pre-

whitened signal is passed to the prediction analy-

sis in addition to the input signal. The signal at

this point is classiﬁed as being either voiced or un-

voiced. We describe these two cases in Section 3.4.1

and 3.4.2.

3.4.1.

Voiced Speech

The long-term prediction (LTP) of voiced signals is

implemented with a ﬁfth order ﬁlter. The LTP co-

eﬃcients are estimated from the pre-whitened input

signal with the covariance method for every 5 ms

subframe. The coeﬃcients are quantized and used

to ﬁlter the input signal (without pre-whitening) to

ﬁnd an LTP residual. This signal is input to the LPC

analysis, where Burg’s method [5], is used to ﬁnd

short-term prediction coeﬃcients.

Burg’s method

provides higher prediction gain than the autocorre-

lation method and, unlike the covariance method, it

produces stable ﬁlter coeﬃcients. The LPC order is

LP C

= 16 for FB, SWB, and WB, and N

LP C

= 10

for MB and NB. A novel implementation of Burg’s

method reduces its complexity to near that of the

autocorrelation method [6]. Also, the signal in each

sub-frame is scaled by the inverse of the quantization

step size in that sub-frame before applying Burg’s

method. This is done to ﬁnd the coeﬃcients that

minimize the number of bits necessary to encode the

residual signal of the frame rather than minimizing

the energy of the residual signal.

Computing LPC coeﬃcients based on the LTP resid-

ual rather than on the input signal approximates a

joint optimization of these two sets of coeﬃcients

[7]. This increases the prediction gain, thus reducing

the bitrate. Moreover, because the LTP prediction is

typically most eﬀective at low frequencies, it reduces

the dynamic range of the AR spectrum deﬁned by

the LPC coeﬃcients. This helps with the numeri-

cal properties of the LPC analysis and ﬁltering, and

avoids the need for any pre-emphasis ﬁltering found

in other codecs.

3.4.2.

Unvoiced Speech

For unvoiced signals, the pre-whitened signal is dis-

AES 135

th

Convention, New York, USA, 2013 October 17–20

Page 4 of 10

Vos et al.

Voice Coding with Opus

carded and Burg’s method is used directly on the

input signal.

The LPC coeﬃcients (for either voiced or unvoiced

speech) are converted to Line Spectral Frequencies

(LSFs), quantized and used to re-calculate the LPC

residual taking into account the LSF quantization

eﬀects. Section 3.7 describes the LSF quantization.

3.5.

Noise Shaping

Quantization noise shaping is used to exploit the

properties of the human auditory system.

A typical state-of-the-art speech encoder determines

the excitation signal by minimizing the perceptually-

weighted reconstruction error.

The decoder then

uses a postﬁlter on the reconstructed signal to sup-

press spectral regions where the quantization noise

is expected to be high relative to the signal. Opus

combines these two functions in the encoder’s quan-

tizer by applying diﬀerent weighting ﬁlters to the

input and reconstructed signals in the noise shap-

ing conﬁguration of Figure 3. Integrating the two

operations on the encoder side not only simpliﬁes

the decoder, it also lets the encoder use arbitrarily

simple or sophisticated perceptual models to simul-

taneously and independently shape the quantization

noise and boost/suppress spectral regions. Such dif-

ferent models can be used without spending bits

on side information or changing the bitstream for-

mat. As an example of this, Opus uses warped noise

shaping ﬁlters at higher complexity settings as the

frequency-dependent resolution of these ﬁlters bet-

ter matches human hearing [8]. Separating the noise

shaping from the linear prediction also lets us se-

lect prediction coeﬃcients that minimize the bitrate

without regard for perceptual considerations.

A diagram of the Noise Shaping Quantization (NSQ)

is shown in Figure 3.

Unlike typical noise shap-

ing quantizers where the noise shaping sits directly

around the quantizer and feeds back to the input,

in Opus the noise shaping compares the input and

output speech signals and feeds to the input of the

quantizer. This was ﬁrst proposed in Figure 3 of

[9]. More details of the NSQ module are described

in Section 3.5.2.

3.5.1.

Noise Shaping Analysis

The Noise Shaping Analysis (NSA) function ﬁnds

gains and ﬁlter coeﬃcients used by the NSQ to shape

the signal spectrum with the following purposes:

• Spectral shaping of the quantization noise sim-

ilarly to the speech spectrum to make it less

audible.

• Suppressing the spectral valleys in between for-

mant and harmonic peaks to make the signal

less noisy and more predictable.

For each subframe, a quantization gain (or step size)

is chosen and sent to the decoder. This quantization

gain determines the tradeoﬀ between quantization

noise and bitrate.

Furthermore, a compensation gain and a spectral tilt

are found to match the decoded speech level and tilt

to those of the input signal.

The ﬁltering of the input signal is done using the

ﬁlter

H(z) = G · (1 − c

tilt

· z

−1

) ·

ana

(z)

syn

(z)

(3)

where G is the compensation gain, and c

tilt

is the

tilt coeﬃcient in a ﬁrst order tilt adjustment ﬁlter.

The analysis ﬁlter are for voiced speech given by

ana

(z) =

1 −

LP C

k=1

ana

(k) · z

−k

(4)

1 − z

−L

k=−2

ana

(k) · z

−k

(5)

and similarly for the synthesis ﬁlter W

syn

(z). N

LP C

is the LPC order and L is the pitch lag in samples.

For unvoiced speech, the last term (5) is omitted to

disable harmonic noise shaping.

The short-term noise shaping coeﬃcients a

ana

(k)

and a

syn

(k) are calculated from the LPC of the input

signal a(k) by applying diﬀerent amounts of band-

width expansion, i.e.,

ana

(k)

a(k) · g

ana

, and

(6)

syn

(k)

a(k) · g

syn

(7)

The bandwidth expansion moves the roots of the

LPC polynomial towards the origin, and thereby

ﬂattens the spectral envelope described by a(k).

The bandwidth expansion factors are given by

ana

0.95 − 0.01 · C, and

(8)

g

syn

0.95 + 0.01 · C,

(9)

AES 135

Convention, New York, USA, 2013 October 17–20

Page 5 of 10

Vos et al.

Voice Coding with Opus

Fig. 3: Predictive Noise Shaping Quantizer.

where C ∈ [0, 1] is a coding quality control param-

eter.

By applying more bandwidth expansion to

the analysis part than the synthesis part, we de-

emphasize the spectral valleys.

The harmonic noise shaping applied to voiced frames

has three ﬁlter taps

ana

· [0.25, 0.5, 0.25], and

(10)

syn

· [0.25, 0.5, 0.25],

(11)

where the multipliers F

ana

and F

syn

∈ [0, 1] are cal-

culated from:

• The coding quality control parameter.

This

makes the decoded signal more harmonic, and

thus easier to encode, at low bitrates.

• Pitch correlation. Highly periodic input signal

are given more harmonic noise shaping to avoid

audible noise between harmoncis.

• The estimated input SNR below 1 kHz. This

ﬁlters out background noise for a noise input

signal by applying more harmonic emphasis.

Similar to the short-term shaping, having F

ana

syn

emphasizes pitch harmonics and suppresses the

signal in between the harmonics.

The tilt coeﬃcient c

tilt

is calculated as

tilt

= 0.25 + 0.2625 · V,

(12)

where V ∈ [0, 1] is a voice activity level which, in

this context, is forced to 0 for unvoiced speech.

Finally, the compensation gain G is calculated as

the ratio of the prediction gains of the short-term

prediction ﬁlters a

ana

and a

syn

An example of short-term noise shaping of a speech

spectrum is shown in Figure 4. The weighted in-

put and quantization noise combine to produce an

output with spectral envelope similar to the input

signal.

3.5.2.

Noise Shaping Quantization

The NSQ module quantizes the residual signal and

thereby generates the excitation signal.

A simpliﬁed block diagram of the NSQ is shown in

Figure 5. In this ﬁgure, P (z) is the predictor con-

AES 135

th

Convention, New York, USA, 2013 October 17–20

Page 6 of 10

Vos et al.

Voice Coding with Opus

Fig. 4: Example of how the noise shaping oper-

ates on a speech spectrum. The frame is classiﬁed

as unvoiced for illustrative purposes, showing only

short-term noise shaping.

taining both the LPC and LTP ﬁlters. F

ana

(z) and

syn

(z) are the analysis and synthesis noise shap-

ing ﬁlters, and for voiced speech they each consist

of both long term and short term ﬁlters. The quan-

tized excitation indices are denoted i(n). The LTP

coeﬃcients, gains, and noise shaping coeﬃcients are

updated for every subframe, whereas the LPC coef-

ﬁcients are updated every frame.

Fig. 5: Noise Shaping Quantization block diagram.

Substituting the quantizer Q with addition of a

quantization noise signal q(n), the output of the

NSQ is given by:

Y (z) = G ·

1 − F

ana

(z)

1 − F

syn

(z)

· X(z) +

1 − F

syn

(z)

· Q(z)

(13)

The ﬁrst part of the equation is the input signal

shaping part and the second part is the quantization

noise shaping part.

3.5.3.

Trellis Quantizer

The quantizer Q in the NSQ block diagram is a

trellis quantizer, implemented as a uniform scalar

quantizer with a variable oﬀset.

This oﬀset de-

pends on the output of a pseudorandom genera-

tor, implemented with linear congruent recursions

on previous quantization decisions within the same

frame [12].

Since the quantization error for each

residual sample now depends on previous quantiza-

tion decisions, both because of the trellis nature of

the quantizer and through the shaping and predic-

tion ﬁlters, improved R-D performance is achieved

by implementing a Viterbi delayed decision mecha-

nism [13]. The number of diﬀerent Viterbi states to

track, N ∈ [2, 4], and the number of samples delay,

D ∈ [16, 32], are functions of the complexity setting.

At the lowest complexity levels each sample is simply

coded independently.

3.6.

Pulse Coding

The integer-valued excitation signal which is the out-

put from the NSQ is entropy coded in blocks of 16

samples. First the signal is split into its absolute

values, called pulses, and signs. Then the total sum

of pulses per block are coded. Next we repeatedly

split each block in two equal parts, each time encod-

ing the allocation of pulses to each half, until sub-

blocks either have length one or contain zero pulses.

Finally the signs for non-zero samples are encoded

separately. The range coding tables for the splits are

optimized for a large training database.

3.7.

LSF Quantization

The LSF quantizer consists of a VQ stage with 32

codebook vectors followed by a scalar quantization

stage with inter-LSF prediction. All quantization

indices are entropy coded, and the entropy coding

tables selected for the second stage depend on the

quantization index from the ﬁrst. Consequently, the

LSQ quantizer uses variable bitrate, which lowers

the average R-D error, and reduce the impact of out-

liers.

3.7.1.

Tree Search

As proposed in [14], the error signals from the N

best quantization candidates from the ﬁrst stage are

all used as input for the next stage. After the sec-

ond stage, the best combined path is chosen. By

AES 135

th

Convention, New York, USA, 2013 October 17–20

Page 7 of 10

Vos et al.

Voice Coding with Opus

varying the number N , we get a means for adjusting

the trade-oﬀ between a low rate-distortion (R-D) er-

ror and a high computational complexity. The same

principle is used in the NSQ, see Section 3.5.3.

3.7.2.

Error Sensitivity

Whereas input vectors to the ﬁrst stage are un-

weighted, the residual input to the second stage is

scaled by the square roots of the Inverse Harmonic

Mean Weights (IHMWs) proposed by Laroia et al. in

[10]. The IHMWs are calculated from the coarsely-

quantized reconstruction found in the ﬁrst stage, so

that encoder and decoder can use the exact same

weights.

The application of the weights partially

normalizes the error sensitivity for the second stage

input vector, and it enables the use of a uniform

quantizer with ﬁxed step size to be used without

too much loss in quality.

3.7.3.

Scalar Quantization

The second stage uses predictive delayed decision

scalar quantization.

The predictor multiplies the

previous quantized residual value by a prediction

coeﬃcient that depends on the vector index from

the ﬁrst stage codebook as well as the index for the

current scalar in the residual vector. The predicted

value is subtracted from the second stage input value

before quantization and is added back afterwards.

This creates a dependency for the current decision

on the previous quantization decision, which again

is exploited in a Viterbi-like delayed-decision algo-

rithm to choose the sequence of quantization indices

yielding the lowest R-D.

3.7.4.

GMM interpretation

The LSF quantizer has similarities with a Gaussian

mixture model (GMM) based quantizer [15], where

the ﬁrst stage encodes the mean and the second

stage uses the Cholesky decomposition of a tridiag-

onal approximation of the correlation matrix. What

is diﬀerent is the scaling of the residual vector by

the IHMWs, and the fact that the quantized resid-

uals are entropy coded with a entropy table that is

trained rather than Gaussian.

3.8.

Adaptive Inter-Frame Dependency

The presence of long term prediction, or an Adaptive

Codebook, is known to give challenges when packet

losses occur. The problem with LTP prediction is

due to the impulse response of the ﬁlter which can

be much longer than the packet itself.

An often used technique is to reduce the LTP coef-

ﬁcients, see e.g. [11], which eﬀectively shortens the

impulse response of the LTP ﬁlter.

We have solved the problem in a diﬀerent way; in

Opus the LTP ﬁlter state is downscaled in the be-

ginning of a packet and the LTP coeﬃcients are kept

unchanged. Downscaling the LTP state reduces the

LTP prediction gain only in the ﬁrst pitch period in

the packet, and therefore extra bits are only needed

for encoding the higher residual energy during that

ﬁrst pitch period. Because of Jensens inequality, its

better to fork out the bits upfront and be done with

it. The scaling factor is quantized to one of three

values and is thus transmitted with very few bits.

Compared to scaling the LTP coeﬃcients, downscal-

ing the LTP state gives a more eﬃcient trade-oﬀ be-

tween increased bit rate caused by lower LTP pre-

diction gain and encoder/decoder resynchronization

speed which is illustrated in Figure 6.

3.9.

Entropy Coding

The quantized parameters and the excitation signal

are all entropy coded using range coding, see [17].

3.10.

Stereo Prediction

In Stereo mode, Opus uses predictive stereo encod-

ing [16] where it ﬁrst encodes a mid channel as the

average of the left and right speech signals. Next

it computes the side channel as the diﬀerence be-

tween left and right, and both mid and side channels

are split into low- and high-frequency bands. Each

side channel band is then predicted from the cor-

responding mid band using a scalar predictor. The

prediction-residual bands are combined to form the

side residual signal S, which is coded independently

from the mid channel M . The full approach is illus-

trated in Figure 7. The decoder goes through these

same steps in reverse order.

DECODING

The predictive ﬁltering consist of LTP and LPC. As

shown in Figure 8, it is implemented in the decoder

through the steps of parameter decoding, construct-

ing the excitation, followed by long-term and short-

term synthesis ﬁltering. It has been a central design

criterion to keep the decoder as simple as possible

and to keep its computational complexity low.

LISTENING RESULTS

Subjective listening tests by Google[18] and Noki-

AES 135

Convention, New York, USA, 2013 October 17–20

Page 8 of 10

Vos et al.

Voice Coding with Opus

Fig. 7: Stereo prediction block diagram.

Fig. 8: Decoder side linear prediction block diagram.

a[19] show that Opus outperforms most existing

speech codecs at all but the lowest bitrates.

In [18], MUSHRA-type tests were used, and the fol-

lowing conclusions were made for WB and FB:

• Opus at 32 kbps is better than G.719 at 32 kbps.

• Opus at 20 kbps is better than Speex and

G.722.1 at 24 kbps.

• Opus at 11 kbps is better than Speex at 11 kbps.

In [19], it is stated that:

• Hybrid mode provides excellent voice quality at

bitrates from 20 to 40 kbit/s.

CONCLUSION

We have in this paper described the voice mode in

Opus. The paper is intended to complement the pa-

per about music mode [3], for a complete description

of the codec. The format of the paper makes it eas-

ier to approach than the more comprehensive RFC

6716 [2].

REFERENCES

[1] Opus Interactive Audio Codec, http://www.-

opus-codec.org/.

[2] J.-M. Valin, K. Vos, and T. B. Terriberry,

“Deﬁnition of the Opus Audio Codec” RFC

6716, http://www.ietf.org/rfc/rfc6716.txt, Am-

sterdam, The Netherlands, September 2012.

[3] J.-M. Valin, G Maxwell, T. B. Terriberry, and

K. Vos, ”High-Quality, Low-Delay Music Cod-

ing in the Opus Codec”, Accepted at the AES

135th Convention, 2013.

[4] K. Vos, S. Jensen, and K. Sørensen, ”SILK

speech codec”, IETF Internet-Draft, http://-

tools.ietf.org/html/draft-vos-silk-02.

[5] Burg, J., ”Maximum Entropy Spectral Analy-

sis”, Proceedings of the 37th Annual Interna-

tional SEG Meeting, Vol. 6, 1975.

[6] K. Vos, ”A Fast Implementation of Burg’s

Method”, www.arxiv.org, 2013.

[7] P. Kabal and R. P. Ramachandran, ”Joint So-

lutions for Formant and Pitch Predictors in

Speech Processing”, Proc. IEEE Int. Conf. A-

coustics, Speech, Signal Processing (New York,

NY), pp. 315-318, April 1988.

[8] H.W. Strube, ”Linear prediction on a Warped

Frequency Scale”, Journal of the Acoustical So-

AES 135

Convention, New York, USA, 2013 October 17–20

Page 9 of 10

Vos et al.

Voice Coding with Opus

Fig. 6: Illustration of convergence speed after a

packet loss by measuring the SNR of the zero state

LTP ﬁlter response. The traditional solution means

standard LTP. Constrained is the method in [11],

where the LTP prediction gain is constrained which

adds 1/4 bit per sample. Reduced ACB is the Opus

method. The experiment is made with a pitch lag of

1/4 packet length, meaning that the Opus method

can add 1 bit per sample in the ﬁrst pitch period in

order to balance the extra rate for constrained LTP.

The unconstrained LTP prediction gain is set to 12

dB, and high-rate quantization theory is assumed (1

bit/sample ↔ 6 dB SNR). After 5 packets the Opus

method outperforms the alternative methods by >

2 dB and the standard by 4 dB.

ciety of America, vol. 68, no. 4, pp. 10711076,

Oct 1980.

[9] B. Atal and M. Schroeder, ”Predictive Coding

of Speech Signals and Subjective Error Crite-

ria”, IEEE Tr. on Acoustics Speech and Signal

Processing, pp. 247-254, July 1979.

[10] Laroia, R., Phamdo, N., and N. Farvardin, ”Ro-

bust and Eﬃcient Quantization of Speech LSP

Parameters Using Structured Vector Quanti-

zation”, ICASSP-1991, Proc. IEEE Int. Conf.

Acoust., Speech, Signal Processing, pp. 641-

644, October 1991.

[11] M. Chibani, P. Gournay, and R. Lefebvre, ”In-

creasing the Robustness of CELP-Based Coders

by Constrained Optimization”, in Proc IEEE

Int. Conf. on Acoustics, Speech and Signal Pro-

cessing, March 2005.

[12] J. B. Anderson, T. Eriksson, M. Novak, Trellis

source codes based on linear congruential recur-

sions, Proc. IEEE International Symposium on

Information Theory, 2003.

[13] E. Ayanoglu and R. M. Gray, ”The Design

of Predictive Trellis Waveform Coders Using

the Generalized Lloyd Algorithm”, IEEE Tr.

on Communications, Vol. 34, pp. 1073-1080,

November 1986.

[14] J. B. Bodie, Multi-path tree-encoding for a-

nalog data sources, Commun. Res. Lab., Fac.

Eng., McMasters Univ., Hamilton, Ont., Cana-

da, CRL Int. Rep., Series CRL-20, 1974.

[15] P. Hedelin and J. Skoglund, Vector quantiza-

tion based on Gaussian mixture models, IEEE

Trans. Speech and Audio Proc., vol. 8, no. 4,

pp. 385401, Jul. 2000.

[16] H. Kr¨

uger and P. Vary, A New Approach for

Low-Delay Joint-Stereo Coding, ITG-Fachta-

gung Sprachkommunikation, VDE Verlag Gm-

bH, Oct. 2008.

[17] G. Nigel and N. Martin, Range encoding: An

algorithm for removing redundancy from a dig-

itized message, Video & Data Recording Con-

ference, Southampton, UK, July 2427, 1979.

[18] J. Skoglund, ”Listening tests of Opus at Goo-

gle”, IETF, 2011.

[19] A. R¨

am¨

o, H. Toukomaa, ”Voice Quality Char-

acterization of IETF Opus Codec”, Interspeech,

2011.

AES 135

Convention, New York, USA, 2013 October 17–20

Page 10 of 10

Yüklə 129,45 Kb.

Dostları ilə paylaş: