1
AVAL: Audio-Visual Active Locator
ECE-492/3 Senior Design Project
Spring 2014
Electrical and Computer Engineering Department
Volgenau School of Engineering
George Mason University
Fairfax, VA
Team members:
Rony Alaghbar, Kelly Byrnes and Jacob Cohen
Faculty Supervisor:
Dr. Kathleen E. Wage
Abstract:
Videoconferencing is a growing trend among businesses, as
the systems reduce need for travel, in turn reducing cost
and carbon footprints, and increasing productivity. There
has been steadily increasing growth in the industry.
However, the main drawbacks appear to be the large start-
up cost for equipment and installation, the need for
professional IT maintenance and setup, and overall lack in
system mobility. The proposed design alleviates the cost
while enhancing functionality of existing systems. This
project aims to design a compact, platform independent,
and affordable tabletop device for a terminal of a
videoconferencing system. The Audio-Visual Active Locator
(AVAL) takes input audio, and uses a microphone phased
array to locate the speaker’s position. This information is
used to position a camera toward the calculated location.
This functionality aims to enhance current conferencing
techniques, while embracing seamless integration into
existing systems.
1. Introduction
Videoconferencing is a widely used business practice for meetings and education. The General Services
Administration cited increased videoconferencing would reduce both travel expenses and pollutant emissions [2].
A single terminal of a room-based videoconferencing system requires audio and video input/output, processing,
and synchronization. Current systems typically utilize monitors as the hub for the videoconferencing terminal.
These monitors are either bulky portable systems or mounted within the room. The coupling of the terminal with
the large monitor increases the overall cost, and significantly hinders the mobility of the device. Not only do group
video systems require mounting or cumbersome relocation, but they also require personnel training and IT
support. Smaller, more transportable terminals are typically desktop-based and geared towards individual users. In
order to produce a product capable of mitigating these concerns, AVAL utilizes two, four-element microphone
arrays centered on the table to steer a servo-controlled camera towards the current speaker. Source localization
signal processing algorithms are used to find the person speaking.
2. Approach
A microphone array was designed specifically for speech signals between 400 and 1000 Hz. The array design
assumes point-to-point propagation. Equation 1 displays the necessary length of the array to assume point-to-
point propagation. R represents the distance from the sensor, while D represents the array length and λ represents
the wavelength. This equation shows the minimum wavelength as the limiting factor, being about 34 cm for 1KHz.
Assuming the distance to the array is near 120 cm, the maximum array length to assume planar propagation is 45
cm long. Experimentation resulted in an array length of 51 cm
The aperture of the array, 51 cm, can be viewed as a rectangular spatial filter (Figure 1). Fourier transform of the
spatial filter, also seen in Figure 1, is given with radians along the x
array is represented by L in the figures above, while
lower limit frequency of 400Hz, corresponding to a wavelength of 85cm, and an array length of 51 cm, the
resulting beam-width is a maximum of π/2 radians.
of the aperture. This discretization effectively turns
Figure 1: Spatial Rectang
There are many types of beamforming algorithms used today, but they all mimic filtering the signal in space and
time. Beamforming a microphone array towards a given angle of incidence is analogous to
signal coming from alternate angles. Delay and Sum Beamforming,
sampling. To synchronize accurately between the microphones
high resolution to the applied delays
associated with each sensor, which will make the signals coming from the furthest sensor just as powerful as the
closest. This will mitigate the added noi
Sum beamforming adds the delays to each sensor using phase shifts associated with each filter. Since all the
computations are done within the time domain
Figure 2: Steering a Phased Array (www.labbookpages.co.uk)
AVAL implements source localization using frequency domain analysis and beamforming. Working in the frequency
domain allows for analysis of broadband
(filtering in space and time) creates a Fourie
is determined by weighting the individual transforms
or phase shift, and finally summing these
implemented a “fixed angle” approach. This has mitigated the need for dynamic filtering, which would require
2
Assuming the distance to the array is near 120 cm, the maximum array length to assume planar propagation is 45
cm long. Experimentation resulted in an array length of 51 cm as the best fit for finding the angle of
The aperture of the array, 51 cm, can be viewed as a rectangular spatial filter (Figure 1). Fourier transform of the
spatial filter, also seen in Figure 1, is given with radians along the x-axis and gain along the y
in the figures above, while λ/L dictates the beam-width of the array in radians. With a
lower limit frequency of 400Hz, corresponding to a wavelength of 85cm, and an array length of 51 cm, the
a maximum of π/2 radians. Discrete microphone points are placed along the entire length
of the aperture. This discretization effectively turns the array in to a spatial and temporal sampling problem.
: Spatial Rectangular Filter and Corresponding Fourier Transform
There are many types of beamforming algorithms used today, but they all mimic filtering the signal in space and
time. Beamforming a microphone array towards a given angle of incidence is analogous to
. Delay and Sum Beamforming, Figure 2, is susceptible to errors in time
between the microphones, a high sampling frequency is needed along with
esolution to the applied delays to avoid destructive interference. This method also has no weightings
associated with each sensor, which will make the signals coming from the furthest sensor just as powerful as the
This will mitigate the added noise in the signal from propagating longer to the furthest sensor. Filter and
Sum beamforming adds the delays to each sensor using phase shifts associated with each filter. Since all the
computations are done within the time domain, it is still susceptible to inaccuracies in phase delays.
: Steering a Phased Array (www.labbookpages.co.uk) and Filter and Sum Beamforming (Eneman, Koen)
source localization using frequency domain analysis and beamforming. Working in the frequency
s for analysis of broadband signals and improved accuracy of the calculation. Spatiotemporal Filtering
(filtering in space and time) creates a Fourier Transform of the beamformed output. The output Fourier Transform
is determined by weighting the individual transforms and conducting a matrix multiplied with a complex exponent,
nd finally summing these manipulated transforms. In order to localize on the speaker, AVAL has
implemented a “fixed angle” approach. This has mitigated the need for dynamic filtering, which would require
R
>
2D
2
λ
Assuming the distance to the array is near 120 cm, the maximum array length to assume planar propagation is 45
as the best fit for finding the angle of incidence.
(1)
The aperture of the array, 51 cm, can be viewed as a rectangular spatial filter (Figure 1). Fourier transform of the
axis and gain along the y-axis. The length of the
width of the array in radians. With a
lower limit frequency of 400Hz, corresponding to a wavelength of 85cm, and an array length of 51 cm, the
Discrete microphone points are placed along the entire length
the array in to a spatial and temporal sampling problem.
There are many types of beamforming algorithms used today, but they all mimic filtering the signal in space and
time. Beamforming a microphone array towards a given angle of incidence is analogous to filtering out every
Figure 2, is susceptible to errors in time
, a high sampling frequency is needed along with
. This method also has no weightings
associated with each sensor, which will make the signals coming from the furthest sensor just as powerful as the
se in the signal from propagating longer to the furthest sensor. Filter and
Sum beamforming adds the delays to each sensor using phase shifts associated with each filter. Since all the
o inaccuracies in phase delays.
and Filter and Sum Beamforming (Eneman, Koen)
source localization using frequency domain analysis and beamforming. Working in the frequency
signals and improved accuracy of the calculation. Spatiotemporal Filtering
r Transform of the beamformed output. The output Fourier Transform
with a complex exponent,
o localize on the speaker, AVAL has
implemented a “fixed angle” approach. This has mitigated the need for dynamic filtering, which would require
changing the filters based on an input. AVAL’s fixed angle method utilizes multiple pre
2. System Design
AVAL operates in a standard office environment, is easy to implement, and
autonomously tracks who is speaking with a camera using microphone arrays and interfaces with teleconferencing
software. To ensure ease of use, including installation and operation, AVAL implements a table
takes video and speech audio signals and outputs a cohesive audio
AVAL consists of three main components and a power system, as seen in
the PC and Data Acquisition System (DAQ)
the microphone array consisting of eight electret condenser microphones.
audio input, and the camera position system takes the video, the processing unit acts as a mediator between the
two. It determines the location of the speaker from the microphone input and forwards the information to the
camera positioning system. The processing unit also consolidates and outputs the singular audio
The processing unit introduces the main functionality of AVAL. It requires data acquisition from the micro
which is sent through an algorithm to produce a location for the speaker source. Then the location is sent to the
camera positioning system. These processes are implemented on a standard PC. A PC was chosen over a
microcontroller, as common microcontrollers are not powerful enough
latency required to operate in real time.
location data from the PC. The microphones
from the eight channels of pre-amplification circuit
connected via USB to the PC.
In order to properly mount various subcomponents, 3D printing was employed to create hardware specif
designed to work with AVAL’s various subcomponents. The AVAL camera and servo mounts were designed in
OpenSCAD and 3D printed on a Makerbot Replicator 2X. The CAD r
Figure 4
3
changing the filters based on an input. AVAL’s fixed angle method utilizes multiple pre-determined filters.
AVAL operates in a standard office environment, is easy to implement, and is affordable. The system
autonomously tracks who is speaking with a camera using microphone arrays and interfaces with teleconferencing
use, including installation and operation, AVAL implements a table
takes video and speech audio signals and outputs a cohesive audio-video stream of the current speaker.
AVAL consists of three main components and a power system, as seen in Figure 3: the processing unit including
Data Acquisition System (DAQ), the camera positioning unit with a Servo Controller and webcam, and
the microphone array consisting of eight electret condenser microphones. While the microphone array takes
audio input, and the camera position system takes the video, the processing unit acts as a mediator between the
two. It determines the location of the speaker from the microphone input and forwards the information to the
processing unit also consolidates and outputs the singular audio
Figure 3: System Design
The processing unit introduces the main functionality of AVAL. It requires data acquisition from the micro
algorithm to produce a location for the speaker source. Then the location is sent to the
camera positioning system. These processes are implemented on a standard PC. A PC was chosen over a
microcontroller, as common microcontrollers are not powerful enough to handle the required processing with the
latency required to operate in real time. One camera is utilized to track the conversation in the room using the
location data from the PC. The microphones are connected directly to the pre-amplification circuit.
amplification circuit are connected directly to the DAQ. All other components
In order to properly mount various subcomponents, 3D printing was employed to create hardware specif
designed to work with AVAL’s various subcomponents. The AVAL camera and servo mounts were designed in
OpenSCAD and 3D printed on a Makerbot Replicator 2X. The CAD renders can be seen in Figure 4
4: 3D prints of camera mount and servo mount
determined filters.
affordable. The system
autonomously tracks who is speaking with a camera using microphone arrays and interfaces with teleconferencing
use, including installation and operation, AVAL implements a table-top design and
video stream of the current speaker.
the processing unit including
, the camera positioning unit with a Servo Controller and webcam, and
While the microphone array takes the
audio input, and the camera position system takes the video, the processing unit acts as a mediator between the
two. It determines the location of the speaker from the microphone input and forwards the information to the
processing unit also consolidates and outputs the singular audio-video feed.
The processing unit introduces the main functionality of AVAL. It requires data acquisition from the microphones,
algorithm to produce a location for the speaker source. Then the location is sent to the
camera positioning system. These processes are implemented on a standard PC. A PC was chosen over a
to handle the required processing with the
utilized to track the conversation in the room using the
amplification circuit. The outputs
connected directly to the DAQ. All other components are
In order to properly mount various subcomponents, 3D printing was employed to create hardware specifically
designed to work with AVAL’s various subcomponents. The AVAL camera and servo mounts were designed in
enders can be seen in Figure 4.
3. Microphone Array
The most important module of the AVAL is the microphone a
as linear, annular, and planar have limits on the range of localization. In addition to designing the geometries of
the microphone array, the spacing between each microphone was considered to limit aliasing of the audio signals.
The array spacing and dimensions were designed by considering spatial sampling, angular resolution of the array,
and sound propagation to the array. Since sound signals are broadband signals, the array was designed for
frequencies from 400Hz to 1KHz. According to
spacing between the two sensors needs to be less than half the minimum
that the microphones can be spaced a maximum of 17cm apart,
with a 1 KHz source. The angular resolution of the array is
length of the aperture increases, the beam
Figure
The last parameter considered was the limitation of Far Field Assumption. Since sound waves are originally
spherical, our array must be far enough to assume
for the assumption of point-to-point propagation, simplifying calculations and modeling in array signal processing.
After experimenting with different spaci
preliminary algorithm. Figure 6 shows the results of the spacing testing and its ability to output the correct angle.
Data was taken for three different angles over four different trials
calculated. The graph shows that the least amount of error occurred at 17cm, which determined the final
dimensions of the microphone arrays.
The audio signals obtained through the microphone array must now be converted into data the device can use. It
is imperative that these audio signals are sampled synchronously. If not, all calculations will be incorrect and
output audio may be distorted or mistimed with the video. Before amplification, all extraneous signals beyond
2KHz (high frequency noise) and below 60Hz (DC components) are eliminated using a high pass and low pass
passive filter. These anti-alias filters condition the signal before ampl
-L/2
-L/4
L/4
L/2
1
Array of Aperture L
-L
-L/2
L/2
L
1
Array of Aperture 2L-1
-3L/2
-3L/4
3L/4
1
Array of Aperture 3L-1
0
5
10
15
20
25
R
M
S
E
rr
o
r
(D
e
g
re
e
s)
4
AVAL is the microphone array. Different geometries of microphone arrays such
as linear, annular, and planar have limits on the range of localization. In addition to designing the geometries of
microphone array, the spacing between each microphone was considered to limit aliasing of the audio signals.
The array spacing and dimensions were designed by considering spatial sampling, angular resolution of the array,
ay. Since sound signals are broadband signals, the array was designed for
According to the Nyquist Sampling Theorem, in order to sample effectively, the
spacing between the two sensors needs to be less than half the minimum wavelength of the signal. This means
that the microphones can be spaced a maximum of 17cm apart, half of our wavelength of 34 cm
. The angular resolution of the array is also determined by the length of the aperture.
length of the aperture increases, the beam-width decreases, as shown in Figure 5 below.
Figure 5: Length of Aperture vs Directivity
The last parameter considered was the limitation of Far Field Assumption. Since sound waves are originally
r array must be far enough to assume that the waves are planar. Utilizing planar propagation allows
point propagation, simplifying calculations and modeling in array signal processing.
After experimenting with different spacing arrays of four microphones, data was taken and processed through a
shows the results of the spacing testing and its ability to output the correct angle.
Data was taken for three different angles over four different trials. The RMS error of the outputted angle was
calculated. The graph shows that the least amount of error occurred at 17cm, which determined the final
Figure 6: RMS Error of Spacing Testing
The audio signals obtained through the microphone array must now be converted into data the device can use. It
is imperative that these audio signals are sampled synchronously. If not, all calculations will be incorrect and
mistimed with the video. Before amplification, all extraneous signals beyond
2KHz (high frequency noise) and below 60Hz (DC components) are eliminated using a high pass and low pass
alias filters condition the signal before amplification. Before the audio signals are
3L/2
-0.25
-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
L
Angle of Directivity /
π
Directivity Pattern of Each Array
Length L Array
Length 2L-1 Array
Length 3L-1 Array
10.5
17
20
25
Sensor Spacing (cm)
RMS Error
50 Degrees
90 Degrees
130 Degrees
rray. Different geometries of microphone arrays such
as linear, annular, and planar have limits on the range of localization. In addition to designing the geometries of
microphone array, the spacing between each microphone was considered to limit aliasing of the audio signals.
The array spacing and dimensions were designed by considering spatial sampling, angular resolution of the array,
ay. Since sound signals are broadband signals, the array was designed for
Nyquist Sampling Theorem, in order to sample effectively, the
wavelength of the signal. This means
half of our wavelength of 34 cm, when dealing
determined by the length of the aperture. As the
The last parameter considered was the limitation of Far Field Assumption. Since sound waves are originally
planar. Utilizing planar propagation allows
point propagation, simplifying calculations and modeling in array signal processing.
ng arrays of four microphones, data was taken and processed through a
shows the results of the spacing testing and its ability to output the correct angle.
. The RMS error of the outputted angle was
calculated. The graph shows that the least amount of error occurred at 17cm, which determined the final
The audio signals obtained through the microphone array must now be converted into data the device can use. It
is imperative that these audio signals are sampled synchronously. If not, all calculations will be incorrect and
mistimed with the video. Before amplification, all extraneous signals beyond
2KHz (high frequency noise) and below 60Hz (DC components) are eliminated using a high pass and low pass
ification. Before the audio signals are
0.2
0.25
Length L Array
Length 2L-1 Array
Length 3L-1 Array
5
processed, they must be amplified. Since the audio signals from the microphones are weak (micro-volts), a
preamp amplifier connected to the microphones strengthens the signal (up to 10 Volts peak-to-peak) for further
processing [3]. Once the analog signal is ready for processing, it is converted to a set of digital data. For this
process a Data Acquisition System (DAQ) is used. Many DAQs use a “round robin” approach when sampling
multiple channels. However, “round robin” does not guarantee synchronous data acquisition, which is key to the
required beamforming processing. An 8-channel ADC samples the signal and quantifies the values to send to the
location calculator. The DAQ used to prepare the signal must support the number of channels required. Since
maximum frequency of speech signals are around 3400Hz, a minimum sampling rate of 6800Hz is needed across
each channel of input. However, as conversational frequencies lie under 2KHz, a smaller sampling frequency may
suffice. We are using analog, omnidirectional electret condenser microphones as inputs to our system.
Once the sampled audio signals are stored, the signals are sent to the Location Calculator module of the device.
The location calculation is processed within the PC, as speed and signal processing power are vital for signal
tracking. The processor implements spatiotemporal filters for source localization. This decreases the need for
extremely high sampling rates required for time-domain processing. The spatiotemporal method is utilized for the
location calculation and also for our optional audio enhancement.
One camera mounted on the device is able to rotate 360° in the azimuth direction. Motors with precision in
position, such as servos, are used to actuate the mounts of the webcams. The camera control device was designed
to be quiet, as to not distort the microphone input. During operation, the camera always points to the speaker
while they are talking if nobody else has spoken yet.
4. Audio Processing Algorithms
The scan function or source localization function acts to find a persistent sound source at a new location (Figure 7).
First, the processing unit takes a short time Fourier transform of the data coming in from the DAQ, about 150ms.
Then, a spatial transform is applied to the frequencies under 2KHz. The resulting two-dimensional array consists of
a wavelength vs. frequency plot. Peaks within the plot above 95% of the maximum value are chosen and the plot
is summed across at the discretized angles that the system is sampling (50, 70, 90, 110, 130 degrees). The largest
sum across the oblique lines corresponding to each angle determines the position of the speaker.
Figure 7: Scan Algorithm
Figure 8 shows a spectrogram of a speech signal recorded. Most of the spectral content within a voice falls under
2KHz. The spectral content does not change drastically as words change with time. Additionally, there are three
major noise bands that are consistent with time; this is due to the room acoustics and equipment running in the
6
room. Therefore, in our algorithm, only frequencies under 2KHz were processed in further stages. Figure 8 also
shows a 2048 point temporal FFT of a 150ms speech signal with the frequencies over 2KHz cutoff. The FFT consists
of three dominant peaks where most of the spectral content is contained. Every other frequency that arises in the
FFT is at least 10dB less than the three dominant peaks.
Figure 8: Spectrogram of speech signal (left), and FFT of speech signal (right)
Figure 9 shows the result of a spatial FFT taken across the temporal FFT data. This process results in a wave
number vs. frequency plot that has characteristics which relate to the source angle. Below is the result of applying
a peak algorithm to the previous dataset. The final dataset consists of 1’s at each peak and 0’s otherwise. If the
data is taken while a person is speaking from an angle, a series of peaks results along the linear relationship. The
output angle is determined by summing across the five discretized angles and selecting the largest sum as the
speaker’s angle.
Figure9: Spatial FFT of Temporal FFT (left) and peak algorithm (right)
5. Experimental Evaluation and Validation
Individually, each component, including the arrays, DAQ, and the servo, were found to fall well within operational
requirements. The first experiment focused on location accuracy. Three speakers spoke at angles varying from 50°
to 130° and 230° to 310° with intervals of 10° on each side of the array. This data is represented in Figure 10. The
gold X’s signify AVAL’s predefined angles and the green X’s signify the midpoints between two predefined angles.
Examining the chart above it becomes clear that all
value. This falls well within the 15° tolerance defined in the requirements. Moving to the midpoints it
noted that the camera is always pointed towards one of AVAL’s predefined angles.
were the system will default to broadside (90°). When the system cannot determine an accurate speaker location
(110° and 130° in this example), the ambient noise in the room will win out, which defaults the system to
broadside. Overall, the system test proved successful as the speaker was always detected at an AVAL
predetermined angle.
Figure 10
The second test consisted of measuring time to location. In each test the camera started at broadside of the
opposite array. Each speaker spoke at one of AVAL’s predefined angles and the time to locate the speaker was
measured. Observing the data below it can be seen that AVAL has
seconds, well within the 4 seconds stated in the requirements. The algorithm response time is under a quarter
second; AVAL takes its time locating the speaker in order to ensure smooth, necessary transitions betwee
speakers. With these experiments completed, it can be said that the AVAL system is functional well within its
predefined requirements.
Figure
References
[1] Davis, Andrew, and Ira Weinstein. “Videoconferencing
[2] General Services Administration.United States of America.FY 2012 Congressional Justification. (2011).
[3] Lewis, Jerad. Understanding Microphone Sensitivity. Analog Devices, (2011).
0.00
50.00
100.00
150.00
200.00
250.00
300.00
350.00
A
n
g
u
la
r
Lo
ca
ti
o
n
0
1
2
3
4
5
0
50
T
im
e
(
s)
7
gold X’s signify AVAL’s predefined angles and the green X’s signify the midpoints between two predefined angles.
Examining the chart above it becomes clear that all predefined angles are precise within
falls well within the 15° tolerance defined in the requirements. Moving to the midpoints it
noted that the camera is always pointed towards one of AVAL’s predefined angles. An anomaly can be seen at 120°,
were the system will default to broadside (90°). When the system cannot determine an accurate speaker location
(110° and 130° in this example), the ambient noise in the room will win out, which defaults the system to
ide. Overall, the system test proved successful as the speaker was always detected at an AVAL
10: Location Accuracy Experimentation Results
of measuring time to location. In each test the camera started at broadside of the
opposite array. Each speaker spoke at one of AVAL’s predefined angles and the time to locate the speaker was
it can be seen that AVAL has an average response time of roughly 2.25
seconds, well within the 4 seconds stated in the requirements. The algorithm response time is under a quarter
second; AVAL takes its time locating the speaker in order to ensure smooth, necessary transitions betwee
speakers. With these experiments completed, it can be said that the AVAL system is functional well within its
Figure 11: Time to Location Experiment Results
Davis, Andrew, and Ira Weinstein. “Videoconferencing by the Numbers.” Wainhouse Research (2011). Web. 26 Sept. 2013.
General Services Administration.United States of America.FY 2012 Congressional Justification. (2011).
Lewis, Jerad. Understanding Microphone Sensitivity. Analog Devices, (2011).
Speaker Locating Experimentation
Experimental Average
Speaker Location
100
150
200
250
300
Angle of Speaker (degrees)
Average Time to Locate Speaker
gold X’s signify AVAL’s predefined angles and the green X’s signify the midpoints between two predefined angles.
2-3° of the theoretical
falls well within the 15° tolerance defined in the requirements. Moving to the midpoints it should be
An anomaly can be seen at 120°,
were the system will default to broadside (90°). When the system cannot determine an accurate speaker location
(110° and 130° in this example), the ambient noise in the room will win out, which defaults the system to
ide. Overall, the system test proved successful as the speaker was always detected at an AVAL
of measuring time to location. In each test the camera started at broadside of the
opposite array. Each speaker spoke at one of AVAL’s predefined angles and the time to locate the speaker was
an average response time of roughly 2.25
seconds, well within the 4 seconds stated in the requirements. The algorithm response time is under a quarter
second; AVAL takes its time locating the speaker in order to ensure smooth, necessary transitions between
speakers. With these experiments completed, it can be said that the AVAL system is functional well within its
by the Numbers.” Wainhouse Research (2011). Web. 26 Sept. 2013.
General Services Administration.United States of America.FY 2012 Congressional Justification. (2011).
350
Dostları ilə paylaş: |