Method and device for sound source localization
10856094 ยท 2020-12-01
Assignee
Inventors
Cpc classification
H04S2400/03
ELECTRICITY
H04R5/04
ELECTRICITY
H04S2420/11
ELECTRICITY
H04S3/02
ELECTRICITY
International classification
H04S3/02
ELECTRICITY
Abstract
A method and an apparatus for locating a sound source are provided. The method includes: obtaining M channels of audio signals of a preset format by microphone arrays located in different planes (S100); preprocessing the M channels of audio signals of the preset format, and projecting them onto the same plane, so as to obtain N channels of audio signals, where MN (S200); performing a time-frequency transformation on each of the N channels of audio signals, so as to obtain frequency domain signals of the N channels of audio signals (S300); further calculating a covariance matrix of the frequency domain signals and performing a smoothing process (S400); performing an eigenvalue decomposition of the smoothed covariance matrix (S500); estimating the sound source direction according to an eigenvector corresponding to the maximum eigenvalue, so as to obtain a sound source orientation parameter (S600).
Claims
1. A method for locating a sound source comprising: step 1: obtaining M channels of audio signals of a preset format by using microphone arrays located on different planes, wherein M is a positive integer; step 2: preprocessing the M channels of audio signals of the preset format, and projecting the M channels of audio signals of the preset format onto a same plane to obtain N channels of audio signals, wherein N is a positive integer, and MN; step 3: performing a time-frequency transform on each of the N channels of audio signals to obtain frequency domain signals of the N channels of audio signals; step 4: calculating covariance matrices of the frequency domain signals, and performing a smoothing process on each of the covariance matrices to obtain smoothed covariance matrices; step 5: performing an eigenvalue decomposition on each of the smoothed covariance matrices to obtain N eigenvalues and corresponding eigenvectors; and step 6: estimating a direction of the sound source according to an eigenvector corresponding to a maximum eigenvalue of the N eigenvalues, to obtain sound source orientation parameters, wherein in the step 1, M=4, the preset format is an Ambisonic A format, and the four channels of audio signals (LFU, RFD, LBD, RBU) are located on different planes, wherein a specific process of the preprocessing in the step 2 is: converting he four channels of audio signals of the Ambisonic A format into three (N=3) channels of audio signals (L, R, S) in the same plane by a conversion matrix A:
2. The method for locating the sound source according to claim 1, wherein a process of the preprocessing in the step 2 is: converting the four channels of audio signals of the Ambisonic A format into four (N=4) channels of audio signals (F, R, B, L) in the same plane by the conversion matrix A:
3. The method for locating the sound source according to claim 2, wherein when the microphone arrays pick up the audio signals, if the sound source is in a middle position (=0), the conversion matrix
4. The method for locating the sound source according to claim 1, wherein a process of the preprocessing in the step 2 is: step 21: converting the four channels of audio signals of the Ambisonic A format into audio signals (W, X, Y, Z) of an Ambisonic B format by the conversion matrix A:
5. The method for locating the sound source according to claim 1, wherein the time-frequency transform in the step 3 is realized by a Discrete Fourier Transform (DFT), a Fast Fourier Transform (FFT) or a Modified Discrete Cosine Transform (MDCT).
6. The method for locating the sound source according to claim 1, wherein a specific process of estimating the direction of the sound source in the step 6 is: searching for, according to the eigenvector corresponding to the maximum eigenvalue of the N eigenvalues, an index value corresponding to a maximum inner product value by using an inner product of the eigenvector corresponding to the maximum eigenvalue of the N eigenvalues and a steering vector, wherein the index value corresponds to the direction of the sound source.
7. The method for locating the sound source according to claim 1, wherein: in the step 3, the frequency domain signals are divided into a plurality of sub-bands; in the step 4, the covariance matrices are calculated for the plurality of sub-bands and the smoothing process is performed; in the step 5, the eigenvalue decomposition is respectively performed on the covariance matrices of the plurality of sub-bands after the smoothing process to obtain N eigenvalues and corresponding eigenvectors of the covariance matrices of the plurality of sub-bands; and in the step 6, the direction of the sound source is estimated for each sub-band of the plurality of sub-bands according to the eigenvector corresponding to the maximum eigenvalue, and the sound source orientation parameters are obtained in combination with detection results of the direction of the sound source for the each sub-band.
8. A device for locating a sound source, comprising: an acquisition unit of an audio signal of a preset format, a signal preprocessing unit, a time-frequency transform unit, a frequency domain signal processing unit, and a sound source orientation estimation unit, wherein the acquisition unit of the audio signal of the preset format is configured to obtain M channels of audio signals of the preset format by using microphone arrays located on different planes, and send the M channels of audio signals of the preset format to the signal preprocessing unit, wherein M is a positive integer and M=4; the signal preprocessing unit is configured to preprocess the M channels of audio signals of the preset format and project the M channels of audio signals of the preset format onto a same plane to obtain N channels of audio signals, and send the N channels of audio signals to the time-frequency transform unit, wherein N is a positive integer, and MN; the time-frequency transform unit is configured to perform a time-frequency transform on each of the N channels of audio signals to obtain frequency domain signals of the N channels of audio signals; the frequency domain signal processing unit is configured to process the frequency domain signals, calculate covariance matrices of the frequency domain signals and perform a smoothing process, further perform an eigenvalue decomposition on the covariance matrices to obtain eigenvalues and eigenvectors, and send the eigenvalues and eigenvectors to the sound source orientation estimation unit; and the sound source orientation estimation unit is configured to estimate a direction of the sound source according to an eigenvector corresponding to a maximum eigenvalue of the eigenvalues, to obtain sound source orientation parameters, wherein the preset format is an Ambisonic A format, and four channels of audio signals (LFU, RFD, LBD, RBU) are located on the different planes, wherein the signal preprocessing unit is configured to convert the four channels of audio signals of the Ambisonic A format into three (N=3) channels of audio signals (L, R, S) in the same plane by a conversion matrix A:
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
DETAILED DESCRIPTION OF THE EMBODIMENTS
(7) In order to enable those skilled in the art to better understand the technical solutions of the present disclosure, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below in combination with the accompanying drawings of the embodiments of the present disclosure. It is apparent that the described embodiments are part of the embodiments of the present disclosure, instead of all of them. All the other embodiments obtained by those skilled in the art on the basis of the embodiments of the present disclosure without creative efforts will fall within the scope of protection of the present disclosure.
(8) With reference to
(9) Step S100: obtaining M channels of audio signals of a preset format by using microphone arrays located in different planes.
(10) In the embodiment, the M channels of audio signals of the preset format may be four channels of audio signals (LFU, RFD, LBD, RBU) of Ambisonic A format. See
(11) Step S200: preprocessing the M channels of audio signals of the preset format, and projecting them onto a same plane to obtain N channels of audio signals.
(12) In the embodiment, referring to
(13)
(14) where the conversion matrix
(15)
and the values of the elements a.sub.11, a.sub.12, . . . , a.sub.34 of the A are constants and are determined by different sound source scenes, e.g.
(16)
(17) By converting the audio signals of the Ambisonic A format into audio signals of the LRS format, errors caused by height information can be excluded and a more accurate detection result can be obtained.
(18) In an embodiment of the present disclosure, referring to
(19)
(20) where the conversion matrix
(21)
the is a height angle, and () is a function related to , e.g.
(22)
(23) When the microphone array picks up audio, if the sound source is in the middle position, then the audio as picked up contains no height information (=0), the conversion matrix
(24)
and the values of the elements a.sub.11, a.sub.12, . . . , a.sub.44 of the A are constants and are determined by different sound source scenes, e.g.
(25)
(26) By using the four-channel audio detection method, the resolution in the horizontal direction can be effectively improved.
(27) Step S300: performing time-frequency transform, channel by channel, on the N channels of audio signals to obtain frequency domain signals of the N channels of audio signals.
(28) In the embodiment, the time-frequency transform can be realized by Discrete Fourier Transform (DFT), Fast Fourier Transform (FFT) or Modified Discrete Cosine Transform (MDCT).
(29) Step S400: calculating covariance matrices of the frequency domain signals, and performing smoothing process on the covariance matrices.
(30) In the embodiment, the calculation of the covariance matrix can be set in a specific frequency band, or the covariance matrix of each sub-band can be calculated separately after dividing the entire frequency band into sub-bands.
(31) The formula for calculating the covariance matrix for a particular frequency band is:
(32)
(33) where n represents the numbering of an audio frame in the audio signal; k represents the numbering of a frequency point of the frequency domain signal; X(n,k) represents a matrix composed of the value of the k-th frequency point in the n-th frame, specifically X(n,k)=[X.sub.1(n,k) X.sub.2(n,k) . . . ], X.sub.i, i=1,2, . . . , N is a frequency domain signal of the audio signal; and k.sub.l and k.sub.u are respectively start frequency point and cut-off frequency point of the covariance matrix calculation.
(34) The smoothing process is:
cov.sub.s(n,k)=cov.sub.s(n1,k)+(1)cov(n,k)
(35) where is a smoothing factor, and can be set at a fixed value, e.g., =0.9, or also may be selected adaptively according to the characteristics of the audio signal.
(36) Step S500: performing eigenvalue decomposition on the smoothed covariance matrices to obtain N eigenvalues and corresponding eigenvectors.
(37) Step S600: estimating the direction of the sound source according to the eigenvector corresponding to the maximum eigenvalue, to obtain sound source orientation parameters.
(38) In the embodiment, the estimation of the direction of the sound source according to the eigenvector corresponding to the maximum eigenvalue can be specifically performed as follows:
(39) searching for an index value corresponding to a maximum inner product value by using the inner product of the maximum eigenvector and a steering vector, where the index value corresponds to the direction of sound source.
(40) The steering vector is:
(41)
(42) where K is the order of the steering vector, and is typically determined by the locating accuracy.
(43) For three channels of audio signals, the value of p.sub.k, k=1,2, . . . , K is determined by the following formula:
(44)
(45) For four channels of audio signals, the value of p.sub.k, k=1,2, . . . , K is determined by the following formula:
(46)
(47) The inner product D of the maximum eigenvector V and the steering vector P is:
D=PV
(48) In the embodiment of the present disclosure, the frequency domain signals obtained in step S300 can also be divided into several sub-bands. In step S400, a covariance matrix is calculated for each of the sub-bands and smoothing process is performed. In step S500, eigenvalue decomposition is respectively performed on the covariance matrices of the several sub-bands after the smoothing process to obtain N eigenvalues and corresponding eigenvectors of the covariance matrix of each sub-band. In step S600, the direction of the sound source is estimated for each sub-band according to the eigenvector corresponding to the maximum eigenvalue, and sound source orientation parameters are obtained in combination with the detection results for each sub-band.
(49) In the embodiment of the present disclosure, the DOA detection can also be performed adaptively on the four channels of audio signals of Ambisonic A format according to the divergence parameter, as shown in
(50) Step S100: obtaining four channels of audio signals (LFU, RFD, LBD, RBU) of Ambisonic A format by using microphone arrays located in different planes.
(51) Step S200: preprocessing the four channels of audio signals of Ambisonic A format, projecting them onto a same plane to obtain four channels of audio signals (W, X, Y, Z) of the B format in the same plane, and determining whether the three (N=3) channels of audio (L, R, S) or the four (N=4) channels of audio will be used to estimate the direction of sound source, according to the four channels of audio signals of the B format.
(52) In the embodiment, the specific preprocessing steps are as follows:
(53) Step S201: converting the four channels of audio signals of the Ambisonic A format into audio signals (W, X, Y, Z) of the Ambisonic B format by a conversion matrix A:
(54)
(55) where the conversion matrix
(56)
and the values of the elements a.sub.11, a.sub.12, . . . , a.sub.44 of the A are constants and are determined by different sound source scenes, e.g.
(57)
(58) Step S202: estimating a divergence parameter based on an energy of a signal in the audio signals of the B format.
(59)
(60) where, Pz and Pw are the powers of Z signal and W signal respectively.
(61) Step S203: determining whether the divergence is greater than a set threshold, wherein the threshold is set by an empirical value according to different scenes.
(62) In the embodiment of the present disclosure, the range of the value of the threshold may be [0.3, 0.6].
(63) Step S204: if the divergence is greater than the set threshold, using the three (N=3) channels of audio signals (L, R, S) to estimate the direction of sound source; and
(64) if the divergence is not greater than the set threshold, using the four (N=4) channels of audio signals to estimate the direction of sound source.
(65) Step S300: performing time-frequency transform, channel by channel, on the N channels of audio signals to obtain frequency domain signals of the N channels of audio signals.
(66) In the embodiment, the time-frequency transform can be realized by Discrete Fourier Transform (DFT), Fast Fourier Transform (FFT) or Modified Discrete Cosine Transform (MDCT).
(67) Step S400: calculating covariance matrices of the frequency domain signals, and performing smoothing process on the covariance matrices.
(68) In the embodiment, the calculation of the covariance matrix can be set in a specific frequency band, or the covariance matrix of each sub-band can be calculated separately after dividing the full frequency band into sub-bands.
(69) Step S500: performing eigenvalue decomposition on the smoothed covariance matrices to obtain N eigenvalues and corresponding eigenvectors.
(70) Step S600: estimating the direction of the sound source according to the eigenvector corresponding to the maximum eigenvalue, to obtain sound source orientation parameters.
(71) In the embodiment, the estimation of the direction of the sound source according to the eigenvector corresponding to the maximum eigenvalue can be specifically performed as follows:
(72) searching for an index value corresponding to a maximum inner product value by using the inner product of the maximum eigenvector and a steering vector, where the index value corresponds to the direction of sound source.
(73) In the present embodiment, the divergence parameter can also be used as a reference for the confidence of the DOA result. When the divergence parameter is small, the DOA result has a high confidence; and when the divergence parameter is large, the DOA result has a small confidence.
(74) In the embodiment, the DOA detection is adaptively performed on the input multiple channels of audio signals based on the divergence parameter obtained through the estimation of the energy of Z signal, and the accuracy of orientation can be improved at a lower complexity.
(75) With reference to
(76) The acquisition unit 100 of audio signal of a preset format is configured to obtain M channels of audio signals of a preset format by using microphone arrays located in different planes, and send the M channels of audio signals of the preset format to the signal preprocessing unit 200.
(77) The signal preprocessing unit 200 is configured to preprocess the received M channels of audio signals of the preset format and project them onto a same plane to obtain N channels of audio signals, and send the N channels of audio signals to the time-frequency transform unit 300.
(78) The time-frequency transform unit 300 is configured to perform time-frequency transform on the received N channels of audio signals, channel by channel, to obtain frequency domain signals of the N channels of audio signals, and send the frequency domain signals of the N channels of audio signals to the frequency domain signal processing unit 400.
(79) The frequency domain signal processing unit 400 is configured to process the frequency domain signals of the N channels of audio signals, calculate covariance matrices of the frequency domain signals and perform smoothing process, further perform eigenvalue decomposition on the covariance matrices, and send the obtained eigenvalues and eigenvectors to the sound source orientation estimation unit 500.
(80) The sound source orientation estimation unit 500 is configured to estimate the direction of sound source according to the eigenvector corresponding to the maximum eigenvalue of the eigenvalues, to obtain sound source orientation parameters.
(81) In the device disclosed in the embodiment, the Ambisonic audio signals located in different planes are projected onto the same plane for detection, which can effectively improve the accuracy of the DOA detection.
(82) The above description of various embodiments of the present disclosure is provided to those skilled in the art for the purpose of illustration. It is not intended to be exhaustive or to limit the present disclosure to the single disclosed embodiment. As described above, various alternatives and modifications to the present disclosure will be apparent to those skilled in the art. Thus, while a few alternative embodiments have been discussed in detail, other embodiments will be apparent to or can be readily obtained by those skilled in the art. The present disclosure is intended to cover all the alternatives, modifications, and variations of the present disclosure discussed above, as well as other embodiments that fall within the spirit and scope of the present disclosure.