Sound source localization method and sound source localization apparatus based coherence-to-diffuseness ratio mask

Abstract

Provided is a sound source localization method including steps of: (a) receiving a mixed signal of a target sound source signal and noise and echo signals through multiple microphones including at least two microphones; (b) generating a binarized mask based on a diffuseness by using a coherence-to-diffuseness ratio CDR, which is information on the target sound source and the noise source, by using the input signal; (c) pre-processing an input signal to multiple microphones by using the generated binarized mask; and (d) performing a predetermined algorithm such as the GCC-PHAT or the SRP-PHAT on the pre-processed input signal to estimate a direction of the target sound source.

Claims

1. A sound source localization method implemented by execution of a processor of a sound source localization apparatus, comprising steps of: (a) receiving a mixed signal of a target sound source signal and a noise signal through multiple microphones including at least two microphones; (b) generating a mask based on a diffuseness reflecting information on a target sound source and a noise source by using the mixed signal; (c) pre-processing the mixed signal received to the multiple microphones by using the generated mask; and (d) estimating a direction for the target sound source by performing a predetermined algorithm on the pre-processed mixed signal.

2. The sound source localization method according to claim 1, wherein, in the step (b) of generating the mask, a coherence-to-diffuseness ratio CDR(l,f) for each frequency frame f and each time frame l is calculated, a diffuseness D(l,f) is calculated by using the coherence-to-diffuseness ratio CDR(l,f), and a binarized mask M is generated by setting a mask value according to the following Mathematical Formula by using the diffuseness D(l,f), $MASK = {\begin{matrix} 1 : D threshold \\ 0 : D > threshold \end{matrix} .$

3. The sound source localization method according to claim 2, wherein, in the step (c) of pre-processing the mixed signal, the mixed signal is binarized by using a binarized mask.

4. The sound source localization method according to claim 1, wherein the predetermined algorithm in the step (d) is a sound source localization method based on generalized cross correlation (GCC) value or a sound source localization method based on a steered response power SRP.

5. The sound source localization method according to claim 4, wherein the predetermined algorithm applies a phase transform (PHAT) scheme for applying a weighting factor () according to the following Mathematical Formula to signals of each frequency, $_{kl} () = \frac{1}{.Math. X_{k} () X_{l}^{*} () .Math.}$ herein, k and l are the number of the microphone, =2f, X.sub.k() is a Fourier transform value for a signal being input to k-th microphone, and X.sub.l*() is a conjugate value of the Fourier transform value.

6. The sound source localization method according to claim 2, wherein the coherence-to-diffuseness ratio CDR(l,f) for each frequency frame f and each time frame l is estimated according to the following Mathematical Formula by using a coherence for the noise signal n, the target sound source signal s, and the mixed signal x of the noise signal and the target sound signal, $CDR (l, f) = \frac{_{n} (f) -_{x} (l, f)}{_{x} (l, f) -_{s} (f)}$ herein, .sub.n(f) is the coherence for the noise signal n, .sub.s(f) is the coherence for the target sound source signal s, and .sub.x(f) is the coherence for the mixed signal x of the noise signal and the target sound source signal s.

7. The sound source localization method according to claim 2, wherein the diffuseness D(l,f) is calculated according to the following Mathematical Formula, $D (l, f) = \frac{1}{CDR (l, f) + 1}$ $0 D 1.$

8. A sound source localization apparatus having a processor and being operable to estimate a direction of a target sound source by using signals input from multiple microphones by execution of the processor, comprising: a mixed signal input module which is connected to the multiple microphones and receives mixed signals of a target sound source signal and a noise signal from the multiple microphones; a mask generation module which generates and outputs a binarized mask based on a diffuseness by using the mixed signal provided from the mixed signal input module; an input signal pre-processing module which receives the binarized mask from the mask generation module, pre-processes the mixed signal by applying the binarized mask to the mixed signal provided from the mixed signal input module, and outputs the pre-processed mixed signal; and a target direction estimation module which receives the pre-processed mixed signal from the input signal pre-processing module, estimates a direction of the target sound source by performing a predetermined localization algorithm on the pre-processed mixed signal, and outputs the estimated direction.

9. The sound source localization apparatus according to claim 8, wherein the mask generation module performs: calculating a coherence-to-diffuseness ratio CDR(l,f) for each frequency frame f and each time frame l of the mixed signal provided from the mixed signal input module; calculating a diffuseness D(l,f) by using the coherence-to-diffuseness ratio CDR(l,f); and generating a binarized mask M by setting a mask value according to the following Mathematical Formula by using the diffuseness D(l,f), $MASK = {\begin{matrix} 1 : D threshold \\ 0 : D > threshold \end{matrix} .$

10. The sound source localization apparatus according to claim 8, wherein the predetermined localization algorithm of the target direction estimation module is a sound source localization method based on a generalized cross correlation (GCC) value or a sound source localization method based on a steered response power SRP.

11. The sound source localization apparatus according to claim 9, wherein the coherence-to-diffuseness ratio CDR(l,f) for each frequency frame f and each time frame l is estimated according to the following Mathematical Formula by using a coherence for the noise signal n, the target sound source signal s, and the mixed signal x of the noise signal n and the target sound signal, $CDR (l, f) = \frac{_{n} (f) -_{x} (l, f)}{_{x} (l, f) -_{s} (f)}$ herein, .sub.n(f) is the coherence for the noise signal n, .sub.s(f) is the coherence for the target sound source signal s, and .sub.x(f) is the coherence for the mixed signal x of the noise signal n and the target sound source signal s.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) FIG. 1 is a graph illustrating time delays in which sound signals from a sound source in a spherical coordinate system arrives at multiple microphones according to angles;

(2) FIG. 2 is a diagram illustrating cross correlation values when delay compensation is performed;

(3) FIG. 3 is a diagram illustrating a steered response power algorithm using a delay-and-sum beamforming method in the related art;

(4) FIG. 4 is a diagram illustrating an exemplary configuration of a diffusive noise;

(5) FIG. 5 is an exemplary diagram illustrating an input path of an input signal in an echo environment;

(6) FIG. 6 is a block diagram illustrating a sound source localization apparatus implemented by applying the sound source localization method according to the embodiment of the present invention;

(7) FIGS. 7A, 7B, and 7C are graphs illustrating an input mixed signal, an estimated CDR, and a binarized mask, respectively, in the sound source localization method according to the embodiment of the present invention; and

(8) FIGS. 8A and 8B are graphs illustrating the performance of the sound source localization method according to the embodiment of the present invention in comparison with a method in the related art.

DETAILED DESCRIPTION

(9) A sound source localization method and apparatus according to the present invention generates a binarized mask based on a diffuseness reflecting information on a target sound source and a noise source in a direction and a noise environment, converts an input signal by using the generated binarized mask, and applies a GCC-PHAT or a SRP-PHAT based on a cross correlation method to the converted input signal to estimate the direction of the target sound source.

(10) Hereinafter, a sound source localization method and apparatus according to a preferred embodiment of the present invention will be described in detail. The sound source localization method according to the present invention can be implemented by a processor such as a CPU of the sound source localization apparatus.

(11) In addition, the present invention also includes a computer-readable nonvolatile recording medium that stores program commands including operations for executing the above-described sound source localization method, and the program commands recorded on the recording medium can be executed by the processor of the sound source localization apparatus.

(12) As in Mathematical Formula 25, the signal-to-noise ratio SNR, which is the power ratio of the target sound source signal to the noise, is expressed by the ratio of the auto-correlation value .sub.s(l,f) of the target sound source to the auto-correlation value .sub.n(l,f) of the noise source. In this case, the auto-correlation value .sub.s(l,f) of the target sound source and the auto-correlation value .sub.n(1,f) of the noise source can be expressed as Mathematical Formula 25 for the environment where a noise and a late echo exist.

(13) $\begin{matrix} SNR (l, f) = \frac{_{s} (l, f)}{_{n} (l, f)}_{s_{1} s_{1}} (l, f) =_{s_{2} s_{2}} (l, f) =_{s} (l, f)_{n_{1} n_{1}} (l, f) =_{n_{2} n_{2}} (l, f) =_{n} (l, f) & [Mathematical Formula 25] \end{matrix}$

(14) : correlation value

(15) .sub.s1s1(l,f): auto-correlation value of target sound source

(16) .sub.n1n1(l,f): auto-correlation value of noise source

(17) Where l denotes a time frame, and f denotes a frequency bin.

(18) Next, the diffuseness D is measured by Mathematical Formula 26. Herein, P.sub.pw denotes a plane wave phasor, and P.sub.diff denotes a diffusive noise phasor. The plane wave and the diffusive noise can be distinguished by coherence. Theoretically, the coherence of the plane wave corresponds to 1, and the coherence of the diffusive noise corresponds to 0. The diffuseness D is expressed by a value ranging from 0 to 1. The larger the value, the higher the diffuseness. The smaller the value, the lower the diffuseness.

(19) On the other hand, the coherence-to-diffuseness ratio CDR can be expressed as Mathematical Formula 27. The coherence-to-diffuseness ratio CDR is the same as that of the case where it assumed that the target sound source is a plane wave and the noise is a diffusive noise in the signal-to-noise SNR. In other words, the coherence-to-diffuseness ratio CDR can be regarded as the ratio of a signal with a low coherence to a signal with a high coherence.

(20) 0 $\begin{matrix} D = \frac{E [{.Math. P_{diff} .Math.}^{2}]}{E [{.Math. P_{diff} .Math.}^{2}] + E [{.Math. P_{pw} .Math.}^{2}]} & [Mathematical Formula 26] \\ CDR = \frac{E [{.Math. P_{pw} .Math.}^{2}]}{E [{.Math. P_{diff} .Math.}^{2}]} & [Mathematical Formula 27] \end{matrix}$

(21) E[|P.sub.pw|.sup.2]: power of plane wave

(22) E[|P.sub.diff|.sup.2]: power of diffusive noise

(23) Therefore, since a sound signal is a signal with a high coherence and a diffusive noise as a target noise of the present invention is a signal with a low coherence, the coherence-to-diffuseness ratio CDR can be used in the same manner as the signal-to-noise SNR. In this case, as illustrated in Mathematical Formula 28, when there are input signals x.sub.1(t) and x.sub.2(t) incoming to the two microphones, the coherence is defined as the value by normalizing the cross correlation value .sub.x1x2 for this signal by the auto-correlation value of the signal. Thus, it can be understood that the coherence is independent of time, in a case where it is assumed that the signal enters in a certain direction without being affected by the change in size of the sound source over time. Therefore, the coherence can only reflect spatial characteristics while excluding temporal characteristics for the target sound source and the noise source in comparison with the cross correlation values. That is, the formula of the coherence can be defined according to the spatial characteristics of each target signal and the noise signal. Therefore, when the coherence-to-diffuseness ratio CDR is expressed by the coherence for the mixed signal, the coherence for the target source, and the coherence for the noise source rather than the cross correlation value, the time domain and the frequency domain in which the signal is dominant in comparison with the noise can be estimated through the mixed signal.

(24) $\begin{matrix} _{x_{1} x_{2}} (f) = \frac{_{x_{1} x_{2}} (l, f)}{\sqrt{_{x_{1} x_{1}} (l, f) *_{x_{2} x_{2}} (l, f)}} & [Mathematical Formula 28] \end{matrix}$

(25) According to the definition of the coherence according to Mathematical Formula 28, the coherence for the target sound source s and the noise source n can be expressed as Mathematical Formula 29, respectively.

(26) $\begin{matrix} _{s} (f) = \frac{_{s_{1} s_{2}} (l, f)}{_{s} (l, f)},_{n} (f) = \frac{_{n_{1} n_{2}} (l, f)}{_{n} (l, f)} & [Mathematical Formula 29] \end{matrix}$

(27) Next, the coherence for the input signal x(t) can be expressed as Mathematical Formula 30, which can be expressed by Mathematical Formula 31 and Mathematical Formula 32 with respect to the coherence-to-diffusiveness rate CDR, and the coherence-to-diffusiveness rate CDR can be expressed by Mathematical Formula 33 by using the coherence for each signal.

(28) $\begin{matrix} _{x} (l, f) = \frac{_{x_{1} x_{2}} (l, f)}{_{x} (l, f)} = \frac{_{s_{1} s_{2}} (l, f) +_{n_{1} n_{2}} (l, f)}{_{s} (l, f) +_{n} (l, f)} = \frac{\begin{matrix} \frac{_{s} (l, f)}{_{n} (l, f)} * \frac{_{s_{1} s_{2}} (l, f)}{_{s} (l, f)} + \\ \frac{_{n_{1} n_{2}} (l, f)}{_{n} (l, f)} \end{matrix}}{\frac{_{s} (l, f)}{_{n} (l, f)} + 1} & [Mathematical Formula 30] \\ _{x} (l, f) = \frac{SNR (l, f)_{s} (f) +_{n} (f)}{SNR (l, f) + 1} & [Mathematical Formula 31] \\ _{x} (l, f) =_{s} (f) + \frac{1}{CDR (l, f) + 1} (_{n} (f) -_{s} (f)) & [Mathematical Formula 32] \\ CDR (l, f) = \frac{_{n} (f) -_{x} (l, f)}{_{x} (l, f) -_{s} (f)} & [Mathematical Formula 33] \end{matrix}$

(29) In addition, the diffuseness D according to Mathematical Formula 26 can be expressed as Mathematical Formula 34 by using the coherence.

(30) $\begin{matrix} D (l, f) = \frac{1}{CR (l, f) + 1} 0 D 1 & [Mathematical Formula 34] \end{matrix}$

(31) Hereinafter, a sound source localization apparatus implemented by applying the sound source localization method according to the embodiment of the present invention will be described in detail.

(32) FIG. 6 is a block diagram illustrating an entire sound source localization apparatus implemented by applying the sound source localization method according to the embodiment of the present invention. Referring to FIG. 6, the sound source localization apparatus 10 according to the present invention includes a mixed signal input module 100, a mask generation module 110, an input signal pre-processing module 120, and a target direction estimation module 130. Each module of the sound source localization apparatus is a module operated by a processor such as a CPU of the sound source localization apparatus.

(33) The sound source localization apparatus 10 according to the present invention having the above-described configuration is connected to the multiple microphones 20 configured with the M microphones and estimates the direction of the target sound source by using the signals input from multiple microphones.

(34) The mixed signal input module 100 is connected to the multiple microphones and is input with mixed signals of the target sound source signals, the noise signals, and the echo signals from multiple microphones.

(35) The mask generation module 110 generates and outputs a binarized mask M based on the diffuseness by using the mixed signal provided from the mixed signal input module. The operation of the mask generation module will be described later in detail.

(36) The input signal pre-processing module 120 receives the binarized mask from the mask generation module and pre-processes and outputs the mixed signal by applying the binarized mask to the mixed signal provided from the mixed signal input module.

(37) The target direction estimation module 130 receives the pre-processed mixed signal from the input signal pre-processing module and estimates and outputs the direction of the target sound source by using the GCC algorithm or the SRP algorithm for the mixed signal or using the GCC-PHAT algorithm or the SRP-PHAT algorithm applying a phase transform weighting function.

(38) Hereinafter, a method of generating the binarized mask using by the diffuseness used in the sound source localization method according to the present invention will be described in detail. On the other hand, the mask generation module 110 of the sound source localization apparatus 10 according to the present invention is implemented by applying the following binarized mask generation method.

(39) In the present invention, the binarized mask based on the diffuseness is used such that the direction can be estimated at the time and frequency in which the target sound source is dominant according to the diffuseness value.

(40) First, the definition of the coherence for the target sound source, the noise, and the echo will be described in detail.

(41) In a case where the target sound source signal is input to the microphone with the arrival direction of at long-distance The coherence for the target sound source can be expressed as Mathematical Formula 35.

(42) $\begin{matrix} _{s} (f) = \frac{_{s_{1} s_{2}} (l, f)}{_{s} (l, f)} = e^{j 2 f t} = e^{jkd s in ()} t = \frac{d \sin ()}{c}, k = \frac{2 f}{c}, & [Mathematical Formula 35] \end{matrix}$

(43) d: distance between microphones, c: speed of sound, : direction of sound source

(44) Next, in a case of considering an environmental background noise, it is assumed that the signal is a superposition of non-correlated noises of which the number is usually infinite. This noise is input from all directions when the microphone exists spatially at the center of a circular form. That is, the noise sources are evenly distributed from all directions away from the microphone, which results in no correlation in the input signal. This noise is called a diffusive noise or an isotropic noise. Most experimentally, when generating such a diffusive noise, a large number of non-correlated noise sources are arranged in all directions as illustrated in FIG. 4. FIG. 4 is a diagram illustrating an exemplary configuration of the diffusive noise.

(45) The coherence for the diffusive noise is defined as Mathematical Formula 36 for two microphones

(46) $\begin{matrix} _{diffuse} (f) = \frac{_{n_{1} n_{2}} (l, f)}{_{n} (l, f)} = \frac{\sin (kd)}{kd} = \frac{\sin (2 f \frac{d}{c})}{2 f \frac{d}{c}} & [Mathematical Formula 36] \end{matrix}$

(47) On the other hand, in the case of the echo signals, it is generally assumed that a convolution of the RIR (Room Impulse Response) and the target source is input. At this time, since the echo signals collide with obstacles in the recording environment as illustrated in FIG. 5 and enter the microphone with different time differences and attenuated sizes through various reflection paths, the echo signals have isotropic characteristics similar to those of the diffusive noise. Thus, a late echo signal can be treated the same as the diffusive noise. FIG. 5 is an exemplary diagram illustrating an input path of an input signal in an echo environment.

(48) First, the auto-correlation values and the cross correlation values between the two microphones required to obtain the coherence-to-diffuseness ratio CDR values are recursively calculated as expressed by Mathematical Formula 37 to obtain an average value for each time. In this case, is a constant value between 0 and 1.
{circumflex over ()}.sub.x.sub.i.sub.x.sub.j(l,f)={circumflex over ()}.sub.x.sub.i.sub.x.sub.j(l1,f)+(1)X.sub.i(l,f)X*.sub.j(l,f) [Mathematical Formula 37]

(49) In addition, to determine the coherence-to-diffuseness ratio CDR expressed by Mathematical Formula 33, it is assumed as expressed by Mathematical Formula 38 that the coherence of the target sound source is set to 1, and Mathematical Formula 39 is obtained from Mathematical Formula 38. By using the Mathematical Formula 39, the coherence-to-diffuseness ratio CDR in the case where the direction of the target sound source is unknown is obtained from Mathematical Formula 40. In Mathematical Formula 40, the reason for taking the maximum value between 0 and 1 to determine the coherence-to-diffuseness ratio CDR is to prevent the coherence-to-diffuseness ratio CDR from taking a negative value.

(50) $\begin{matrix} .Math._{s} (f) .Math. = .Math._{x} (l, f) - (_{n} (f) -_{x} (l, f)) {CDR (l, f)}^{- 1} .Math. \overset{}{=} 1 & [Mathematical Formula 38] \\ ({.Math._{x} (l, f) .Math.}^{2} - 1) {CDR (l, f)}^{2} - 2 Re {_{x} (l, f) {(_{n} (f) -_{x} (l, f))}^{*}} CDR (l, f) + {.Math._{n} (f) -_{x} (l, f) .Math.}^{2} = 0 & [Mathematical Formula 39] \\ noDOA (l, f) = \max (0, \frac{(\begin{matrix} (f) Re (l, f))) - {.Math. {\overset{.Math.}{}}_{x} (l, f) .Math.}^{2} - \\ \sqrt{(f) Re ({\overset{}{}}_{x} (l, f)) - {\overset{}{}}_{n}^{2} (f) {.Math. {\overset{}{}}_{n} (f) .Math.}^{2} +} \\ 2 {\overset{}{}}_{n}^{2} (f) - 2 {\overset{}{}}_{n} (f) Re {{\overset{}{}}_{x} (l, f)} + {.Math. {\overset{\hat{.Math.}}{}}_{x} (, f) .Math.}^{2} \end{matrix})}{({.Math. {\overset{\hat{.Math.}}{}}_{x} (l, f) .Math.}^{2} - 1)}) & [Mathematical Formula 40] \end{matrix}$

(51) Finally, the value of diffuseness D is expressed as Mathematical Formula 41 and has a value between 0 and 1, as mentioned above.

(52) $\begin{matrix} D (l, f) = \frac{1}{1 +} & [Mathematical Formula 41] \end{matrix}$

(53) Generally, in the noise removing and echo removing algorithms, when a noise source is removed by using a mask, a distortion of the target sound source is generated, which deteriorates the performance of the voice recognition rate. Therefore, even if the performance of the noise removing or echo removing is somewhat deteriorates, the masking is performed under the condition that no distortion occurs in the original signal. However, in the present invention, since the robust sound source localization using the coherence-to-diffuseness ratio CDR is performed by providing the information on noise and echo rather than the noise removing or the echo removing, it can be concluded that the removing as much noise as possible within such a range that a phase difference between the two microphones of the target sound source is preserved leads to a clearer cross correlation value in the target sound source. Therefore, the value of the diffuseness D having a linear value such as Mathematical Formula 42 is binarized by setting the threshold value.

(54) $\begin{matrix} MASK = {\begin{matrix} 1 : D threshold \\ 0 : D > threshold \end{matrix} & [Mathematical Formula 42] \end{matrix}$

(55) In this case, it is preferable that the threshold value is set to a value having the highest accuracy with respect to the target sound source estimation through experiments.

(56) As described above, the sound source localization robust to the echo and the noise can be implemented by performing the GCC-PHAT or the SRP-PHAT by applying the binarized mask generated by using the diffuseness measured by using the coherence-to-diffuseness ratio CDR to the signal input to the microphone,

(57) FIGS. 7A, 7B, and 7C are graphs illustrating an input mixed signal, an estimated CDR, and a binarized mask, respectively, in the sound source localization method according to the embodiment of the present invention. Referring to FIG. 7, it can be seen that the time and frequency in which the sound signal exists are detected while illustrating a dominant value in the region where the target signal exists.

(58) FIG. 8 is a graph illustrating the performance of the sound source localization method according to the embodiment of the present invention in comparison with a method in the related art. FIG. 8A is a graph illustrating a frame erroneously detected as a result of the GCC-PHAT according to the method in the related art, and FIG. 8B is a graph illustrating a result of the GCC-PHAT using the masking technique according to the present invention. Referring to FIG. 8, although the direction of the target sound source is erroneously detected in the method in the related art, it can be seen that the direction can be correctly detected in the present invention.

(59) On the other hand, as described above, the SRP-PHAT is equivalent to application of the GCC-PHAT algorithm applied to the two microphones to multiple microphones. In addition, in terms of Mathematical Formula, the SRP-PHAT is the sum of the GCC-PHAT for all microphone pairs. Therefore, when only two microphones are used, the SRP-PHAT and the GCC-PHAT have the same result. Accordingly, in the localization method according to the present invention, the direction of the target sound source can be estimated by applying the binarized mask generated by using the diffuseness to the input signal, and after that, by using the SRP-PHAT as well as the GCC-PHAT.

(60) While the present invention has been particularly illustrated and described with reference to exemplary embodiments thereof, it should be understood by the skilled in the art that the invention is not limited to the disclosed embodiments, but various modifications and applications not illustrated in the above description can be made without departing from the spirit of the invention. In addition, differences relating to the modifications and applications should be construed as being included within the scope of the invention as set forth in the appended claims.

Sound source localization method and sound source localization apparatus based coherence-to-diffuseness ratio mask

Assignee

Inventors

Cpc classification

Classification Explorer

G01S3/8006

PHYSICS

Classification Explorer

G10L2021/02166

PHYSICS

Classification Explorer

H04R3/005

ELECTRICITY

Classification Explorer

H04R5/04

ELECTRICITY

Classification Explorer

G10L2021/02082

PHYSICS

Classification Explorer

H04R5/027

ELECTRICITY

Classification Explorer

G01S3/8083

PHYSICS

Classification Explorer

G10L21/0232

PHYSICS

Classification Explorer

H04R3/04

ELECTRICITY

Classification Explorer

H04S7/303

ELECTRICITY

Classification Explorer

G10L21/0208

PHYSICS

International classification

Classification Explorer

G10L21/0232

PHYSICS

Classification Explorer

H04R3/04

ELECTRICITY

Classification Explorer

H04R3/00

ELECTRICITY

Classification Explorer

H04R5/027

ELECTRICITY

Classification Explorer

H04S7/00

ELECTRICITY

Classification Explorer

G01S3/808

PHYSICS

Classification Explorer

G01S3/80

PHYSICS

Classification Explorer

H04R5/04

ELECTRICITY

Abstract

Claims

Description