SOUND SOURCE DIRECTION ESTIMATION DEVICE AND METHOD, AND PROGRAM
20200333423 ยท 2020-10-22
Inventors
- Kazuki OCHIAI (Kanagawa, JP)
- Shusuke Takahashi (Chiba, JP)
- Akira Takahashi (Saitama, JP)
- Kazuya Tateishi (Tokyo, JP)
Cpc classification
H04R2430/21
ELECTRICITY
G01S3/8006
PHYSICS
H04R2430/25
ELECTRICITY
International classification
Abstract
The present technology relates to a sound source direction estimation device and method, and a program that can reduce an operation amount for estimating a direction of a target sound source. A first estimation unit estimates a first horizontal angle that is a horizontal angle of a sound source direction from an input acoustic signal. A second estimation unit estimates a second horizontal angle that is the horizontal angle of the sound source direction and an elevation angle, with respect to the first horizontal angle, in a predetermined range near the first horizontal angle. The present technology can be applied, in a case where a voice is uttered from a surrounding sound source (for example, a person), to a device having a function of estimating the direction in which the voice is uttered.
Claims
1. A sound source direction estimation device comprising: a first estimation unit configured to estimate a first horizontal angle that is a horizontal angle of a sound source direction from an input acoustic signal; and a second estimation unit configured to estimate a second horizontal angle that is the horizontal angle of the sound source direction and an elevation angle, with respect to the first horizontal angle, in a predetermined range near the first horizontal angle.
2. The sound source direction estimation device according to claim 1, further comprising an input unit configured to input the acoustic signal from a microphone array including a plurality of microphones.
3. The sound source direction estimation device according to claim 2, wherein in the microphone array, the plurality of microphones is arranged three-dimensionally.
4. The sound source direction estimation device according to claim 3, wherein the first estimation unit performs an operation on a first spatial spectrum, and estimates the first horizontal angle on a basis of the first spatial spectrum.
5. The sound source direction estimation device according to claim 4, wherein the first estimation unit includes a first processing unit that performs an operation on the first spatial spectrum by a MUSIC method.
6. The sound source direction estimation device according to claim 5, wherein the second estimation unit includes a second processing unit that performs an operation on a second spatial spectrum by the MUSIC method.
7. The sound source direction estimation device according to claim 5, wherein the first estimation unit further includes a horizontal angle estimation unit configured to estimate the first horizontal angle on a basis of the first spatial spectrum on which the first processing unit performs an operation.
8. The sound source direction estimation device according to claim 6, wherein the second processing unit performs an operation on the second spatial spectrum by the MUSIC method in a range of the entire elevation angle in a predetermined range of the horizontal angle near the first horizontal angle.
9. The sound source direction estimation device according to claim 5, wherein the first processing unit includes a first correlation matrix calculation unit that calculates a correlation matrix of a target sound signal of respective frequencies for every time frame of the acoustic signal for performing an operation on the first spatial spectrum.
10. The sound source direction estimation device according to claim 9, wherein the first processing unit further includes a second correlation matrix calculation unit that calculates a correlation matrix of a noise signal of respective frequencies for every time frame of the acoustic signal for performing an operation on the first spatial spectrum.
11. The sound source direction estimation device according to claim 6, wherein the second estimation unit further includes a detection unit that detects the sound source direction from a peak of the second spatial spectrum.
12. The sound source direction estimation device according to claim 11, further comprising a presentation unit configured to present the sound source direction detected by the detection unit.
13. The sound source direction estimation device according to claim 12, wherein the presentation unit changes a presentation state according to the estimated elevation angle.
14. The sound source direction estimation device according to claim 5, wherein the first processing unit thins out the direction in which the first spatial spectrum is calculated, and performs an operation on the first spatial spectrum in the thinned out direction by interpolation.
15. The sound source direction estimation device according to claim 11, wherein the second estimation unit repeats processing of computing a range in which the second spatial spectrum is computed in a range limited in both the horizontal angle and the elevation angle, and detecting the peak of the computed second spatial spectrum until both the horizontal angle and the elevation angle no longer change.
16. The sound source direction estimation device according to claim 3, wherein the second estimation unit includes an SRP processing unit that processes a pair signal of one channel of the microphones arranged three-dimensionally and another one channel of the microphones.
17. The sound source direction estimation device according to claim 16, wherein the SRP processing unit calculates a cross-correlation of a plurality of the pair signals, and in the predetermined range near the first horizontal angle, the SRP processing unit estimates the second horizontal angle and the elevation angle from a peak of the cross-correlation.
18. The sound source direction estimation device according to claim 16, wherein the first estimation unit does not estimate the first horizontal angle, and the SRP processing unit estimates the second horizontal angle and the elevation angle from a peak of the cross-correlation.
19. A method of estimating a sound source direction of a sound source direction estimation device, the method comprising: a first step of estimating a first horizontal angle that is a horizontal angle of the sound source direction from an input acoustic signal; and a second step of estimating a second horizontal angle that is the horizontal angle of the sound source direction and an elevation angle, with respect to the first horizontal angle, in a predetermined range near the first horizontal angle.
20. A program for causing a computer to execute sound source direction estimation processing comprising: a first step of estimating a first horizontal angle that is a horizontal angle of the sound source direction from an input acoustic signal; and a second step of estimating a second horizontal angle that is the horizontal angle of the sound source direction and an elevation angle, with respect to the first horizontal angle, in a predetermined range near the first horizontal angle.
Description
BRIEF DESCRIPTION OF DRAWINGS
[0032]
[0033]
[0034]
[0035]
[0036]
[0037]
[0038]
[0039]
[0040]
[0041]
[0042]
[0043]
[0044]
[0045]
[0046]
[0047]
[0048]
[0049]
[0050]
[0051]
[0052]
[0053]
[0054]
[0055]
[0056]
[0057]
[0058]
[0059]
[0060]
MODE FOR CARRYING OUT THE INVENTION
[0061] Embodiments for carrying out the present technology will be described below. Note that the description will be made in the following order.
[0062] 1. First embodiment (
[0063] 2. Second embodiment (
[0064] 3. Third embodiment (
[0065] 4. Fourth Embodiment (
[0066] 5. Fifth Embodiment (
[0067] 6. Sixth embodiment (
[0068] 7. Seventh embodiment (
[0069] 8. Experimental results (
[0070] 9. Computer (
[0071] 10. Other
First Embodiment
[0072] (
[0073] First, with reference to
[0074]
[0075] The sound source direction estimation device 1 is installed in, for example, a smart speaker, a voice agent, a robot, and the like, and has a function of, in a case where a voice is uttered from a surrounding sound source (for example, a person), estimating a direction in which the voice is uttered. The estimated direction is used to present the sound source direction, for example, by causing the LED 13a in the corresponding direction to emit light. Hereinafter, an electric configuration of the sound source direction estimation device 1 will be described.
[0076]
[0077] The sound source direction estimation device 100 includes an input unit 111, a first estimation unit 112, and a second estimation unit 113.
[0078] The input unit 111 corresponds to the microphone array 12 of
[0079] The microphones 12a may be arranged on a plane as shown in
[0080]
[0081] In this case, when the time at which a sound arriving from the direction (, ) reaches the origin is 0 and the time at which the sound reaches the m-th microphone at the coordinates (X.sub.m, Y.sub.m, Z.sub.m) is t.sub.m, the time t.sub.m can be determined by the following equation (1). Note that in equation (1), c represents the speed of sound.
[0082] Therefore, an arrival time difference between the m-th microphone and the n-th microphone is expressed by the following equation (2).
[0083] Direction estimation is performed on the basis of the time difference t.sub.m,n expressed by equation (2). Therefore, if the sound source direction is estimated by detecting only the horizontal angle without detecting the elevation angle , in a case where the elevation angle is not 0, an error will occur. Therefore, in the present technology, not only the horizontal angle but also the elevation angle is detected.
[0084] The first estimation unit 112 of
[0085] In step S11, the input unit 111 inputs an acoustic signal. That is, the plurality of microphones 12a constituting the microphone array 12 collects a sound from a sound source in a predetermined direction and output a corresponding acoustic signal.
[0086] In step S12, the first estimation unit 112 estimates a first horizontal angle while fixing the elevation angle. That is, the elevation angle is fixed at a predetermined angle (for example, 0 degrees). Then, a predetermined horizontal angle among the horizontal angles in the 360-degree direction in the horizontal plane is estimated as the first horizontal angle {circumflex over ()} representing the sound source direction. As described with reference to
[0087] In step S13, the second estimation unit 113 estimates a second horizontal angle and the elevation angle with respect to the first horizontal angle {circumflex over ()}. That is, with respect to the first horizontal angle {circumflex over ()} estimated in the processing of step S12 the horizontal angle and the elevation angle are estimated only in a predetermined range ({circumflex over ()}s) near the first horizontal angle {circumflex over ()}. The first horizontal angle {circumflex over ()}, which is estimated in a state where the elevation angle is fixed at a predetermined value (that is, in a state where it is assumed that the sound source exists at an elevation angle different from the actual elevation angle), is not always accurate and contains an error. Therefore, in this step, together with the actual elevation angle of the sound source, the second horizontal angle .sup.out is estimated as a more accurate horizontal angle of the sound source.
[0088]
Second Embodiment
[0089] (
[0090] Next, the second embodiment will be described with reference to
[0091] The sound source direction estimation device 200 includes an acoustic signal input unit 211, a frequency conversion unit 212, a first MUSIC processing unit 213, a horizontal angle estimation unit 214, a second MUSIC processing unit 215, and a peak detection unit 216. In this embodiment, a multiple signal classification (MUSIC) method is used for estimation processing.
[0092] The acoustic signal input unit 211 and the frequency conversion unit 212 correspond to the input unit 111 of
[0093] The acoustic signal input unit 211 corresponds to the microphone array 12 of
[0094] The frequency conversion unit 212 performs frequency conversion on the acoustic signal input from the acoustic signal input unit 211. On the basis of a frequency domain signal input from the frequency conversion unit 212, the first MUSIC processing unit 213 determines an eigenvalue and an eigenvector of a correlation matrix of the signal of respective frequencies. Moreover, the first MUSIC processing unit 213 performs an operation on a spatial spectrum at the entire horizontal angle in a state where the elevation angle with respect to the sound source direction viewed from the sound source direction estimation device 200 is fixed at a predetermined constant value.
[0095] The horizontal angle estimation unit 214 calculates a threshold from the spatial spectrum on which an operation is performed by the first MUSIC processing unit 213, detects the spatial spectrum having a peak value exceeding the threshold, and estimates and detects the direction corresponding to the spatial spectrum as the sound source direction (first horizontal angle {circumflex over ()}).
[0096] With respect to the first horizontal angle {circumflex over ()} estimated by the horizontal angle estimation unit 214, the second MUSIC processing unit 215 computes the spatial spectrum of the horizontal angle in a limited predetermined range near the first horizontal angle {circumflex over ()} and the entire elevation angle on the basis of the eigenvector of the correlation matrix of the signal of respective frequencies determined by the first MUSIC processing unit 213.
[0097] The peak detection unit 216 detects the peak value of the spatial spectrum for the horizontal angle and the elevation angle within the predetermined range computed by the second MUSIC processing unit 215, and estimates the direction corresponding to the peak value as the final sound source direction (.sup.out, (.sup.out).
[0098] An operation of the sound source direction estimation device 200 will be described with reference to
[0099] In step S51, the acoustic signal input unit 211 inputs an acoustic signal. That is, for example, the plurality of microphones 12a constituting the microphone array 12 arranged as shown in
[0100] In step S52, the frequency conversion unit 212 performs frequency conversion on the acoustic signal input from the acoustic signal input unit 211. That is, the acoustic signal is converted from a signal of a time-base domain to a signal of a frequency domain. For example, discrete Fourier transform (DFT) or short time Fourier transform (STFT) processing is performed for every frame. For example, a frame length can be 32 ms and a frame shift can be 10 ms.
[0101] In step S53, the first MUSIC processing unit 213 performs first MUSIC processing. Specifically, the frequency domain signal is input from the frequency conversion unit 212, and processing is performed by the MUSIC method for the entire horizontal angle with the elevation angle fixed at a certain value. An operation is performed on the eigenvalue and the eigenvector of the correlation matrix of the signal, and the spatial spectrum is calculated. Weighted averaging is performed on the spatial spectrum between frequencies.
[0102] In step S54, the horizontal angle estimation unit 214 performs horizontal angle estimation processing. Specifically, the threshold is calculated from the spatial spectrum determined by the first MUSIC processing unit 213, and the direction having the peak exceeding the threshold is set as the estimated horizontal angle (first horizontal angle {circumflex over ()}).
[0103] In step S55, the second MUSIC processing unit 215 performs second MUSIC processing. Specifically, the eigenvector determined by the first MUSIC processing unit 213 and the horizontal angle estimated by the horizontal angle estimation unit 214 (first horizontal angle {circumflex over ()}) are input. Then, the spatial spectrum is calculated by the MUSIC method for the horizontal angle in the range limited to the first horizontal angle {circumflex over ()}s and the entire elevation angle. That is, the horizontal angle and the elevation angle are estimated in the limited range ({circumflex over ()}s) near the primarily estimated first horizontal angle {circumflex over ()}. Weighted averaging is performed on the spatial spectrum between frequencies.
[0104] In step S56, the peak detection unit 216 detects the peak value. Specifically, the spatial spectrum having the maximum value (peak) is detected from among the spatial spectra subjected to weighted averaging output from the second MUSIC processing unit 215. Then, the horizontal angle (second horizontal angle .sup.out) and the elevation angle .sup.out corresponding to the spatial spectrum are output as the sound source direction (.sup.out, .sup.out).
[0105] In the second embodiment, since the operation by the MUSIC method is performed, the sound source direction can be accurately determined. Furthermore, in a similar manner to the first embodiment, the range in which the elevation angle is estimated is not the range of the entire horizontal angle of 360 degrees, but the limited range near the primarily estimated first horizontal angle {circumflex over ()}({circumflex over ()}s). Therefore, the operation amount can be reduced. As a result, even a device whose operation resource is not high (operation capability is not high) can perform the operation in real time.
Third Embodiment
[0106] (
[0107] Next, the third embodiment will be described with reference to
[0108] The sound source direction estimation device 300 of
[0109] The first MUSIC processing unit 213 of
[0110] The first correlation matrix calculation unit 411 calculates a correlation matrix of a target signal of respective frequencies for every time frame. The second correlation matrix calculation unit 417 calculates a correlation matrix of a noise signal of respective frequencies for every time frame. The eigenvalue decomposition unit 412 performs an operation on an eigenvalue and an eigenvector of the correlation matrix. The frequency weight computation unit 413 computes a frequency weight representing the degree of contribution of a spatial spectrum for each frequency. In a case where a sound arrives from a certain direction, an imbalance is created in distribution of the eigenvalue, and only the eigenvalue of the number of sound sources becomes large.
[0111] The transfer function storage unit 414 stores a transfer function vector in advance. The first spatial spectrum computation unit 415 uses the eigenvector and the transfer function vector relating to the horizontal angle to compute a spatial spectrum indicating the degree of sound arrival from the direction of the horizontal angle . The frequency information integration unit 416 integrates the first spatial spectrum on the basis of the frequency weight.
[0112] The horizontal angle estimation unit 214 includes a threshold updating unit 451 and a first peak detection unit 452. The threshold updating unit 451 calculates a threshold for determining whether or not to employ a peak of the spatial spectrum as a detection result. The first peak detection unit 452 detects the direction of the spatial spectrum having a peak exceeding the threshold.
[0113] The second MUSIC processing unit 215 includes a transfer function storage unit 481, a second spatial spectrum computation unit 482, and a frequency information integration unit 483. The transfer function storage unit 481 stores the transfer function vector in advance. The second spatial spectrum computation unit 482 computes the spatial spectrum indicating the degree of sound arrival from the direction of the predetermined horizontal angle and the elevation angle. The frequency information integration unit 483 computes the weighted average of the spatial spectrum for each frequency.
[0114] The sound source direction presentation unit 311 presents the estimated sound source direction to a user.
[0115] Next, an operation of the sound source direction estimation device 300 of
[0116] In step S101, the acoustic signal input unit 211 inputs an acoustic signal collected by the microphone array 12. In step S102, the frequency conversion unit 212 performs frequency conversion on the acoustic signal input from the acoustic signal input unit 211. Processing in steps S101 and S102 is similar to processing in steps S51 and S52 of
[0117] In step S103, the first MUSIC processing unit 213 performs first MUSIC processing. Details of the first MUSIC processing are shown in
[0118] In step S131 of
[0119] In step S132, the second correlation matrix calculation unit 417 calculates a second correlation matrix. The second correlation matrix is a correlation matrix of a noise signal of respective frequencies for every time frame, and is calculated on the basis of the following equation (4).
[0120] In equation (4), T.sub.K represents a frame length for calculating the correlation matrix, and t is used such that a signal of a time frame common to R.sub., t of equation (3) and K.sub., t of equation (4) is not used. .sub., is a weight and may be generally 1, but in a case where it is desired to change the weight depending on the type of sound source, it is possible to prevent all the weights from becoming zero as in equation (5).
[Equation 5]
K.sub.,t=(1.sub.,t)K.sub.,t1+.sub.,tz.sub.,ttz.sub.,tt.sup.H(5)
[0121] According to equation (5), the second correlation matrix calculation unit 417 sequentially updates a second spatial correlation matrix to which a weight has been applied, which is subjected to generalized eigenvalue decomposition by the eigenvalue decomposition unit 412 in the subsequent stage, on the basis of the second spatial correlation matrix to which a past weight has been applied. Such an updating equation makes it possible to use a stationary noise component for a long time. Moreover, in a case where the weight is a continuous value from 0 to 1, as the second space correlation matrix is calculated in more past, the number of times of weight integration increases and the weight becomes smaller, and thus larger weight is applied as the stationary noise component is generated at later time. Therefore, with the larger weight applied to the stationary noise component at the most recent time, which is considered to be close to the stationary noise component behind the target sound, the calculation of the second spatial correlation matrix becomes possible.
[0122] In step S133, the eigenvalue decomposition unit 412 performs eigenvalue decomposition. That is, the eigenvalue decomposition unit 412 performs generalized eigenvalue decomposition based on the weighted second spatial correlation matrix supplied from the second correlation matrix calculation unit 417, and a first spatial correlation matrix supplied from the first correlation matrix calculation unit 411. Then, the eigenvalue and the eigenvector are calculated from the following equation (6).
[Equation 6]
R.sub.,te.sub.,t,i=.sub.,t,iK.sub.,te.sub.,t,i(6)
(i=1 . . . , M)
[0123] In equation (6), .sub.i represents the i-th largest eigenvalue vector determined by generalized eigenvalue decomposition, e.sub.i represents an eigenvector corresponding to .sub.i, and M represents the number of microphones 12a.
[0124] In a case where SEVD is used, K.sub., t has the same value as in equation (7).
[Equation 7]
K.sub.,t=I(7)
[0125] In a case where GEVD is used, equation (6) is transformed as expressed by equations (9) and (10) by using a matrix .sub., t satisfying the following equation (8). This will lead to a problem of SEVD, and the eigenvalue and the eigenvector are determined from equations (9) and (10).
[Equation 8]
.sub.,t.sup.H.sub.,t=K.sub.,t(8)
(.sub.,t.sup.HR.sub.,t.sub.,t.sup.1)f.sub.,t,i=.sub.,t,if.sub.,t,i(9)
f.sub.,t,i=.sub.,te.sub.,t,i(10)
[0126] .sup.H.sub., t of equation (9) is a whitening matrix. A part in the parenthesis on the left side of equation (9) is obtained by whitening R.sub., t by the stationary noise component, that is, obtained by removing the stationary noise component.
[0127] In step S134, the first spatial spectrum computation unit 415 computes the first spatial spectrum P.sup.n.sub., , t on the basis of the following equations (11) and (12). That is, the first spatial spectrum computation unit 415 computes the spatial spectrum P.sup.n.sub., , t representing the degree of sound arrival from the direction by using the eigenvector e.sub.i corresponding to the M-N eigenvalues from the smallest one and a steering vector a.sub.. The eigenvector e.sub.i is supplied from the eigenvalue decomposition unit 412. The steering vector a.sub., which is a transfer function regarding the direction , is a transfer function obtained in advance assuming that there is a sound source in the direction , and is stored in advance in the transfer function storage unit 414.
[0128] N represents the number of sound sources, and represents the horizontal direction for calculating the spatial spectrum while the elevation angle is fixed.
[0129] In step S135, the frequency weight computation unit 413 computes a frequency weight representing the degree of contribution of the spatial spectrum for each frequency. In a case where a sound is arriving from a certain direction, an imbalance is created in distribution of the eigenvalue, and only the eigenvalue of the number of sound sources becomes large. For example, the frequency weight w.sub., t is calculated by the following equation (13). .sub.i is the i-th largest eigenvalue obtained by generalized eigenvalue decomposition, and the eigenvalue of the numerator in equation (13) means the largest eigenvalue.
[0130] In step S136, the frequency information integration unit 416 computes the weighted average P.sup.n.sub., t of the first spatial spectrum for each frequency by the following equations (14) and (15). The first spatial spectrum P.sup.n.sub., , t is supplied from the first spatial spectrum computation unit 415, and the frequency weight w.sub., t is supplied from the frequency weight computation unit 413.
[0131] Note that the second term in equation (15) is to minimize log P.sup.n.sub., t in equation (15) when is changed in the entire range of the horizontal direction in which the spatial spectrum is calculated with the elevation angle fixed.
[0132] Although the harmonic mean is determined in the operation of equation (14), the arithmetic mean or the geometric mean may be determined. By the operation of equation (15), the minimum value is normalized to 0. The log base in this operation is arbitrary, but for example, Napier's constant can be used. The operation by equation (15) produces an effect of suppressing the peak irrelevant to the sound source to the threshold or less in the first peak detection unit 452 in the subsequent stage.
[0133] As described above, the weighted average P{circumflex over ()}.sup.n.sub., t of the first spatial spectrum is calculated by the first MUSIC processing by the first MUSIC processing unit 213.
[0134] Returning to
[0135] In step S161, the threshold updating unit 451 calculates the threshold. That is, out of the weighted average P{circumflex over ()}.sup.n.sub., t of the first spatial spectrum output from the frequency information integration unit 416 of the first MUSIC processing unit 213, a threshold P.sup.th.sub., t for determining whether or not to perform peak detection is calculated by, for example, the following equations (16) and (17). .sup.th, .sup.th, and .sup.th are each constants, and represents the number of scanning directions.
[0136] This threshold value P.sup.th.sub., t produces an effect of removing a sound source that is not in that direction but has a small peak value, or removing a sound that continues to ring from a certain direction. The target voice is often a short command or utterance for manipulating a device, and is assumed not to last for a long time.
[0137] Next, in step S162, the first peak detection unit 452 detects a first peak. That is, out of the weighted average P{circumflex over ()}.sup.n.sub., t of the first spatial spectrum output from the frequency information integration unit 416, those having a peak exceeding the threshold value P.sup.th.sub., t output from the threshold value updating unit 451 are detected. Then, the horizontal angle {circumflex over ()} corresponding to the weighted average P{circumflex over ()}.sup.n.sub., t of the first spatial spectrum having the detected peak is output as the sound source direction (first horizontal angle) when the elevation angle is fixed.
[0138] As described above, the first horizontal angle {circumflex over ()}, which is the sound source direction when the elevation angle is fixed, is estimated by the horizontal angle estimation processing by the horizontal angle estimation unit 214 in step S104 of
[0139] Next to the horizontal angle estimation processing in step S104 of
[0140] In step S181, the second spatial spectrum computation unit 482 computes a second spatial spectrum. That is, the second spatial spectrum is computed by using the eigenvector e.sub.i corresponding to the M-N eigenvalue .sub.i from the smaller one out of the eigenvector e.sub.i obtained by the eigenvalue decomposition unit 412, and the steering vector a.sub., which is the transfer function for the direction (, ). The computation of the second spatial spectrum P.sup.n.sub., , t is performed, for example, by the following equation (18).
[0141] is, with respect to the estimated direction {circumflex over ()} of the sound source when the elevation angle is fixed, a limited range ({circumflex over ()}s) near the estimated direction {circumflex over ()}. That is, {circumflex over ()}s<<{circumflex over ()}+s. That is, the range for estimating the elevation angle is not the range of the entire horizontal angle of 360 degrees, but the limited range near the primarily estimated first horizontal angle {circumflex over ()}. represents the direction of the elevation angle for calculating the spatial spectrum.
[0142] The second spatial spectrum is a spatial spectrum representing the degree of sound arrival from the direction (, ). The steering vector a.sub., for the direction (, ) is stored in advance in the transfer function storage unit 481. The eigenvector e.sub.i is supplied from the eigenvalue decomposition unit 412 of the first MUSIC processing unit 213.
[0143] In step S182, the frequency information integration unit 483 computes a weighted average P{circumflex over ()}.sup.n.sub., , t of the second spatial spectrum for each frequency by the following equations (19) and (20). The second spatial spectrum P.sup.n.sub., , t is supplied from the second spatial spectrum computation unit 482. The frequency weight w.sub., t is supplied from the frequency weight computation unit 413 of the first MUSIC processing unit 213.
[0144] By the above second MUSIC processing of the second MUSIC processing unit 215, the weighted average P{circumflex over ()}.sup.n.sub., , t of the second spatial spectrum for each frequency is computed.
[0145] Returning to
[0146] In step S107, the sound source direction presentation unit 311 presents the sound source direction. That is, the sound source direction detected in step S106 is presented. For example, out of the LEDs 13a constituting the display unit 13 of
[0147] The three-dimensional sound source direction estimation makes it easy to estimate the accurate direction, but in a case where the elevation angle is large, the accuracy tends to be harder to obtain than in a case where the sound source exists on the same horizontal plane. Therefore, the display state can be changed depending on whether the estimated elevation angle is small or large.
[0148] For example, in a case where the estimated direction is presented with the LED, the presentation state can be changed, for example, by changing the way of illuminating the LED when the elevation angle is large or small. In a case where the estimated elevation angle is small (height is the same as or close to the plane on which the microphone array 12 exists), the illumination width of the LED 13a can be reduced. In a case where the elevation angle is large, the illumination width can be increased. For example, in a case where the width is reduced, only one LED 13a can be turned on as shown in
[0149] Moreover, the color of the LED 13a can be changed. For example, in a case where the elevation angle is small, the LED 13a may have white to blue base color, and in a case where the elevation angle is large, the LED 13a may have yellow to red base color.
[0150] In this way, by indicating the lighting width or color, it is possible to notify the user of a fact that the direction of the sound source may be difficult to estimate.
[0151] Furthermore, in a case where there is a front surface or a part corresponding to the face of the housing 11, by rotating the face (housing 11) to be directed to the estimated direction of the sound source, it is possible to show that the voice from that direction is being received.
[0152] The third embodiment can also produce an effect similar to the effect of the second embodiment. That is, since the operation by the MUSIC method is performed, the sound source direction can be accurately determined. Furthermore, the range in which the horizontal angle and the elevation angle are estimated is not the range of the entire horizontal angle of 360 degrees, but the limited range near the primarily estimated first horizontal angle {circumflex over ()} ({circumflex over ()}s). Therefore, the operation amount can be reduced. As a result, even a device whose operation resource is not high (operation capability is not high) can perform the operation in real time.
[0153] Moreover, in the third embodiment, since the sound source direction is presented, it is possible to inform the user of the estimated sound source direction.
Fourth Embodiment
[0154] (
[0155] Next, the fourth embodiment will be described. The block diagram of the fourth embodiment is similar to the block diagram shown in
[0156] In the fourth embodiment, an operation amount is further reduced by devising processing in a first spatial spectrum computation unit 415. An example thereof will be described with reference to
[0157] In the example of
[0158] In a case where the number of directions to be thinned out when computing the spatial spectrum is one, that is, in a case where the spatial spectra are computed in the directions (horizontal angle) of , 2, 4, . . . , in
[Equation 15]
P.sub.,+,t.sup.n=P.sub.,2,t.sup.n+P.sub.,,t.sup.n+P.sub.,+2,t.sup.n(21)
[0159] Similarly, in a case where the number of directions to be thinned out when computing the spatial spectrum is two, that is, in a case where the spatial spectra are computed in the directions of , 3, 6, . . . , in
[Equation 16]
P.sub.,+,t.sup.n= 1/9P.sub.,3,t.sup.n+ 8/9P.sub.,,t.sup.n+ 2/9P.sub.,+3,t.sup.n(22)
P.sub.,+,t.sup.n= 1/9P.sub.,3,t.sup.n+ 5/9P.sub.,,t.sup.n+ 5/9P.sub.,+3,t.sup.n(23)
[0160] Moreover, in a case where the number of directions to be thinned out when computing the spatial spectrum is three, that is, in a case where the spatial spectra are computed at the horizontal angles , 4, 8, . . . , in
[Equation 17]
P.sub.,+,t.sup.n= 3/32P.sub.,4,t.sup.n+ 15/16P.sub.,,t.sup.n+ 5/32P.sub.,+4,t.sup.n(24)
P.sub.,+2,t.sup.n=P.sub.,4,t.sup.n+P.sub.,,t.sup.n+P.sub.,+4,t.sup.n(25)
P.sub.,+3,t.sup.n= 3/32P.sub.,4,t.sup.n+ 7/16P.sub.,,t.sup.n+ 21/32P.sub.,+4,t.sup.n(26)
[0161] The above-described processing is performed in the processing of computing the first spatial spectrum in step S134 of
[0162] By interpolating the spatial spectrum in this way, the operation of the vector and the product of the matrix can be reduced, and the entire operation amount can be reduced.
Fifth Embodiment
[0163] (
[0164] Next, with reference to
[0165] The configuration of the sound source direction estimation device 500 of
[0166] The sound source direction estimation device 300 of
[0167] In step S201, the second spatial spectrum computation unit 482 sets a range for computing the second spatial spectrum. The range of the horizontal angle is a range of a predetermined horizontal angle near the first horizontal angle detected by the first peak detection unit 452. The range may be the same as the range for the sound source direction estimation device 300 ({circumflex over ()}s in
[0168] In step S202, the second spatial spectrum computation unit 482 computes the second spatial spectrum. This processing is similar to the processing of step S181 in
[0169] In step S203, the frequency information integration unit 483 computes a weighted average of the second spatial spectrum for each frequency. This processing is similar to the processing of step S182 in
[0170] In step S204, the second peak detection unit 216 detects the second peak. This processing is similar to the processing of step S106 in
[0171] In step S205, the second peak detection unit 216 determines whether or not the direction has changed. That is, it is determined whether the horizontal angle detected this time is different from the horizontal angle detected last time. Furthermore, it is determined whether or not the elevation angle detected this time is different from the elevation angle detected last time. In a case where it is determined that at least one of the horizontal angle and the elevation angle is different from the last time, the process returns to step S201.
[0172] Again in step S201, the second spatial spectrum computation unit 482 sets a range for computing the second spatial spectrum. With respect to the horizontal angle and the elevation angle detected by the second peak detection unit 216, the range is a predetermined width range set in advance near the horizontal angle and the elevation angle.
[0173] In the newly set range, the second spatial spectrum is computed in step S202, the weighted average of the second spatial spectrum for each frequency is computed in step S203, and the second peak is detected again in step S204. Then, it is determined again in step S205 whether or not the direction has changed.
[0174] As described above, the processing of steps S201 to S205 is repeated until both the horizontal angle and the elevation angle no longer change. When both the horizontal angle and elevation angle stop changing, the horizontal angle and the elevation angle are supplied to the sound source direction presentation unit 311 as the final sound source direction (.sup.out, .sup.out).
[0175] The processing of
[0176] An outline of the processing of
[0177] In
[0178] When the horizontal angle (first horizontal angle) is detected by the first peak detection unit 452 in a state where the elevation angle is fixed (fixed to 0 degrees in the example of
[0179] Next, with respect to the point P.sub.2, the range R.sub.2 of the width R.sub. in the horizontal angle direction and the width R.sub. in the elevation angle direction is set as the second range for computing the second spatial spectrum. Then, the second spatial spectrum is computed in the range R.sub.2, the maximum value of the peak is detected, and the point of the horizontal angle and the elevation angle corresponding to the peak is P.sub.3. The point P.sub.3 has the same horizontal angle as the point P.sub.2, but has a different elevation angle.
[0180] Therefore, furthermore, with respect to the point P.sub.3, the range R.sub.3 of the width R.sub. in the horizontal angle direction and the width R.sub. in the elevation angle direction is set as the third range for computing the second spatial spectrum. Then, the second spatial spectrum is computed in the range R.sub.3, the maximum value of the peak is detected, and the point of the horizontal angle and the elevation angle corresponding to the peak is P.sub.4.
[0181] Moreover, with respect to the point P.sub.4, the range R.sub.4 of the width R.sub. in the horizontal angle direction and the width R.sub. in the elevation angle direction is set as the fourth range for computing the second spatial spectrum. Then, the second spatial spectrum is computed in the range R.sub.4, and the maximum value of the peak is detected. However, the point of the horizontal angle and the elevation angle corresponding to the peak is P.sub.4, and the horizontal angle and the elevation angle are the same as last time. Therefore, the horizontal angle and the elevation angle of the point P.sub.4 are set as the final sound source direction (.sup.out, .sup.out).
[0182] In this way, since the range in which the operation is performed on the spatial spectrum is limited in the fifth embodiment, the operation amount therefor can be further reduced.
Sixth Embodiment
[0183] (
[0184] Next, with reference to
[0185] Then, one pair 12p is formed by the microphone 12at of one channel arranged three-dimensionally and one of the other six microphones 12as (of one channel). Therefore, the number of pairs 12p is six. Direction estimation is performed for each pair 12p, and results thereof are integrated into the final sound source direction. Note that what actually constitutes the pair may not be the microphone 12a itself, but is only required to be an output of the microphone 12a.
[0186] In a sound source direction estimation device 600 of
[0187] The SRP-PHAT processing unit 611 includes a number of cross-correlation calculation units 621-1 to 621-6 corresponding to the pairs 12p. The cross-correlation calculation unit 621-1 to the cross-correlation calculation unit 621-6 of the SRP-PHAT processing unit 611 each calculate the cross-correlation of the corresponding pair 12p. The cross-correlation integration unit 612 integrates the cross-correlation of the six pairs 12p. The peak determination unit 613 determines the final sound source direction from the peak of the integrated cross-correlation.
[0188] Next, sound source estimation processing of the sound source direction estimation device 600 will be described with reference to
[0189] The processing in steps S301 to S304 is similar to the processing in steps S101 to S104 of
[0190] Then, in step S304, the horizontal angle estimation unit 214 detects, among the weighted averages P{circumflex over ()}.sup.n.sub., t of the first spatial spectrum output from the MUSIC processing unit 213, those having a peak exceeding a threshold P.sup.th.sub., t. Then, the horizontal angle {circumflex over ()} corresponding to the detected weighted average P{circumflex over ()}.sup.n.sub., t of the first spatial spectrum having the peak is output as the sound source direction when the elevation angle is fixed (first horizontal angle).
[0191] In step S305, the SRP-PHAT processing unit 611 performs SRP-PHAT processing. Specifically, the cross-correlation calculation unit 621-1 calculates the weighted cross-correlation R.sub.t,t,m.n of the microphone 12at and the first microphone 12as that constitute the first pair 12p by the following equations (27) and (28). In these equations, m means the m-th microphone and n means the n-th microphone. In the example of
[0192] The calculation of equation (27) is as follows. That is, from an STFT (or first Fourier transform (FFT)) signal z.sub.,t,m of the m-th microphone 12at and a complex conjugation z*.sub.,t,n of the STFT (or FFT) signal z.sub.,t,n of the n-th microphone 12as, the correlation .sub.,t,m,n therebetween is calculated. Moreover, the correlation .sub.,t,m,n obtained by equation (27) is weighted by a weight w.sub.,t,m,n as shown in equation (28), and inverse short time Fourier transform (ISTFT) is performed. Alternatively, inverse first Fourier transform (IFFT) is performed.
[0193] In a case where the following equation (29) is used as the weight w.sub.,t,m,n for equation (28), this results in steered response power with the phase transform (SRP-PHAT). Alternatively, in a case where the following equation (30) is used as the weight w.sub.,t,m,n for equation (28), this results in steered response power with the smoothed coherence transform (SRP-SCOT). By using SRP, the operation amount can be reduced.
[Equation 19]
w.sub.,t,m,n=|.sub.,t,m,n|(29)
w.sub.,t,m,n=.sub.w,t,m,m.sub.,t,n,n(30)
[0194] Similarly, the cross-correlation calculation unit 621-2 to the cross-correlation calculation unit 621-6 also calculate the weighted cross-correlation R.sub.t,t,m.n of the microphone 12at and the microphone 12as of the corresponding pair 12p by the above-described equations (27) and (28). Thus, in the example of
[0195] In step S306, the cross-correlation integration unit 612 integrates the cross-correlation. That is, an operation is performed on R{circumflex over ()}.sub.t,t,m by equation (31) from the weighted cross-correlation R.sub.t,t,m.n by the six pairs 12p calculated by the cross-correlation calculation unit 621-1 to the cross-correlation calculation unit 621-6.
[0196] In step S307, the peak determination unit 613 determines the peak. That is, an operation by equation (32) is performed on a set of horizontal angle and elevation angle that maximizes R{circumflex over ()}.sub.t,t,m on which an operation is performed by equation (31), and the set is determined as the sound source direction (.sup.out, .sup.out)
[0197] It can also be understood that the processing in step S306 and step S307 described above is performed by the cross-correlation integration unit 612 and the peak determination unit 613 executing the following equation (33).
[0198] That is, the range of operation by the peak determination unit 613 is limited to, with respect to the first horizontal angle {circumflex over ()} supplied from the first peak detection unit 452, a predetermined range near the first horizontal angle ({circumflex over ()}s), that is, {circumflex over ()}s<<{circumflex over ()}+s. Then, in the range, an operation is performed on the final second horizontal angle .sup.out and the elevation angle .sup.out. With this operation, the operation amount can be reduced.
[0199] t is a function of the horizontal angle and the elevation angle , and furthermore, m and n, as expressed in equation (2) described above. R{circumflex over ()}.sub.t,t,m including the element of t can calculate the sound source direction (.sup.out, .sup.out) from equation (32) or equation (33) of the function argmax.
[0200] In the sixth embodiment, the range can be narrowed down to some extent, and the maximum value within the narrowed range is determined. Therefore, it is possible to estimate a plurality of directions at the same time.
Seventh Embodiment
[0201] (
[0202] Next, with reference to
[0203] The sound source direction estimation device 700 includes an acoustic signal input unit 211, a frequency conversion unit 212, an SRP-PHAT processing unit 611, a cross-correlation integration unit 612, a peak detection unit 613, and a sound source direction presentation unit 311. The SRP-PHAT processing unit 611 includes a cross-correlation calculation unit 621-1 to a cross-correlation calculation unit 621-6.
[0204] That is, the seventh embodiment of
[0205] Next, sound source direction estimation processing of the sound source direction estimation device 700 will be described with reference to the flowchart of
[0206] The processing in step S351 and step S352 is similar to the processing in step S301 and step S302 of
[0207] In step S352, the frequency conversion unit 212 performs frequency conversion on the acoustic signal input from the acoustic signal input unit 211. That is, the acoustic signal is converted from a signal of a time-base domain to a signal of a frequency domain. For example, DFT or STFT processing is performed for every frame. For example, a frame length can be 32 ms and a frame shift can be 10 ms. The cross-correlation calculation unit 621-1 to the cross-correlation calculation unit 621-6 each acquire a signal in the frequency domain of the corresponding pair of the six pairs 12p.
[0208] Next, in step S353, SRP-PHAT processing is performed by the SRP-PHAT processing unit 611. In the seventh embodiment of
[0209] The SRP-PHAT processing of step S353 and the processing of integrating the cross-correlation of step S354 are similar to the SRP-PHAT processing of step S305 and the processing of integrating the cross-correlation of step S306 in
[0210] That is, in step S353, in a similar manner to the SRP-PHAT processing of step S305 described above, the cross-correlation calculation unit 621-1 to the cross-correlation calculation unit 621-6 perform the calculation by the above-described equations (27) and (28). With this calculation, the weighted cross-correlation R.sub.t,t,m.n of the microphone 12at and the microphone 12as of the corresponding pair 12p is calculated.
[0211] In step S354, the cross-correlation integration unit 612 performs processing of integrating the cross-correlation. That is, an operation is performed on R{circumflex over ()}.sub.t,t,m,n by the above-described equation (31).
[0212] In step S355, the peak determination unit 613 determines the peak. That is, an operation by equation (32) is performed on a set of horizontal angle and elevation angle that maximizes R{circumflex over ()}.sub.t,t,m,n on which an operation is performed by equation (31), and the set is determined as the sound source direction (.sup.out, .sup.out).
[0213] It can also be understood that the processing of step S354 and step S355 described above is performed by the cross-correlation integration unit 612 and the peak determination unit 613 executing equation (34).
[0214] However, unlike the sixth embodiment of
[0215] In step S356, the sound source direction presentation unit 311 presents the sound source direction. That is, the sound source direction determined in the processing of step S355 is presented to the user. This processing is similar to the processing of step S308 in
[0216] In the sixth embodiment, since the range can be narrowed down to some extent and the maximum value is determined in the narrowed down range, a plurality of directions can be estimated at the same time. In the seventh embodiment, one direction is output in each frame.
[0217] <Experimental Result>
[0218] (
[0219] Next, as in the embodiment of
[0220]
[0221]
[0222] Next, the operation amount will be described.
[0223] The number of points to compute the spatial spectrum is 120 in a case where the elevation angle is fixed (in a case where only the horizontal angle is estimated), 840 in a case where the entire direction of the horizontal angle and the elevation angle is estimated, and 120+42N in a case where the horizontal angle is estimated and then the horizontal angle and the elevation angle are estimated for the found direction. Furthermore, in a case where the horizontal angle is estimated by skipping one and interpolation is performed therebetween by the above equation (21), the number of points to compute the spatial spectrum is 60+24N. In a case where the horizontal angle is estimated and then the horizontal angle and the elevation angle are estimated for the found direction, and in a case where the horizontal angle is estimated by skipping one and interpolation is performed therebetween by the above equation (21), it can be seen that the number of points to compute the spatial spectrum is extremely smaller than in a case where the entire direction of the horizontal angle and the elevation angle is estimated.
[0224] In the above description, the estimated sound source direction is presented to the user, but there are other uses of the estimated sound source direction. For example, the sound source direction can be used for automatic switching to the near mode. In a situation where the elevation angle is large relative to the microphone array 12 of a device, it is likely that the user gives utterance after approaching the device. As the distance is shorter, the elevation angle tends to increase even with a slight difference in height. There may be a case where the elevation angle is large but actually the user is not close, such as utterance from a different floor.
[0225] In a case where a fairly large elevation angle is determined by the sound source direction estimation, it can be determined that the user is close to the device and the signal processing configuration can be switched. For example, a configuration may be used in which after voice activity detection (VAD) (voice/non-voice determination) is performed, a voice is extracted by beam forming (BF), noise reduction (NR) is further performed, and voice recognition is performed. That is, in a case where the user is close to the device, the signal-to-noise (SN) ratio of the voice will be good, and therefore switching may be performed such that the input voice is recognized as it is without performing direction estimation.
[0226] <Computer>
[0227] (
[0228] A series of types of processing described above can be performed by hardware, or can be performed by software. In this case, for example, each device includes a personal computer as shown in
[0229] In
[0230] The CPU 921, the ROM 922, and the RAM 923 are connected to one another via a bus 924. An input-output interface 925 is also connected to the bus 924.
[0231] An input unit 926 including a keyboard, a mouse, or the like, an output unit 927 including a display such as a CRT or LCD, a speaker, and the like, a storage unit 928 including a hard disk or the like, and a communication unit 929 including a modem, a terminal adapter, or the like are connected to the input-output interface 925. The communication unit 929 performs communication processing via a network, such as, for example, the Internet.
[0232] A drive 930 is also connected to the input-output interface 925 as necessary. A removable medium 931 such as a magnetic disk, an optical disk, a magneto-optical disk, or a semiconductor memory is appropriately mounted. A computer program read therefrom is installed in the storage unit 48 as necessary.
[0233] Note that in this specification, steps describing the program to be recorded on the recording medium includes not only processing to be executed on a time-series basis according to the listed order, but also processing that may be not necessarily executed on a time-series basis but is executed in parallel or individually.
[0234] Furthermore, embodiments of the present technology are not limited to the embodiments described above, and various modifications may be made without departing from the spirit of the present technology.
OTHER
[0235] The present technology can also have the following configurations.
[0236] (1)
[0237] A sound source direction estimation device including:
[0238] a first estimation unit configured to estimate a first horizontal angle that is a horizontal angle of a sound source direction from an input acoustic signal; and
[0239] a second estimation unit configured to estimate a second horizontal angle that is the horizontal angle of the sound source direction and an elevation angle, with respect to the first horizontal angle, in a predetermined range near the first horizontal angle.
[0240] (2)
[0241] The sound source direction estimation device according to (1) described above, further including
[0242] an input unit configured to input the acoustic signal from a microphone array including a plurality of microphones.
[0243] (3)
[0244] The sound source direction estimation device according to (1) or (2) described above, in which
[0245] in the microphone array, the plurality of microphones is arranged three-dimensionally.
[0246] (4)
[0247] The sound source direction estimation device according to any one of (1) to (3) described above, in which
[0248] the first estimation unit performs an operation on a first spatial spectrum, and estimates the first horizontal angle on the basis of the first spatial spectrum.
[0249] (5)
[0250] The sound source direction estimation device according to any one of (1) to (4) described above, in which
[0251] the first estimation unit includes a first processing unit that performs an operation on the first spatial spectrum by a MUSIC method.
[0252] (6)
[0253] The sound source direction estimation device according to any one of (1) to (5) described above, in which
[0254] the second estimation unit includes a second processing unit that performs an operation on a second spatial spectrum by the MUSIC method.
[0255] (7)
[0256] The sound source direction estimation device according to any one of (1) to (6) described above, in which
[0257] the first estimation unit further includes a horizontal angle estimation unit configured to estimate the first horizontal angle on the basis of the first spatial spectrum on which the first processing unit performs an operation.
[0258] (8)
[0259] The sound source direction estimation device according to any one of (1) to (7) described above, in which
[0260] the second processing unit performs an operation on the second spatial spectrum by the MUSIC method in a range of the entire elevation angle in a predetermined range of the horizontal angle near the first horizontal angle.
[0261] (9)
[0262] The sound source direction estimation device according to any one of (1) to (8) described above, in which
[0263] the first processing unit includes a first correlation matrix calculation unit that calculates a correlation matrix of a target sound signal of respective frequencies for every time frame of the acoustic signal for performing an operation on the first spatial spectrum.
[0264] (10)
[0265] The sound source direction estimation device according to any one of (1) to (9) described above, in which
[0266] the first processing unit further includes a second correlation matrix calculation unit that calculates a correlation matrix of a noise signal of respective frequencies for every time frame of the acoustic signal for performing an operation on the first spatial spectrum.
[0267] (11)
[0268] The sound source direction estimation device according to any one of (1) to (10) described above, in which
[0269] the second estimation unit further includes a detection unit that detects the sound source direction from a peak of the second spatial spectrum.
[0270] (12)
[0271] The sound source direction estimation device according to any one of (1) to (11) described above, further including
[0272] a presentation unit configured to present the sound source direction detected by the detection unit.
[0273] (13)
[0274] The sound source direction estimation device according to any one of (1) to (12) described above, in which
[0275] the presentation unit changes a presentation state according to the estimated elevation angle.
[0276] (14)
[0277] The sound source direction estimation device according to any one of (1) to (12) described above, in which
[0278] the first processing unit thins out the direction in which the first spatial spectrum is calculated, and performs an operation on the first spatial spectrum in the thinned out direction by interpolation.
[0279] (15)
[0280] The sound source direction estimation device according to any one of (1) to (14) described above, in which
[0281] the second estimation unit repeats processing of computing a range in which the second spatial spectrum is computed in a range limited in both the horizontal angle and the elevation angle, and detecting the peak of the computed second spatial spectrum until both the horizontal angle and the elevation angle no longer change.
[0282] (16)
[0283] The sound source direction estimation device according to any one of (1) to (15) described above, in which
[0284] the second estimation unit includes an SRP processing unit that processes a pair signal of one channel of the microphones arranged three-dimensionally and another one channel of the microphones.
[0285] (17)
[0286] The sound source direction estimation device according to any one of (1) to (16) described above, in which the SRP processing unit calculates a cross-correlation of a plurality of the pair signals, and in the predetermined range near the first horizontal angle, the SRP processing unit estimates the second horizontal angle and the elevation angle from a peak of the cross-correlation.
[0287] (18)
[0288] The sound source direction estimation device according to any one of (1) to (17) described above, in which
[0289] the first estimation unit does not estimate the first horizontal angle, and
[0290] the SRP processing unit estimates the second horizontal angle and the elevation angle from a peak of the cross-correlation.
[0291] (19)
[0292] A method of estimating a sound source direction of a sound source direction estimation device, the method including:
[0293] a first step of estimating a first horizontal angle that is a horizontal angle of the sound source direction from an input acoustic signal; and
[0294] a second step of estimating a second horizontal angle that is the horizontal angle of the sound source direction and an elevation angle, with respect to the first horizontal angle, in a predetermined range near the first horizontal angle.
[0295] (20)
[0296] A program for causing a computer to execute sound source direction estimation processing including:
[0297] a first step of estimating a first horizontal angle that is a horizontal angle of the sound source direction from an input acoustic signal; and
[0298] a second step of estimating a second horizontal angle that is the horizontal angle of the sound source direction and an elevation angle, with respect to the first horizontal angle, in a predetermined range near the first horizontal angle.
REFERENCE SIGNS LIST
[0299] 1 Sound source direction estimation device [0300] 12 Microphone array [0301] 12a Microphone [0302] 13 Display unit [0303] 111 Acoustic signal input unit [0304] 112 First estimation unit [0305] 113 Second estimation unit [0306] 211 Acoustic signal input unit [0307] 212 Frequency conversion unit [0308] 213 First MUSIC processing unit [0309] 214 Horizontal angle estimation unit [0310] 215 Second MUSIC processing unit [0311] 216 Peak detection unit