Sound localization method, apparatus and device
12593173 ยท 2026-03-31
Assignee
Inventors
Cpc classification
G01S3/8006
PHYSICS
G10L15/30
PHYSICS
International classification
G10L15/30
PHYSICS
H04M3/56
ELECTRICITY
Abstract
A conference speech presentation system, a sound localization method and apparatus, a conference system and a pickup device. The method includes the following steps: collecting (S101) a multi-channel voice signal through a directional microphone array; determining (S103) a steering vector including phase information and amplitude information according to array shape information and microphone pointing direction information; determining (S105) sound direction information according to the steering vector and the voice signal. By adopting this processing mode, both the phase information and the amplitude information are considered when determining the steering vector, which can effectively improve the accuracy of sound localization.
Claims
1. A conference speech presentation system, characterized by comprising: a terminal device, configured to: collect a multi-channel voice signal of a conference space through a directional microphone array; determine a steering vector comprising phase information and amplitude information according to array shape information and microphone pointing direction information; determine location information of a conference speaker according to the steering vector and the multi-channel voice signal; send the multi-channel voice signal and the location information to a server end; and present conference speech texts of different conference speakers sent back by the server end; and the server end, configured to: convert the multi-channel voice signal into a conference speech text through a voice recognition algorithm; and determine the conference speech texts of the different conference speakers according to the location information.
2. A sound localization method, characterized by comprising: collecting a multi-channel voice signal through a directional microphone array; determining a steering vector comprising phase information and amplitude information according to array shape information and microphone pointing direction information; and determining sound direction information according to the steering vector and the multi-channel voice signal.
3. The method according to claim 2, characterized by, the determining a steering vector comprising phase information and amplitude information according to array shape information and microphone pointing direction information comprises: determining a phase difference according to the array shape information; determining an amplitude response according to the microphone pointing direction information; and determining the steering vector according to the phase difference and the amplitude response.
4. The method according to claim 2, characterized by: the directional microphone array comprises a linear array; the array shape information comprises a distance between microphones; and a microphone pointing direction comprises a direction perpendicular to the linear array and pointing to one side.
5. The method according to claim 2, characterized by: the directional microphone array comprises a circular array; the array shape information comprises a radius of the circular array; and the microphone pointing direction is a direction of a microphone relative to a center of the circular array.
6. The method according to claim 2, characterized by, the determining sound direction information according to the steering vector and the multi-channel voice signal comprises: determining a spatial spectrum according to the steering vector and the multi-channel voice signal; and determining the sound direction information according to the spatial spectrum.
7. The method according to claim 6, characterized by, the determining the sound direction information according to the spatial spectrum comprises: taking a direction for which energy response data is in the forefront as a sound direction.
8. A sound localization apparatus, characterized by comprising: at least one processor and a memory; the memory stores a computer executable instruction; and the at least one processor executes the computer executable instruction stored in the memory to enable the at least one processor to: collect a multi-channel voice signal through a directional microphone array; determine a steering vector comprising phase information and amplitude information according to array shape information and microphone pointing direction information; and determine sound direction information according to the steering vector and the multi-channel voice signal.
9. A pickup device, characterized by comprising: a directional microphone array; a processor; and a memory, configured to store a computer executable instruction for implementing a sound localization method according to claim 2, and the pickup device is powered and executes the computer executable instruction through the processor.
10. A conference system, characterized by comprising: a sound localization apparatus according to claim 8 and a speaker tracking apparatus.
11. A non-transitory computer-readable storage medium, characterized by an instruction being stored in the non-transitory computer-readable storage medium, which, when run on a computer, causes the computer to perform the method according to claim 2.
12. The pickup device according to claim 9, wherein the memory is configured to store the computer executable instruction for further implementing the following operations: determining a phase difference according to the array shape information; determining an amplitude response according to the microphone pointing direction information; and determining the steering vector according to the phase difference and the amplitude response.
13. The pickup device according to claim 9, wherein: the directional microphone array comprises a linear array; the array shape information comprises a distance between microphones; and a microphone pointing direction comprises a direction perpendicular to the linear array and pointing to one side.
14. The pickup device according to claim 9, wherein: the directional microphone array comprises a circular array; the array shape information comprises a radius of the circular array; and the microphone pointing direction information indicates a direction of a microphone relative to a center of the circular array.
15. The pickup device according to claim 9, wherein the memory is configured to store the computer executable instruction for further implementing the following operations: determining a spatial spectrum according to the steering vector and the multi-channel voice signal; and determining the sound direction information according to the spatial spectrum.
16. The pickup device according to claim 15, wherein the memory is configured to store the computer executable instruction for further implementing the following operation: taking a direction for which energy response data is in the forefront as a sound direction.
17. The non-transitory computer-readable storage medium according to claim 11, wherein the instruction, when run on a computer, causes the computer to further implement the following operations: determining a phase difference according to the array shape information; determining an amplitude response according to the microphone pointing direction information; and determining the steering vector according to the phase difference and the amplitude response.
18. The non-transitory computer-readable storage medium according to claim 11, wherein the instruction, when run on a computer, causes the computer to further implement the following operations: determining a spatial spectrum according to the steering vector and the multi-channel voice signal; and determining the sound direction information according to the spatial spectrum.
19. The non-transitory computer-readable storage medium according to claim 18, wherein the instruction, when run on a computer, causes the computer to further implement the following operation: taking a direction for which energy response data is in the forefront as a sound direction.
20. The sound localization apparatus according to claim 8, wherein the at least one processor is further configured to: determine a phase difference according to the array shape information; and determine an amplitude response according to the microphone pointing direction information; and determine the steering vector according to the phase difference and the amplitude response.
Description
BRIEF DESCRIPTION OF DRAWINGS
(1)
(2)
(3)
(4)
DESCRIPTION OF EMBODIMENTS
(5) In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. However, the present application can be implemented in many other ways different from those described here, and those skilled in the art can make similar promotion without violating the connotation of the present application, so the present application is not limited by the specific embodiments disclosed below.
(6) In the present application, a conference speech presentation system, a sound localization method and apparatus, a conference system and a pickup device are provided. Various schemes are described in detail in the following embodiments.
The First Embodiment
(7) The embodiment of the present disclosure provides a sound localization method, which can be adopted for a pickup device, an audio and video conference terminal and the like, where the device includes a directional microphone array instead of an omnidirectional microphone array.
(8) Please refer to
(9) The directional microphone includes, but is not limited to, heart-shaped, hypercardioid, gun-shaped, and double-directional.
(10) The microphone array can be a circular array or a linear array, or an array with other geometric shapes, such as a square array and a triangular array, or an array with irregular geometric shapes. Step S103: determining a steering vector including phase information and amplitude information according to array shape information and microphone pointing direction information.
(11) The processing flow of the method provided by the embodiment of the present application adopts the same processing flow as that of the DOA method based on an omnidirectional microphone in the prior art, but the determination mode of the steering vector is improved, and this step S103 is the improved determination mode of the steering vector.
(12) In a specific implementation, the DOA localization method such as Steered-Response Power-Phase Transform (SRP-PHAT), MUSIC (Multiple Signal Classification) and MVDR (Minimum Variance Distortionless Response) can be adopted. Taking the SRP-PHAT localization method as an example, this method scans different angles (0-360 degrees), calculates energy response of each angle according to the steering vector and the signal received by the microphone array, and then obtains a spatial spectrum: after obtaining the spatial spectrum, the angle with higher energy response in the spatial spectrum can be selected as a sound localization result. The difference among these DOA methods is that the mode of calculating the spatial spectrum according to the steering vector and the multi-channel speech signal is different.
(13) The array shape information is related to the geometric shape of the array. Taking a linear array as an example, the array shape information may include information such as a distance between microphones. Taking a circular array as an example, the array shape information may include information such as the radius of the circular array.
(14) The microphone pointing direction information is also related to the geometric shape of the array. Take a linear array as an example, the pointing direction of the microphone is perpendicular to the array and pointing to one side. Taking a circular array as an example, the pointing direction of the microphone is the direction of the microphone relative to the center of the array.
(15) In the prior art, when an omnidirectional microphone array is adopted, the steering vector only represents the phase relationship of an incident signal on each array element in the microphone array. In the method provided by the present application, when the microphone in the array is a directional microphone, the directivity of the microphone is also considered in the steering vector, that is, an amplitude response in the direction is to be calculated. That is to say, the steering vector described in the embodiment of the present application includes phase information and amplitude information. Therefore, for signals in different directions, both phase information and amplitude information can be adopted for localization.
(16) In this embodiment, step S103 may include the following sub-steps: determining the phase difference according to the array shape information; determining the amplitude response according to the microphone pointing direction information; determining the steering vector according to the phase difference and the amplitude response.
(17) As shown in
(18)
(19) In this formula, p(.sub.m, ) represents the amplitude response of m-th directional microphone, represents an incident direction of a signal, .sub.m represents a pointing direction of the m-th directional microphone and a represents a first-order coefficient of the directional microphone.
(20) Accordingly, the following formula can be adopted for the steering vector:
(21)
(22) In the prior art, the following formula can be adopted to calculate the steering vector of directional microphone:
(23)
(24) It can be seen from this formula that the amplitude information is not considered when calculating the steering vector in the prior art, therefore the steering vector is not accurate enough.
(25) In another example, the directional microphone array is a circular array, and the following formula can be adopted for the steering vector:
(26)
(27) In this formula, represents an incident direction of a signal, .sub.m represents a pointing direction of the m-th directional microphone and R represents a radius of the circular array.
(28) Step S105: determining sound direction information according to the steering vector and the voice signal.
(29) After determining the steering vector including phase information and amplitude information, the DOA method can be adopted to determine sound direction information according to the steering vector and the voice signal.
(30) As shown in
(31) It can be seen from the above embodiment that the sound localization method provided by the embodiment of the present application collects a multi-channel voice signal through a directional microphone array: determines a steering vector including phase information and amplitude information according to array shape information and microphone pointing direction information: determines sound direction information according to the steering vector and the voice signal. By adopting this processing mode, both the phase information and the amplitude information are considered when determining the steering vector, which can effectively improve the accuracy of sound localization.
The Second Embodiment
(32) In the above embodiment, a sound localization method is provided, and correspondingly, the present application further provides a sound localization apparatus. The apparatus corresponds to the embodiment of the above method. Since the apparatus embodiment is basically similar to the method embodiment, the description thereof is relatively simple, and the relevant points can be found in part of the description of the method embodiment. The apparatus embodiment described below is merely schematic.
(33) The present application additionally provides a sound localization apparatus, which includes: a sound collecting unit, configured to collect a multi-channel voice signal through a directional microphone array; a steering vector determining unit, configured to determine a steering vector including phase information and amplitude information according to array shape information and microphone pointing direction information; a sound direction determining unit, configured to determine sound direction information according to the steering vector and the voice signal.
(34) In an implementation, the steering vector determining unit includes: a phase difference determining subunit, configured to determine a phase difference according to the array shape information; an amplitude response determining subunit, configured to determine an amplitude response according to the microphone pointing direction information; a steering vector determining subunit, configured to determine the steering vector according to the phase difference and the amplitude response.
(35) In an implementation, the array includes a linear array; the array shape information includes a distance between microphones; a microphone pointing direction includes a direction perpendicular to the array and pointing to one side.
(36) In an implementation, the array includes a circular array: the array shape information includes a radius of the circular array; a microphone pointing direction is a direction of a microphone relative to the center of the circular array.
(37) In an implementation, the sound direction determining unit includes: a spatial spectrum determining subunit, configured to determine a spatial spectrum according to the steering vector and the voice signal; a sound direction determining subunit, configured to determine the sound direction information according to the spatial spectrum.
(38) In an implementation, the sound direction determining subunit is specifically configured to take a direction for which energy response data is in the forefront as a sound direction.
The Third Embodiment
(39) Corresponding to the mentioned sound localization method, the present disclosure further provides a conference system. The parts of this embodiment that are the same as those of the first embodiment are not repeated here, which could refer to the corresponding parts in the first embodiment. A conference system provided by the present application includes a sound localization apparatus and a speaker tracking apparatus.
(40) An audio and video conference system is a system device with which individuals or groups in two or more different places transmit sound, images and documents to each other through transmission lines and conference terminals and the like to implement instant and interactive communication, so as to realize a simultaneous conference.
(41) The sound localization apparatus corresponds to the first embodiment, so it will not be described in detail, which could refer to the corresponding part in the first embodiment. The speaker tracking apparatus is configured to determine activity track information of the speaker according to the sound direction information output by the sound localization apparatus. Since the speaker tracking is a mature prior art, it will not be described here.
(42) It can be seen from the above embodiment that the conference system provided by the embodiment of the present application includes a sound localization apparatus and a speaker tracking apparatus. The sound localization apparatus is configured to collect a multi-channel voice signal through a directional microphone array: determine a steering vector including phase information and amplitude information according to array shape information and microphone pointing direction information: determine sound direction information according to the steering vector and the voice signal. The speaker tracking apparatus is configured to determine activity track information of the speaker according to the sound direction information output by the sound localization apparatus. The system considers both phase information and amplitude information when determining the steering vector, so it can effectively improve the accuracy of sound localization and then improve the accuracy of speaker tracking.
The Fourth Embodiment
(43) Corresponding to the mentioned sound localization method, the present application further provides a conference speech presentation system. The parts of this embodiment that are the same as those of the first embodiment are not repeated here, which could refer to the corresponding parts in the first embodiment. A conference system provided by the present application includes a terminal device and a server end.
(44) Please refer to
(45) It can be seen from the above embodiment that in the conference speech presentation system provided by the embodiment of the present application, the terminal device collects a multi-channel voice signal of a conference space through a directional microphone array: determines a steering vector including phase information and amplitude information according to array shape information and microphone pointing direction information; determines location information of a conference speaker according to the steering vector and the voice signal; sends the voice signal and the location information to a server end. The server end converts the voice signal into a conference speech text through a voice recognition algorithm; and determines conference speech texts of different conference speakers according to the location information. The terminal device presents the conference speech texts of different conference speakers. By adopting this processing mode, both the phase information and the amplitude information are considered when determining the steering vector, which can effectively improve the accuracy of localization for conference speakers, and then improve the accuracy of conference speech presentation.
(46) Although the present application has been disclosed in terms of the preferred embodiments, it is not intended to limit the present application to these embodiments. Any person skilled in the art can make possible changes and modifications without departing from the spirit and scope of the present application, so the protection scope of the present application should be based on the scope defined in the claims of the present application.
(47) In a typical configuration, a computing device includes one or more processors (CPU), an input/output interface, a network interface, and a memory.
(48) The memory may include a non-permanent memory, a random access memory (RAM) and/or a nonvolatile memory and the like in computer-readable medium, such as a read-only memory (ROM) or a flash memory. The memory is an example of a computer-readable medium.
(49) 1. The computer-readable medium includes a permanent medium and a non-permanent medium, a removable medium and a non-removable medium, with which information storage can be implemented by any method or technology. Information can be a computer-readable instruction, a data structure, a module of a program or other data. An example of storage medium for a computer includes, but not limited to a phase change memory (PRAM), a static random access memory (SRAM), a dynamic random access memory (DRAM), other types of random access memory (RAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a flash memory or other memory technologies, a compact disc read-only memory (CD-ROM), and a digital versatile disc (DVD). Or other optical storage, a magnetic cassette, a magnetic tape/magnetic disk storage or other magnetic storage device or any other non-transmission medium, which can be configured to store information that can be accessed by a computing device. According to the definition in this specification, a computer-readable medium does not include a non-transitory computer-readable medium (transitory media), such as a modulated data signal and a carrier wave.
(50) 2. It should be understood by those skilled in the art that embodiments of the present application can be provided as a method, a system or a computer program product. Therefore, the present application can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Moreover, the present application can take the form of a computer program product implemented on one or more computer-usable storage medium (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.