巴西专利BR112014017001B1 signal classification of multiple encoding modes

专利PDF首页>>巴西专利

专利附录

专利说明

权利要求

类似技术

同族专利

引用文献

法律状态

优先权

专利摘要:
SIGNAL CLASSIFICATION OF MULTIPLE ENCODING MODES The optimized audio classification for encoding applications is provided. An initial classification is performed, followed by a more refined classification, to produce speech classifications and music classifications with superior accuracy and less complexity than previously available. Audio is classified as speech or music on a frame-by-frame basis. If the painting is classified as music by the initial classification, the painting goes through a second, more refined classification to confirm that the painting is music and does not speak (for example, it is tonal and / or structured which may not have been classified as speech) by the initial classification). Depending on the implementation, one or more parameters can be used in the more refined classification. Exemplary parameters include vocalization, modified correlation, signal activity, and actual long-term frequency gain.
公开号:BR112014017001B1
申请号:R112014017001-0
申请日:2012-12-21
公开日:2020-12-22
发明作者:Venkatraman Srinivasa Atti；Ethan Robert Duni
申请人:Qualcomm Incorporated；
IPC主号:

专利说明:

REMISSIVE REFERENCE TO RELATED ORDERS
[0001] This application claims priority in accordance with the benefit of 35 USC §119 (e) for Provisional Patent Application No. 61 / 586,374, filed on January 13, 2012. This provisional patent application is hereby expressly incorporated herein entirely by reference. BACKGROUND
[0002] The transmission of voice (also referred to as speech) and music by digital techniques has become widespread and incorporated into a wide range of devices, including wireless communication devices, personal digital assistants (PDAs), laptop computers, computer computers. desk, mobile and / or satellite radio phones, and the like. An exemplary field is that of wireless communications. The wireless communications field has many applications including, for example, cordless phones, paging, wireless local area networks, wireless telephony such as cell phone and PCS systems, Internet Protocol (IP) mobile telephony and communication systems through satellite.
[0003] In telecommunication networks, information is transferred in a coded form between a transmitting communication device and a receiving communication device. The transmitting communication device encodes the original information, such as voice signals and / or music signals, into encoded information and sends it to the receiving communication device. The receiving communication device decodes the received coded information to recreate the original information. Encoding and decoding is performed using codecs. The encoding of voice signals and / or music signals performed in a codec located in a transmitting communication device, and decoding is performed in a codec located in the receiving communication device.
[0004] In modern codecs, multiple encoding modes are included to handle different types of input sources, such as speech, music, and mixed content. For optimal performance, the optimal encoding mode for each frame of the input signal must be selected and used. Exact classification is necessary to select the most efficient encoding schemes and obtain the lowest data rate.
[0005] This classification can be performed in an open loop form to save complexity. In this case, the optimal mode classifier must consider the main characteristics of the various coding modes. Some modes (such as speech coding modes like excited linear algebraic prediction (ACELP)) contain an adaptive codebook (ACB) that explores the correlation between past and current frames. Some other modes (such as modified discrete cosine transform (MDCT) modes for music / audio) may not contain such a feature. Thus, it is important to ensure that the input frames that have a high correlation with the previous frame are classified in the way that they have ACB or that include other techniques of correlation modeling between frames.
[0006] Previous solutions used closed loop mode decisions (for example, AMR-WB +, USAC) or various types of open loop decisions (for example, AMR-WB +, EVRC-WB), but these solutions are either complex or their performances are prone to errors. SUMMARY
[0007] Improved audio classification is provided for encoding applications. An initial classification is performed, followed by a more refined classification, to produce speech classifications and music classifications with superior accuracy and less complexity than previously available.
[0008] Audio is classified as speech or music on a portion-by-portion basis (for example, frame by frame). If the frame is classified as music by the initial classification, that frame is subjected to a second, more refined classification to confirm that the frame is music and does not speak (for example, speech that is tonal and / or structured that may not have been classified as spoken by the initial classification).
[0009] Depending on the implementation, one or more parameters can be used in the most improved classification. Exemplary parameters include vocalization, modified correlation, signal activity and real frequency gain of long-term sound. These parameters are only examples, and are not intended to be limiting.
[0010] This summary is provided to introduce a selection of contexts in a simplified form which are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed study material nor is it intended to be used to limit the scope of the claimed study material. BRIEF DESCRIPTION OF THE DRAWINGS
[0011] The preceding summary, as well as the following detailed description of the illustrative modalities, is best understood when read in conjunction with the accompanying drawings. For the purpose of illustrating the modalities, exemplary constructions of the modalities are shown in the drawings; however, the modalities are not limited to the specific methods and instrumentalities revealed. In the drawings:
[0012] Figure 1A is a block diagram illustrating an exemplary system in which a source device transmits an encoded bit stream to a receiving device;
[0013] Figure 1B is a block diagram of two devices that can be used as described here;
[0014] Figure 2 is a block diagram of an implementation of a signal encoding and classification system of multiple encoding modes;
[0015] Figure 3 is an operational flow of an implementation of a method for classifying audio;
[0016] Figure 4 is a diagram of an exemplary mobile station; and
[0017] Figure 5 shows an exemplary computing environment. DETAILED DESCRIPTION
[0018] The revealed modalities present classification techniques for a variety of encoding modes in environments with various types of audio such as speech and music. Types of audio frames can be identified reliably and accurately for encoding in the most efficient way. Although these examples and description refer to audio frames here, more generally portions of audio signals are considered and can be used according to the implementations described here.
[0019] Figure 1A is a block diagram illustrating an exemplary system 10 in which a source device 12a transmits an encoded bit stream via a communication link 15 to a receiving device 14a. The bit stream can be represented as one or more packets. The source device 12a and the receiving device 14a can be digital devices. Specifically, the source device 12a can encode the data in accordance with the 3GPP2 EVRC-B standard, or similar standards that make use of packet encoding data for speech compression. One, or the two devices 12a, 14a of system 10 can implement encoding mode selections (based on different encoding models) and encoding rates for audio compression (for example, speech and / or music), as described in more detail below to improve the audio encoding process. An exemplary mobile station, which may comprise a source device or a receiving device, is described with reference to Figure 4.
[0020] Communication link 15 may comprise a wireless link, a physical transmission line, optical fibers, a packet-based network such as a local area network, remote area network or global network such as the Internet, a public switched telephone network (PSTN), or any other communication link capable of transferring data. The communication link 15 can be coupled to a storage medium. Thus, the communication link 15 represents any suitable communication medium, or possibly a group of different networks and links, for transmitting the compressed speech data from the source device 12a to the receiving device 14a.
[0021] The source device 12a may include one or more microphones 16 that capture the sound. Continuous sound is sent to digitizer 18. Digitizer 18 samples the sound at discrete intervals and quantizes (digitizes) the speech. The digitized speech can be stored in memory 20 and / or sent to an encoder 22 where the digitized speech samples can be encoded, usually through a 20 ms frame.
[0022] More specifically, the encoder divides the incoming speech signal into blocks of time, or frames or portions of analysis. The duration of each time segment (or frame) is typically selected so that it is short enough that the spectral envelope of the signal must remain relatively stationary. For example, a typical frame length is 20 milliseconds (20 ms), which corresponds to 160 samples at a typical sampling rate of 8 kHz (8 kHz), although any frame length or sample rate deemed suitable for a specific application can be used.
[0023] The encoding process carried out on the encoder 22 produces one or more packets, for sending to the transmitter 24, which can be transmitted via the communication link 15 to the receiving device 14a. For example, the encoder analyzes the arrival frame to extract certain relevant parameters, and then quantizes the parameters in binary representation, that is, for a set of bits or a packet of binary data. The data packets are transmitted through the communication channel (ie, a wired and / or wireless network connection) to a receiver and a decoder. The decoder processes the data packets, inversely quantizes them to produce the parameters and again synthesizes the audio frames using the inversely quantized parameters.
[0024] Encoder 22 may include, for example, different hardware, software or firmware, or one or more digital signal processors (DSP) that run programmable software modules to control encoding techniques, as described here. Memory and associated logic circuits can be provided to support the DSP in controlling encoding techniques. As will be described, encoder 22 can perform more robustly if encoding modes and rates can be changed before and / or during encoding depending on whether a speech frame or music frame has been determined to be encoded.
[0025] The receiving device 14a can take the form of any digital audio device capable of receiving and decoding the audio data. For example, the receiving device 14a may include a receiver 26 for receiving packets from the transmitter 24, for example, via intermediate links, routers, other network equipment, and the like. The receiving device 14a may also include a decoder 28 to decode one or more packets, and one or more speakers 30 to allow a user to hear the reconstructed audio after the decoding of the packets by the speech decoder 28.
[0026] In some cases, a source device 12b and receiving device 14b may individually include a speech encoder / decoder (codec) 32, as shown in Figure 1B, for encoding and decoding digital audio data. Specifically, the source device 12b and the receiving device 14b can include transmitters and receivers as well as memory and speakers. Many of the encoding techniques considered here are described in the context of a digital audio device that includes an encoder to compress speech and / or music.
[0027] It is understood, however, that the encoder can form part of a 32 codec. In this case, the codec can be implemented within hardware, software, firmware, a DSP, a microprocessor, a general purpose processor, a circuit application-specific integrated (ASIC), a field programmable port arrangement (FPGA), discrete hardware components, or various combinations of them. In addition, it is understood by those skilled in the art that encoders can be implemented with a DSP, ASIC, discrete port logic, firmware, or any conventional programmable software module and microprocessor. The software module could reside in RAM, flash memory, recorders or any other form of recordable storage medium known in the art. Alternatively, any conventional processor, controller, or state machine could be a replacement for the microprocessor. An exemplary computing device is described with reference to Figure 5.
[0028] Figure 2 is a block diagram of an implementation of a coding and signal classification system of multiple coding modes 200. In one implementation, system 200 can be used with a device, such as a source device or receiving device described with reference to Figures 1A and 1B. For example, system 200 may operate in conjunction with encoder 22 of source device 12a.
The multi-coding signal encoding and classification system 200 comprises an initial classifier 210 (also referred to as a first classifier) and a refined classifier 220 (also referred to as a second classifier). System 200 also comprises a refined classifier selection switch 230 which can be selected (for example, by a user) to enable or disable refined classifier 220 and its associated higher classification functionality.
[0030] Various types of encoders are included within the system 200, such as speech encoders and a music encoder. In one implementation, a first encoding mode, referred to as "encoding mode 1" 240 (such as a code-driven linear predictive type encoder (CELP), or a speech encoding mode, for example) can be provided and used in response to classification by the initial classifier 210. A second encoding mode, referred to as “encoding mode 2” 260 (such as a hybrid transform / CELP encoder, or a second speech encoding mode, for example) can be provided and used responsive to classification by refined classifier 220.
[0031] A third encoding mode, referred to as "encoding mode 3" 250 (such as a transform encoder, or music encoding mode, for example) can be provided and used in response to classification by the initial classifier 210 and / or by refined classifier 220. These types of encoding modes and encoders are well known, and additional descriptions are omitted for the sake of brevity. The exemplary encoding modes and encoders described for encoding modes 1, 2 and 3 are examples only and are not intended to be limiting. Any appropriate speech encoding mode (s) and / or encoder (s) and music encoding mode (s) and / or encoder (s) may be used.
[0032] Figure 3 is an operational flow of an implementation of a method 300 to classify audio. At 310, the initial classifier 210 receives an input audio frame (or other portion of an audio signal to classify the audio signal portion as a speech-like audio signal or a music-like audio signal) and classifies the same as speech or music in 320. The initial classifier 210 can be any classifier that classifies the frame or portion of audio as speech or music.
[0033] In some implementations, the initial classifier 210 may comprise more than one classifier (shown in 320 as "classifier 1" and "classifier 2", although any number of classifiers can be used depending on the implementation). For example, the initial classifier may comprise a classifier that is completely directed towards speech, and another different classifier such as a classifier that is more directed towards music. These two classifiers can operate on the input board sequentially or sometimes simultaneously (depending on the implementation) at 320, with their results being combined to form a result that is sent either to 330 or 340.
[0034] There is a small probability that speech will be detected as music by the initial classifier 210. As such, some speech frames may initially be classified as music. For example, speech in the presence of very low level background music or a singing voice, which are representative of speech, may not be classified as speech by initial classifier 210. Instead, initial classifier 210 may classify such signals like music. The presence of other background noise, such as vehicle horns in street noise or ringing in a typical office, for example, can also contribute to misclassification of speech as music.
[0035] If the frame is determined in 320 to be a speech frame by the initial classifier 210, then the frame is provided with encoding mode 1 240 (for example, a CELP type encoder) for encoding. In some implementations, any known CELP type encoding can be used.
[0036] If, on the other hand, the frame is determined in 320 as a music frame by the initial classifier 210, then it is determined in 340 if a more refined classification is enabled (for example, through the user having previously enabled it) the feature, via a selection switch “on” and “off” on the device corresponding to “enabled” and “not enabled”, respectively). This more refined classification is a second classification step that reinforces the decision of the first classification. In an implementation, the more refined classification for processing audio data can be selectively enabled by a user.
[0037] If the more refined classification is not enabled as determined at 340, then the frame is provided with encoding mode 3 250 (for example, a transform encoder) for encoding as a music frame at 350. However, if the more refined classification is enabled as determined in 340, then the table is provided to refined classifier 220 in 360 for a more refined, additional classification. The most refined classification is used to further distinguish a speech frame from a music frame.
[0038] In one implementation, the more refined classification is used to confirm that the picture is similar to broadband noise that is a characteristic of certain types of music, as opposed to the tonal and / or almost stationary characteristics of the spoken speech. If the more refined rating at 360 results in the frame being identified as a music frame, then the frame is sent to encoding mode 3 for encoding as a 350 music frame.
[0039] If the more refined classification at 360 results in the frame being identified as a speech frame, then the frame is sent to encoding mode 2 260 for encoding as a speech frame at 370. As noted above, in an implementation , encoding mode 2 260 may be a hybrid transform / CELP encoder, which may be used for encoding structured and / or tonal speech frames. In an alternative implementation, encoding mode 2 260 at 370 may be a CELP type encoder such as encoding mode 1 used at 330.
[0040] In an implementation, the most refined classification carried out in 360 (for example, through the refined classifier 220) can compare the various characteristics or aspects of the framework with one or more thresholds to determine whether the framework is a speech framework or a music board.
[0041] In some implementations, the vocalization of the frame can be compared with a first threshold THR1. If the frame's vocalization is greater than THR1, then the frame is determined to be a speech frame. An exemplary value for THR1 is 0.99, although any value can be used depending on the implementation. Vocalization ranges from 0 (corresponding to no correlation with a speech board) to 1 (corresponding to a high correlation with a speech board).
[0042] In some implementations, the weighted signal correlation can be compared with a second THR2 threshold. If the weighted signal correlation is greater than THR2, then the frame is determined to be a speech frame. An exemplary value for THR2 is 0.87, although any value can be used depending on the implementation. The signal correlation ranges from 0 (corresponding to random noise) to 1 (corresponding to the highly structured sound).
[0043] In some implementations, the actual frequency gain of long-term sound can be compared with a third limiting THR3. If the actual long-term sound frequency gain is greater than THR3, then the frame is determined to be a speech frame. An exemplary value for THR3 is 0.5, although any value can be used depending on the implementation. The real frequency gain of long-term sound is the normalized cross-correlation between past excitation and current residual prediction. The actual frequency gain of long-term sound ranges from 0 (indicating that the error in the past frame is not adequate in representing the current frame) to 1 (indicating that the residual error in the past frame can completely represent the current frame).
[0044] In some implementations, the tone of the current frame can be determined and compared with a THR4 threshold. The tone of a signal can be measured using a measure of spectral leveling or a measure of peak / average spectral ratio. If the spectrum does not contain any prominent localized peaks, then the measure of spectral leveling would tend to be close to 1. On the other hand, if the spectrum exhibits a strong slope with localized peaks, then the measure of spectral leveling would be close to 0. If the hue is greater than THR4, so the frame is determined to be a speech frame. An exemplary value for THR4 is 0.75, although any value can be used depending on the implementation.
[0045] Additionally or alternatively, in some implementations, it can be determined if there is any signal activity. If there is no signal activity (that is, the frame is silent), then it is determined that there is no useful signal to be encoded, and it can be encoded as a speech frame.
[0046] In some implementations, if none of the conditions are met to determine in 360 that the frame is a speech frame, then it is determined that the frame is a music frame.
[0047] The comparisons and limits described here are not intended to be limiting, since any one or more comparisons and / or thresholds can be used depending on the implementation. Additional and / or alternative comparisons and thresholds can also be used, depending on the implementation.
[0048] Thus, in an implementation, if a frame is initially classified (by the initial classifier 210) as speech, it is passed to a CELP encoder. If the frame is classified as music, however, then it is checked whether a more refined classification is enabled or not. The most refined classification can be enabled using an external user control. If the more refined classification is not enabled, then the frame that is initially classified as music is sent to a transform encoder for encoding. If the more refined classification is enabled then a logical combination of certain criteria (for example, vocalization, modified correlation, signal activity, actual long-term sound frequency gain, etc.) is used to select a transform encoder and a transform encoder hybrid a transform encoder / CELP hybrid. THR1, THR2, determined experimentally and depend on sample rates and signal types, for example.
[0049] In an implementation, strongly tonal signals are encoded in MDCT mode (which have no adaptive codebook) and are instead provided for linear predictive coding (LPC) modes using adaptive codebook.
[0050] The components of the encoders and classifiers described here can be implemented as electronic hardware, as computer software or combinations of both. These components are described here in terms of their functionality. Whether the functionality is implemented as hardware or software will depend on the specific application and the design restrictions imposed on the global system. Those skilled in the art will recognize the interchangeability of hardware and software under these circumstances, and how best to implement the functionality described for each specific application.
[0051] As used here, the term "determining" (and its grammatical variants) is used in an extremely broad sense. The term "determining" encompasses a wide variety of actions and therefore "determining" can include calculating, computing, processing, deriving, investigating, querying (for example, querying on a table, database or other structure of data), ascertaining and similar. In addition, "determining" may include receiving (for example, receiving information), accessing (for example, accessing data in a memory) and the like. In addition, "determining" may include resolving, selecting, choosing, establishing and the like.
[0052] The term "signal processing" (and its grammatical variants) can refer to signal processing and interpretation. Signs of interest can include sound, images, and many others. Processing of such signals may include storage and reconstruction, separation of noise information, compression, and feature extraction. The term "digital signal processing" can refer to the study of signals in a digital representation and the methods of processing these signals. Digital signal processing is an element of many communication technologies such as mobile stations, non-mobile stations and the Internet. The algorithms that are used for digital signal processing can be executed using specialized computers, which can make use of specialized microprocessors called digital signal processors (sometimes abbreviated as DSPs).
[0053] Unless otherwise indicated, any disclosure of an operation of an apparatus having a specific characteristic also expressly intends to reveal a method having a similar characteristic (and vice versa), and any disclosure of an operation of a device according to a specific configuration also expressly aims to reveal a method according to a similar configuration (and vice versa).
[0054] Figure 4 shows a block diagram of a model of an exemplary mobile station 400 in a wireless communication system. Mobile station 400 can be a cell phone, a terminal, a handset, a PDA, a wireless modem, a cordless phone, etc. The wireless communication system can be a CDMA system, a GSM system, etc.
[0055] The mobile station 400 is capable of providing bidirectional communication through a reception path and a transmission path. In the reception path, the signals transmitted by the base stations are received by an antenna 412 and provided to a receiver (RCVR) 414. The receiver 414 conditions and digitizes the received signal and supplies samples to a digital section 420 for further processing. In the transmission path, a transmitter (TMTR) 416 receives the data to be transmitted from the digital section 420, processes and conditions the data, and generates a modulated signal, which is transmitted through the antenna 412 to the base stations. Receiver 414 and transmitter 416 can be part of a transmitter that can support CDMA, GSM, etc.
[0056] The digital section 420 includes several processing units, interface and memory such as, for example, a modem processor 422, a reduced instruction set computer / digital signal processor (RISC / DSP) 424, a controller / processor 426, an internal memory 428, a generalized audio encoder 432, a generalized audio decoder 434, a graphics / display processor 436, and an external bus interface (EBI) 438. Modem processor 422 can perform processing for transmission and reception of data, for example, encoding, modulation, demodulation, and decoding. RISC / DSP 424 can perform general and specialized processing for wireless device 400. Controller / processor 426 can direct the operation of multiple processing and interface units within digital section 420. Internal memory 428 can store data and / or instructions for various units within digital section 420.
[0057] The generalized audio encoder 432 can encode the input signals from an audio source 442, a microphone 443, etc. The generalized audio decoder 434 can decode the encoded audio data and can provide output signals to a speaker / handset 444. The graphics / display processor 436 can perform processing for graphics, videos, images and text, which can be presented to a display unit 446. EBI 438 can facilitate the transfer of data between digital section 420 and a main memory 448.
[0058] The digital section 420 can be implemented with one or more processors, DSPs, microprocessors, RISCs, etc. The digital section 420 can also be manufactured in one or more application-specific integrated circuits (ASICs) and / or some other type of integrated circuits (ICs).
[0059] Figure 5 shows an exemplary computing environment in which exemplary implementations and aspects can be implemented. The computing system environment is just one example of a suitable computing environment and is not intended to suggest any limitations regarding the scope of use or functionality.
[0060] Computer executable instructions, such as program modules, being executed by a computer can be used. Program modules generally include routines, programs, objects, components, data structures, etc. who perform specific tasks or implement specific abstract data types. Distributed computing environments can be used where tasks are performed by remote processing devices that are connected via a communications network or other means of data transmission. In a distributed computing environment, program modules and other data can be located on local and remote computer storage media, including memory storage devices.
[0061] With reference to Figure 5, an exemplary system for implementing the aspects described here includes a computing device, such as computing device 500. In its most basic configuration, computing device 500 typically includes at least one unit of computing. processing 502 and memory 504. Depending on the exact configuration and type of computing device, memory 504 can be volatile (such as random access memory (RAM)), non-volatile (such as read memory (ROM), flash memory, etc.), or some combination of the two. This most basic configuration is illustrated in Figure 5 by the dashed line 506.
[0062] The computing device 500 may have additional features and / or functionality. For example, computing device 500 may include additional storage media (removable and / or non-removable) including, but not limited to, magnetic or optical disks or tape. Such additional storage medium is illustrated in Figure 5 by the removable storage medium 508 and non-removable storage medium 510.
[0063] The computing device 500 typically includes a variety of computer-readable media. The computer-readable media can be any available media that can be accessed by the device 500 and include volatile and non-volatile media, and removable and non-removable media. Computer storage media includes volatile and non-volatile, and removable and non-removable media implemented in any method or technology for storing information such as computer-readable instructions, data structures, program modules or other data. Memory 504, removable storage medium 508, and non-removable storage medium 510 are all examples of computer storage media. Computer storage media include, but are not limited to RAM, ROM, electrically erasable programmable read memory (EEPROM), flash memory and other memory technology, CD-ROM, digital versatile discs (DVD) or other optical storage media , magnetic tapes, magnetic tape, magnetic disk storage medium or other magnetic storage devices, or any other medium that can be used to store the desired information and that can be accessed by the computing device 500. Any computer storage medium may be part of the computing device 500.
[0064] The computing device 500 may contain communication connection (s) 512 which allows the device to communicate with other devices. Computing device 500 may also have input device (s) 514 (such as a keyboard, mouse, pen, voice input device, touch input device, etc.) Output device (s) 516 such as a display, speakers, printers, etc. can also be included. All of these devices are well known in the art and need not be discussed in detail here.
[0065] In general, any device described here can represent several types of devices, such as a cordless or wired phone, a cell phone, a laptop computer, a wireless multimedia device, a wireless PC card, a PDA, an external or internal modem, a device that communicates over a wireless or wired channel, etc. A device can have multiple names, such as access terminal (AT), access unit, subscriber unit, mobile station, mobile device, mobile unit, mobile phone, mobile device, remote station, remote terminal, remote unit, user, user equipment, handheld device, non-mobile station, non-mobile device, endpoint, etc. Any device described here can have a memory to store instructions and data, as well as hardware, software, firmware or their combinations.
[0066] The techniques described here can be implemented by various means. For example, these techniques can be implemented in hardware, firmware, software or a combination of them. Those skilled in the art would further consider that the various logic blocks, modules, circuits, and steps of algorithms described in connection with the present disclosure can be implemented as electronic hardware, computer software or combinations of both. To clearly illustrate the interchangeability of hardware and software, several illustrative components, blocks, modules, circuits and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends on the specific application and design restrictions imposed on the global system. Those skilled in the art can implement the functionality described in various ways for each specific application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.
[0067] For a hardware implementation, the processing units used to perform the techniques can be implemented within one or more ASICs, DSPs, digital signal processing devices (DSPDs), programmable logic devices (PLDs), FPGAs, processors , controllers, microcontrollers, microprocessors, electronic devices, other electronic units designed to perform the functions described here, a computer, or a combination of them.
[0068] Thus, the various illustrative logic blocks, modules and circuits described in connection with the disclosure presented here can be implemented or carried out with a general purpose processor, DSP, ASIC, FPGA or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof, designed to perform the functions described here. A general purpose processor can be a microprocessor, but in the alternative, the processor can be any conventional processor, controller, microcontroller, or state machine. A processor can also be implemented as a combination of computing devices, for example, a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
[0069] For a firmware and / or software implementation, the techniques can be incorporated as instructions in a computer-readable medium, such as RAM, ROM, non-volatile RAM, programmable ROM, EEPROM, flash memory, compact disc (CD) , data storage device, magnetic or optical, or the like. The instructions can be executable by one or more processors and can cause the processor (s) to perform certain aspects of the functionality described here.
[0070] If implemented in software, the functions can be stored or transmitted through one or more instructions or code in a computer-readable medium. Computer readable media include computer storage media and communication media including a medium that facilitates the transfer of a computer program from one location to another. A storage medium can be any available medium that can be accessed by a general purpose or special purpose computer. As an example, and not as a limitation, such computer-readable storage media may comprise RAM, ROM, EEPROM, CD-ROM or any other solid, optical or magnetic state data storage medium, including optical disk storage, storage magnetic disk, or other magnetic storage devices, or any other means that can be used to store desired program code in the form of instructions or data structures and that can be accessed by a computer. In addition, any connection is properly called a computer-readable medium. For example, if the software is transmitted from a network site, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technologies such as such as infrared, radio and microwave, then axial cable, fiber optic cable, twisted wire pair, DSL, or wireless technologies such as infrared, radio and microwave are included in the definition of medium. Magnetic disk and optical disk, as used here, include compact disk (CD), laser disk, optical disk, versatile digital disk (DVD), floppy disk and Blu-ray disk, where magnetic disks normally reproduce data magnetically, while optical discs reproduce data optically with lasers. Combinations of those mentioned above should also be included in the scope of computer-readable media.
[0071] A software module can reside RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, removable disk, CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor in such a way that the processor can read the information from the storage medium and write information to it. Alternatively, the storage medium can be integral to the processor. The processor and storage medium can reside in an ASIC. The ASIC can reside on a user terminal. Alternatively, the processor and the storage medium can reside as discrete components in a user terminal.
[0072] The previous description of the revelation is provided to enable those skilled in the art to make or use the revelation. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein can be applied to other variations without departing from the essence or scope of the disclosure. Thus, the disclosure should not be limited to the examples described here, but it should be granted the broadest scope consistent with the novel principles and characteristics disclosed here.
[0073] Although exemplary implementations may refer to the use of aspects of the subject under study presently revealed in the context of one or more independent computer systems, the subject under study is not thus limited, but more properly it can be implemented in connection with any environment computing, such as a distributed computing environment or network. Still further, aspects of the subject matter currently under study may be implemented on or through a plurality of chips or processing devices, and storage may similarly be carried out through a plurality of devices. Such devices could include PCs, network servers, and handheld devices, for example.
[0074] Although the subject under study has been described in specific language for structural characteristics and / or methodological actions, it should be understood that the subject under study defined in the attached claims is not necessarily limited to the specific characteristics or actions described above. More specifically, the specific characteristics and actions described above are revealed as exemplary ways of implementing the claims.

权利要求:
Claims (15)
[0001]
1. CHARACTERIZED method for understanding: receiving a portion of an audio signal in a first classifier; classify the portion of the audio signal in the first classifier as speech or music; and processing the audio signal portion, wherein processing the audio signal portion comprises: if the portion is classified by the first classifier as speech, then encode the speech using a first encoding mode; and if the portion is classified by the first classifier as music, then: provide the portion to a second classifier; classify the portion in the second classifier as speech or music; and encoding the audio signal portion, wherein encoding the audio signal portion comprises: if the portion is classified in the second classifier as speech, then encode the portion using a second encoding mode; or if the portion is classified in the second classifier as music, then encode the portion using a third encoding mode.
[0002]
2. Method according to claim 1, CHARACTERIZED that the portion of the audio signal is a frame.
[0003]
Method according to claim 1, characterized by the first coding mode comprising a first speech encoder, the second coding mode comprising a second speech encoder; and the third encoding mode comprises a music encoder.
[0004]
4. Method according to claim 3, CHARACTERIZED that the first speech encoder is a code-driven linear predictive type encoder (CELP), the second speech encoder is a hybrid transform / CELP encoder, and the music encoder be a transform encoder.
[0005]
5. Method, according to claim 1, CHARACTERIZED by further comprising determining whether the second classifier is enabled before providing the portion for the second classifier, and if the second classifier is not enabled, then encoding the portion with the third mode codification.
[0006]
6. Method according to claim 1, CHARACTERIZED for classifying the portion in the second classifier as speech or as music comprising comparing a plurality of characteristics of the portion with one or more thresholds to classify whether the portion has characteristics of music or speech characteristics vocalized.
[0007]
7. Method, according to claim 6, CHARACTERIZED by the characteristics of music comprise characteristics of music similar to broadband noise, and the characteristics of vocalized speech comprise at least one of tonal characteristics of vocalized speech or almost stationary characteristics of vocalized speech .
[0008]
8. Method, according to claim 1, CHARACTERIZED by classifying the portion in the second classifier as speech or as music comprising at least one of comparing the vocalization of the portion with a first threshold, comparing the modified correlation with a second threshold, or comparing real frequency gain of long-term sound with a third threshold.
[0009]
9. Method, according to claim 8, CHARACTERIZED by the vocalization varies from 0, corresponding to no correlation with speech, up to 1, corresponding to the high correlation with speech; in which the modified correlation varies from 0, corresponding to random noise, to 1, corresponding to the highly structured sound; where the gain in actual long-term sound frequency is the normalized correlation between past excitation and the current prediction residual; and in which the gain of real frequency of long-term sound varies from 0, indicating that the error in the passed portion is not adequate in the representation of the current portion, until 1, indicating that the use of the residual error in the passed portion can completely represent current portion.
[0010]
10. Method, according to claim 1, CHARACTERIZED by classifying the portion in the second classifier as speech or as music to understand to determine if there is any signal activity in the portion, and if there is no signal activity, then determine if there is no useful signal to encode, and encode the portion as speech.
[0011]
11. Apparatus FEATURED for understanding: means for receiving a portion of an audio signal in a first classifier; means for classifying the portion of the audio signal in the first classifier as speech or music; means for encoding speech using a first encoding mode if the portion is classified by the first classifier as speech, or classifying the portion in a second classifier as speech or as music when the portion is classified by the first classifier as music; and means for encoding the portion using a second encoding mode when the portion is classified in the second classifier as speech, or encoding the portion using a third encoding mode when the portion is classified in the second classifier as music.
[0012]
12. Apparatus according to claim 11, further comprising means for determining whether the second classifier is enabled before providing the portion for the second classifier, and if the second classifier is not enabled, then encoding the portion with the third encoding mode.
[0013]
13. Apparatus according to claim 11, CHARACTERIZED by the means for classifying the portion in the second classifier as speech or as music comprising means for comparing a plurality of characteristics of the portion with one or more thresholds for classifying whether the portion has characteristics of music or vocalized speech characteristics.
[0014]
14. Memory CHARACTERIZED for understanding instructions for making a processor perform the method as defined in any of claims 1 to 10.
[0015]
15. System FEATURED for understanding: a first classifier that receives a portion of an audio signal, classifies the portion of the audio signal as speech or as music, and processes the portion of the audio signal, in which to process the portion of the audio signal audio comprises: if the portion is classified as speech, then encodes the speech using a first encoding mode, or if the portion is classified as music, then supplies the portion to a second classifier; and the second classifier, in which if the portion is classified by the first classifier as music, classifies the portion as speech or music and encodes the portion of the audio signal, in which encoding the portion of the audio signal comprises: if the portion is classified in the second classifier as speech, encode the portion using a second encoding mode; or if the portion is classified in the second classifier as music, encode the portion using a third encoding mode.

类似技术:

公开号 | 公开日 | 专利标题

BR112014017001B1|2020-12-22|signal classification of multiple encoding modes

KR102317296B1|2021-10-26|Voice profile management and speech signal generation

RU2419167C2|2011-05-20|Systems, methods and device for restoring deleted frame

CN106409313B|2021-04-20|Audio signal classification method and device

KR101721303B1|2017-03-29|Voice activity detection in presence of background noise

CA2658560C|2014-07-22|Systems and methods for modifying a window with a frame associated with an audio signal

JP2017515147A|2017-06-08|Keyword model generation to detect user-defined keywords

US9143571B2|2015-09-22|Method and apparatus for identifying mobile devices in similar sound environment

CN101322182B|2011-11-23|Systems, methods, and apparatus for detection of tonal components

RU2584461C2|2016-05-20|Noise-robust speech coding mode classification

JP2014505898A|2014-03-06|Adaptive processing by multiple media processing nodes

EP2956939B1|2017-11-01|Personalized bandwidth extension

Ren et al.2016|AMR steganalysis based on second-order difference of pitch delay

Ren et al.2019|An AMR adaptive steganographic scheme based on the pitch delay of unvoiced speech

Gibson2015|Challenges in speech coding research

KR100757858B1|2007-09-11|Optional encoding system and method for operating the system

同族专利:

公开号 | 公开日

JP5964455B2|2016-08-03|

SI2803068T1|2016-07-29|

US9111531B2|2015-08-18|

CN104040626A|2014-09-10|

CN104040626B|2017-08-11|

EP2803068A1|2014-11-19|

DK2803068T3|2016-05-23|

HUE027037T2|2016-08-29|

WO2013106192A1|2013-07-18|

JP2015507222A|2015-03-05|

KR20170005514A|2017-01-13|

EP2803068B1|2016-04-13|

KR20140116487A|2014-10-02|

US20130185063A1|2013-07-18|

BR112014017001A8|2017-07-04|

IN2014MN01588A|2015-05-08|

ES2576232T3|2016-07-06|

BR112014017001A2|2017-06-13|

引用文献:

公开号 | 申请日 | 公开日 | 申请人 | 专利标题

EP1107231B1|1991-06-11|2005-04-27|QUALCOMM Incorporated|Variable rate vocoder|

US5778335A|1996-02-26|1998-07-07|The Regents Of The University Of California|Method and apparatus for efficient multiband celp wideband speech and music coding and decoding|

US6493665B1|1998-08-24|2002-12-10|Conexant Systems, Inc.|Speech classification and parameter weighting used in codebook search|

US7072832B1|1998-08-24|2006-07-04|Mindspeed Technologies, Inc.|System for speech encoding having an adaptive encoding arrangement|

US7272556B1|1998-09-23|2007-09-18|Lucent Technologies Inc.|Scalable and embedded codec for speech and audio signals|

US6691084B2|1998-12-21|2004-02-10|Qualcomm Incorporated|Multiple mode variable rate speech coding|

JP2000267699A|1999-03-19|2000-09-29|Nippon Telegr & Teleph Corp <Ntt>|Acoustic signal coding method and device therefor, program recording medium therefor, and acoustic signal decoding device|

CN1242379C|1999-08-23|2006-02-15|松下电器产业株式会社|Voice encoder and voice encoding method|

US6604070B1|1999-09-22|2003-08-05|Conexant Systems, Inc.|System of encoding and decoding speech signals|

US6625226B1|1999-12-03|2003-09-23|Allen Gersho|Variable bit rate coder, and associated method, for a communication station operable in a communication system|

US6697776B1|2000-07-31|2004-02-24|Mindspeed Technologies, Inc.|Dynamic signal detector system and method|

US6694293B2|2001-02-13|2004-02-17|Mindspeed Technologies, Inc.|Speech coding system with a music classifier|

US6785645B2|2001-11-29|2004-08-31|Microsoft Corporation|Real-time speech and music classifier|

US6829579B2|2002-01-08|2004-12-07|Dilithium Networks, Inc.|Transcoding method and system between CELP-based speech codes|

US7657427B2|2002-10-11|2010-02-02|Nokia Corporation|Methods and devices for source controlled variable bit-rate wideband speech coding|

US7363218B2|2002-10-25|2008-04-22|Dilithium Networks Pty. Ltd.|Method and apparatus for fast CELP parameter mapping|

FI118834B|2004-02-23|2008-03-31|Nokia Corp|Classification of audio signals|

AT457512T|2004-05-17|2010-02-15|Nokia Corp|AUDIOCODING WITH DIFFERENT CODING FRAME LENGTHS|

US8010350B2|2006-08-03|2011-08-30|Broadcom Corporation|Decimated bisectional pitch refinement|

CN1920947B|2006-09-15|2011-05-11|清华大学|Voice/music detector for audio frequency coding with low bit ratio|

CN101197130B|2006-12-07|2011-05-18|华为技术有限公司|Sound activity detecting method and detector thereof|

KR100964402B1|2006-12-14|2010-06-17|삼성전자주식회사|Method and Apparatus for determining encoding mode of audio signal, and method and appartus for encoding/decoding audio signal using it|

KR100883656B1|2006-12-28|2009-02-18|삼성전자주식회사|Method and apparatus for discriminating audio signal, and method and apparatus for encoding/decoding audio signal using it|

CN101226744B|2007-01-19|2011-04-13|华为技术有限公司|Method and device for implementing voice decode in voice decoder|

KR100925256B1|2007-05-03|2009-11-05|인하대학교 산학협력단|A method for discriminating speech and music on real-time|

CN101393741A|2007-09-19|2009-03-25|中兴通讯股份有限公司|Audio signal classification apparatus and method used in wideband audio encoder and decoder|

CN101399039B|2007-09-30|2011-05-11|华为技术有限公司|Method and device for determining non-noise audio signal classification|

CN101221766B|2008-01-23|2011-01-05|清华大学|Method for switching audio encoder|

AU2009220321B2|2008-03-03|2011-09-22|Intellectual Discovery Co., Ltd.|Method and apparatus for processing audio signal|

CN101236742B|2008-03-03|2011-08-10|中兴通讯股份有限公司|Music/ non-music real-time detection method and device|

US8768690B2|2008-06-20|2014-07-01|Qualcomm Incorporated|Coding scheme selection for low-bit-rate applications|

EP2144230A1|2008-07-11|2010-01-13|Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V.|Low bitrate audio encoding/decoding scheme having cascaded switches|

PT2301011T|2008-07-11|2018-10-26|Fraunhofer Ges Forschung|Method and discriminator for classifying different segments of an audio signal comprising speech and music segments|

KR101261677B1|2008-07-14|2013-05-06|광운대학교 산학협력단|Apparatus for encoding and decoding of integrated voice and music|

CN101751920A|2008-12-19|2010-06-23|数维科技（北京）有限公司|Audio classification and implementation method based on reclassification|

CN101814289A|2009-02-23|2010-08-25|数维科技（北京）有限公司|Digital audio multi-channel coding method and system of DRA with low bit rate|

JP5519230B2|2009-09-30|2014-06-11|パナソニック株式会社|Audio encoder and sound signal processing system|

CN102237085B|2010-04-26|2013-08-14|华为技术有限公司|Method and device for classifying audio signals|

AU2012218778B2|2011-02-15|2016-10-20|Voiceage Evs Llc|Device and method for quantizing the gains of the adaptive and fixed contributions of the excitation in a celp codec|US9589570B2|2012-09-18|2017-03-07|Huawei Technologies Co., Ltd.|Audio classification based on perceptual quality for low or medium bit rates|

RU2630889C2|2012-11-13|2017-09-13|Самсунг Электроникс Ко., Лтд.|Method and device for determining the coding mode, method and device for coding audio signals and a method and device for decoding audio signals|

CN104347067B|2013-08-06|2017-04-12|华为技术有限公司|Audio signal classification method and device|

CN104424956B|2013-08-30|2018-09-21|中兴通讯股份有限公司|Activate sound detection method and device|

US10090004B2|2014-02-24|2018-10-02|Samsung Electronics Co., Ltd.|Signal classifying method and device, and audio encoding method and device using same|

CN107424622B|2014-06-24|2020-12-25|华为技术有限公司|Audio encoding method and apparatus|

CN104143335B|2014-07-28|2017-02-01|华为技术有限公司|audio coding method and related device|

US9886963B2|2015-04-05|2018-02-06|Qualcomm Incorporated|Encoder selection|

CN104867492B|2015-05-07|2019-09-03|科大讯飞股份有限公司|Intelligent interactive system and method|

KR20170019257A|2015-08-11|2017-02-21|삼성전자주식회사|Adaptive processing of audio data|

US10186276B2|2015-09-25|2019-01-22|Qualcomm Incorporated|Adaptive noise suppression for super wideband music|

US10678828B2|2016-01-03|2020-06-09|Gracenote, Inc.|Model-based media classification service using sensed media noise characteristics|

WO2017117234A1|2016-01-03|2017-07-06|Gracenote, Inc.|Responding to remote media classification queries using classifier models and context parameters|

JP6996185B2|2017-09-15|2022-01-17|富士通株式会社|Utterance section detection device, utterance section detection method, and computer program for utterance section detection|

法律状态:
2018-12-04| B06F| Objections, documents and/or translations needed after an examination request according art. 34 industrial property law|

2019-09-10| B06U| Preliminary requirement: requests with searches performed by other patent offices: suspension of the patent application procedure|

2020-10-06| B09A| Decision: intention to grant|

2020-12-22| B16A| Patent or certificate of addition of invention granted|Free format text: PRAZO DE VALIDADE: 20 (VINTE) ANOS CONTADOS A PARTIR DE 21/12/2012, OBSERVADAS AS CONDICOES LEGAIS. |

优先权:

申请号 | 申请日 | 专利标题

US201261586374P| true| 2012-01-13|2012-01-13|

US61/586,374|2012-01-13|

US13/722,669|US9111531B2|2012-01-13|2012-12-20|Multiple coding mode signal classification|

US13/722,669|2012-12-20|

PCT/US2012/071217|WO2013106192A1|2012-01-13|2012-12-21|Multiple coding mode signal classification|

[返回顶部]