巴西专利BR112013030117B1 METHOD AND APPARATUS FOR CLASSIFICATION OF ROBUST NOISE SPEECH AND LEGIBLE MEMORY BY COMPUTER

专利PDF首页>>巴西专利

专利附录

专利说明

权利要求

类似技术

同族专利

引用文献

法律状态

优先权

专利摘要:
summary? robust noise speech coding mode classification ?. a robust noise speech classification method is described. Classification parameters are registered in a speech classifier from external components. internal classification parameters are generated in the speech classifier from at least one of the input parameters. a normalized auto-correlation coefficient function limit is determined. 10 a parameter analyzer is selected according to a signal environment. a speech mode classification is determined based on a noise estimate from multiple frames of the input speech.
公开号:BR112013030117B1
申请号:R112013030117-1
申请日:2012-04-12
公开日:2021-03-30
发明作者:Ethan Robert Duni；Vivek Rajendran
申请人:Qualcomm Incorporated；
IPC主号:

专利说明:

Related Requests
[0001] This application relates to and claims priority from U.S. Provisional Patent Application No. 61 / 489,629, filed May 24, 2011, for "Noise-Robust Speech Coding Mode Classification". Field of the Invention
[0002] The present description generally refers to the field of speech processing. More particularly, the described configurations refer to the classification of robust noise speech coding mode. Description of the Prior Art
[0003] Voice transmission by digital techniques has become widespread, particularly in long-distance digital radio telephony applications. This, in turn, has created interest in determining the smallest amount of information that can be sent through a channel while maintaining the perceived quality of the reconstructed speech. If speech is transmitted simply by sampling and digitizing, a data rate of the order of 64 kilobits per second (kbps) is necessary to achieve the quality of conventional analog telephony. However, through the use of speech analysis, followed by coding, transmission and adequate new synthesis at the receiver, a significant reduction in the data rate can be achieved. The more accurate speech analysis can be performed, the more appropriately the data can be encoded, thus reducing the data rate.
[0004] The devices that employ the techniques to compress speech by extracting parameters that refer to a model of human speech generation are called speech encoders. A speech encoder divides the input speech signal into blocks of time, or analysis frames. Speech encoders typically comprise an encoder and a decoder, or a codec. The encoder analyzes the input speech frame to extract certain relevant parameters, and then quantizes the parameters in binary representation, that is, for a set of bits or a packet of binary data. The data packets are transmitted through the communication channel to a receiver and a decoder. The decoder processes the data packets, quantizes them to produce the parameters, and then synthesizes the speech frames again using the quantized parameters.
[0005] Modern speech encoders can use a multi-mode encoding approach to input speech. Multi-mode variable bit rate encoders use speech classification to accurately capture and encode a high percentage of speech segments using a minimum number of bits per frame. The more accurate speech classification produces a lower average encoded bit rate, and higher quality decoded speech. Previously, speech classification techniques considered a minimum number of parameters for isolated speech frames only, producing few inaccurate speech mode classifications. Thus, there is a need for a high-performance speech classifier to correctly classify the numerous speech modes under varying environmental conditions in order to allow maximum performance of multi-mode variable bit rate encoding techniques. Brief Description of the Figures
[0006] Figure 1 - is a block diagram illustrating a system for wireless communication;
[0007] Figure 2A - is a block diagram illustrating a classifier system that can use robust noise speech coding mode classification;
[0008] Figure 2B - is a block diagram illustrating another classifier system that can use robust noise speech coding mode classification;
[0009] Figure 3 - is a flowchart illustrating a robust noise classification method;
[0010] Figures 4A to 4C - illustrate configurations of the decision-making process of mode for robust speech classification of noise;
[0011] Figure 5 - is a flow chart illustrating a method of adjusting limits for speech classification;
[0012] Figure 6 - is a block diagram illustrating a speech classifier for robust noise speech classification;
[0013] Figure 7 - is a timeline graph illustrating a configuration of a received speech signal with associated parameter values and speech mode classifications; and
[0014] Figure 8 - illustrates certain components that can be included within a wireless device / electronic device. Detailed Description of the Invention
[0015] The function of a speech encoder is to compress the digitized speech signal into a low bit rate signal by removing all natural redundancies inherent in speech. Digital compression is achieved by representing the input speech frame with a set of parameters and employing quantization to represent the parameters with a set of bits. If the input speech frame has a number of Ni bits and the data packet produced by the speech encoder has a number of No bits, the compression factor achieved by the speech encoder is Cr = Ni / No. The challenge is to retain the high voice quality of the decoded speech while reaching the target compression factor. The performance of a speech encoder depends on (1) how well the speech model, or the combination of the analysis and synthesis process described above, performs, and (2) how well the parameter quantization process is performed on the rate of target bit of No bits per frame. The purpose of the speech model is thus to capture the essence of the speech signal, or the target voice quality, with a small set of parameters for each frame.
[0016] Speech encoders can be implemented as time domain encoders, which attempt to capture the time domain speech waveform by employing high time resolution processing to encode small speech segments (typically subframes of speech). 5 milliseconds) at a time. For each subframe, a high precision representative of a codebook space is found by means of several search algorithms. Alternatively, speech encoders can be implemented as frequency domain encoders, which attempt to capture the short term speech spectrum of the input speech frame with a set of parameters (analysis) and employ a corresponding synthesis process to recreate the speech waveform from spectral parameters. The parameter quantizer preserves the parameters by representing them with the stored representations of the code vectors according to the quantization techniques described in A. Gersho & R.M. Gray, Vector Quantization and Signal Compression (1992).
[0017] A possible time domain speech encoder is the Code Excited Linear Predictive Encoder (CELP) described in LB Rabiner & RWSchafer, Digital Processing of Speech Signals 396-453 (1978), which is fully incorporated here by reference . In a CELP encoder, short-term correlations, or redundancies, in the speech signal are removed by a linear prediction (LP) analysis, which finds the coefficients of a short-term format filter. The application of the short-term prediction filter to the input speech frame generates an LP residue signal, which is further modeled and quantized with the long-term prediction filter parameters and a subsequent stochastic code book. Thus, CELP coding divides the task of coding the time domain speech waveform into separate tasks of coding LP short-term filter coefficients and coding the LP residue. Time domain encoding can be performed at a fixed rate (that is, using the same number of bits, N0, for each frame) or at a variable rate (in which different bit rates are used for different types of content). frame). Variable rate encoders try to use only the amount of bits necessary to encode codec parameters to a level suitable for obtaining a target quality. A possible variable rate CELP encoder is described in U.S. Patent No. 5,414,796, which is assigned to the assignee of the configurations currently described and fully incorporated herein by reference.
[0018] Time domain encoders such as the CELP encoder are typically based on a high number of bits, N0, per frame to preserve the accuracy of the time domain speech waveform. Such encoders typically deliver excellent voice quality as long as the number of bits, N0, per frame is relatively large (for example, 8 kbps or above). However, with low bit rates (4 kbps and below), time domain encoders fail to retain high quality and robust performance due to the limited number of bits available. At lower bit rates, the limited code book space holds the waveform matching capability of conventional time domain encoders, which are successfully developed in higher rate commercial applications.
[0019] Typically, CELP schemes employ a short-term prediction filter (STP) and a long-term prediction filter (LTP), An Analysis by Synthesis (AbS) approach is employed in an encoder to find the delays and LTP earnings, in addition to the best earnings and stochastic code book indices. Current state of the art CELP encoders such as the Enhanced Variable Rate Encoder (EVRC) can achieve good quality synthesized speech at a data rate of approximately 8 kilobits per second.
[0020] Additionally, unvoiced speech does not exhibit periodicity. The bandwidth consumed encoding the LTP filter in conventional CELP schemes is not as efficiently used for unvoiced speech as it is for voiced speech, where speech frequency is strong and LTP filtering is significant. Therefore, a more efficient coding scheme (that is, with a lower bit rate) is desirable for unvoiced speech. Accurate speech classification is necessary to select the most efficient coding schemes and to achieve the lowest data rate.
[0021] For coding at lower bit rates, several methods of spectral coding, or speech frequency domain, have been developed, in which the speech signal is analyzed as an evolution that varies with the time of the spectrum. See, for example, R.J. McAulay & T.F. Quatieri, Sinusoidal Coding, in Speech Coding and Synthesis ch. 4 (W.B. Kleijn & K. K. Paliwal Eds., 1995). In spectral encoders, the goal is to model, or predict, the short-term speech spectrum of each speech input frame with a set of spectral parameters, rather than accurately reproducing the speech waveform that varies with time . The spectral parameters are then encoded and a speech output frame is created with the decoded parameters. The resulting synthesized speech does not match the original input speech waveform, but offers similar perceived quality. Examples of frequency domain encoders include multi-band excitation encoders (MBEs), sinusoidal transformation encoders (STCs) and harmonic encoders (HCs). Such frequency domain encoders offer a high quality parametric model having a compact set of parameters that can be accurately quantized with the low number of bits available at low bit rates.
[0022] Nevertheless, low bit rate encoding imposes critical restrictions on a limited encoding resolution, or a limited codebook space, which limits the efficiency of a single encoding mechanism, rendering the encoder unable to represent various types of speech segments under various background conditions with equal accuracy. For example, conventional low bit rate frequency domain encoders do not transmit phase information to speech frames. Instead, the phase information is reconstructed using an artificially generated, randomly generated initial phase value and linear interpolation techniques. See, for example, H. Yang et al., Quadratic Phase Interpolation for Voiced Speech Synthesis in the MBE Model, in 29 Electronic Letters 856-57 (May 1993). Since the phase information is artificially generated, even if the amplitudes of the sine waves are perfectly preserved by the quantization-dequantization process, the resulting speech produced by the frequency domain encoder will not be aligned with the original input speech (that is, pulses) will not be synchronized). It then proved difficult to adopt any closed-loop performance measure, such as, for example, signal-to-noise ratio (SNR) or perceptual SNR in frequency domain encoders.
[0023] An efficient technique for efficient speech encoding at a low bit rate is the encoding of multiple modes. Multi-mode coding techniques have been employed to perform low-rate speech coding in conjunction with an open-loop mode decision process. One such multi-mode coding technique is described in Amitada Das et al., Multi-mode and Variable-Rate Coding of Speech, in Speech Coding and Synthesis ch. 7 (W.B. Kleijn & K. K. Paiwal Eds. 1995). Conventional multi-mode encoders apply different modes, or encoding and decoding algorithms, to different types of input speech frames. Each mode, or encoding and decoding process is customized to represent a particular type of speech segment, such as, for example, spoken speech, unvoiced speech, or background noise (not speaking) in the most efficient way. The success of such multi-mode coding techniques is highly dependent on correct mode decisions, or speech classifications. An external open-loop mode decision mechanism examines the incoming speech frame and makes a decision as to which mode to apply to the frame. The open-loop mode decision is typically made by extracting a number of parameters from the input frame, evaluating the parameters for certain temporal and spectral characteristics, and guiding a mode decision through evaluation. The mode decision is thus made without knowing the exact condition of the outgoing speech in advance, that is, how close the outgoing speech will be to the incoming speech in terms of voice quality or other performance measures. A possible open-loop mode decision for a speech codec is described in U.S. Patent No. 5,414,796, which is assigned to the assignee of the present invention and is fully incorporated herein by reference.
[0024] The encoding of multiple modes can be a fixed rate, using the same number of N0 bits for each frame, or a variable rate, in which different bit rates are used for different modes. The goal in variable rate coding is to use only the bit quality necessary to encode the codec parameters at an appropriate level to obtain the target quality. As a result, the same target voice quality as that of a higher, fixed rate splitter encoder can be achieved at a significantly lower average rate using the variable bit rate (VBR) techniques. A possible variable rate speech encoder is described in U.S. Patent No. 5,414,796. There is currently an increase in research interest and a strong commercial need to develop a high quality speech encoder operating at medium to low bit rates (that is, in the range of 2.4 to 4 kbps and below). Application areas include wireless telephony, satellite communications, Internet telephony, various multimedia applications and voice sequencing, voicemail, and other voice storage systems. The driving forces are the need for high capacity and the demand for robust performance under packet loss situations. Several recent speech coding standardization efforts are another direct drive force driving the research and development of low rate speech coding algorithms. A low-rate speech encoder creates more channels, or users, by allowed application bandwidth. A low rate speech encoder coupled with a suitable additional channel encoding layer can fit the overall bit budget of the encoder specifications and deliver robust performance under channel error conditions.
[0025] VBR speech coding in multiple modes is, therefore, an efficient mechanism to encode speech at low bit rate. Conventional multi-mode schemes require the design of efficient coding schemes, or modes, for multiple speech segments (eg, unvoiced, voiced, transition) in addition to a mode for background noise, or quiet. The overall performance of the speech encoder depends on the robustness of the mode classification and how well each mode behaves. The average encoder rate depends on bit rates in different ways for unvoiced, voiced, and other speech segments. In order to achieve the target quality at a low average rate, it is necessary to correctly determine the speech mode under varying conditions. Typically, voiced and unvoiced speech segments are captured at high bit rates, and background noise and silent segments are represented as modes operating at a significantly lower rate. Multi-mode variable bit rate encoders require correct speech classification to accurately capture and encode a high percentage of speech segments using a minimum number of bits per frame. A more accurate speech classification produces a lower average encoded bit rate, and higher quality decoded speech.
[0026] In other words, in controlled source variable rate encoding, the performance of this frame classifier determines the average bit rate based on the characteristics of the input speech (energy, voice, spectral slope, pitch contour, etc.). ). Speech classifier performance can degrade when incoming speech is corrupted by noise. This can cause undesirable effects on quality and bit rate. Accordingly, methods for detecting the presence of noise and properly adjusting the classification logic can be used to ensure robust operation in real-time use cases. Additionally, speech classification techniques previously considered a minimum number of parameters for isolated speech frames only, producing few inaccurate speech mode classifications. Thus, there is a need for a high-performance speech classifier to correctly classify numerous speech modes under varying environmental conditions in order to allow maximum performance from multi-mode variable bit rate encoding techniques.
[0027] The described configurations provide a method and apparatus for improved speech classification in vocoder applications (vocoder). Classification parameters can be analyzed to produce speech classifications with relatively high accuracy. A decision-making process is used to classify speech frame by frame. Parameters derived from the original input speech can be used by a state-based decision-making element to precisely classify various speech modes. Each speech frame can be classified by analyzing past and future frames, in addition to the current frame. The speech modes that can be classified by the described settings comprise at least transient speech, transitions to active speech and at the end of words, voiced, unvoiced and silent.
[0028] In order to ensure robustness in the classification logic, the present systems and methods can use a multi-frame measurement of the background noise estimate (which is typically provided by standard upstream speech coding components, such as a voice activity detector) and adjust the classification logic based on this. Alternatively, an SNR can be used by the classification logic if it includes information around more than one frame, for example, if it is viewed through multiple frames. In other words, any noise estimate that is relatively stable across multiple frames can be used by the classification logic. Adjusting the classification logic may include changing one or more limits used to classify speech. Specifically, the power limit for classifying a frame as "unvoiced" can be increased (reflecting the high level of "silent" frames), the mouth limit for classifying a frame as "unvoiced" can be increased (reflecting the corruption of mouth information under noise), the voice limit for classifying a frame as "voiced" can be reduced (again, reflecting the corruption of voice information), or some combination. In the case where no noise is present, no changes can be made to the classification logic. In a loud noise configuration (for example, SNR 20 dB, typically the lowest SNR tested in speech codec standardization), the unvoiced power limit can be increased by 10 dB, the unvoiced voice limit can be increased by 0.06, and the voiced voice limit can be reduced by 0.2. In this configuration, the intermediate noise cases can be handled by interpolating between the "clean" and "noise" configurations, based on the measurement of input noise, or using a hard limit configured for some intermediate noise level.
[0029] Figure 1 is a block diagram illustrating a system 100 for wireless communication. In system 100, a first encoder 110 receives the digitized speech samples s (n) and encodes the samples s (n) for transmission on a transmission medium 112, or communication channel 112, to a first decoder 114. The decoder 114 decodes the encoded speech samples and synthesizes an output speech signal sSYNTH (n). For transmission in the opposite direction, a second encoder 116 encodes the digitized speech samples s (n), which are transmitted on a communication channel 118. A second decoder 120 receives and encodes the encoded speech samples, generating a speech signal from synthesized output sSYNTH (n).
[0030] Speech samples, s (n), represent speech signals that have been digitized and quantized according to any of several methods including, for example, pulse code modulation (PCM), μ companded law, or law A. In one configuration, the speech samples, s (n), are organized into input data frames where each frame comprises a predetermined number of digitized speech samples s (n). In one configuration, a sampling rate of 8 kHz is employed, with each 20 ms frame comprising 160 samples. In the configurations described below, the data transmission rate can vary frame by frame from 8 kbps (total rate) to 4 kbps (half rate) to 2 kbps (a quarter rate) to 1 kbps (an eighth rate). Alternatively, other data rates can be used. As used here, the terms "full rate", or "high rate" generally refer to data rates that are greater than or equal to 8 kbps, and the terms "half rate" or "low rate" generally refer to data rates that are less than or equal to 4 kbps. The variation of the data transmission rate is beneficial since lower bit rates can be selectively used for frames containing relatively lesser speech information. While the specific rates are described here, any suitable sample rate, frame size, and data transmission rate can be used with the present systems and methods.
[0031] The first encoder 110 and the second decoder 120 together may comprise a first speech encoder, or speech codec. Similarly, the second encoder 116 and the first decoder 114 together comprise a second speech encoder. Speech encoders can be implemented with a digital signal processor (DSP), an application specific integrated circuit (ASIC), discrete port logic, firmware or any other conventional programmable software module and a microprocessor. The software module can reside in RAM, flash memory, registers or any other form of recordable storage medium. Alternatively, any conventional processor, controller or state machine can be replaced by the microprocessor. Possible ASICs, designed specifically for speech coding are described in U.S. Patent Nos. 5,727,123 and 5,784,532 assigned to the assignee of the present invention and fully incorporated herein by reference.
[0032] As an example, without limitation, a speech encoder can reside in a wireless communication device. As used here, the term "wireless communication device" refers to an electronic device that can be used for voice and / or data communication through a wireless communication system. Examples of wireless communication devices include cell phones, personal digital assistants (PDAs), portable devices, wireless modems, laptop computers, personal computers, tablets, etc. A wireless communication device may alternatively be referred to as an access terminal, a mobile terminal, a mobile station, a remote station, a user terminal, a terminal, a subscriber unit, a subscriber station, a mobile device, a wireless device, user equipment (UE), or some other similar terminology.
[0033] Figure 2A is a block diagram illustrating a classifier system 200a that can use robust noise speech coding mode classification. The classifier system 200a of figure 2A can reside in the encoders illustrated in figure 1. In another configuration, the classifier system 200a can be independent, providing speech classification mode output 246a for devices such as the encoders illustrated in figure 1.
[0034] In figure 2A, input speech 212a is provided for a noise suppressor 202. Input speech 212a can be generated by converting analog to digital a voice signal. The noise suppressor 202 filters out the noise components of the input speech 212a producing a suppressed noise output speech signal 214a. In one configuration, the speech classification apparatus of figure 2A can use an Enhanced Variable Rate CODEC (EVRC). As illustrated, this configuration may include a built-in noise suppressor 202 that determines a noise estimate 216a and SNR information 218.
[0035] The noise estimate 216a and the output speech signal 214a can be recorded in a speech classifier 210a. The output speech signal 214a from noise suppressor 202 can also be recorded on a voice activity detector 204a, and LPC Analyzer 206a and an open loop pitch estimator 208a. The noise estimate 216a can also be fed to the voice activity detector 204a with SNR information 218 from the noise suppressor 202. The noise estimate 216a can be used by the speech classifier 210a to determine the periodicity limits and to distinguish between clean and noisy speech.
[0036] A possible way to classify speech is to use the SNR 218 information. However, the speech classifier 210a of the present systems and methods can use the noise estimate 216a instead of the SNR 218 information. Alternatively, the SNR 218 information it can be used if it is relatively stable across multiple frames, for example, a metric that includes SNR 218 information for multiple frames. The noise estimate 216a can be a relatively long term indicator of the noise included in the input speech. The noise estimate 216a is thereafter referred to as ns_est. The output speech signal 214a is hereinafter referred to as t_in. If, in a configuration, noise suppressor 202 is not present, or is turned off; the noise estimate 216a, ns_est, can be predetermined to a standard value.
[0037] An advantage of using a noise estimate 216a instead of SNR 218 information is that the noise estimate can be relatively stable frame by frame. The noise estimate 216a is just the estimate of the background noise level, which tends to be relatively constant over long periods of time. In one configuration the noise estimate 216a can be used to determine the SNR 218 for a particular frame. In contrast, the SNR 218 can be a frame-by-frame measurement that can include relatively large fluctuations depending on the instantaneous voice energy, for example, the SNR can oscillate by many dB between silent frames and active speech frames. Therefore, if the SNR 218 information is used for classification, it can be averaged across more than one input speech frame 212a. The relative stability of the noise estimate 216a can be useful in distinguishing high noise situations from simple silent frames. Even with zero noise, SNR 218 can still be very low in frames where the interlocutor is not speaking, and so the mode decision logic using SNR 218 information can be activated in those frames. The noise estimate 216a can be relatively constant unless ambient noise conditions change, thereby avoiding problems.
[0038] Voice activity detector 204a can send voice activity information 220a to the current speech frame to speech classifier 210a, that is, based on output speech 214a, noise estimate 216a and information SNR 218. The voice activity information output 220a indicates whether the current speech is active or inactive. In one configuration, the output of the voice activity information 330a can be binary, that is, active or inactive. In another configuration, the voice activity information output 220a can have multiple values. The voice activity information parameter 220a is referred to here as vad.
[0039] The analyzer LPC 206a sends reflection coefficients LPC 222a for the current output speech to the speech classifier 210a. The LPC analyzer 206a can also send other parameters such as LPC coefficients (not shown). The reflection coefficient parameter LPC 222a is referred to here as refl.
[0040] The open loop pitch estimator 208a sends a Normalized Autocorrelation Coefficient Function (NACF) value 224a, and NACF around the pitch values 226a, to speech classifier 210a. The NACF parameter 224a is hereinafter referred to as nacf, and NACF around the pitch parameter 226a is hereinafter referred to as nacf_at_pitch. A more periodic speech signal produces a higher value of nacf_at_pitch 226a. A higher value of nacf_at_pitch 226a is more likely to be associated with a stationary speech output type of speech. Speech classifier 210a maintains a set of values nacf_at_pitch 226a, which can be computed based on subframe. In one configuration, two open loop pitch estimates are measured for each 214a output speech frame by measuring two subframes per frame. NACF around pitch (nacf_at_pitch) 226a can be computed from the open loop pitch estimate for each subframe. In one configuration, a penta dimensional set of values nacf_at_pitch 226a (that is, nacf_at_pitch [4]) contains values of two and a half frames of output speech 214a. The nacf_at_pitch set is updated for each output speech frame 214a. The use of a set for the nacf_at_picth 226a parameter provides speech classifier 210a with the ability to use current and future current signal information to make accurate speech mode and robust noise decisions.
[0041] In addition to recording information in speech classifier 210a from external components, speech classifier 210a internally generates parameters derived 282a from output speech 214a for use in the speech mode decision making process.
[0042] In one configuration, speech classifier 210a internally generates a zero crossing rate parameter 228a, hereinafter referred to as zcr. The zcr parameter 228a of the current output speech 214a is defined as the number of signal changes in the speech signal per speech frame. In voiced speech, the zcr value 228a is low, while unvoiced speech (or noise) has a higher zcr value 228a since the signal is very random. The parameter zcr 228a is used by speech classifier 210a to classify voiced and unvoiced speech.
[0043] In one configuration, speech classifier 210a internally generates a current frame energy parameter 230a, hereinafter referred to as E. And 230a can be used by speech classifier 210a to identify transient speech by comparing energy in the current frame with energy in the past and future frames. The vEprev parameter is the previous frame energy derived from E 230a.
[0044] In one configuration, the speech classifier 210a internally generates a future frame energy parameter 232a, hereinafter referred to as Enext. Enext 232a can contain energy values for a part of the current frame and a part of the next output speech frame. In one configuration, Enext 232a represents the energy in the second half of the current frame and the energy in the first half of the next frame of the output speech. Enext 232a is used by speech classifier 210a to identify transition speech. At the end of the speech, the energy in the next frame 232a drops dramatically compared to the energy in the current frame 230a. Speech classifier 210a can compare the energy of the current frame 230a and the energy of the next frame 232a to identify the end of speech and the beginning of speech conditions, or upward and downward transient speech modes.
[0045] In one configuration, the speech classifier 210a internally generates a band energy ratio parameter 234a, defined as log2 (EL / EH), where EL is the current frame energy of the 0 to 2 kHz bay band, and EH is the current high bandwidth frame energy from 2 to 4 kHz. The band energy ratio parameter 234a is hereinafter referred to as bER. The parameter bER 345a allows the speech classifier 210a to identify the voiced and unvoiced speech modes, since, in general, the voiced speech concentrates energy in the low band, while the loud voiced speech concentrates energy in the high band .
[0046] In one configuration, speech classifier 210a internally generates an average voiced energy parameter of three frames 236a from output speech 214a, hereinafter referred to as vEav. In other configurations, vEav 236a can be averaged across several frames in addition to three. If the current speech mode is active or voiced, vEav 236a calculates an average energy run over the last three output speech frames. The averaging of energy over the last three output speech frames provides speech classifier 210a with more stable statistics on which to base speech mode decisions than single frame energy calculations alone. vEav 236a is used by speech classifier 210a to classify the end of speech speech, or downward transient mode, as the current frame energy 230a, E, will drop dramatically compared to the average voice energy 236a, vEav, when speech is interrupted. vEav 236a is updated only if the current frame is voiced, or is reset to a fixed value for idle or unvoiced speech. In a configuration, the fixed reset value is .01.
[0047] In one configuration, the speech classifier 210a internally generates an average voiced energy parameter from three previous frames 238a, hereinafter referred to as vEprev. In other configurations, vEprev 238a can be averaged across numerous frames in addition to three. vEprev 238a is used by speech classifier 210a to identify transition speech. At the beginning of the speech, the energy of the current frame 230a rises dramatically compared to the average energy of the previous three voiced frames 238a. The speech classifier 210 can compare the energy of the current frame 230a and the energy of the previous three frames 238a to identify the beginning of the speech conditions, or rising and transient modes of speech. Similarly, at the end of the spoken speech, the energy of the current 230a frame drops dramatically. Thus, vEprev 238a can also be used to classify the transition at the end of speech.
[0048] In one configuration, the speech classifier 210a internally generates a current frame energy for an average voiced energy ratio parameter of three previous frames 240a, defined as 10 * log10 (E / vEprev). In other configurations, vEprev 238a can be averaged across numerous frames in addition to three. The average voiced energy ratio parameter of three previous frames to current energy 240a is hereinafter referred to as vER. vER 240a is used by speech classifier 210a to classify the beginning of the spoken speech and the end of the spoken speech, or the transient upward and downward transient mode, since vER 240a is large when the speech resumes and small at the end of the spoken speech. . The parameter vER 240a can be used in conjunction with the parameter vEprev 238a in the classification of transient speech.
[0049] In one configuration, the speech classifier 210a internally generates an average voiced energy parameter of three frames for current frame energy 242a, defined as MIN (20.20 * log10 (E / vEav)). The current frame energy for average voiced energy of three frames 242a is hereinafter referred to as vER2. vER2 242a is used by speech classifier 210a to classify the transient speech modes at the end of the spoken speech.
[0050] In one configuration, speech classifier 210a internally generates a maximum subframe energy index parameter 244a. Speech classifier 210a equally divides the current frame of output speech 214a into subframes, and computes the energy value of Root Means Squared (RMS) for each subframe. In one configuration, the current frame is divided into ten subframes. The maximum subframe energy index parameter is the subframe index that has the highest RMS energy value in the current frame, or in the second half of the current frame. The maximum subframe energy index parameter 244a is hereinafter referred to as maxsfe_idx. Dividing the current frame into subframes provides speech classifier 210a with information on peak energy locations, including the location of the largest peak energy, within a frame. Higher resolution is achieved by dividing a frame into two or more subframes. The maxsfe_idx parameter 244a is used in conjunction with other parameters by speech classifier 210a to classify transient speech modes, since the energies of unvoiced or silent speech modes are generally stable, while energy increases or decreases in one mode transient speech.
[0051] The speech classifier 210a can use parameters registered directly from the coding components and parameters used internally, to classify speech modes more precisely and robustly than previously possible. Speech classifier 210a can apply a decision-making process to parameters directly recorded and internally generated to produce improved speech classification results. The decision-making process is described in detail below with reference to figures 4A to 4C and Tables 4 to 6.
[0052] In one configuration, the output of speech modes by speech classifier 210 comprises: Transient, Upward Transient, Downward Transient, Voiced, Unvoiced, and Silent modes. The transient mode is a spoken but less periodic speech, ideally encoded with full rate CELP. The ascending transient mode is the first voiced frame in active speech, ideally encoded with full rate CELP. The descending transient mode is a low-energy spoken speech typically at the end of a word, ideally encoded with half-rate CELP. The voiced mode is a highly periodic voiced speech, basically comprising vowels. Voiced speech can be encoded at full rate, half rate, quarter rate, eighth rate. The data rate for voiced mode speech coding is selected to meet the Average Data Rate (ADR) requirements. The unvoiced mode, basically comprising consonants, is ideally encoded with a quarter-rate NELP (Noise Excited Linear Prediction). Quiet mode is idle speech, ideally encoded with an eighth rate CELP.
[0053] Appropriate parameters and speech modes are not limited to the specific parameters and speech modes of the described settings. Additional parameters and speech modes can be used without departing from the scope of the described settings.
[0054] Figure 2B is a block diagram illustrating another classifier system 200b that can use robust noise speech coding mode classification. The classifier system 200b of figure 2B can reside in encoders illustrated in figure 1. In another configuration, the classifier system 200b can be independent, providing speech classification mode output to the devices as encoders illustrated in figure 1. The classifier system 200b shown in figure 2B may include elements corresponding to the classifier system 200a shown in figure 2A. Specifically, the LPC analyzer 206b, the open loop pitch estimator 208b and the speech classifier 210b illustrated in Figure 2B can correspond to and include functionality similar to the LPC analyzer 206a, open loop pitch estimator 208a and speech classifier 210a illustrated in figure 2A, respectively. Similarly, speech classifier 210b records in figure 2B (voice activity information 220b, reflection coefficients 222b, NACF 224b and NACF around pitch 226b) may correspond to speech classifier records 210a (activity information of voice 220a, reflection coefficients 222a, NACF 224a and NACF around pitch 226a) in figure 2A, respectively. Similarly, the derived parameters 282b in figure 2B (zcr 228b, E 230b, Enext 232b, bER 234b, vEav 236b, vEprev 238b, vER 240b, vER2 242b and maxsfe_idx 244b) may correspond to the derived parameters 282a in figure 2A (zcr 228a, E 230a, Enext 232a, bER 234a, vEav 236a, vEprev 238a, vER 240a, vER2 242a and maxsfe_idx 244a), respectively.
[0055] In figure 2B, there is no noise suppressor included. In one configuration, the speech classification device in figure 2B can use an Enhanced Voice Services CODEC (EVS). The apparatus of figure 2B can receive the input speech frames 212b from a noise suppression component external to the speech codec. Alternatively, suppression may not be performed. Since there is no included noise suppressor 202, the noise estimate, ns_est, 216b can be determined by the voice activity detector 204. While figures 2A and 2B describe two configurations where the noise estimate 216b is determined by a suppressor noise 202 and an activity detector 204b, respectively, noise estimate 216a and b can be determined by any suitable module, for example, a generic noise estimator (not shown).
[0056] Figure 3 is a flow chart illustrating a robust speech classification method 300. In step 302, the classification parameters that enter the external components are processed for each suppressed noise output speech frame. In one configuration (for example, classifier system 200a illustrated in figure 2A), the classification parameters recorded from the external components comprise ns_est 216a and t_in 214a registered from a noise suppressor component 202, and parameters nacf 224a and nacf_at_pitch 226a recorded from an open loop pitch estimator component 208a, vad 220a recorded from a voice activity detector component 204a, and refl 222a recorded from the LPC analysis component 206a. Alternatively, ns_est 216b can be registered from a different module, for example, a voice activity detector 204b as illustrated in figure 2B. The t_in register 214a-b can be the output speech frames 214a of a noise suppressor 202 as in figure 2A or input frames as 212b in figure 2B. The flow of control proceeds to step 304.
[0057] In step 304, the derived internally generated additional parameters 282a-b are computed from the registration of classification parameters from external components. In one configuration, zcr 228a-b, E 230a-b, Enext 232a-b, bER 234a-b, vEav 236a-b, vEprev 238a- b, vER 240a-b, vER2 242a-b and maxsfe_idx 244a-b are computed a from t_in 214a-b. When the internally generated parameters were computed for each output speech frame, the flow of control proceeds to step 306.
[0058] In step 306, the NACF limits are determined, and a parameter analyzer is selected according to the environment of the speech signal. In one configuration, the NACF limit is determined by comparing the ns_est 216a-b parameter recorded in step 302 with a noise estimate limit value. The information ns_est 216a-b can provide an adaptive control of a periodicity decision limit. Thus, different periodicity limits are applied to the classification process for speech signals with different levels of noise components. This can produce a relatively accurate speech classification decision when the most appropriate NACF, or periodicity, threshold for the noise level of the speech signal is selected for each frame of the outgoing speech. The determination of the most appropriate periodicity limit for a speech signal allows the selection of the best parameter analyzer for the speech signal. Alternatively, the SNR 218 information can be used to determine the NACF limit, if the SNR 218 information includes information about multiple frames and is relatively stable from frame to frame.
[0059] Clean and noisy speech signals inherently differ in frequency. When noise is present, speech corruption is present. When speech corruption is present, the measurement of periodicity or nacf 224a-b is less than that of clean speech. In this way, the NACF threshold is reduced to compensate for a noisy or high signal environment for a clean signal environment. The speech classification technique of the described systems and methods can adjust the periodicity limits (that is, NACF) for different environments, producing a decision in a relatively precise and robust way regardless of noise levels.
[0060] In a configuration, if the value of ns_est 216a-b is less than or equal to an estimated noise limit, the NACF limits for clean speech are applied. Possible NACF limits for clean speech can be defined by the following table:
Table 1
[0061] However, depending on the value ns_est 216a- b, several limits can be adjusted. For example, if the value of ns_est 216a-b is greater than a noise estimation limit, the NACF limits for loud speech can be applied. The noise estimation limit can be any suitable value, for example, 20 dB, 25 dB, etc. In one configuration, the noise estimation threshold is set to be above what is seen under clean speech and below what is seen with each loud speech. Possible NACF limits for loud speech can be defined by the following table:
Table 2
[0062] In the case where no noise is present (that is, ns_est 216a-b does not exceed the noise estimation limit), the voice limits may not be adjusted. However, the NACF voice limit for classifying a frame as "voiced" can be reduced (reflecting the corruption of voice information) when there is a loud noise in recorded speech. In other words, the voice limit for classification of "voiced" speech can be reduced by 0.2, as observed in Table 2, when compared with Table 1.
[0063] Alternatively, or in addition to modifying the NACF limits for classifying "voiced" frames, speech classifier 210a-b can adjust one or more limits for classifying "unvoiced" frames based on the value of ns_est 216a -B. There can be two types of NACF limits for classifying "unvoiced" frames that are adjusted based on the value of ns_est 216a-b: a voice limit and an energy limit. Specifically, the NACF voice limit for classifying a frame as "unvoiced" can be increased (reflecting the corruption of information voiced under noise). For example, the NACF limit for "unvoiced" voice may increase by 0.06 in the presence of loud noise (that is, when ns_est 216a-b exceeds the noise estimate limit), thus making the classifier more permissive in classifying frames as "unvoiced". If the multi-frame SNR information 218 is used instead of ns_est 216a-b, a low SNR (indicating the presence of loud noise), the "unvoiced" voice limit can be increased by 0.06. Examples of adjusted voice NACF limits can be provided according to Table 3.
Table 3
[0064] The energy limit for classifying a frame as "unvoiced" can also be increased (reflecting the high level of "silent" frames) in the presence of high noise, that is, when ns_est 216a-b exceeds the limit of noise estimation. For example, the unvoiced power limit can be increased by 10 dB in loud noise frames, for example, the power limit can be increased from -25 dB in the case of clean speech to -15 dB in the noisy case. Increasing the voice limit and energy limit for classifying a frame as "unvoiced" can make it easier (that is, make it more permissive) to classify a frame as unvoiced as the noise estimate becomes higher (or SNR becomes smaller). The limits for intermediate noise frames (for example, when ns_est 216a-b does not exceed the noise estimate limit, but is above a minimum noise measurement) can be adjusted by interpolating between "clean" settings (Table 1 ) and "noisy" configurations (Table 2 and / or Table 3), based on the input noise estimate. Alternatively, hard limit settings can be defined for some intermediate noise estimates.
[0065] The "voiced" voice limit can be adjusted independently of the "unvoiced" voice and energy limits. For example, the "voiced" voice limit can be adjusted, but neither the energy limit nor the "unvoiced" voice limit can be adjusted. Alternatively, one or both of the "voiced" power and voice limits can be adjusted, but the "voiced" voice limit may not be adjusted. Alternatively, the "voiced" voice limit can be adjusted with just one of the "unvoiced" or energy voice limits.
[0066] Noisy speech is the same as clean speech with added noise. With the adaptive periodicity limit control, the robust speech classification technique may be more likely to produce identical classification decisions for clean and noisy speech than previously possible. When the NACF limits have been set for each frame, the flow of control proceeds to step 308.
[0067] In step 308, a speech mode classification 246a-b is determined based, at least in part, on the noise estimate. A state machine or any other analysis method selected according to the signal environment is applied to the parameters. In a configuration, the parameters registered from external components and the parameters generated internally are applied to a decision-making process based on the state described in detail with respect to figures 4A to 4C and Tables 4 to 6. The process decision-making process produces a speech mode classification. In one configuration, a 246a-b speech mode rating of Transient, Transient Ascending, Transient Descending, Voiced, Unvoiced or Silent is produced. When a speech mode decision 246a-b has been produced, the flow of control proceeds to step 310.
[0068] In step 310, state variables and several parameters are updated to include the current frame. In a configuration, vEav 236a-b, vEprev 238a-b, and the voiced state of the current frame are updated. The current frame energy E 230a-b, nacf_at_pitch 226a-b, and the current frame speech mode 246a-b are updated to classify the next frame. Steps 302-310 can be repeated for each speech board.
[0069] Figures 4A to 4C illustrate configurations of the mode decision making process for robust noise speech classification. The decision-making process selects a state machine for the classification of speech based on the periodicity of the speech frame. For each speech frame, a state machine more compatible with the periodicity, or noise component, of the speech frame is selected for the decision-making process by comparing the measurement of the speech frame periodicity, that is, nacf_at_pitch value 226a-b, with the NACF limits set in step 304 of figure 3. The level of periodicity of the speech frame limits and controls the state transitions of the mode decision process, producing a more robust classification.
[0070] Figure 4A illustrates a configuration of the state machine selected in a configuration when vad 220a-b is equal to 1 (there is an active speech) and the third value of nacf_at_pitch 226a-b (that is, nacf_at_pitch [2], index zero) is too high, or higher than VOICEDTH. VOICEDTH is defined in step 306 of figure 3. Table 4 illustrates the parameters evaluated by each state:

Table 4
[0071] Table 4, according to a configuration, illustrates the parameters evaluated by each state, and the state transitions when the third value of nacf_at_pitch 226a-b (that is, nacf_at_pitch [2]) is very high, or greater than VOICEDTH. The decision table illustrated in Table 4 is used by the state machine described in figure 4A. The speech mode classification 246a-b of the previous speech frame is illustrated in the leftmost column. When the parameters are evaluated as illustrated in the row associated with each previous mode, the speech mode classification transitions to the current mode identified in the upper row of the associated column.
[0072] The initial state is Silent 450a. The current frame will always be classified as Silent 450a, regardless of the previous state, if vad = 0 (that is, there is no voice activity).
[0073] When the previous state is Silent 450a, the current frame can be classified as Unvoiced 452a or Transient Ascending 460a. The current frame is classified as Unvoiced 452a if nacf_at_pitch [3] is too low, zcr 228a-b is high, bER 234a-b is low and vER 240a-b is too low, or if a combination of these conditions is met. Otherwise the classification is standard for Transient Ascending 460a.
[0074] When the previous state is Unvoiced 452a, the current frame must be classified as Unvoiced 452a or Transient Ascending 460a. The current frame remains classified as Unvoiced 452a if nacf 224a-b is too low, nacf_at_pitch [3] is too low, nacf_at_pitch [4] is too low, zcr 228a-b is high, bER 234a-b is low, vER 240a -b is too low, and E 230a-b is less than vEprev 238a-b, or if a combination of these conditions is met. Otherwise, classification is standard for Transient Ascending 460a.
[0075] When the previous state is Voiced 456a, the current frame can be classified as Unvoiced 452a, Transient 454a, Descending Transient 458a, or Voiced 456a. The current frame is classified as Unvoiced 452a if vER 240a-b is too low and E 230a is less than vEprev 238a-b. The current frame is classified as Transient 454a, if nacf_at_pitch [1] and nacf_at_pitch [3] are low, E 230a- b is greater than half of vEprev 238a-b, or a combination of these conditions is met. The current frame is classified as Transient Descending 458a if vER 240a-b is too low, and nacf_at_pitch [3] has a moderate value. Otherwise, the current rating is standard for Voice 456a.
[0076] When the previous state is Transient 454a or Transient Ascending 460a, the current frame can be classified as Unvoiced 452a, Transient 454a, Transient Descending 458a or Voiced 456a. The current frame is classified as Unvoiced 452a if vER 240a-b is too low, and E 230a-b is less than vEprev 238a-b. The current frame is classified as Transient 454a if nacf_at_pitch [1] is low, nacf_at_pitch [3] has a moderate value, nacf_at_pitch [4] is low and the previous state is not Transient 454a, or if a combination of these conditions is met. The current frame is classified as Transient Descending 458a if nacf_at_pitch [3] has a moderate value, and E 230a-b is less than .05 times vEav 236a-b. Otherwise, the current rating is standard for Voice 456a-b.
[0077] When the previous frame is Transient Descending 458a, the current frame can be classified as Unvoiced 452a, Transient 454a or Transient Descending 458a. The current frame will be classified as Unvoiced 452a if it sees 240a-b is too low. The current frame will be classified as Transient 454a if E 230a-b is greater than vEprev 238a-b. Otherwise, the current rating remains Transient Descending 458a.
[0078] Figure 4B illustrates a configuration of the state machine selected in a configuration when vad 220a-b is equal to 1 (there is active speech) and the third value of nacf_at_pitch 226a-b is very low, or less than UNVOICEDTH. UNVOICEDTH is defined in step 306 in figure 3. Table 5 illustrates the parameters evaluated by each state.

Table 5
[0079] Table 5 illustrates, according to a configuration, the parameters evaluated by each state, and the state transitions when the third value (that is, nacf_at_pitch [2]) is very low, or less than UNVOICEDTH. The decision table illustrated in Table 5 is used by the state machine described in figure 4B. The speech mode classification 246a-b of the previous speech frame is illustrated in the leftmost column. When the parameters are evaluated as illustrated in the row associated with each previous mode, the speech mode classification moves to the current mode 246a-b identified in the upper row of the associated column.
[0080] The initial state is Silent 450b. The current frame will always be classified as Silent 450b, regardless of the previous state, if vad = 0 (that is, there is no voice activity).
[0081] When the previous state is Silent 450b, the current frame can be classified as Unvoiced 452b or Ascending Transient 460b. The current picture is classified as Transient Ascending 460b if nacf_at_pitch [2-4] illustrates an upward trend, nacf_at_pitch [3-4] has a moderate value, zcr 228a-b is very low to moderate, bER 234a-b is high and vER 240a-b has a moderate value, or if a combination of these conditions is met. Otherwise the classification is standard for Unvoiced 452b.
[0082] When the previous state is Unvoiced 452b, the current frame can be classified as Unvoiced 452b or Transient Ascending 460b. The current picture is classified as Transient Ascending 460b if nacf_at_pitch [2-4] illustrates an upward trend, nacf_at_pitch [3-4] has a moderate to very high value, zcr 228a-b is very low or moderate, see 240a- b is not too low, bER 234a-b is high, refl 222a-b is low, nacf 224a-b has a moderate value and E 230a-b is greater than vEprev 238a-b, or if a combination of these conditions is met . The combinations and limits for these conditions can vary depending on the noise level of the speech frame as reflected in the parameter ns_est 216a-b (or possibly average SNR information of multiple frames 218). Otherwise, the rating is standard for Unvoiced 452b.
[0083] When the previous state is Voiced 456b, Transient Ascending 460b, or Transient 454b, the current frame can be classified as Unvoiced 452b, Transient 454b, or Transient Descending 458b. The current frame is classified as Unvoiced 452b if bER 234a0b is less than or equal to AA zero, vER 240a is too low, bER 234a-b is greater than zero, and E 230a-b is less than vEprev 238a-b or if a combination of these conditions is matched. The current picture is classified as Transient 454b if bER 234a-b is greater than zero, nacf_at_pitch [2-4] illustrates an upward trend, zcr 228a-b is not high, vER 240a-b is not low, refl 222a- b is low, nacf_at_pitch [3] and nacf 224a-b are moderate and bER 234a-b is less than or equal to zero, or if a certain combination of these conditions is met. The combinations and limits for these conditions can vary depending on the noise level of the speech board as reflected in parameter ns_est 216a-b. The current frame is classified as Transient Descending 458a-b, if bER 234a-b is greater than zero, nacf_at_pitch [3] is moderate, E 230a-b is less than vEprev 238a-b, zcr 228a-b is not high and vER2 242a-b is less than fifteen negative.
[0084] When the previous frame is Transient Descending 458b, the current frame can be classified as Unvoiced 452b, Transient 454b or Transient Descending 458b. The current frame will be classified as Transient 454b if nacf_at_pitch [2-4] has shown an upward trend, nacf_at_pitch [3-4] is moderately high, vER 240a- b is not low, and E 230a-b is greater than double vEprev 238a-b, or if a combination of these conditions are met. The current frame will be classified as Transient Descending 458b if vER 240a-b is not low and zcr 228a-b is low. Otherwise, the current rating is standard for Unvoiced 452b.
[0085] Figure 4C illustrates a configuration of the selected state machine in a configuration when vad 220a-b is equal to 1 (there is active speech) and the third value of nacf_at_pitch 226a-b (that is, nacf_at_pitch [3]) is moderate, that is, greater than UNVOICEDTH and less than VOICEDTH. UNVOICEDTH and VOICEDTH are defined in step 306 of figure 3. Table 6 illustrates the parameters evaluated by each state.

5
Table 6
[0086] Table 6 illustrates, according to a modality, the parameters evaluated by each state, and the state transitions when the third value of nacf_at_pitch 226a-b (that in nacf_at_pitch [3]) is moderate, that is, superior UNVOICEDTH, but less than VOICEDTH. The decision table illustrated in Table 6 is used by the state machine described in figure 4C. The speech mode classification of the previous speech frame is illustrated in the leftmost column. When the parameters are evaluated as illustrated in the row associated with each previous mode, the speech mode classification 246a-b transitions to the current mode 246a-b identified in the upper row of the associated column.
[0087] The initial state is Silent 450c. The current frame will always be classified as Silent 450c, regardless of the previous state, if vad = 0 (that is, if there is no voice activity).
[0088] When the previous state is Silent 450c, the current frame can be classified as Unvoiced 452c or Transient Ascending 460c. The current picture is classified as Transient Ascending 460c if nacf_at_pitch [2-4] illustrates an upward trend, nacf_at_pitch [3-4] is moderate to high, zcr 228a-b is not high, bER 234a-b is high, vER 240a -b has a moderate value, zcr 228a-b is very low and E 230a-b is greater than double that of vEprev 238a-b, or if a determined combination of these conditions is met. Otherwise, the rating is standard for Unvoiced 452c.
[0089] When the previous state is Unvoiced 452c, the current frame can be classified as Unvoiced 452c or Transient Ascending 460c. The current picture is classified as Transient Ascending 460c if nacf_at_pitch [2-4] illustrates an upward trend, nacf_at_pitch [3-4] has a value from moderate to very high, zcr 228a-b is not high, see 240a-b no is low, bER 234a-b is high, refl 222a-b is low, E 230a-b is greater than vEprev 238a- b, zcr 228a-b is very low, nacf 224a-b is not low, maxsfre_idx 244a-b point to the last subframe and E 230a- b is greater than double that of vEprev 238a-b, or if a combination of these conditions is met. The combinations and limits for these conditions can vary depending on the noise level of the speech frame as reflected in parameter ns_est 216a-b (or possibly the average SNR information of multiple frames 218). Otherwise, the rating is standard for Unvoiced 452c.
[0090] When the previous state is Voiced 456c, Transient Ascending 460c, or Transient 454c, the current frame can be classified as Unvoiced 452c, Voiced 456c, Transient 454c, Transient Descending 458c. The current frame is classified as Unvoiced 452c if bER 234a-b is less than or equal to zero, vER 240a-b is too low, Enext 232a-b is less than E 230a-b, nacf_at_pitch [3-4] very low, bER 234a-b is greater than zero, nacf_at_pitch [2-4] illustrates an upward trend, zcr 228a-b is not high, vER 240a-b is not low, refl 222a-b is low, nacf_at_pitch [ 3] and nacf 225a-b are not low, or if a combination of these conditions is met. The combinations and limits for these conditions can vary depending on the noise level of the speech frame as reflected in the parameter ns_est 216a-b (or possibly average SNR information of multiple frames 218). The current frame is classified as Transient Descending 458c if, bER 234a-b is greater than zero, nacf_at_pitch [3] is not high, E 230a-b is less than vEprev 238a-b, zcr 228a-b is not high, see 240a-b is less than -15 and vER2 242a-b is less than -15, or if a combination of these conditions is met. The current frame is classified as Vozeado 456c if nacf_at_pitch [2] is greater than LOWVOICEDTH, bER 234a-b is greater than or equal to zero, and vER 240a-b is not low, or if a combination of these conditions is met.
[0091] When the previous frame is Transient Descending 458c, the current frame can be classified as Unvoiced 452c, Transient 454c, or Transient Descending 458c. The current table will be classified as Transient 454c if bER 234a-b is greater than zero, nacf_at_pitch [2-4] illustrates an upward trend, nacf_at_pitch [3-4] is moderately high, vER 240a-b is not low and E 230a -b is greater than twice that of vEprev 238a-b, or if a given combination of these conditions is met. The current frame will be classified as Transient Descending 458c if vER 240a-b is not low and zcr 228a-b is low. Otherwise, the current rating is standard for Unvoiced 452c.
[0092] Figure 5 is a flowchart illustrating a method 500 for setting limits for speech classification. The adjusted limits (for example, NACF limits, or periodicity limits) can then be used, for example, in the robust noise speech classification method 300 illustrated in figure 3. The 500 method can be performed by speech classifiers 210a b illustrated in figures 2A-2B.
[0093] A noise estimate (for example, ns_est 216a-b) of the input speech can be received 502 in the speech classifier 210a-b. The noise estimate can be based on multiple frames of the input speech. Alternatively, an average of the multi-frame SNR information 218 can be used instead of a noise estimate. Any suitable noise metric that is relatively stable across multiple frames can be used in method 500. Speech classifier 210a-b can determine 504 if the noise estimate exceeds a noise estimate limit. Alternatively, speech classifier 210a-b can determine whether the multi-frame SNR information 218 has failed to exceed a multi-frame SNR threshold. If not, speech classifier 210a-b may not 506 adjust any NACF limits for classifying speech as "voiced" or "unvoiced". However, if the noise estimate exceeds the noise estimate limit, the speech classifier 210a-b can also determine 508 whether to adjust the unvoiced NACF limits, If not, the unvoiced NACF limits may not be adjusted 510 , that is, the limits for classifying a frame as "unvoiced" may not be adjusted. If so, speech classifier 210a-b can increase 512 the unvoiced NACF limits, that is, increase a voice limit for classifying a current frame as unvoiced and increase an energy limit to classify the current frame as unvoiced voiced. Increasing the voice limit and the energy limit for classifying a frame as "unvoiced" makes it easier (that is, makes it more permissive) to classify a frame as unvoiced as the noise estimate becomes higher (or SNR becomes smaller). Speech classifier 210a-b can also determine 514 whether to adjust the voiced NACF limit (alternatively, spectral slope or transient detection or zero crossing rate limits can be adjusted). If not, speech classifier 210a-b may not 516 adjust the speech threshold to classify a frame as "voiced", that is, the imitations for classifying a frame as "voiced" may not be adjusted. If so, speech classifier 210a-b can reduce 518 a speech threshold to classify a current frame as "voiced". Therefore, the NACF limits for classifying a speech frame as "voiced" or "unvoiced" can be adjusted independently of each other. For example, depending on how the 610 classifier is tuned in the clean (no noise) case, only one of the "voiced" or "unvoiced" limits can be adjusted independently, that is, it may be the case that the "unvoiced" classification "be much more sensitive to noise. In addition, the penalty for erroneous classification of a "voiced" frame may be greater than for the erroneous classification of a "voiced" frame (both in terms of quality and bit rate).
[0094] Figure 6 is a block diagram illustrating a 610 speech classifier for robust noise speech classification. Speech classifier 610 can correspond to speech classifiers 210a-b illustrated in figures 2A and 2B and can perform method 300 illustrated in figure 3 or method 500 illustrated in figure 5.
[0095] Speech classifier 610 can include received parameters 670. This can include received speech frames (t_in) 672, SNR information 618, a noise estimate (ns_est) 616, voice activity information (vad) 620, reflection coefficients (refl) 622, NACF 624 and NACF around pitch (nacf_at_pitch) 626. These parameters 670 can be received from various modules such as those illustrated in figures 2A and 2B. For example, the received speech frames (t_in) 672 can be output speech frames 214a from a noise suppressor 202 illustrated in figure 2A or input speech 212b itself as shown in figure 2B.
[0096] A derivation module of parameter 674 can also determine a set of derived parameters 682. Specifically, the derivation module of parameter 674 can determine a zero crossing rate (zcr) 628, a current frame energy (E) 630 , an advance frame energy (Enext) 632, a band energy ratio (bER) 634, an average voiced energy of three frames (vEav) 636, an anterior frame energy (vEprev) 638, a current energy ratio for measured voice energy of three previous frames vER) 640, a current frame energy for average voice energy of three frames (vER2) 642 and a maximum subframe energy index (maxsfe_idx) 644.
[0097] A noise estimate comparator 678 can compare the received noise estimate (ns_est) 616 with a noise estimate limit 676. If the noise estimate (ns_est) 616 does not exceed the noise estimate limit 676, a limit set of NACF 684 may not be adjusted. However, if the noise estimate (ns_est) 616 exceeds the noise estimate limit 676 (indicating the presence of loud noise), one or more limits of NACF 684 can be adjusted. Specifically, a voice limit for rating "voiced" frames 686 can be reduced, a voice limit for rating "unvoiced" frames 688 can be increased, and a power limit for rating "unvoiced" frames 690 can be increased be increased, or some combination of adjustments. Alternatively, instead of comparing the noise estimate (ns_est) 616 with the noise estimate limit 676, the noise estimate comparator can compare the SNR 618 information with a multiple frame SNR limit 680 to determine whether it fits the noise limits. NACF 684. In this configuration, the NACF 684 limits can be adjusted if the SNR 618 information fails to exceed the SNR limit of multiple frames 680, that is, the NACF 684 limits can be adjusted when the SNR 618 information is below a minimum level, thus indicating the presence of loud noise. Any suitable noise metric that is relatively stable across multiple frames can be used by the 678 noise estimation comparator.
A classifier state machine 692 can then be selected and used to determine a speech mode classification 646 based at least in part on derived parameters 682, as described above and illustrated in figures 4A to 4C and Tables 4 to 6.
[0099] Figure 7 is a timeline graph illustrating a configuration of a received speech signal 772 with the associated parameter values and speech mode classifications 746. Specifically, Figure 7 illustrates a configuration of the present systems and methods in which the speech mode classification 746 is chosen based on various received parameters 670 and derived parameters 682. Each signal or parameter is illustrated in figure 7 as a function of time.
[0100] For example, the third NACF value around pitch (nacf_at_pitch [2]) 794, the fourth NACF value around pitch (nacf_at_pitch [3]) 795 and the fifth NACF value around pitch ( nacf_at_pitch [4]) 796 are illustrated. Additionally, the current energy to average voice energy ratio of three previous frames (vER) 740, the band energy ratio (bER) 734, the zero crossover rate (zcr) 728 and the reflection coefficients (refl) 722 are also known. Based on the illustrated signs, the received speech 772 can be classified as Silent around the moment 0, Unvoiced around the moment 4, Transient around the moment 9, Voiced around the moment 10 and Descending Transient around the moment 25 .
[0101] Figure 8 illustrates certain components that can be included within an 804 electronic or wireless device. The 804 electronic or wireless device can be an access terminal, a mobile station, a user equipment (UE), an base station, an access point, a broadcast transmitter, a B node, an evolved B node, etc. The electronic or wireless device 804 includes an 803 processor. The 803 processor can be a multi-chip or general purpose single-chip microprocessor (for example, an ARM), a special-purpose microprocessor (for example, a signal processor (DSP)), a microcontroller, a set of programmable ports, etc. The processor 803 can be referred to as a central processing unit (CPU). Although only a single processor 803 is illustrated on the electronic or wireless device 804 of figure 4, in an alternative configuration, a combination of processors (for example, an ARM or DSP) can be used.
[0102] The electronic or wireless device 804 also includes an 805 memory. The 805 memory can be any electronic component capable of storing electronic information. The 805 memory can be embodied as random access memory (RAM), read-only memory (ROM), magnetic disk storage medium, optical storage medium, flash memory, devices in RAM, built-in memory included with the processor, memory EPROM, EEPROM memory, registers and so on, including combinations of them.
[0103] Data 807a and instructions 809a can be stored in memory 805. Instructions 809a can be executable by processor 803 to implement the methods described here. Executing instructions 809a may involve using data 807a that is stored in memory 805. When processor 803 executes instructions 809a, several parts of instructions 809b can be loaded into processor 803, and several parts of data 808b can be loaded into the 803 processor.
[0104] The electronic or wireless device 804 can also include a transmitter 811 and a receiver 813 to allow the transmission and reception of signals to and from the electronic or wireless device 804. The transmitter 811 and receiver 813 can be collectively referred to as an 815 transceiver. Multiple antennas 817a-b can be electrically coupled to the 815 transceiver. The electronic or wireless device 804 may also include (not shown) multiple transmitters, multiple receivers, multiple transceivers and / or additional antennas.
[0105] The 804 electronic or wireless device may include an 821 digital signal processor (DSP). The 804 electronic or wireless device may also include an 823 communications interface. The 823 communications interface may allow a user to interact with the electronic or wireless device 804.
[0106] The various components of the 804 electronic or wireless device can be coupled together by one or more buses, which may include a power bus, a control signal bus, a status signal bus, a data bus , etc. For the sake of clarity, the various buses are illustrated in figure 8 as an 819 bus system.
[0107] The techniques described here can be used for various communication systems, including communication systems that are based on an orthogonal multiplexing scheme. Examples of such communication systems include Orthogonal Frequency Division Multiple Access (OFDMA) systems, Single Carrier Frequency Division Multiple Access (SC-FDMA) systems, and so on. An OFDMA system uses orthogonal frequency division multiplexing (OFDM), which is a modulation technique that divides the system's overall bandwidth into multiple orthogonal subcarriers. These subcarriers can also be called compartments, tones, etc. With OFDM, each subcarrier can be modulated independently with data. A SC-FDMA system can use interleaved FDMA (IFDMA) to transmit on subcarriers that are distributed across the system's bandwidth, localized FDMA (LFDMA) to transmit on an adjacent subcarrier block, or Enhanced FDMA (EFDMA) to transmit on multiple blocks of adjacent subcarriers. In general, the modulation symbols are sent in the frequency domain with OFDM and in the time domain with SC-FDMA.
[0108] The term "determining" encompasses a wide variety of actions and therefore "determining" may include calculating, computing, processing, deriving, investigating, querying (for example, querying on a table, database or other structure of data), determining and similar. In addition, "determining" may include receiving (for example, receiving information), accessing (for example, accessing data in a memory) or the like. In addition, "determining" may include resolving, selecting, choosing, establishing or the like.
[0109] The phrase "based on" does not mean "based only on" unless expressly specified otherwise. In other words, the phrase "based on" describes both "based only on" and "based on at least".
[0110] The term "processor" should be interpreted broadly to encompass a general purpose processor, a central processing unit (CPU), a microprocessor, a digital signal processor (DSP), a controller, a microcontroller, a state machine, and so on. Under some circumstances, a "processor" may refer to an application specific integrated circuit (ASIC), a programmable logic device (PLD), a field programmable port set (FPGA), etc. The term "processor" can refer to a combination of processing devices, for example, a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core or any other similar configuration.
[0111] The term "memory" must be interpreted broadly to encompass any electronic component capable of storing electronic information. The term memory can refer to various types of computer-readable media such as random access memory (RAM), read-only memory (ROM), non-volatile random access memory (NVRAM), programmable read-only memory (PROM) , programmable and erasable read-only memory (EPROM), electrically erasable PROM (EEPROM), flash memory, magnetic or optical data storage, records, etc. Memory is considered to be in electronic communication with a processor if the processor can read information from and / or write information to memory. The memory that is integral to a processor is in electronic communication with the processor.
[0112] The terms "instructions" and "code" must be interpreted broadly to include any type of computer-readable statements. For example, the terms "instructions" and "code" can refer to one or more programs, routines, subroutines, functions, procedures, etc. "Instructions" and "code" can comprise a computer-readable statement or many computer-readable statements.
[0113] The functions described here can be implemented in software or firmware and are executed by hardware. The functions can be stored as one or more instructions in a computer-readable medium. The terms "computer-readable medium" or "computer program product" refer to any tangible storage medium that can be accessed by a computer or processor. By way of example, and not by way of limitation, a computer-readable medium may comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that may be used to carry or store the desired program code in the form of instructions or data structures and which can be accessed by a computer. Floppy disk and disk, as used here, include compact disk (CD), laser disk, optical disk, digital versatile disk (DVD), floppy disk, and Blu-ray® disk, where floppy disks typically reproduce data magnetically, while disks reproduce data optically with lasers.
[0114] The methods described here comprise one or more steps or actions to achieve the described method. The method steps and / or actions can be interchanged with each other without departing from the scope of the claims. In other words, unless a specific order of steps or actions is necessary for the proper operation of the method being described, the order and / or use of specific steps and / or actions can be modified without departing from the scope of the claims .
[0115] Additionally, it should be appreciated that the modules and / or other means suitable for carrying out the methods and techniques described here, as illustrated by figures 3 and 5, can be downloaded and / or otherwise obtained by a device . For example, a device can be coupled to a server to facilitate the transfer of means for carrying out the methods described here. Alternatively, several methods described here can be provided via a storage medium (for example, random access memory (RAM), read-only memory (ROM), a physical storage medium such as a compact disc (CD) or floppy disk , etc.) so that a device can obtain the various methods by coupling or supplying the storage medium to the device.
[0116] It should be understood that the claims are not limited to the precise configuration and components illustrated above. Various modifications, changes and variations can be made in the arrangement, operation and details of the systems, methods and apparatus described here without departing from the scope of the claims.

权利要求:
Claims (15)
[0001]
1. Robust noise speech classification method, comprising: inserting classification parameters to a speech classifier from external components; generate, in the speech classifier, internal classification parameters from at least one of the input classification parameters; configure a Normalized Autocorrelation Coefficient Function limit, the method characterized by the fact that the step of configuring a Normalized Autocorrelation Function limit comprises: increasing a first voice limit to classify a current frame as unvoiced when the signal-to-noise ratio (SNR) fails to exceed a first SNR limit, in which the first voice limit is not adjusted if the SNR is above the first SNR limit; and increase a power limit to classify the current frame as unvoiced when the noise estimate exceeds a noise estimate limit, where the energy limit is not adjusted if the noise estimate is below the noise estimate limit ; and determining a speech mode classification based on the first speech threshold and the energy threshold.
[0002]
2. Method, according to claim 1, characterized by the fact that configuring the Normalized Autocorrelation Coefficient Function further comprises reducing a second voice limit to classify a current frame as voiced when the SNR fails to exceed a second limit of voice SNR, where the second voice limit is not adjusted if the SNR is above the second SNR limit.
[0003]
3. Method, according to claim 1, characterized by the fact that the input classification parameters comprise one or more of the following: a suppressed noise speech signal; voice activity information; reflection coefficients of Linear Prediction; Normalized Autocorrelation Coefficient Function information; or Normalized Autocorrelation Coefficient Function in pitch information.
[0004]
4. Method, according to claim 1, characterized by the fact that the internal parameters comprise one or more of the following: a zero crossing rate parameter; a current frame energy parameter; an advance frame energy parameter; a band energy ratio parameter; a voiced energy parameter of the average of three frames; a voiced energy parameter of the average of three previous frames; a parameter of the voiced energy ratio from the average of three previous frames to current frame energy; a parameter of the voiced energy ratio of the average of three frames to current frame energy; a maximum subframe energy index parameter.
[0005]
5. Method, according to claim 1, characterized by the fact that configuring a Normalized Autocorrelation Coefficient Function limit comprises additionally comparing the noise estimate with a predetermined noise signal estimation limit.
[0006]
6. Method, according to claim 1, characterized by the fact that it additionally comprises selecting a parameter analyzer that applies the input classification parameters and internal parameters to a state machine comprising a state for each speech classification mode.
[0007]
7. Method, according to claim 1, characterized by the fact that the speech mode classification comprises one or more of the following: a Transient mode; a Transient Ascending mode; a Transient Descending mode; a Voice mode; an unvoiced mode; or a Quiet mode.
[0008]
8. Method, according to claim 1, characterized by the fact that it additionally comprises updating at least one parameter, the at least one parameter comprising one or more of the following: a Normalized Autocorrelation Coefficient Function in the pitch parameter; a voiced energy parameter of the average of three frames; an advance frame energy parameter; a voiced energy parameter of the average of three previous frames; a voice activity detection parameter.
[0009]
9. Robust noise classification apparatus, comprising: mechanisms for inserting classification parameters into a speech classifier from external components; mechanisms to generate, in the speech classifier, internal classification parameters from at least one of the input classification parameters; mechanisms for configuring a Normalized Autocorrelation Coefficient Function limit, the apparatus characterized by the fact that the mechanisms for configuring a Normalized Autocorrelation Function Limit comprise: mechanisms for increasing a first voice limit to classify a current frame as not -voiced when the signal-to-noise ratio (SNR) fails to exceed a first SNR limit, in which the first voice limit is not adjusted if the SNR is above the first SNR limit; mechanisms to raise an energy limit to classify the current frame as unvoiced when the noise estimate exceeds a noise estimate limit, where the energy limit is not adjusted if the noise estimate is below the estimated noise limit noise; and mechanisms for determining a speech mode classification based on the first speech threshold and the energy threshold.
[0010]
10. Apparatus, according to claim 9, characterized by the fact that the mechanisms to configure the limit of the Normalized Autocorrelation Coefficient Function additionally comprise mechanisms to reduce a second voice limit for classifying a current frame as voiced when the SNR fails to exceed a second SNR limit, where the second voice limit is not adjusted if the SNR is above the second SNR limit.
[0011]
11. Apparatus according to claim 9, characterized by the fact that the input classification parameters comprise one or more of the following: a suppressed noise speech signal; voice activity information; reflection coefficients of Linear Prediction; information from the Normalized Autocorrelation Coefficient Function; or Normalized Autocorrelation Coefficient in pitch information.
[0012]
12. Apparatus according to claim 9, characterized by the fact that the internal parameters comprise one or more of the following: a zero crossing rate parameter; a current frame energy parameter; an advance frame energy parameter; a parameter of the band energy ratio; a voiced energy parameter of the average of three frames; a voiced energy parameter of the average of three previous frames; a parameter of the voiced energy ratio from the average of three previous frames to current frame energy; a parameter of the voiced energy ratio of the average of three frames to current frame energy; a maximum subframe energy index parameter.
[0013]
13. Apparatus according to claim 9, characterized by the fact that the speech mode classification comprises one or more of the following: a Transient mode; a Transient Ascending mode; a Transient Descending mode; a Voice mode; an unvoiced mode; or a Quiet mode.
[0014]
14. Apparatus, according to claim 9, characterized by the fact that it additionally comprises mechanisms to update at least one parameter, the at least one parameter comprising one or more of the following: a Normalized Autocorrelation Coefficient Function in the pitch parameter; a voiced energy parameter of the average of three frames; an advance frame energy parameter; a voiced energy parameter of the average of three previous frames; a voice activity detection parameter.
[0015]
15. Computer-readable memory for robust noise speech classification, characterized by the fact that it comprises instructions stored therein, the instructions that when executed by computer perform the steps of the method as defined in any of claims 1 to 8.

类似技术:

公开号 | 公开日 | 专利标题

BR112013030117B1|2021-03-30|METHOD AND APPARATUS FOR CLASSIFICATION OF ROBUST NOISE SPEECH AND LEGIBLE MEMORY BY COMPUTER

US7472059B2|2008-12-30|Method and apparatus for robust speech classification

ES2684297T3|2018-10-02|Method and discriminator to classify different segments of an audio signal comprising voice and music segments

JP5596189B2|2014-09-24|System, method and apparatus for performing wideband encoding and decoding of inactive frames

RU2437171C1|2011-12-20|Systems, methods and device for broadband coding and decoding of active frames

US6584438B1|2003-06-24|Frame erasure compensation method in a variable rate speech coder

US7877253B2|2011-01-25|Systems, methods, and apparatus for frame erasure recovery

BR122020023363B1|2021-06-01|DECODIFICATION METHOD

KR100711047B1|2007-04-24|Closed-loop multimode mixed-domain linear prediction speech coder

Cellario et al.1994|CELP coding at variable rate

CN105103229A|2015-11-25|Decoder for generating frequency enhanced audio signal, method of decoding, encoder for generating an encoded signal and method of encoding using compact selection side information

ES2254155T3|2006-06-16|PROCEDURE AND APPLIANCE TO FOLLOW THE PHASE OF AN ALMOST PERIODIC SIGNAL.

JP2011090311A|2011-05-06|Linear prediction voice coder in mixed domain of multimode of closed loop

同族专利:

公开号 | 公开日

BR112013030117A2|2016-09-20|

CN103548081B|2016-03-30|

US20120303362A1|2012-11-29|

EP2715723A1|2014-04-09|

CA2835960C|2017-01-31|

RU2584461C2|2016-05-20|

KR20140021680A|2014-02-20|

CN103548081A|2014-01-29|

KR101617508B1|2016-05-02|

WO2012161881A1|2012-11-29|

TWI562136B|2016-12-11|

CA2835960A1|2012-11-29|

US8990074B2|2015-03-24|

JP5813864B2|2015-11-17|

RU2013157194A|2015-06-27|

TW201248618A|2012-12-01|

JP2014517938A|2014-07-24|

引用文献:

公开号 | 申请日 | 公开日 | 申请人 | 专利标题

US4052568A|1976-04-23|1977-10-04|Communications Satellite Corporation|Digital voice switch|

DE3639753C2|1986-11-21|1988-09-15|Institut Fuer Rundfunktechnik Gmbh, 8000 Muenchen, De|

EP1107231B1|1991-06-11|2005-04-27|QUALCOMM Incorporated|Variable rate vocoder|

US5734789A|1992-06-01|1998-03-31|Hughes Electronics|Voiced, unvoiced or noise modes in a CELP vocoder|

JP3297156B2|1993-08-17|2002-07-02|三菱電機株式会社|Voice discrimination device|

SG47708A1|1993-11-25|1998-04-17|British Telecomm|Testing telecommunication apparatus|

US5784532A|1994-02-16|1998-07-21|Qualcomm Incorporated|Application specific integrated circuit for performing rapid speech compression in a mobile telephone system|

TW271524B|1994-08-05|1996-03-01|Qualcomm Inc|

US5742734A|1994-08-10|1998-04-21|Qualcomm Incorporated|Encoding rate selection in a variable rate vocoder|

WO1996034382A1|1995-04-28|1996-10-31|Northern Telecom Limited|Methods and apparatus for distinguishing speech intervals from noise intervals in audio signals|

US5909178A|1997-11-28|1999-06-01|Sensormatic Electronics Corporation|Signal detection in high noise environments|

US6847737B1|1998-03-13|2005-01-25|University Of Houston System|Methods for performing DAF data filtering and padding|

US6240386B1|1998-08-24|2001-05-29|Conexant Systems, Inc.|Speech codec employing noise classification for noise compensation|

US6233549B1|1998-11-23|2001-05-15|Qualcomm, Inc.|Low frequency spectral enhancement system and method|

US6691084B2|1998-12-21|2004-02-10|Qualcomm Incorporated|Multiple mode variable rate speech coding|

US6618701B2|1999-04-19|2003-09-09|Motorola, Inc.|Method and system for noise suppression using external voice activity detection|

US6910011B1|1999-08-16|2005-06-21|Haman Becker Automotive Systems - Wavemakers, Inc.|Noisy acoustic signal enhancement|

US6584438B1|2000-04-24|2003-06-24|Qualcomm Incorporated|Frame erasure compensation method in a variable rate speech coder|

US6741873B1|2000-07-05|2004-05-25|Motorola, Inc.|Background noise adaptable speaker phone for use in a mobile communication device|

US6983242B1|2000-08-21|2006-01-03|Mindspeed Technologies, Inc.|Method for robust classification in speech coding|

US7472059B2|2000-12-08|2008-12-30|Qualcomm Incorporated|Method and apparatus for robust speech classification|

US6889187B2|2000-12-28|2005-05-03|Nortel Networks Limited|Method and apparatus for improved voice activity detection in a packet voice network|

US8271279B2|2003-02-21|2012-09-18|Qnx Software Systems Limited|Signature noise removal|

US20060198454A1|2005-03-02|2006-09-07|Qualcomm Incorporated|Adaptive channel estimation thresholds in a layered modulation system|

EP2063418A4|2006-09-15|2010-12-15|Panasonic Corp|Audio encoding device and audio encoding method|

CN100483509C|2006-12-05|2009-04-29|华为技术有限公司|Aural signal classification method and device|

ES2533358T3|2007-06-22|2015-04-09|Voiceage Corporation|Procedure and device to estimate the tone of a sound signal|

WO2009078093A1|2007-12-18|2009-06-25|Fujitsu Limited|Non-speech section detecting method and non-speech section detecting device|

US20090319261A1|2008-06-20|2009-12-24|Qualcomm Incorporated|Coding of transitional speech frames for low-bit-rate applications|

US8335324B2|2008-12-24|2012-12-18|Fortemedia, Inc.|Method and apparatus for automatic volume adjustment|

CN102044241B|2009-10-15|2012-04-04|华为技术有限公司|Method and device for tracking background noise in communication system|US8868432B2|2010-10-15|2014-10-21|Motorola Mobility Llc|Audio signal bandwidth extension in CELP-based speech coder|

US9208798B2|2012-04-09|2015-12-08|Board Of Regents, The University Of Texas System|Dynamic control of voice codec data rate|

US9263054B2|2013-02-21|2016-02-16|Qualcomm Incorporated|Systems and methods for controlling an average encoding rate for speech signal encoding|

CN104347067B|2013-08-06|2017-04-12|华为技术有限公司|Audio signal classification method and device|

US8990079B1|2013-12-15|2015-03-24|Zanavox|Automatic calibration of command-detection thresholds|

BR112016014104B1|2013-12-19|2020-12-29|Telefonaktiebolaget Lm Ericsson |background noise estimation method, background noise estimator, sound activity detector, codec, wireless device, network node, computer-readable storage medium|

JP6206271B2|2014-03-17|2017-10-04|株式会社Ｊｖｃケンウッド|Noise reduction apparatus, noise reduction method, and noise reduction program|

EP2963645A1|2014-07-01|2016-01-06|Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V.|Calculator and method for determining phase correction data for an audio signal|

JP2017009663A|2015-06-17|2017-01-12|ソニー株式会社|Recorder, recording system and recording method|

KR20170035625A|2015-09-23|2017-03-31|삼성전자주식회사|Electronic device and method for recognizing voice of speech|

法律状态:
2018-03-27| B15K| Others concerning applications: alteration of classification|Ipc: G10L 19/22 (2013.01), G10L 25/93 (2013.01), G10L 1 |

2018-12-18| B06F| Objections, documents and/or translations needed after an examination request according [chapter 6.6 patent gazette]|

2020-02-04| B06U| Preliminary requirement: requests with searches performed by other patent offices: procedure suspended [chapter 6.21 patent gazette]|

2021-01-26| B09A| Decision: intention to grant|

2021-03-30| B16A| Patent or certificate of addition of invention granted|Free format text: PRAZO DE VALIDADE: 20 (VINTE) ANOS CONTADOS A PARTIR DE 12/04/2012, OBSERVADAS AS CONDICOES LEGAIS. |

优先权:

申请号 | 申请日 | 专利标题

US201161489629P| true| 2011-05-24|2011-05-24|

US61/489,629|2011-05-24|

US13/443,647|2012-04-10|

US13/443,647|US8990074B2|2011-05-24|2012-04-10|Noise-robust speech coding mode classification|

PCT/US2012/033372|WO2012161881A1|2011-05-24|2012-04-12|Noise-robust speech coding mode classification|

[返回顶部]