WO2005083921A1 - Method for predicting the perceptual quality of audio signals - Google Patents

Method for predicting the perceptual quality of audio signals Download PDF

Info

Publication number
WO2005083921A1
WO2005083921A1 PCT/FI2004/050020 FI2004050020W WO2005083921A1 WO 2005083921 A1 WO2005083921 A1 WO 2005083921A1 FI 2004050020 W FI2004050020 W FI 2004050020W WO 2005083921 A1 WO2005083921 A1 WO 2005083921A1
Authority
WO
WIPO (PCT)
Prior art keywords
signal
determining
distortion
linear
quality
Prior art date
Application number
PCT/FI2004/050020
Other languages
French (fr)
Inventor
Brian C. J. Moore
Chin-Tuan Tan
Nick Zacharov
Ville-Veikko Mattila
Original Assignee
Nokia Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nokia Corporation filed Critical Nokia Corporation
Priority to PCT/FI2004/050020 priority Critical patent/WO2005083921A1/en
Publication of WO2005083921A1 publication Critical patent/WO2005083921A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L1/00Arrangements for detecting or preventing errors in the information received
    • H04L1/20Arrangements for detecting or preventing errors in the information received using signal quality detector

Definitions

  • This invention relates generally to a signal processing and particularly to such processing, where the signal is being subjected to linear or non-linear distortion.
  • Audio signals are often distorted by transducers, such as e.g. loudspeakers and microphones, transducer amplifiers and equalizers, audio and speech coders as well as many transmission channels. These transducers, loudspeakers, microphones etc. can be found also in mobile communication devices. Due to the distortion the received or reproduced signal is not identical to the original and a decreasing of tone quality is considerable. The distortion can be categorized into linear and non-linear distortion.
  • Linear distortion involves changes in relative amplitudes and phases of frequency components present in a complex signal. Such changes are typically perceived as changes in timbre or in tone quality (coloration).
  • the perceptual effects of linear distortion are discussed in several studies of related art, e.g. "Toole, F.E.: Loudspeaker measurements and their relationship to listener preferences, Part 1 (J. Audio Eng. Soc. 34, pages 227 - 235) and Part 2 (J. Audio Eng. Soc. 34, pages 323 - 348), 1986” and "Gabrielsson, A. & al: Perceived sound quality of reproductions with different frequency responses and sound levels (J. Acoust. Soc. Am. 88, pages 1359 - 1366), 1990". These studies have not given proposals for calculating the perceived quality of sounds that have been affected by said distortion.
  • Non-linear distortion introduces frequency components that were not present in the input signal. Effects of nonlinear distortion may be described as “harshness” or “roughness” or in terms of the perception of sounds that were not present in the original signal such as “crackles” or “clicks”.
  • the non-linear distortion of the signal may be dependent, e.g., on the level and/or the phase of the signal. In a speaker the distortion can depend, e.g. on displacement of a coil.
  • form of the non-linear distortion may be symmetric or asymmetric, and the distortion can appear as harmonic or inharmonic.
  • the level of non-linearity may vary from weak to strong or even to chaotic.
  • the audio processing methods of related art lack suitable methods for predicting the perceived quality of signals that have been subjected to linear and/or non-linear distortion. Especially when the distortion is of that type, which is produced by electro- acoustic transducers such as microphones and loudspeakers. For assessing the perceived quality of sounds transduced by microphones and loudspeakers a representative panel of human listeners should be used to obtain subjective judgments. It is obvious that this kind of arrangement is expensive and time-consuming.
  • the aim of the current invention is to provide an objective method for predicting the perceived quality (naturalness) of audio signals that have been affected by linear or non-linear distortion or by combination of those two.
  • the invention suggests one method, which can be applied to said distortions by means of three models.
  • Another aim of the current invention is to provide a method that can be utilized in broadcasting, telecommunications, sound reproduction, sound recording and sound coding. Naturally, the invention is also applicable in other applications.
  • Yet another aim of the current invention is to provide models for the linear and the non-linear and the combined distortion, which can be used in product audio quality assurance process, where said models can be used to evaluate the acoustic signal that is reproduced by e.g. an earpiece or a loudspeaker.
  • the linear model can be used to predict the quality with respect to spectral distortion and the non-linear model with respect to signal clipping.
  • the invention is characterized in that a type of a distortion is determined and depending on whether linear distortion, non-linear distortion, or both are detected, a model for quality prediction is selected and the quality of the distorted signal is determined.
  • the method according to the invention can give a direct estimate of the quality as perceived by human listeners, enabling e.g. a perceptually-valid comparison of different solutions for a product.
  • the method according to the invention is objective and the predictions appear to correlate highly with the mean ratings of human listeners' panels. Due to the invention, the evaluation of transducers and transmission channels can be implemented without a need to run expensive listening tests with large panels.
  • Figure 1 is a block model of an embodiment of a linear model according to the invention
  • Figure 2 is a block model of an embodiment of a non-linear model according to the invention.
  • Figure 3 is a block model of an embodiment of a combined model according to the invention.
  • the method for predicting the perceived quality of audio signals according to the invention is divided into three models according to the type of distortion.
  • the models can be applied e.g. to speech or music signals.
  • Signals can be either acoustic or electrical (analog or digital).
  • a frequency response of a device in question is taken as an input to evaluate device's perceptual effect on audio signal.
  • the model can be used to study any spectral distortion, despite whether the distortion is generated e.g. through a transducer or in analog or digital filtering system of electric signals.
  • a time-domain speech or music signal can be taken as an input to evaluate signal's perceptual quality.
  • the signal may be generated through e.g. a transducer or processed by any analog or digital signal processing system in electric domain.
  • the model is relatively independent of linear distortion, so the input signals may also include linear distortion. However this may not affect quality predictions for non-linear distortion.
  • both the frequency response and the waveform of the signal under study are taken as inputs.
  • the model is able to separate effects of linear and non-linear distortion on perceptual quality as well as to predict the combined effect on quality.
  • the linear model describes how the method according to the invention is used for predicting the perceptual quality of signals that have been affected by linear distortion.
  • a quantity being inversely related to the subjective naturalness of a signal subjected to a specific spectral distortion is generated. This model is described with reference to figure 1.
  • a complex test signal PN e.g. speech-shaped noise, pink noise, white noise, real speech just to mention few potentials
  • the input level is adjusted ADJ so that the loudness level N of the output corresponds to a predetermined value, e.g. 86.4 phons.
  • Loudness level of time-invariant signals can be calculated e.g. by: Zwicker's model or Moore's model. These models are, however, often also applied to time-varying signals by computing the loudness for short signal frames. For time-varying signals also Glasberg and Moore's temporal loudness model can be used. Other models are also known, which can be used, so the invention is not limited only to one particular method.
  • An excitation pattern is calculated EP next according to some known method.
  • the calculation can follow e.g. the Moore model with use of a sharpening parameter SP.
  • the sharpness of filters used to calculate the excitation pattern can be controlled by a parameter P, whose value at each centre frequency FC determines both bandwidth and slope of the filter; the higher the value of the parameter P the shaper the filter is.
  • the filters are sharpened by multiplying the parameter value P at each centre frequency FC by the sharpening parameter SP.
  • the excitation pattern can be calculated as a function of ERB N number "i" at intervals of 0.5 ERB N , for the original undistorted signal and the spectrally distorted signal.
  • ERB N signifies the mean equivalent rectangular bandwidth of the auditory filter for young normally hearing listeners at moderate sound levels.
  • the calculated excitation level for any value of "i" is compared CMP to a noise floor value F.
  • the excitation level falls below the value F the excitation level is set equal to the noise floor value F.
  • second-order difference DIFF2 is determined for each value of "i" between the excitation level EO for the original signal and the excitation level ED for the spectrally distorted signal;
  • w s 0.5.
  • the values of W(i) are set to 0 for "i" ⁇ 4 and for "i" > 36.
  • the standard deviation SD of the weighted first-order differences WD1 is determined across all "i". Also the standard deviation SD of the weighted second-order WD2 differences is determined across all "i”. Standard deviation can be determined by subtracting the overall averages avg(WD1 (i)) and avg(WD2(i)) from WD1(i) and WD2(i) for each value correspondingly. Those differences are then multiplied by themselves. The resulting squared values are summed after which those results are divided by (n-1) (n is number of values). The standard deviation is the square root of those ratios.
  • D wSD ⁇ EO(i) - ED(i) ⁇ + (1-w)SD ⁇ EO(i+1) - ED(i+1) - EO(i) + ED(i) ⁇
  • the final calculated quantity is referred by the overall weighted excitation-pattern difference D.
  • Subjective naturalness is inversely related to excitation-pattern difference D and predictable from excitation-pattern difference D.
  • the first step is to apply a linear transform to each subjective rating score S UN so that the range of transformed scores Sj, is from 0 to 1 :
  • MIN is the smallest obtained value of subjective judgement S and MAX is the largest.
  • Frequency regions are characterized in terms of (modelled) auditory filters.
  • the extent to which the output in one filter (frequency region) is caused by the input in that filter, can be measured by the correlation of the output of a simulated auditory filter in response to the original and distorted signals; the higher the correlation of the filter responses to the original and distorted signals, the lower is the distortion, and the higher is the perceptual quality.
  • This model comprises following steps:
  • the input signal is e.g. speech signal (for prediction of speech data) or music signal (for prediction of music data).
  • the model requires the waveform of the original input signal and the waveform of the signal after passing the non-linear system under test.
  • the original Ol and distorted Dl signals are time aligned TA.
  • the original Ol and distorted Dl signals are filtered to mimic the effects of transmission through the outer and middle ear. This can be done by using a finite impulse response (FIR) filter with 4097 coefficients.
  • Analog signals are at first converted into digital format AD and saved into memory MEM. Signal and corresponding coefficient CO are fetched from the memory and applied to the multiplier MA. The resulting products are summed for producing the digital filtered signal, which is then converted DA into analog format.
  • the filtered signals Ol, Dl are fed (separately) to an array of 40 gammatone filters each 1 - ERBN wide FB.
  • the cross- correlation CC between the response to the undistorted signal and the response to the distorted signal is determined. This is done in 30 ms non-overlapping frames. For each frame and each filter, the maximum value MAX of the cross correlation is determined: this allows for any (small) frequency-dependent time delay in the non-linear system. For each 30 ms frame "i”, the power at the output of each filter "j" is calculated and converted to a decibel measure Level(j) DB. The value of Level(j) is used to determine a weight that is applied to the short- term correlation for frame "i" and filter "j". The weights are scaled separately for each frame so that the sum of the weights is unity.
  • Empirical data show that the function relating subjective scores S NO NUN to Rnonlin is sigmoidal in shape. This function is described by four parameters (lower asymptote, upper asymptote, slope, and Rnonlin value giving a medium rating).
  • the function is used to predict the score S N ONUN- for a given type of distortion from the Rnonlin value for that distortion.
  • This model introduces the method for predicting the perceptual quality of signals subjected to combined linear and non-linear distortion This model is described with reference to figure 3.
  • the linear model gives a predicted score SUN (based on a test signal).
  • the non-linear model gives a predicted score SNONLIN (using speech or music as input, as appropriate).
  • is a parameter of the model that characterizes the relative importance of the linear and non-linear parts of the distortion.
  • is a parameter of the model that characterizes the relative importance of the linear and non-linear parts of the distortion.
  • the most accurate predictions are obtained with ⁇ set to a small value, such as 0.05, indicating that linear distortion has smaller perceptual effects than non-linear distortion.
  • the linear model is non-intrusive, which means that the model does not require any reference signal when predicting spectral quality.
  • the model could be used in e.g. mobile terminal for controlling adaptive equalization of an earpiece signal, when the acoustic response of the earpiece is known.
  • the linear model could be used to adapt the equalizer within certain constraints to reproduce an optimal acoustic response.
  • the non-linear model is intrusive and requires a reference signal. In this way, it could be used for the quality prediction of signals that are stored into terminal's memory.
  • the model could be applied to measuring the quality of audio samples stored into terminals (e.g., down-loaded music, MIDI, etc.)
  • an adaptive equalizer or dynamic range control is used to optimise the level versus quality of a music signal.
  • the user of a mobile terminal might wish to increase the loudness of the music sample and the non-linear model would be used to decide what increase can still be allowed without causing perceptual degradation of audio quality.
  • the model for non-linear distortion can also be used for controlling substantially any speech-processing algorithm, if the input signal is taken as a reference signal. Even if the signal has already been corrupted in previous processing blocks.
  • the model could also produce accurate predictions in cases where the reference is not representing non-processed quality.
  • the method according to the invention can be implemented by a computer program that comprises computer readable instructions for receiving a specification of the frequency response and time waveforms of test signals.
  • the computer program can be written in any relevant programming language, e.g. in MATLAB or C / C++ and be saved in a memory of any suitable computer program product.
  • the method according to the invention can be also implemented by a stand-alone system arranged in a digital-signal-processing chip.
  • Frequency response is acquired from the device where the system or the program is arranged as well as the time waveforms are at the input and output of the device. According to one implementation the frequency response is determined from the two time waveforms.
  • the system and/or computer program can be arranged into devices such as audio quality testing devices but also mobile terminals to control audio and speech processing algorithms due to its relative low computational complexity. In that situation it could be used for tuning transducer equalizers, controlling signal compression and gain control.
  • devices such as audio quality testing devices but also mobile terminals to control audio and speech processing algorithms due to its relative low computational complexity. In that situation it could be used for tuning transducer equalizers, controlling signal compression and gain control.

Abstract

A method for signal processing, wherein said signal is distorted and a subjective quality of said signal is predicted. A type of a distortion is determined and depending on whether linear distortion, non-linear distortion, or both are detected, a model for quality prediction is selected and the quality of the distorted signal is determined. The subjective quality of the signal subjected to that distortion is determined. The invention relates also to a system and a computer program product for implementing the prediction models.

Description

METHOD FOR PREDICTING THE PERCEPTUAL QUALITY OF AUDIO SIGNALS
Technical Field
This invention relates generally to a signal processing and particularly to such processing, where the signal is being subjected to linear or non-linear distortion.
Background Art
Audio signals are often distorted by transducers, such as e.g. loudspeakers and microphones, transducer amplifiers and equalizers, audio and speech coders as well as many transmission channels. These transducers, loudspeakers, microphones etc. can be found also in mobile communication devices. Due to the distortion the received or reproduced signal is not identical to the original and a decreasing of tone quality is considerable. The distortion can be categorized into linear and non-linear distortion.
Linear distortion involves changes in relative amplitudes and phases of frequency components present in a complex signal. Such changes are typically perceived as changes in timbre or in tone quality (coloration). The perceptual effects of linear distortion are discussed in several studies of related art, e.g. "Toole, F.E.: Loudspeaker measurements and their relationship to listener preferences, Part 1 (J. Audio Eng. Soc. 34, pages 227 - 235) and Part 2 (J. Audio Eng. Soc. 34, pages 323 - 348), 1986" and "Gabrielsson, A. & al: Perceived sound quality of reproductions with different frequency responses and sound levels (J. Acoust. Soc. Am. 88, pages 1359 - 1366), 1990". These studies have not given proposals for calculating the perceived quality of sounds that have been affected by said distortion.
Non-linear distortion, on the other hand, introduces frequency components that were not present in the input signal. Effects of nonlinear distortion may be described as "harshness" or "roughness" or in terms of the perception of sounds that were not present in the original signal such as "crackles" or "clicks". The non-linear distortion of the signal may be dependent, e.g., on the level and/or the phase of the signal. In a speaker the distortion can depend, e.g. on displacement of a coil. In addition, form of the non-linear distortion may be symmetric or asymmetric, and the distortion can appear as harmonic or inharmonic. The level of non-linearity may vary from weak to strong or even to chaotic.
There are some studies concerning the non-linear distortion, e.g. "Czerwinski, E. & al: Multitone testing of sound system components - some results and conclusions, Part 2: Modeling and application (J. Audio Eng. Soc. 49, pages 1011 - 1048), 2001" and "Geddes, E. R, and Lee, L.W.: Auditory perception of nonlinear distortion - Theory (Proceedings of the Audio Engineering Society; 115th Convention, NY, USA), 2003." The studies have focused for characterizing the effects of the non-linear distortion but they have not given proposals for calculating the perceived quality of sounds that have been affected by said distortion. E.g. the publication of Czerwinski admits, that all attempts to find simple quantitative ratios between non-linear parameters and subjectively detected distortion have not been entirely successful. The method proposed by Geddes and Lee has appeared to be applicable only to a limited set of examples of non-linear distortion.
In related art, methods for assessing the perceived quality of audio signals have been developed for the purpose of evaluating bit-rate reduction systems used in digital coding, complex audio processing or transmission chains. Examples of such methods can be found from the ITU-R BS.1387-1 Recommendation (PEAQ) and the ITU P.862 Recommendation (PESQ). These methods are intended to be used only with electrical signals, and not with signals that have been passed through electro-acoustic transducers. Also, PESQ is intended for narrowband speech measurements (300 Hz - 3400 Hz). For separating the effects of linear and non-linear distortion, these methods do not provide satisfactory answer. As can be seen, the audio processing methods of related art lack suitable methods for predicting the perceived quality of signals that have been subjected to linear and/or non-linear distortion. Especially when the distortion is of that type, which is produced by electro- acoustic transducers such as microphones and loudspeakers. For assessing the perceived quality of sounds transduced by microphones and loudspeakers a representative panel of human listeners should be used to obtain subjective judgments. It is obvious that this kind of arrangement is expensive and time-consuming.
In applicant's point of view there are no methods for accurately predicting the perceived effects of linear distortion; or of non-linear distortion being produced by transducers such as microphones and loudspeakers or by transducer amplifiers and equalizers; or of combination of those in related art. Hence, there can be found a need for such. This invention is addressed such a need.
Disclosure of Invention
The aim of the current invention is to provide an objective method for predicting the perceived quality (naturalness) of audio signals that have been affected by linear or non-linear distortion or by combination of those two. The invention suggests one method, which can be applied to said distortions by means of three models.
Another aim of the current invention is to provide a method that can be utilized in broadcasting, telecommunications, sound reproduction, sound recording and sound coding. Naturally, the invention is also applicable in other applications.
Yet another aim of the current invention is to provide models for the linear and the non-linear and the combined distortion, which can be used in product audio quality assurance process, where said models can be used to evaluate the acoustic signal that is reproduced by e.g. an earpiece or a loudspeaker. The linear model can be used to predict the quality with respect to spectral distortion and the non-linear model with respect to signal clipping.
To put it more precisely the invention is characterized in that a type of a distortion is determined and depending on whether linear distortion, non-linear distortion, or both are detected, a model for quality prediction is selected and the quality of the distorted signal is determined.
As opposed to measurements of related art, the method according to the invention can give a direct estimate of the quality as perceived by human listeners, enabling e.g. a perceptually-valid comparison of different solutions for a product.
The method according to the invention is objective and the predictions appear to correlate highly with the mean ratings of human listeners' panels. Due to the invention, the evaluation of transducers and transmission channels can be implemented without a need to run expensive listening tests with large panels.
Brief Description of Drawings
Figure 1 is a block model of an embodiment of a linear model according to the invention,
Figure 2 is a block model of an embodiment of a non-linear model according to the invention, and
Figure 3 is a block model of an embodiment of a combined model according to the invention.
Best mode for carrying out the invention
The method for predicting the perceived quality of audio signals according to the invention is divided into three models according to the type of distortion. The models can be applied e.g. to speech or music signals. Signals can be either acoustic or electrical (analog or digital).
In the model for the linear distortion a frequency response of a device in question is taken as an input to evaluate device's perceptual effect on audio signal. In general, the model can be used to study any spectral distortion, despite whether the distortion is generated e.g. through a transducer or in analog or digital filtering system of electric signals.
On the other hand, in the model for non-linear distortion a time-domain speech or music signal can be taken as an input to evaluate signal's perceptual quality. Again, the signal may be generated through e.g. a transducer or processed by any analog or digital signal processing system in electric domain. The model is relatively independent of linear distortion, so the input signals may also include linear distortion. However this may not affect quality predictions for non-linear distortion.
In the combined model both the frequency response and the waveform of the signal under study are taken as inputs. The model is able to separate effects of linear and non-linear distortion on perceptual quality as well as to predict the combined effect on quality.
Linear model
The linear model describes how the method according to the invention is used for predicting the perceptual quality of signals that have been affected by linear distortion. By means of this model a quantity being inversely related to the subjective naturalness of a signal subjected to a specific spectral distortion, is generated. This model is described with reference to figure 1.
In the linear model a complex test signal PN, e.g. speech-shaped noise, pink noise, white noise, real speech just to mention few potentials, is used as an input and the input level is adjusted ADJ so that the loudness level N of the output corresponds to a predetermined value, e.g. 86.4 phons. Loudness level of time-invariant signals can be calculated e.g. by: Zwicker's model or Moore's model. These models are, however, often also applied to time-varying signals by computing the loudness for short signal frames. For time-varying signals also Glasberg and Moore's temporal loudness model can be used. Other models are also known, which can be used, so the invention is not limited only to one particular method.
An excitation pattern is calculated EP next according to some known method. The calculation can follow e.g. the Moore model with use of a sharpening parameter SP. The sharpening parameter is advantageously SP = 1.5. Also here it should be noticed, that other suitable models can also be used.
The sharpness of filters used to calculate the excitation pattern can be controlled by a parameter P, whose value at each centre frequency FC determines both bandwidth and slope of the filter; the higher the value of the parameter P the shaper the filter is. In the .model according to the method of the invention the filters are sharpened by multiplying the parameter value P at each centre frequency FC by the sharpening parameter SP.
The excitation pattern can be calculated as a function of ERBN number "i" at intervals of 0.5 ERBN, for the original undistorted signal and the spectrally distorted signal. ERBN signifies the mean equivalent rectangular bandwidth of the auditory filter for young normally hearing listeners at moderate sound levels.
The calculated excitation level for any value of "i" is compared CMP to a noise floor value F. When the excitation level falls below the value F the excitation level is set equal to the noise floor value F. The noise floor value can be e.g. F = 32.
For each value of "i", first-order difference DIFF1 is determined between the excitation level EO for the original signal and the excitation level ED for the spectrally distorted signal; D1 = EO(/) - ED(/).
Also second-order difference DIFF2 is determined for each value of "i" between the excitation level EO for the original signal and the excitation level ED for the spectrally distorted signal;
D2 = {EO(/+1) - ED(/+1)} - {EO(/) - ED(/)}.
These first- and second-order D1 , D2 differences at each "i" are then weighted W according to weighting function defined below, with parameter ws:
W(i) = 1 for i < 17.5 W(i) = 1 - ws(i - 17.5)/46 for i from 17.5 to 40.
Value of ws can be e.g. ws = 0.5.
If the signal being judged is speech signal, the values of W(i) are set to 0 for "i" < 4 and for "i" > 36.
The standard deviation SD of the weighted first-order differences WD1 is determined across all "i". Also the standard deviation SD of the weighted second-order WD2 differences is determined across all "i". Standard deviation can be determined by subtracting the overall averages avg(WD1 (i)) and avg(WD2(i)) from WD1(i) and WD2(i) for each value correspondingly. Those differences are then multiplied by themselves. The resulting squared values are summed after which those results are divided by (n-1) (n is number of values). The standard deviation is the square root of those ratios.
A weighted sum of the standard deviations of the first-order (D1) and second-order (D2) differences, with weight parameter w, is formed in the following way: D = wSD{EO(i) - ED(i)} + (1-w)SD{EO(i+1) - ED(i+1) - EO(i) + ED(i)}
The value of weight parameter can be e.g. w=0.4.
The final calculated quantity is referred by the overall weighted excitation-pattern difference D. Subjective naturalness is inversely related to excitation-pattern difference D and predictable from excitation-pattern difference D.
To make the relationship between the values of excitation-pattern difference D and the subjective judgments SUN more linear, the subjective judgements SUN values are transformed. The first step is to apply a linear transform to each subjective rating score SUN so that the range of transformed scores Sj, is from 0 to 1 :
Sτ = (SLIN - MIN)/(MAX-MIN),
where MIN is the smallest obtained value of subjective judgement S and MAX is the largest.
The second step is to apply the arcsine transform to the transformed scores St, giving the transformed values T: T = 2arcsinVSt
The resulting transformed values T are almost linearly related to the values of excitation-pattern difference D. Hence, calculated values of excitation-pattern difference D are used to predict transformed values T. The inverse of the transformations described above is used to predict the values of subjective judgment SUN from transformed values T. Non-linear model
Reference is now made to figure 2, where an example of a model for predicting the perceptual quality of signals subjected to non-linear distortion is presented.
When non-linearities are present in a system, the output of the system in a given frequency region may be influenced by input components in a different region. Frequency regions are characterized in terms of (modelled) auditory filters.
The extent to which the output in one filter (frequency region) is caused by the input in that filter, can be measured by the correlation of the output of a simulated auditory filter in response to the original and distorted signals; the higher the correlation of the filter responses to the original and distorted signals, the lower is the distortion, and the higher is the perceptual quality.
This model comprises following steps:
The input signal is e.g. speech signal (for prediction of speech data) or music signal (for prediction of music data). The model requires the waveform of the original input signal and the waveform of the signal after passing the non-linear system under test.
First the original Ol and distorted Dl signals are time aligned TA. After this the original Ol and distorted Dl signals are filtered to mimic the effects of transmission through the outer and middle ear. This can be done by using a finite impulse response (FIR) filter with 4097 coefficients. Analog signals are at first converted into digital format AD and saved into memory MEM. Signal and corresponding coefficient CO are fetched from the memory and applied to the multiplier MA. The resulting products are summed for producing the digital filtered signal, which is then converted DA into analog format. The filtered signals Ol, Dl are fed (separately) to an array of 40 gammatone filters each 1 - ERBN wide FB. For each filter, the cross- correlation CC between the response to the undistorted signal and the response to the distorted signal is determined. This is done in 30 ms non-overlapping frames. For each frame and each filter, the maximum value MAX of the cross correlation is determined: this allows for any (small) frequency-dependent time delay in the non-linear system. For each 30 ms frame "i", the power at the output of each filter "j" is calculated and converted to a decibel measure Level(j) DB. The value of Level(j) is used to determine a weight that is applied to the short- term correlation for frame "i" and filter "j". The weights are scaled separately for each frame so that the sum of the weights is unity.
For each filter "j" the weighted cross-correlation is averaged AVE1 across 30 ms frames. Further the average correlation for each filter is averaged across filters AVE2. This gives a measure response Rnonlin, which decreases with increasing distortion.
Empirical data show that the function relating subjective scores SNONUN to Rnonlin is sigmoidal in shape. This function is described by four parameters (lower asymptote, upper asymptote, slope, and Rnonlin value giving a medium rating).
The function is used to predict the score SNONUN- for a given type of distortion from the Rnonlin value for that distortion.
Combined model
This model introduces the method for predicting the perceptual quality of signals subjected to combined linear and non-linear distortion This model is described with reference to figure 3.
At first the linear and non-linear methods are run independently. The linear model gives a predicted score SUN (based on a test signal). The non-linear model gives a predicted score SNONLIN (using speech or music as input, as appropriate). The predicted scores of those two models are combined to give a predicted overall score SøVERALL according to the rule: SoVERALL = αS IN + (1- α)SNONLIN
where α is a parameter of the model that characterizes the relative importance of the linear and non-linear parts of the distortion. The most accurate predictions are obtained with α set to a small value, such as 0.05, indicating that linear distortion has smaller perceptual effects than non-linear distortion.
It should be noticed for all the models that all above-mentioned parameter values have been selected based on a series of calibration experiments (subjective listening tests) to reach as high correlation as possible between objective quality ratings and subjective judgements. By understanding this, it will be evident that those values can vary independently.
Electric signals
The previous description has concentrated to acoustic signals. It should be noticed, that the invention is also applicable with electric signals. As was seen, the linear model is non-intrusive, which means that the model does not require any reference signal when predicting spectral quality. In this way, the model could be used in e.g. mobile terminal for controlling adaptive equalization of an earpiece signal, when the acoustic response of the earpiece is known. The linear model could be used to adapt the equalizer within certain constraints to reproduce an optimal acoustic response.
On the contrary to the previous, the non-linear model is intrusive and requires a reference signal. In this way, it could be used for the quality prediction of signals that are stored into terminal's memory. Hence, the model could be applied to measuring the quality of audio samples stored into terminals (e.g., down-loaded music, MIDI, etc.) As an example one scenario can be presented: an adaptive equalizer or dynamic range control is used to optimise the level versus quality of a music signal. Here the user of a mobile terminal might wish to increase the loudness of the music sample and the non-linear model would be used to decide what increase can still be allowed without causing perceptual degradation of audio quality.
The model for non-linear distortion can also be used for controlling substantially any speech-processing algorithm, if the input signal is taken as a reference signal. Even if the signal has already been corrupted in previous processing blocks. The model could also produce accurate predictions in cases where the reference is not representing non-processed quality.
System for predicting the perceptual quality of signals
The method according to the invention can be implemented by a computer program that comprises computer readable instructions for receiving a specification of the frequency response and time waveforms of test signals. The computer program can be written in any relevant programming language, e.g. in MATLAB or C / C++ and be saved in a memory of any suitable computer program product. The method according to the invention can be also implemented by a stand-alone system arranged in a digital-signal-processing chip.
Frequency response is acquired from the device where the system or the program is arranged as well as the time waveforms are at the input and output of the device. According to one implementation the frequency response is determined from the two time waveforms.
The system and/or computer program can be arranged into devices such as audio quality testing devices but also mobile terminals to control audio and speech processing algorithms due to its relative low computational complexity. In that situation it could be used for tuning transducer equalizers, controlling signal compression and gain control. The foregoing detailed description is provided for clearness of understanding only. It should be noticed that there are different perceptual auditory model, which can be used along the invention. That is why there can be alternatives for some calculations described above. For understanding this not necessarily limitation should be read therefrom into the claims herein.

Claims

Claims:
1. A method for signal processing, wherein a signal is acquired, said signal being subjected to distortion, in which method the quality of said signal is predicted, characterized in that a type of a distortion is determined and depending on whether linear distortion, non-linear distortion, or both are detected, a model for quality prediction is selected and the quality of the distorted signal is determined.
2. The method according to claim 1 , characterized in that, when linear distortion is present in the signal, the prediction model comprises at least steps for - adjusting (ADJ) the output of an inputted test signal (PN) to correspond a predetermined value, - determining an excitation level with a sharpener parameter as a function of a mean equivalent rectangular bandwidth for the undistorted signal and the distorted signal (EP), - comparing (CMP) the determined excitation level to a noise floor value (F), and setting equal to said noise floor value if being below it, - determining a first-order difference (DIFF1) between the excitation level (EO) of the original signal and the excitation level (ED) of the distorted signal, - determining a second-order difference (DIFF2) between excitation level (EO) of the original signal and the excitation level (ED) of the distorted signal, - weighting (W) the determined first- (D1) and second- order (D2) differences, - determining a standard deviation (SD) of the weighted first- (WD1) and second-order (WD2) differences, - determining a sum (SUM) of the determined standard deviations.
3. The method according to claim 2, characterized in that, the quality is determined from subjective scores (SUN) being inversely related to the determined sum (SUM).
4. The method of claim 2 or 3, characterized in that, a loudness level of the output is adjusted for corresponding the predetermined value.
5. The method according to any of the preceding claims 3 - 4, characterized in that subjective scores (SUN) are transformed (Sτ) by applying first a linear transform to each subjective score (SUN) and then applying an arcsin transform to the transformed subjective scores (Sτ), whereupon the resulting transformed values (T) are substantially linearly related to the determined sum.
6. The method according to claim 1 , characterized in that, when nonlinear distortion . is present in the signal, the prediction model comprises at least steps for - acquiring a waveform of original input signal (Ol) and the a waveform of the distorted signal (Dl), - time aligning (TA) the original (Ol) and the distorted signals (Dl), - filtering (FIR) the original (Ol) and distorted signals (Dl), - feeding the filtered signals to a filter array (FB), - determining a cross-correlation (CC) between the response to the original signal (Ol) and the response to distorted signal (Dl) for each filter, - determining the maximum value of the cross-correlation for each filter and each frame (MAX), - defining a weight value by determining power at the output of each filter for each frame and converting the power to a decibel measure (DB), - determining an average of the weighted cross-correlation for each filter across frames (AVE1), - determining another average (Rnonlin) of the average cross-correlation for each filter across filters (AVE2), - predicting the quality by means of the other average (Rnonlin).
7. The method according to claim 6, characterized in that, said quality is determined from subjective scores (Snon n) being related to the other average (Rnonlin).
8. The method according to claims 1 - 7, characterized in that, when both linear and non-linear distortion is present in the signal the prediction model comprises at least steps for - determining the subjective scores for the signal of the linear (S|in) and non-linear (Snoniιn) distortion, - determining a predicted overall score (Soveraiι) by means of a parameter (a) indicating a relative importance of the distortion.
9. The method according to one of the claims 1 - 8, characterized in that, said signal is an acoustic signal.
10. A method for signal processing, wherein a signal is acquired, said signal being subject to linear distortion, and in which method a quality of said signal is predicted characterized in that the prediction comprises at least steps for - adjusting (ADJ) the output of an inputted test signal (PN) to correspond a predetermined value, - determining an excitation level with a sharpener parameter as a function of a mean equivalent rectangular bandwidth for the undistorted signal and the distorted signal (EP), - comparing (CMP) the determined excitation level to a noise floor value (F), and setting equal to said noise floor value if being below it, - determining a first-order difference (DIFF1) between the excitation level (EO) of the original signal and the excitation level (ED) of the distorted signal, - determining a second-order difference (DIFF2) between excitation level (EO) of the original signal and the excitation level (ED) of the distorted signal, - weighting (W) the determined first- (D1) and second- order (D2) differences, - determining a standard deviation (SD) of the weighted first- (WD1) and second-order (WD2) differences, - determining a sum (SUM) of the determined standard deviations.
11. The method according to claim 10, characterized in that, the quality is determined from subjective scores (SUN) being inversely related to the determined sum (SUM).
12. The method according to claim 10 or 11 , characterized in that, a loudness level of the output is adjusted for corresponding the predetermined value.
13. The method according to any of the preceding claims 10 - 12, characterized in that subjective scores (SUN) are transformed (Sτ) by applying first a linear transform to each subjective score (SUN) and then applying an arcsin transform to the transformed subjective scores (Sτ), whereupon the resulting transformed values (T) are substantially linearly related to the determined sum.
14. A method for signal processing, wherein a signal is acquired, said signal being subjected to non-linear distortion, and in which method a quality of said signal is predicted characterized in that the prediction comprises at least steps for - acquiring a waveform of original input signal (Ol) and the a waveform of the distorted signal (Dl), - time aligning (TA) the original (Ol) and the distorted signals (Dl), - filtering (FIR) the original (Ol) and distorted signals (Dl), - feeding the filtered signals to a filter array (FB), - determining a cross-correlation (CC) between the response to the original signal (Ol) and the response to distorted signal (Dl) for each filter, - determining the maximum value of the cross-correlation for each filter and each frame (MAX), - defining a weight value by determining power at the output of each filter for each frame and converting the power to a decibel measure (DB), - determining an average of the weighted cross-correlation for each filter across frames (AVE1), - determining another average (Rnonlin) of the average cross-correlation for each filter across filters (AVE2), - predicting the quality by means of the other average (Rnonlin).
15. The method according to claim 14, characterized in that, the quality is determined from subjective scores (Snon|in) being related to the other average (Rnonlin).
16. A method for signal processing, wherein a signal is acquired, said signal being subject to both linear and non-linear distortion, in which method the quality of said signal is predicted, characterized in that the prediction comprises at least steps for - determining the subjective scores for the signal by means of linear (S n) and non-linear (Snon n) distortion, - determining a predicted overall score (Soveran) by means of a parameter (a) indicating a relative importance of the distortion.
17. The method according to claim 16, characterized in that, effects of linear and non-linear distortion on the quality are separated.
18. A signal processing system comprising means for acquiring a signal, said signal being subject to distortion, which system comprises at least means for predicting a quality of said signal characterized in that said system further comprises means for determining a type of the distortion and whether linear distortion, non-linear distortion or both are detected selecting a model for quality prediction for determining the quality of the distorted signal.
19. The signal processing system according to claim 18, characterized in that said system is configured into device for mobile communication.
20. The signal processing system according to claim 18, characterized in that said system is configured into device for audio quality testing.
21. The signal processing system according to claim 18 or 19 or 20, characterized in that said system is part of a speech recognition system.
22. The signal processing system according to one of the preceding claims 18 - 22, characterized in that said signal either acoustic or electrical.
23. A computer program product for signal processing, comprising memory means for storing a computer program comprising first computer readable instructions for acquiring a signal, said signal being subject to distortion, and second instructions for predicting a quality of said signal characterized by that said computer program further comprises instructions for determining a type of the distortion, and whether linear distortion, non-linear distortion or both are detected, selecting the model for quality prediction for determining the quality of the distorted signal.
24. The computer program product according to claim 23, characterized in that, when linear distortion is present in the signal, the selected model comprises at least instructions for - adjusting (ADJ) the output of an inputted test signal (PN) to correspond a predetermined value, - determining an excitation level with a sharpener parameter as a function of a mean equivalent rectangular bandwidth for the undistorted signal and the distorted signal (EP), - comparing (CMP) the determined excitation level to a noise floor value (F), and setting equal to said noise floor value if being below it, - determining a first-order difference (DIFF1) between the excitation level (EO) of the original signal and the excitation level (ED) of the distorted signal, - determining a second-order difference (DIFF2) between excitation level (EO) of the original signal and the excitation level (ED) of the distorted signal, - weighting (W) the determined first- (D1) and second- order (D2) differences, - determining a standard deviation (SD) of the weighted first- (WD1) and second-order (WD2) differences, - determining a sum (SUM) of the determined standard deviations.
25. The computer program product according to claim 23 or 24, characterized in that when non-linear distortion is present in the signal, the selected model comprises at least instructions for - acquiring a waveform of original input signal (Ol) and the a waveform of the distorted signal (Dl), - time aligning (TA) the original (Ol) and the distorted signals (Dl), - filtering (FIR) the original (Ol) and distorted signals (Dl), - feeding the filtered signals to a filter array (FB), - determining a cross-correlation (CC) between the response to the original signal (Ol) and the response to distorted signal (Dl) for each filter, - determining the maximum value of the cross-correlation for each filter and each frame (MAX), - defining a weight value by determining power at the output of each filter for each frame and converting the power to a decibel measure (DB), - determining an average of the weighted cross-correlation for each filter across frames (AVE1), - determining another average (Rnonlin) of the average cross-correlation for each filter across filters (AVE2), - predicting the quality by means of the other average (Rnonlin).
26. The computer program product according to claim 25, characterized in that, the quality is determined from subjective scores (Snoniin) being related to the other average (Rnonlin).
27. The computer program product according to any of the preceding claims 23 - 26, characterized in that, when linear and non-linear distortion is present in the signal, the selected model comprises at least instructions for - determining the subjective scores for the signal of the linear (S π) and non-linear (Snontιn) distortion, - determining a predicted overall score (Soveran) by means of a parameter (a) indicating a relative importance of the distortion.
PCT/FI2004/050020 2004-02-27 2004-02-27 Method for predicting the perceptual quality of audio signals WO2005083921A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/FI2004/050020 WO2005083921A1 (en) 2004-02-27 2004-02-27 Method for predicting the perceptual quality of audio signals

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/FI2004/050020 WO2005083921A1 (en) 2004-02-27 2004-02-27 Method for predicting the perceptual quality of audio signals

Publications (1)

Publication Number Publication Date
WO2005083921A1 true WO2005083921A1 (en) 2005-09-09

Family

ID=34896363

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/FI2004/050020 WO2005083921A1 (en) 2004-02-27 2004-02-27 Method for predicting the perceptual quality of audio signals

Country Status (1)

Country Link
WO (1) WO2005083921A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022225413A1 (en) 2021-04-23 2022-10-27 Harman International Industries, Incorporated Methods and system for determining a sound quality of an audio system
WO2023018349A1 (en) 2021-08-13 2023-02-16 Harman International Industries, Incorporated Method for determining a frequency response of an audio system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5886988A (en) * 1996-10-23 1999-03-23 Arraycomm, Inc. Channel assignment and call admission control for spatial division multiple access communication systems
EP0946015A1 (en) * 1998-03-27 1999-09-29 Ascom Infrasys AG Method and system for estimating transmission quality
EP1089429A2 (en) * 1999-09-29 2001-04-04 Toshiba Corporation Automatic gain control circuit and receiver having the same
US20040025669A1 (en) * 2002-08-09 2004-02-12 Sony Corporation/Sony Music Entertainment Audio quality based culling in a peer-to peer distribution model

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5886988A (en) * 1996-10-23 1999-03-23 Arraycomm, Inc. Channel assignment and call admission control for spatial division multiple access communication systems
EP0946015A1 (en) * 1998-03-27 1999-09-29 Ascom Infrasys AG Method and system for estimating transmission quality
EP1089429A2 (en) * 1999-09-29 2001-04-04 Toshiba Corporation Automatic gain control circuit and receiver having the same
US20040025669A1 (en) * 2002-08-09 2004-02-12 Sony Corporation/Sony Music Entertainment Audio quality based culling in a peer-to peer distribution model

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022225413A1 (en) 2021-04-23 2022-10-27 Harman International Industries, Incorporated Methods and system for determining a sound quality of an audio system
DE112021007572T5 (en) 2021-04-23 2024-02-15 Harman International Industries, Incorporated Method and system for determining sound quality of an audio system
WO2023018349A1 (en) 2021-08-13 2023-02-16 Harman International Industries, Incorporated Method for determining a frequency response of an audio system

Similar Documents

Publication Publication Date Title
US9391579B2 (en) Dynamic compensation of audio signals for improved perceived spectral imbalances
Kates et al. The hearing-aid speech quality index (HASQI)
Tan et al. The effect of nonlinear distortion on the perceived quality of music and speech signals
Tan et al. Predicting the perceived quality of nonlinearly distorted music and speech signals
US8976974B2 (en) Sound tuning system
US20070195963A1 (en) Measuring ear biometrics for sound optimization
Moore et al. Measuring and predicting the perceived quality of music and speech subjected to combined linear and nonlinear distortion
JP2014531865A (en) Improving stability and ease of listening to sound in hearing devices
Romoli et al. A mixed decorrelation approach for stereo acoustic echo cancellation based on the estimation of the fundamental frequency
Harlander et al. Sound quality assessment using auditory models
US8964996B2 (en) Method and arrangement for auralizing and assessing signal distortion
Kates Modeling the effects of single-microphone noise-suppression
CA2442317C (en) Improved method for determining the quality of a speech signal
Liski et al. Audibility of group-delay equalization
Voishvillo Assessment of Nonlinearity in Transducers and Sound Systems–from THD to Perceptual Models
WO2005083921A1 (en) Method for predicting the perceptual quality of audio signals
Voishvillo Measurements and Perception of Nonlinear Distortion—Comparing Numbers and Sound Quality
Soria-Rodríguez et al. Modeling and real-time auralization of electrodynamic loudspeaker non-linearities
Temme et al. Practical measurement of loudspeaker distortion using a simplified auditory perceptual model
Bramsløw An objective estimate of the perceived quality of reproduced sound in normal and impaired hearing
Johansen et al. The excess phase in loudspeaker/room transfer functions: Can it be ignored in equalization tasks?
Olofsson et al. Objectively measured and subjectively perceived distortion in nonlinear systems
Fielder Perceptual Assessment of Headphone Distortion
de Santis et al. Perception & thresholds of nonlinear distortion using complex signals
Moore Computational models for predicting sound quality

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): BW GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
122 Ep: pct application non-entry in european phase