BitAllocation Scheme for Coding of Digital Audio Using Psychoacoustic Modeling full
computer science technology Active In SP Posts: 740 Joined: Jan 2010 
24012010, 06:05 PM
A New BitAllocation Scheme for Coding of Digital Audio Using Psychoacoustic Modeling full report.DOC (Size: 186.5 KB / Downloads: 77) A New BitAllocation Scheme for Coding of Digital Audio Using Psychoacoustic Modeling Abstract: In this paper we present a modified adaptive bitallocation scheme for coding of audio signals using psychoacoustic modeling. Statistical redundancy reduction (lossless coding) generally gives compression ratios up to 2. We increase this ratio using perceptual irrelevancy removal (lossy coding). In this work MPEG1 Layer1 standard for psychoacoustic modeling is used and compression is improved by modifying the bitallocation scheme, which without using a second stage (lossless) compression, is giving results comparable to that of MP3 (which uses Huffman coding as second stage). To increase this ratio further, some of the high frequency subbands which contribute less to the signal are removed. Subjective evaluation is the best method to assess the quality of audio signals. Original and reconstructed audio signals are presented to listeners that grade and/or compare them according to perceived quality. Without quality degradation the achieved bitrate is 90 kbps from original bitrate 706 kbps. I. INTRODUCTION Audio compression is concerned with the efficient transmission of audio data with good quality. The audio data on a compact disc is sampled at 44.1 kHz that requires an uncompressed data rate of about 706 kbps for mono with 16 bits/sample. With this data rate audio signals require large memory and bandwidth. Digital Audio Broadcasting, communication systems and internet are demanding high quality signals with low bitrates. Digital audiocoding techniques can be classified into two groups: timedomain and frequency domain processing. The time domain technique can be implemented with low complexity, but it requires more than 10 bits/sample for maintaining high quality. Most of the known techniques belong to the frequency domain. Traditional speech coders designed specifically for speech signals, achieve compression by utilizing models of speech production based on the human vocal tract. However, these traditional coders are not effective when the signal to be coded is not human speech but some other signal such as music. Hence in order to code a variety of audio signals the characteristics of the final receiver, i.e., human hearing system are exploited. To achieve high compression ratio, perceptually irrelevant information (to ear) is discarded. Perceptual audiocoders use models of the auditory masking phenomena to identify the inaudible parts of audio signals. This is called psychoacoustic masking, in which the masked signal must be below the masking threshold. The noise masking phenomenon has been observed through a wide variety of psychoacoustic experiments. This masking occurs whenever the presence of a strong audio signal makes a spectral neighborhood of weaker signal imperceptible. With the present algorithm, high quality can be achieved at bit rates 2 bits/sample (i.e., 90kbps) or above. Subband coding is used for mapping the audio signal into the frequency domain. This technique yields certain compression ratios due to its frequency distribution capabilities. II. OVERALL STRUCTURE OF THE PROCESS 1. The audio signal is decomposed into M=32 equal subbands using a perfect reconstruction cosinemodulated Mband analysis filter banks and decimated by M. 2. Psychoacoustic modeling is performed in parallel to calculate the global threshold of hearing. 3. The bitallocation to different subbands is calculated using this threshold. 4. The signals in different subbands are quantized using bitallocationI. 5. Then bitallocationII evaluates the actual number bits needed and accordingly signals are coded. The coded signals are sent to decoder. 6. The decoder simply passes these signals after Mfold expansion, through synthesis filters to reconstruct the output signal. At the Encoder and Decoder The following section describes the subband decomposition and filter banks. Section 4 describes the Psychoacoustic Modeling of standard MPEG1 and section 5 gives details of bitallocation and quantization scheme. III. SUBBAND DECOMPOSITION AND FILTER BANKS The filter banks divide timedomain input into frequency subbands and generate a timeindexed series of coefficients representing the frequency localized signal power within each band. By providing explicit information about the distribution of signal and hence masking power over the timefrequency plane, the filter bank plays an essential role in the identification of perceptual irrelevancies when using in conjunction with a psychoacoustic model. The filter bank facilitates perceptual noise shaping. On the other hand, by decomposing the signal into its constituent frequency components, the filter bank also assists in the reduction of statistical redundancies. Filter banks for audio coding: Design considerations The choice of an appropriate filter bank is essential to the success of a perceptual audio coder. The properties of analysis filter banks should adequately match the input signal. Algorithm designers face an important and difficult tradeoff between time and frequency resolution. Failure to choose a suitable filter bank can result in perceptible artifacts in the output (e.g., preechoes) or impractically low coding gain and attendant high bit rates. Most common audio content is highly nonstationary and contains tonal and atonal energy, as well as both steady state and transient intervals. As a rule, signal models tend to remain constant for a long time and then change suddenly. Therefore the ideal analysis filter bank should have timevarying resolutions in both the time and frequency domains. Filter banks emulating the properties of human auditory system, i.e., those containing nonuniform critical bandwidth subbands, have proven highly effective in the coding of highly transient signals. For dense harmonically structured signals, on the other hand, the critical band filter banks have been less successful because of their coding gain relative to filter banks with a large number of subbands. The following bank characteristics are highly desirable for audio coding: Â¢ Signal adaptive timefrequency tiling Â¢ Good channel separation Â¢ Low resolution , criticalband mode, e.g., 32 subbands Â¢ Efficient resolution switching Â¢ Minimum blocking artifacts Â¢ High resolution mode, up to 4096 subbands Â¢ Strong stopband attenuation Â¢ Perfect reconstruction Â¢ Critical sampling Â¢ Availability of fast algorithms Good channel separation and stopband attenuation are particularly desirable for signals containing very little irrelevancy. Maximum redundancy removal is essential to maintaining high quality at low bitrates for these signals. Blocking artifacts in timevarying filterbank can lead to audible distortion in the reconstruction. Although PseudoQMF banks have been used quite successfully in perceptual audio coders, the overall system design must compensate for the inherent distortion induced by the lack of perfect reconstruction to avoid audible artifacts in the codec output. The compensation strategy may be a simple one (e.g., increased prototype filter length), but perfect reconstruction is actually preferable because it constrains the sources of output distortion to the quantized stage. For the special case of L=2M, the filter banks with perfect reconstruction and low complexity are achieved. The PerfectReconstruction properties of these banks were first demonstrated by Princen and Bradley using timedomain arguments for the developments of the Time Domain Aliasing Cancellation (TDAC) filter bank. Analysis filter impulse responses are given by hk(n )= w(n) v(2/M) cos{ (2n+M+1) (2k+1) p/4M } k=0,1,Â¦..,M and n=0,1,Â¦.,L and synthesis filters, to satisfy the overall linear phase constraint, are obtained by a time reversal, i.e., gk(n)=hk(2M1n) where w(n) is a FIR prototype lowpass filter is to satisfy the following conditions for linea for linear phase and Nyquist constraints (PR constraints) w(2M1n)=w(n) w2(n)+w2(n+M)=1 In this algorithm a sine window is used, which is defined as w(n)=sin{ (2n+1) p/4M } The signal is decomposed into 32 subbanks using these filter banks and each band is decimated by 32 (i.e., critically sampling). IV. PSYCHOACOUSTIC MODELING High precision engineering models for highfidelity audio do not exist. Therefore, audio coding algorithms must rely upon generalized receiver models to optimize coding efficiency. In the case of audio, the receiver is ultimately the human ear and sound perception is affected by its masking properties. The field of psychoacoustics has made significant progress toward characterizing human auditory perception the timefrequency capabilities of the inner ear. Although applying perceptual rules to general coding is not a new idea, most current audio coders achieve compression by exploiting the fact that irrelevant signal information is not detectable by even a well trained or sensitive listener. Irrelevant information is identified during signal analysis by incorporating into the coder several psychoacoustic principles, including absolute threshold of hearing, critical band frequency analysis, simultaneous masking, the spread of masking along the basilar membrane, and temporal masking. Combining these psychoacoustic notions with basic properties of signal quantization has led to the development of perceptual entropy, a quantitative estimate of the fundamental limit of transparent audio signal compression. This section reviews psychoacoustic fundamentals and gives the details of the psychoacoustic model. Absolute Threshold of Hearing: ATH characterizes the amount of energy needed in a pure tone such that it can be detected in quiet. It is typically expressed in terms of dB SPL. The frequency dependence of this threshold was quantified as early as 1940 when Fletcher reported test results for a range of listeners which were generated in a National Institutes of Health (NIH) study of typical American hearing acuity. The quiet threshold is well approximated by the nonlinear function given by Tq(f) = 3.64(f/1000)0.8  6.5exp(0.6(f/1000 3.3)2 + 103 (f/1000)4 (dB SPL) where f is in Hz. When applied to signal compression, Tq(f) could be interpreted naively as a maximum allowable energy level for coding distortions introduced in the frequency domain. Critical Bands: Considering on its own the absolute threshold is of limited value in the coding context. The detection threshold for quantization noise is a modified version of the absolute threshold, with its shape determined by the stimuli present at any given time. Since stimuli are in general timevarying, the detection threshold is salso a timevarying function input signsl. In order to estimate this threshold, we must first understand how the ear performs spectral analysis. A frequencytoplace transformation takes in the cochlea (inner ear), along the basilar membrane. Dinstinct regions in the cochlea, each with a set of neural receptors, are tuned to different frequency bands. In fact, the cochlea can be viewed as a bank of highly overlapping bandpass filters. The magnitude responses are asymmetric and nonlinear(leveldependent). Moreover, the cochlear filter passbands are of nonuniform bandwidth, and the bandwidths increase with increasing frequency. The critical bandwidth is a function of frequency that quantifies the cochlear filter passbands. Its notion is that the loudness (perceived intensity) remains constant for a narrowband noise source presented at a constant SPL even as the noise bandwidth is increased up to the critical bandwidth. For any increase beyond the critical bandwidth, the loudness then begins to increase. Critical Bandwidth tends to remain constant (about 100Hz) upto 500Hz, and increases to approximately 20% of center frequency. Its approximate expression is given by BWc(f) = 25 + 75[1 + 1.4(f/1000)2]0.69 (Hz) Although the function BWc is continuous, it is useful when building practical systems to treat the ear as a discrete set of band pass filters. A distance of one critical band is commonly referred to as one bark in the literature. The function z(f) = 13arctan(0.00076f) + 3.5arctan[(f/7500)2] (Bark) is often used to convert from frequency in Hz to the Bark scale. Simultaneous masking and the Spread of masking: Masking refers to a process where one sound is rendered inaudible because of the presence of another sound. Simultaneous masking refers to a frequencydomain phenomenon that can be observed whenever two o more stimuli are simultaneously presented to the auditory system. Although arbitrary audio spectra may contain comple simultaneous masking scenarios, for the purpose of shaping coding distortions it is convenient to distinguish between only two types of simultaneous masking, namely tonemasking noise, and noisemaskingtone. Interband masking has also been observed, i.e., a maker centered within one critical band has some predictable effect on detection thresholds in the other critical bands. This effect, also known as the spread of masking is often modeled in coding applications by an approximately triangular spreading function that has slopes of +25 and 10 dB per Bark. Psychoacoustic model: The model uses a 1024point FFT for high resolution spectral analysis (43.07 Hz), then estimates for each input frame individual simultaneous masking thresholds due to the presence of tonelike and noiselike maskers in the original spectrum. A global masking threshold is then estimated for a subset of the original 256 frequency bins by additive combination of the tonal and atonal masking thresholds. The five steps leading to computation of global masking threshold are as follows: Step 1: Spectral analysis and SPL normalization First the incoming audio samples s(n) are normalized according to FFT length N, and the number of bits per sample, b, using the relation s(n) x(n) =  N(2b1) Normalization references the power spectrum to a 0dB maximum. The normalized input x(n) is then segmented into frames (1024 samples) using a 1/16th overlapped Hann window. A power spectral density estimate P(k) is then obtained using a 1024point FFT, i.e., P(k)=PN+10log10 N1w(n)x(n)exp(j2pkn/N)2 n=0 0=k=N/2 where the power normalization term, PN, is fixed at 96.3 dB and the Hann window, w(n), is defined as w(n) = Ã‚Â½[1cos(2pn/N)] Step 2: Identification of Tonal and Noise maskers Local maxima in the sample PSD that exceed neighboring components within a certain bark distance by at least 7 dB are classified as tonal. Specifically, the tonal set, ST, is defined as ST = { P(k)  P(k)>P(kÃ‚Â±1), P(k)>P(kÃ‚Â±k)+7dB } where [2,4] 4=k<126 (0.175.5kHz) k { [2,6] 126=k<254 (5.51kHz) [2,12] 254=k<512 (1122kHz) Tonal maskers, PTM(k), are computed from the spectral peaks listed in ST as follows PTM(k) = 10log10{ P(k1) + P(k) + P(k+1) } (dB) A single noise masker for each critical band, PNM(g), is then computed from (remaining) spectral lines not within the Ã‚Â±k neighborhood of a tonal masker using the sum PNM(g) = 10log10 100.1P(j) (dB), where g is defined to be the geometric mean spectral line of the critical band, i.e., u g = { j}1/(lu+1) j=l and l & u are the lower and upper spectral boundaries of the critical band, respectively. The idea behind eq. for PNM(g) is that residual energy within a critical band not associated with a tonal masker must, by default, be associated with a noise masker. Step 3: Decimation and Reorganization of maskers In this step, the number of maskers is reduced using two criteria. First any tonal or noise maskers below the absolute threshold are discarded, i.e., only maskers which satisfy PTM,NM(k) = Tq(k) are retained, where tq(k)is the SPL of the threshold in quiet at spectral line k. Next, a sliding 0.5 barkwide window is used to replace any pair of maskers occurring within a distance of 0.5 Bark by the stronger of the two. After the sliding window procedure, masker frequency bins are reorganized according to the sub sampling scheme if PTM,NM(i) > PTM,NM(k) PTM,NM(k) = 0 where k ranges between 0.5 Bark neighborhood of i. Step 4: Calculation of individual masking thresholds Having obtained a decimated set of tonal and noise maskers, individual tone and noise masking thresholds can be computed. Each individual threshold represents a masking contribution at frequency bin i due to the tone or noise masker located at bin j (reorganized during step 3). Tonal masking thresholds, TTM(i,j), are given by TTM(i,j) = PTM(j) â€œ 0.275z(j) + SF(i,j) 6.025 (dB SPL) where PTM(j) denotes the SPL of the tonal masker in frequency bin j, z(j) denotes the Bark frequency of bin j, and the spread of masking from masker bin j to masker bin i, SF(i,j), is modeled by the expression 17k0.4PTM(j)+11, 3=k<1 SF(i,j)= (0.4PTM(j) +6) k, 1=k<0 17k, 0=k<1 (0.15PTM(j)17) k 0.15PTM(j), 1=k=8 (dB SPL) i.e., as a piecewise linear function of masker level, PTM(j), and bark maskeemasker separation, k = z(i) â€œ z(j). Individual noisemasking thresholds, TNM(i,j), are given by TNM(i,j) = PNM(j) â€œ 0.175z(j) + SF(i,j) 2.025 (dB SPL) where PNM(j) denotes the SPL of the noise masker in frequency bin j, z(j) denotes the Bark frequency bin j and SF(i,j) is obtained by replacing PTM(j) with PNM(j) everywhere in eqÂ¦.. Step 5: Calculation of global masking thresholds In this step, individual masking thresholds are combined to estimate a global masking threshold for each frequency bin. The model assumes that masking effects are additive. The global masking threshold, Tg(i), is therefore obtained by computing the sum L M Tg(i) = 10Tq(i)/10 + 10TTM(i,l)/10 + 10TNM(i,m)/10 (amplitude units) l=1 m=1 where Tq(i) is the absolute hearing threshold for frequency i, TTM(i,l) and TNM(i,m) are the individual masking thresholds from step 4, and L and M are the number of tonal and noise maskers, respectively, identified during step 3. In other words, the global threshold for each frequency bin represents a signal dependent, additive modification of the absolute threshold due to the basilar spread of all tonal and noise maskers in the signal power spectrum. Step 1: Obtain PSD, express in dB SPL. Absolute threshold is superimposed. Step 2: Tonal maskers identified and denoted by ËœOâ„¢ symbol; Noise maskers identified and denoted by ËœXâ„¢ symbol. Steps 3,4: Spreading functions are associated with each of the individual tonal maskers satisfying the rules outlined in the text. Spreading functions are associated with each of the individual noise maskers that were extracted after the tonal maskers had been eliminated from consideration, as described in the text. Step 5: A global masking threshold is obtained by combining the individual thresholds as described in the text. The maximum of the global threshold and the absolute threshold are taken at each point in frequency to be the final global threshold. The figure clearly shows that some portions of the input spectrum require SNRs of better than 20 dB to prevent audible distortion, while other spectral regions require less than 3 dB SNR. In fact, some highfrequency portions of the signal spectrum are masked and therefore perceptually irrelevant, ultimately requiring no bits for quantization without the introduction of artifacts. V. BITALLOCATION AND QUANTIZATION The bitallocation scheme for different subbands is calculated using the global threshold obtained from psychoacoustic model. First the minimum masking threshold for each subband is determined and is used to shape the distortion in quantizing the subband samples. That is, as the noise induced in quantization is at most (stepsize)/2, we take it from the minimum of the calculated global threshold (min_thresh). Then the number of bits required to quantize a given subband is calculated using b = log2( R/min_thresh )1 where R is the range of input samples = 65536. With this bitallocation quantization of each subband samples is performed. To increase compression ratio obtained using psychoacoustic modeling, we used a modified bitallocation scheme which is adaptive in nature. In this scheme, after performing quantization according to bitallocation mentioned before, maximum absolute value of a sample (i.e., the range of values) in each subband is determined and from this we determine the actual number of bits needed to transmit each subband samples without loss of information. Since the decoder should know the number of bits transmitted, the number of bits calculated from above equation and the number of bits actually needed to transmit, are transmitted to the decoder for each subband. In this way the compression ratio is approximately doubled depending on the dynamic characteristics of the audio signal. To still increase the ratio we used an adaptivehighfrequency band elimination process. Since high frequencies contribute less percentage to the signal than low ones we do not transmit the samples of those subbands which are having less than a specified number of nonzero samples. This is the scheme which is employed in this paper to increase compression ratio to almost equal to the methods involving two stage compression. The encoded samples are then sent to the decoder. The decoder simply applies synthesis filters to the encoded subband samples (after Mfold expansion), to reconstruct the output signal. VI. RESULTS We have experimented with a different set of audio signals (sampled at 44.1kHz) in Matlab (Mathworks, Inc) and the results (average compression ratio) for them are summarized below. Audio Type Compression ratio with Psychoacoustic model only (MPEG1 layer1) Compression ratio with Modified bit allocation Compression ratio with Adaptive high frequency band elimination Compression ratio with Half of the subbands eliminated Pop music 1.74 6.38 7.59 8.99 Rock music 1.82 6.94 8.52 9.70 Male voice 1.78 6.66 8.01 9.16 Female voice 1.79 7.10 8.18 9.78 Gun Fighting 1.84 7.35 8.40 10.34 Trumpet 1.73 6.83 9.55 10.61 VII. CONCLUSIONS We showed how psychoacoustic modeling can be used to compress audio data by reducing perceptual irrelevancies present in the signals. But this does not give a good compression ratio. For a higher compression ratio with this algorithm we presented a novel adaptive bitallocation method which makes it to compete with MP3 algorithm (which uses advanced coding algorithms). MP3 adds very efficient noise shaping algorithm, which together with huffmann coding gives superior results. If the same type of filterbanks are used in this work we could have got even better results. We have implemented the same coding blocks as in MP1/MP2 but with change in bitallocation scheme. With comparison to them our algorithm is very efficient. IX. REFERENCES [1] T. Painter and A. Spanias, Perceptual coding of digital audio, Proc. IEEE, vol. 88, pp. 451â€œ513, Apr. 2000. [2] H. NajafzadehAzghandi, Perceptual Coding of Narrowband Audio Signals. PhD thesis, McGill University, Montreal, Canada, Apr. 2000. [3] Christopher R. Cave, Perceptual Modelling for LowRate Audio Coding., Master of Engineering Thesis, McGill University, Montreal, Canada, June 2002. [4] K.R. Rao and J. J. Hwang, Techniques and Standards for Image, Video, and Audio Coding, Prentice Hall PTR, 1996. [5] P. P. Vaidyanathan, Multirate Systems and Filter Banks, PTR PrenticeHall, 1993. [6] Sanjit K. Mitra, Digital Signal Processing, Tata McGrawHill Edition, 2001. Use Search at http://topicideas.net/search.php wisely To Get Information About Project Topic and Seminar ideas with report/source code along pdf and ppt presenaion



