In Theory Part 1: DSP Theory Introduced

PLEASE NOTE: This article has been archived. It first appeared on ProRec.com in November 1998, contributed by then Contributing Editor Jose-Maria Catena. We will not be making any updates to the article. Please visit the home page for our latest content. Thank you!

Introduction

Understanding basic DSP (Digital Signal Processing) theory is really helpful to get the most of digital audio recording and processing. This series of articles is directed toward musicians and sound engineers, so, the issues will be covered mostly in a practical way, and the necessary theory will be explained as good backgrounder or just to understand the practical consequences, not to be a deep or complete reference for DSP math.

The series will begin with the more basic issues, assuming that the reader doesn’t know DSP theory at all, allowing anybody to follow and understand the explanations. Very often, it’s difficult to understand issue A without first understanding issue B, while B cannot be fully understood without first understanding issue A, so, don’t worry if you can’t understand completely something, accept it as a hypothesis, and you will understand better each time as you follow the series.

If you have questions, you can use the DSP theory forum at ProRec. I’m yours there. It’s preferred for this matter to the e-mail because everybody can benefit from the answers.

Basic audio signal definitions

First of all, we need to define some simple concepts about analog audio signals to understand later how they can be represented numerically.

Sound is variations of air pressure. The air pressure can be converted to an electrical signal by a microphone. We can represent the signal in a graph where the X axis is the time and the Y axis is the instant amplitude (sound pressure) at each time. This representation is called time domain representation. This representation is what we see in an oscilloscope and sound editing software. The instant sound pressure (or level of the electrical signal) at each time is called instant amplitude.

periodic signal is a signal where a fixed pattern, or cycle, is repeated along the time. The cycle period is the time length of a single cycle. The frequency is the inverse of the period, that is, the number of cycles per second, and the unit is called Hertz (Hz). In musical terms, the frequency is the pitch, lower frequencies are perceived as bass sounds and higher ones are perceived as treble sounds. Musically speaking, the pitch perception is logarithmic. One octave is twice the frequency.

Bandwidth is a range of frequencies. The frequencies that we the humans can hear are inside the range from 20 Hz to 20 KHz. Not everybody can hear this entire range, and this bandwidth is usually reduced with age, exposition to high sound pressure levels, or diseases. More often, only young people can hear very near 20 Khz, while adult people rarely hear beyond 17-18 KHz in the best cases. So, this 20 Hz to 20 KHz bandwidth is what is assumed as the maximum human hearing range, and is adopted by most standards.

Sound is not only pure periodic signals. But to simplify many math demonstrations, usually periodic signals are used, although the methods are applicable later to complex and not periodic signals. For this matter, the time is often expressed as an angle, in radians, and so, the frequency as angular speed, in radians per second.

The simplest periodic signal is the sine function: s(t) = sin(t). Do you remember trigonometric theory? If you rotate a wheel with constant angular speed, the sine function will give you the Y position of a point in its circumference (instant amplitude) for each time. Any other periodic signal, as complex as you want, can be expressed as a sum of sine signals, as demonstrated by the Fourier transform:

S(t) = K1*sin(t) + K2*sin(t*2) + K3*sin(t*3) + … + Kn*sin(t*n))

The first sine component is the fundamental frequency (the lowest). The rest are called harmonics, and are multiples of the fundamental frequency. K1 to Kn are the set of coefficients or amplitude of each single frequency, and represent the spectral content of the signal, which in musical terms we call timbre. For example, a square signal only has certain amplitudes in the odd harmonics, while even coefficients are always 0; a sawtooth signal has certain amplitudes in all harmonics. Fourier transforms can also be applied to any kind of signal, even not periodic, for any given time period. In this case we are not describing harmonics, as there can be infinite unrelated sine components.

We can represent the Fourier coefficients in a graph where the X axis is frequency and the Y axis is amplitude, which is called the frequency domain representation, and shows us the spectral content of a signal. So, the Fourier transform converts time domain information to frequency domain information.

As you should begin to intuit, Fourier theory is very important for us. The Fourier transform is very complex to compute, but fortunately, there is an approximation that can be computed much faster: Fast Fourier Transform, or FFT. It’s what is usually implemented, for example, to analyze the spectrum content of a signal.

Finally, we must know what a decibel is. We do not perceive sound level linearly, but following a logarithmic transfer function. The decibel is defined as dB = 20 * log10(V1/V2). It represents the logarithmic ratio between two values. Twice the level correspond to + 6 dB, half the level, to – 6 dB. 20 dB is equivalent to a x10 ratio. The power doubles each 3 dB, which means that we must supply 4 times the power to hear twice as loud.

The decibel is ideal to describe ratios. For example, the dynamic range is the ratio between the maximum level available and the minimum (or noise), the SNR (signal to noise ratio) is the ratio between the nominal signal level and the noise. Sometimes, the decibel is used relative to an implicit value, for example, to define levels. The audio signal level is usually referred to the nominal 0 dB, whose standard value is 1 milliWatt RMS through 600 ohms (aprox 0.775 volts RMS.). RMS means root mean square, which is the form to express average effective power/level, and is related to perceived loudness. When measuring sound pressure (SPL), it’s referred to the hearing threshold.

Numeric representation of analog signals

To represent digitally (that is, numerically) an analog signal, we must convert the instant amplitude at each time to a number that represent its value. Each value is called a sample.

sample is a single value that represents the instant sound pressure at a given time.

Sampling period is the time between two consecutive samples.

Sampling frequency is the inverse of sampling period, that is, the number of samples per second (freq = 1 / period).

The sampling frequency determines the bandwidth that can be represented. The Nyquist therorem demonstrates that the bandwidth that can be represented using a given sampling frequency is half of that sampling frequency. This sampling frequency divided by two is called Nyquist frequency. So, the Nyquist frequency corresponds with the maximum frequency of the analog signal that can be represented or bandwidth. If we try to digitize a signal whose frequency is larger than the Nyquist frequency, that signal is registered wrongly as frequency of the original signal minus the Nyquist frequency, which is known as aliasing. To avoid aliasing, the audio signal must be filtered to remove the portion of the bandwidth above the Nyquist frequency before converting it to digital form. We’ll study anti-aliasing filtering further in the next chapter, dedicated to sample rate conversions, and in one further about analog to digital converters.

This means that sampling frequency must be higher than twice the bandwidth that we want to digitize. This is why CD uses 44100 samples per second: it gives a usable bandwidth of 22050 Hz, which covers entirely the audible bandwidth. Here, we are not dealing with the antialiasing filter, for which the sampling rate is very important; we’ll study it in the chapter dedicated to analog to digital converters. So for now, strictly speaking about the information stored in digital form, using higher sampling rates for audio serves no purpose.

Then you may ask, for example, how a signal of about 20 KHz can be represented with accuracy if there are only less than 3 samples per cycle. Well, it doesn’t matter, Nyquist was not wrong: if you learned what the Fourier transform means, then you can understand now that any 20 KHz signal, if limited to a 22k05 bandwidth, is always a sine wave. That is, you can’t hear the differences between different 20 KHz signals (sine, square, saw, etc.) because you can only hear the fundamental (a sine wave), as the harmonics are beyond the audible range (the first one at 40 KHz in this example). Obviously, if you pass any 20 KHz waveform by an ideal low pass filter which cuts at 22k05 (our ears cut perfectly before that), the output is always a perfect sine signal, even when the samples seem to represent something like a very irregular square wave. We’ll study more on this in a future digital to analog converters chapter. Therefore, a 44k1 sampling rate gives us fully accurate representation in a limited 22k05 bandwidth, which exceeds our hearing bandwidth.

The sample format is the form to digitally represent sample values. Digital means that there are only zeros and ones. Each bit is a single binary digit that can take the values zero or one. Grouping bits, we can represent any value.

For example, with 4 bits, we can code the values from 0000b (0 in our well known decimal form) to 1111b (15 in decimal). This format is called unsigned integer. To represent signed integers, we use two’s complement form: 0000b to 0111b are 0 to 7 in decimal, and 1111b to 1000 are –1 to –8 in decimal. The value range that can be represented with integer formats is two raised to the number of bits. For example, with 8 bits we can code 0 to 255 (=2^8) or –128 to +127; with 16 bits, from 0 to 65535 (=2^16) or –32768 to +32767, and so on. We can interpret the numbers scaled by any factor power of two, although the coding is the same, and then we call this fixed point format (an arbitrary number of bits are the integer portion and other the fractional portion). Anyway, if you know simple arithmetic, you’ll find that integer and fixed point are treated the same, being the only difference how you want to interpret them.

Another, more advanced way to represent numbers, is the floating point form. Here, a bit is the sign, a group of bits are called the mantissa, which is coded as a fixed point number, and other group of bits are the exponent of two. This is interpreted and handled as (sign, mantissa * 2 ^ exponent), that is, similar to the scientific notation, but in base two. This way we can code ranges much larger than possible if we code all bits as fixed point.

There are several standard floating points formats by IEEE, but to represent audio, the smaller is more than needed: 32 bits floats (sign, 24 bit mantissa and 7 bit exponent). The exponent is adjusted automatically to keep the maximum number of meaningful bits in the mantissa (this is called float normalization). Applied to audio samples, the floating point format has some important advantages over fixed point formats: virtually unlimited dynamic range, lower quantization noise, and quantization noise relative to sample value, not absolute as with fixed point. Furthermore, multiplication with floats is significantly faster than with integers in the actual Intel Pentium processors, being the sum of products the core of most (virtually all) DSP operations.

Note that converters can only work with fixed point data, although the format conversion is a trivial task for any processor with floating point unit.

How the sample format affects to the audio signal? It defines the dynamic range available and the quantization noise.
Using fixed point samples, the dynamic range (the ratio between the smallest step and the maximum value) is equivalent to the quantization noise. The dynamic range in dB is increased by 6 by each bit.

dB = 20 * log(2 ^ bits). The 0 dB reference is set at the maximum value, so, values above 0 dB can’t be represented and result in overload distortion. For 16 bits, the dynamic range is 96 dB (so the quantization noise is at –96 dB); for 20 bits, it’s 120 dB; for 24 bits, 144 dB, and for 32 bits, 192 dB.

When using 32 bit floating point samples, 0 dB is defined as the value +- 1.0. The dynamic range is virtually unlimited (from –765 dB to +765 dB). The quantization noise is variable in function of the sample value, being always at –150 dB below the level of each sample.

The “useable” human hearing dynamic range is about 120 dB (from the hearing threshold to the ear injury level). So, for music storage, 16 bits (96 dB) offers very good quality, and 20 bits (120 dB) is equivalent to our hearing capabilities. So, more that that seems at first not useful, but that isn’t true.

The reason is that when we want to digitize sound, we usually can’t know in advance the largest peak value that will occur unless we use a hard limiter/compressor (what is very often not desirable at recording time). This way, we can’t adjust the level applied to the analog to digital converter (ADC) to take advantage of it’s full dynamic range. In fact, to avoid the risk of clipping at the ADC, we usually record below –20 dB. Obviously, if we subtract this necessary headroom from the available dynamic range, we have only about 76 dB of effective dynamic range with 16 bits, which results in a poor quality conversion. So, it’s useful to digitize at 20 or even better at 24 bits to end up with full 16 bit quality or better.

There is another important reason for using more than 16 bits: between the recording and the final master product, we do a lot of processes on the audio signals: we equalize, mix, apply effects… and each single operation of each process adds quantization noise. As this is an accumulative effect, we want the quantization noise as low as possible. 24 bits offer –144 dB of quantization noise, that is significantly lower than the hearing umbral, that is, good enough. 32 bit floats are even better because we can keep –150 dB of maximum quantization noise and much smaller mean quantization noise, regardless of the signal levels. 32 bit floats have another important advantage: productivity. When we apply any process to an audio signal using fixed point samples, the signal can be amplified or attenuated, possibly resulting in clipping or too low value (read increased quantization). And then, we must check always the data after each process, and if the resulting level is not good, undo, adjust the gain of the process, and try again. Obviously, this unproductive work is not necessary using 32 bit floats.

As a logical result, 16 bit samples are enough for consumer music distribution, and little improvement will result from upgrading to 20 or 24 bits But during the recording and processing, 20 to 24 bit ADC and 24 bit fixed point to 32 bit float samples are desirable. In the following chapter, sample format and rate conversions this will be covered in depth, as these are must-know issues, and we should have already the adequate scope.