Digital Audio Primer

So it’s 1995, and the Digital Audio Revolution™ is all anyone can talk about. CDs, wave files, MP3s, sampling – what does it all mean?! And now it’s all back in the spotlight, thanks to the WaveFile Ruby gem. Fortunately, your hip, tech-savvy friend Joel is here to help.

Sound Is Movement

Let’s start at the beginning. Sound occurs when something moves. This movement causes the air around the object to compress and decompress in a wave^[1]. Your ears can detect these compression waves, and your brain interprets them as sound. Examples of things that can move and create sound waves: vocal cords, guitar strings, or the speakers on your Hi-Fi system.

You can plot sound waves on a graph. Here is what the opening shout of Help! by The Beatles looks like:

1.0 0.0 -1.0

Time →

Basic Properties of Waves

Some waves are periodic, meaning they consist of a pattern that repeats. For example, a sine wave:

1.0 0.0 -1.0

Time →

You might recognize what a sine wave sounds like if you’ve ever taken a hearing test, or if you have a filthy mouth.

A repeated pattern such as is called an oscillation or a cycle. The frequency of a periodic wave is how often this cycle repeats.

Frequency is expressed in hertz, or cycles per second. If a cycle such as repeats 50 times in one second, it has a frequency of 50Hz.

The period is the amount of time it takes for one cycle to complete. If a wave has a frequency of 50Hz, then it has a period of 1/50th of a second.

For a simple wave such as a sine wave, the higher the frequency, the higher the perceived pitch. People can generally hear frequencies between about 20Hz and 20,000Hz. (As you get older, you gradually lose the ability to hear higher frequencies).

Try moving the slider to change the frequency of a sine wave. Notice how as the frequency goes up, each cycle happens more quickly, and the period becomes shorter.

Frequency:

1.0 0.0 -1.0

Time →

The amplitude of a wave is the distance on a graph between its maximum height and 0. Amplitude determines how loud a sound is perceived. The higher the amplitude, the louder the sound. However, not all frequencies are created equal - given the same amplitude, our ears/brains perceive some to be louder than others.

Try moving the slider to change the amplitude of a sine wave. Notice how as the amplitude goes up, the sound becomes louder, and the height of the wave increases.

Amplitude:

1.0 0.0 -1.0

Time →

All of the graphs above are normalized, so that the maximum possible amplitudes of sound from a source (such as a pair of speakers) are labeled 1.0 and -1.0. This convention is used in the audio world to simplify things. Of course, in the real world the exact loudness represented by an amplitude of 1.0 or -1.0 is relative.

Analog Signals vs. Digital Signals

So our humble sine wave is wafting through the air, and we want our computer to capture it and store it for later. To do this, we need to convert the sound wave from an analog form to a digital form. This process is called sampling, and is necessary because computers can only store data digitally.

An analog signal is continuous. Notice how the sine waves in the previous section are smooth. There are no gaps anywhere – if you were to zoom in, and keep zooming in into infinity, the line would always be smooth. If you have two points in an analog signal, there are an infinite number of points between them.

In contrast, a digital signal consists of a series of instantaneous “snapshots” of the amplitude of the signal over time. Each snapshot is called a sample^[2].

What’s cool is that a digital signal is “just a list of numbers”, so it can be easily stored and used on a computer, unlike an analog signal. What’s even cooler is that if you take the samples fast enough, the collection of samples has enough information to allow perfectly re-creating the original analog signal, so you can convert back and forth between the two^[3]. This means you can store your samples digitally on a computer or compact disc for later use, and then convert them into an analog form when you want to play them on your speakers.

A Sample of Sampling

“Sampling a signal” means to record instantaneous amplitudes (i.e. y-values) at a regular time interval, and put them in a list. Here’s an example of sampling a sine wave:

► Start Animation

1.0 0.0 -1.0

Time →

Because the same amount of time is elapsed between each sample, the samples are all the same distance apart on the x-axis. (The x-axis denotes some arbitrary amount of time).

This list of samples can be saved into a file format such as *.wav or *.aiff (glossing over some technical details). When you play it back, your computer or stereo will convert the list of digital samples back into an analog signal so that it can be played on your speakers. Note that in real life, you would need many more samples to create a sound that lasts long enough to be heard.

End-to-End Example

Here’s an example of how this all can work in the real world. Suppose you want to record yourself playing guitar, using a microphone plugged into your computer. As you strum, the microphone will detect the analog sound wave and convert it into an equivalent analog electrical signal. This signal will be sent to your computer, which will sample it many times a second to convert it to digital form. You can then save this digital audio as a Wave file, MP3, etc. When you use a program to play it back, the samples will be converted back into an analog electrical signal and sent to your speakers, causing the speaker cones to move. The speaker movement will cause a sound wave to move through the air, which your ears will detect. You will frown at realizing you flubbed that chord.

Sample Rate

An important property of digital audio is the sample rate. This represents how many times a signal is sampled per second, and is expressed in Hertz. A common sample rate is 44,100Hz, which means a sample is taken every 1/44,100th of a second. Or put differently, it means that 44,100 samples are used to create 1 second of sound.

A digital signal can only contain frequencies that are lower than half the sample rate. This threshold is called the Nyquist frequency. For example, if a digital signal has a sample rate of 10,000Hz, the Nyquist frequency is 5,000Hz. The signal can contain frequencies lower than 5,000Hz, but it’s not possible for it to contain frequencies 5,000Hz or higher.

Let’s say we want to sample a sound wave, and it only contains frequencies lower than 5,000Hz. Sampling at a rate of 10,000Hz or higher will allow us to capture all of the frequencies present, and we should be good to go.

But, what happens if the sound wave instead also contains frequencies 5,000Hz or higher? When sampled at 10,000Hz, are those higher frequencies just ignored? Nope! Instead, they get shifted down to a frequency below 5,000Hz. It’s as if frequencies above the Nyquist frequency are removed, and phantom lower frequencies are added in their place. This is called aliasing, and causes distortion.

You can prevent aliasing by either increasing the sample rate so that it is more than twice the maximum frequency being sampled, or by filtering out frequencies at the Nyquist frequency and greater before sampling. This type of filtering is called anti-aliasing.

Since CD audio is sampled at 44,100Hz, CDs can accurately reproduce frequencies up to 22,050Hz. Humans can hear frequencies up to around 20,000Hz, so this means CDs can essentially capture the range of human hearing. (As long as frequencies 22,050Hz or higher are filtered out before sampling to prevent aliasing distortion). Historically, telephone signals used an 8,000Hz sample rate. This allowed capturing most, but not all, of the frequencies of human speech. This meant you could understand someone talking, but it sounded a little muffled.

Bits Per Sample / Sample Depth

While the sample rate determines how accurately we can capture frequencies with digital audio, the sample depth determines how accurately we can capture amplitudes. Or put differently, the sample rate determines resolution on the x-axis, and the sample depth determines resolution on the y-axis. The sample depth determines the difference between the quietest sound we can capture, and the loudest sound. This difference is called the dynamic range.

Digital audio commonly represents each sample as a 8, 16, or 24-bit integer. 8-bit numbers can encode 256 different amplitudes, 16-bit numbers can encode 65,536, and 24-bit numbers can encode 16,777,216. Since an analog signal is continuous and has an infinite number of possible amplitudes, but 8/16/24-bit integers only allow a certain number of possible values, this means that the sampled signal in real world digital audio can’t 100% accurately capture the amplitude of a signal. The process of converting a sampled analog amplitude to one of the possible integer values is called quantization. The number of bits used to encode each sample is called the bits per sample.

Try choosing different values for the bits per sample. Listen to what these different sample depths sounds like, and notice how closely (or not closely) the digital samples match the original analog signal.

Bits Per Sample 1 4 8 16

1.0 0.0 -1.0

Time →

Notice how the original analog signal changes in amplitude over time. The higher the bits per sample, the higher the dynamic range, and thus the more accurately the fade in/out is captured in the digital signal. At 16 bits per sample, you can hear the fade in/out clearly. At 8 bits per sample, if you listen closely with headphones the quieter part of the signal is a little distorted. At 4 bits per sample the sound is very distorted due to there not being many “buckets” available for the amplitude to be mapped to. At 1 bit per sample the fade in/out disappears completely! This is because with 1 bit there is only enough dynamic range to distinguish between “sound at max volume” and “no sound”.

Although digital audio commonly uses integer samples, it’s also possible to use floating point numbers (normally in the range of -1.0 to 1.0). Floating point numbers can provide an even larger number of amplitude value “buckets” than 24-bit integers, and thus capture a wider dynamic range.

Clipping Distortion

Besides frequency aliasing, there is another type of distortion to watch out for. Clipping distortion occurs when the amplitude of a signal is outside the range of minimum and maximum sample values. When this happens, any sample that would correspond to a value less than -1.0 gets clamped to -1.0, and any value larger than 1.0 gets clamped to 1.0. This has the effect of flattening the top or bottom of the sampled signal, causing distortion.

Amplitude:

3.0 2.0 1.0 0.0 -1.0 -2.0 -3.0

Time →

Clipping distortion can happen when mixing multiple signals together, such as a recording of a guitar with a recording of vocals. Combining digital signals is performed by adding the corresponding samples at each moment in time into one sample. A sample of 0.7 and a sample of 0.9 might be fine individually, but if added together would result in a sample of 1.6, which would get clamped down to 1.0.

Number of Channels

When audio consists of a single sound wave, it is said to have 1 channel, or to be monophonic. Audio that has 2 channels is called stereo, and consists of two separate sound waves that are played at the same time. One sound wave is sent to a left speaker, and the other to a right speaker. This allows an immersive effect, most noticeable when listening with headphones. For example one instrument can be played in your left ear and another in your right ear. Or, a drum roll can sweep from left to right and back again. Audio CDs are stereo.

Audio with 3 or more channels is less common, but allows for an even more immersive experience. Surround sound, which generally uses 6 channels, allows for sound to come from in front of and behind you, in addition to just left and right.

Creating Your Own Digital Audio

You probably want to write programs to create your own sounds. (Why wouldn’t you?) To do so, your program needs to generate a list of samples, and then tell your computer to play back those samples so you can hear them.

One way to solve the playback problem is to save your samples to a sound file format such as MP3 or Wave, and let another program handle the playback for you. The Wavefile Ruby gem makes it possible to create Wave files using Ruby. The code below shows an example.

require "wavefile"
include WaveFile

amplitude = 0.3

# Create a square wave cycle, which alternates between the same
# positive and negative amplitude. Since the cycle is 100 samples
# long, if repeated it will have a frequency of 441Hz when the
# sample rate is 44,100Hz.
square_wave_cycle = ([amplitude] * 50) + ([-amplitude] * 50)

# Since the sample format is `:float`, the samples in the buffer
# should be numbers between -1.0 and 1.0
buffer = Buffer.new(square_wave_cycle, Format.new(:mono, :float, 44100))

Writer.new("mysound.wav", Format.new(:mono, :pcm_16, 44100)) do |writer|
  # Since the buffer has 100 samples, writing it 441 times will result
  # in 44,100 samples being written. This will result in 1 second of
  # sound when the sample rate is 44,100Hz.
  441.times { writer.write(buffer) }
end

To learn more about different ways of creating simple sounds using Ruby, check out this article about NanoSynth.

Compact Discs, Wave Files, and MP3s

Throughout this article we’ve talked about different ways of storing digital audio, such as CDs, Wave files, and MP3s. Let’s look at these a bit more in depth.

First, audio CDs. The surface of a CD contains millions of tiny pits. A CD player uses a laser to read the pattern of pits, and converts it into a series of 1s and 0s. CDs use 16 bits per sample, so every 16 0s or 1s represents one sample. CDs store stereo data, so separate samples are stored for the left and right speakers. Since CDs use a sample rate of 44,100Hz, every second of sound requires 88,200 samples. (44,100 samples for the sound in the left speaker, and 44,100 samples for the right speaker). When you play a CD, a stream of samples is read from the CD surface and sent to a digital-to-analog converter (DAC) inside the CD player. The resulting analog signal is sent to your speakers or headphones, which converts the signal to a sound wave via movement of the speaker cone.

Wave (.wav) and AIFF (.aiff or .aif) files on your computer are conceptually similar. Like a CD, a Wave/AIFF file mostly consists of a long stream of raw samples. (They can use compression, but I’m not sure how common this is). The beginning includes a small header that indicates the bits per sample, sample rate, etc., but the rest is mostly just samples. When you play a Wave/AIFF file, your computer converts this list of samples to an analog signal and sends it to your speakers. The documentation for the WaveFile gem has more info on the Wave file format.

MP3 files store audio data in a different way that allows them to be much smaller. MP3 files take advantage of the fact that due to the way our ears/brains work, certain frequencies in a sound can be removed without changing how it sounds it too much. When an MP3 file is created, raw digital samples are first converted into a set of sine waves with frequencies and amplitudes that would re-create those samples if added together. Next, some of the frequencies are removed. Finally, the frequencies/amplitudes (not samples) are stored in a compressed manner. When an MP3 file is played, these frequencies/amplitudes are used to re-create a list of samples that can be played back. The end result is a file which is much smaller than a Wave file, but with “good enough” sound quality. Handy if you are downloading songs from the Internet.

Conclusion

Well, that should cover the basics. You should now know the gist of how digital audio works, and be able to sound smart at parties.

Footnotes

Sound waves can travel through any physical medium, not just air. For example water, or solid objects like a desk or wall. ↩
"Sample" is an overloaded term. The original technical meaning is an instantaneous point in a signal, like we use in this article, but in popular usage it's come to mean a short snippet of sound, often taken from a pre-existing song. ↩
In the real world, sampling never results in a digital signal that allows perfectly recreating the original analog signal, for a few reasons. We'll cover some later in this article. ↩