mozilla
Your Search Results

    Basic concepts behind Web Audio API

    Draft
    This page is not complete.

    This article explains some of the audio theory behind how the features of the web audio API work. It won't turn you into a master sound engineer, but it will give you enough background to understand why the Web Audio API works like it does, and allow you to make better informed decisions while developing with it.

    Audio graphs

    The Web Audio API involves handling audio operations inside an audio context, and has been designed to allow modular routing. Basic audio operations are performed with audio nodes, which are linked together to form an audio routing graph. Several sources — with different types of channel layout — are supported even within a single context. This modular design provides the flexibility to create complex audio functions with dynamic effects.

    Audio nodes are linked via their inputs and outputs, forming a chain that starts with one or more sources, goes through one or more nodes, then ends up at a destination (although you don't have to provide a destination if you, say, just want to visualise some audio data.) A simple, typical workflow for web audio would look something like this:

    1. Create audio context
    2. Inside the context, create sources — such as <audio>, oscillator, stream
    3. Create effects nodes, such as reverb, biquad filter, panner, compressor
    4. Choose final destination of audio, for example your system speakers
    5. Connect the sources up to the effects, and the effects to the detination.

    A simple box diagram with an outer box labeled Audio context, and three inner boxes labeled Sources, Effects and Destination. The three inner boxes have arrow between them pointing from left to right, indicating the flow of audio information.

    Each input or output is composed of several channels, which represent a specific audio layout. Any discrete channel structure is supported, including mono, stereo, quad, 5.1, and so on.

    Show the ability of AudioNodes to connect via their inputs and outputs and the channels inside these inputs/outputs.

    Audio sources can come from a variety of places:

    • Generated directly by JavaScript by an audio node (such as an oscillator.)
    • Created from raw PCM data (the audio context has methods to decode supported audio formats.)
    • Taken from HTML media elements (such as <video> or <audio>.)
    • Taken directly from a WebRTC MediaStream (such as a webcam or microphone.)

    Audio data: what's in a sample

    TBD

    Cover things like why audio data needs to be in the range of -1.0 and 1.0, what the data means, etc.

    Audio buffers: frames, samples and channels

    An AudioBuffer takes as its parameters a number of channels (1 for mono, 2 for stereo, etc), a length, meaning the number of sample frames inside the buffer, and a sample rate, which is the number of sample frames played per second.

    A sample is a single float32 value that represents the value of the audio stream at each specific point in time, in a specific channel (left or right, if in the case of stereo). A frame, or sample frame is the set of all values for all channels that will play at a specific point in time: all the samples of all the channels that play at the same time (two for a stereo sound, six for 5.1, etc.)

    The sample rate is the number of those samples (or frames, since all samples of a frame play at the same time) that will play in one second, measured in Hz. The higher the sample rate, the better the sound quality.

    Let's look at a Mono and a Stereo audio buffer, each one second long, and playing at 44100Hz:

    • The Mono buffer will have 44100 samples, and 44100 frames. The length property will be 44100.
    • The Stereo buffer will have 82100 samples, but still 44100 frames. The length property will still be 44100 since it's equal to the number of frames.

    A diagram showing several frames in an audio buffer in a long line, each one containing two samples, as the buffer has two channels, it is stereo.

    When a buffer plays, you will hear the left most sample frame, and then the one right next to it, etc, etc. In the case of stereo, you will hear both channels at the same time. Sample frames are very useful, because they are independent of the number of channels, and represent time, in a useful way for doing precise audio manipulation.

    Note: To get a time in seconds from a frame count, simply divide the number of frames by the sample rate. To get a number of frames from a number of samples, simply divide by the channel count.

    Here's a couple of simple trivial examples:

    var context = new AudioContext();
    var buffer = context.createBuffer(2, 22050, 44100);

    If you use this call, you will get a stereo buffer (two channels), that, when played back on an AudioContext running at 44100Hz (very common, most normal sound cards run at this rate), will last for 0.5 seconds: 22050 frames / 44100Hz = 0.5 seconds.

    var context = new AudioContext();
    var buffer = context.createBuffer(1, 22050, 22050);

    If you use this call, you will get a mono buffer (one channel), that, when played back on an AudioContext running at 44100Hz, will be automatically *resampled* to 44100Hz (and therefore yield 44100 frames), and last for 1.0 second: 44100 frames / 44100Hz = 1 second.

    Note: audio resampling is very similar to image resizing: say you've got a 16 x 16 image, but you want it to fill a 32x32 area: you resize (resample) it. the result has less quality (it can be blurry or edgy, depending on the resizing algorithm), but it works, and the resized image takes up less space. Resampled audio is exactly the same — you save space, but in practice you will be unable to properly reproduce high frequency content (treble sound).

    Planar versus interleaved buffers

    The Web Audio API uses a planar buffer format: the left and right channels are stored like this:

    LLLLLLLLLLLLLLLLRRRRRRRRRRRRRRRR (for a buffer of 16 frames)

    This is very common in audio processing: it makes it easy to process each channel independently.

    The alternative is to use an interleaved buffer format:

    LRLRLRLRLRLRLRLRLRLRLRLRLRLRLRLR (for a buffer of 16 frames)

    This format is very common for storing and playing back audio without much processing, for example a decoded MP3 stream.

    The Web Audio API exposes *only* planar buffers, because it's made for processing. It works with planar, but converts the audio to interleaved when it is sent to the
    sound card, for playback. Conversely, when an MP3 is decoded, it starts off in interleaved format, but is converted to planar for processing.

    Audio channels

    Different audio buffers contain different numbers of channels, from the more basic mono (only one channel) and stereo (left and right channels) to more complex sets like quad and 5.1, which have different sound samples contained in each channel, leading to a richer sound experience. The channels are usually represented by standard abbreviations detaild in the table below:

    Mono 0: M: mono
    Stereo 0: L: left
    1: R: right
    Quad 0: L:  left
    1: R:  right
    2: SL: surround left
    3: SR: surround right
    5.1 0: L:   left
    1: R:   right
    2: C:   center
    3: LFE: subwoofer
    4: SL:  surround left
    5: SR:  surround right

    Up-mixing and down-mixing

    When the amount of channels doesn't match between an input and an output, up- or down-mixing happens according the following rules. This can be somewhat controlled by setting the AudioNode.channelInterpretation property to speakers or discrete.

    Interpretation Input channels Output channels Mixing rules
    speakers 1 (Mono) 2 (Stereo) Up-mix from mono to stereo.
    The M input channel is used for both output channels (L and R).
    output.L = input.M
    output.R = input.M
    1 (Mono) 4 (Quad) Up-mix from mono to quad.
    The M input channel is used for non-surround output channels (L and R). Surround output channels (SL and SR) are silent.
    output.L  = input.M
    output.R  = input.M
    output.SL = 0
    output.SR = 0
    1 (Mono) 6 (5.1) Up-mix from mono to 5.1.
    The M input channel is used for the center output channel (C). All the others (L, R, LFE, SL, and SR) are silent.
    output.L   = 0
    output.R   = 0

    output.C   = input.M
    output.LFE = 0
    output.SL  = 0
    output.SR  = 0
    2 (Stereo) 1 (Mono) Down-mix from stereo to mono.
    Both input channels (L and R) are equally combined to produce the unique output channel (M).
    output.M = 0.5 * (input.L + input.R)
    2 (Stereo) 4 (Quad) Up-mix from stereo to quad.
    The L and R input channels are used for their non-surround respective output channels (L and R). Surround output channels (SL and SR) are silent.
    output.L  = input.L
    output.R  = input.R
    output.SL = 0
    output.SR = 0
    2 (Stereo) 6 (5.1) Up-mix from stereo to 5.1.
    The L and R input channels are used for their non-surround respective output channels (L and R). Surround output channels (SL and SR), as well as the center (C) and subwoofer (LFE) channels, are left silent.
    output.L   = input.L
    output.R   = input.R
    output.C   = 0
    output.LFE = 0
    output.SL  = 0
    output.SR  = 0
    4 (Quad) 1 (Mono) Down-mix from quad to mono.
    All four input channels (L, R, SL, and SR) are equally combined to produce the unique output channel (M).
    output.M = 0.25 * (input.L + input.R + input.SL + input.SR)
    4 (Quad) 2 (Stereo) Down-mix from quad to mono.
    Both left input channels (L and SL) are equally combined to produce the unique left output channel (L). And similarly, both right input channels (R and SR) are equally combined to produce the unique right output channel (R).
    output.L = 0.5 * (input.L + input.SL)
    output.R = 0.5 * (input.R + input.SR)
    4 (Quad) 6 (5.1) Up-mix from quad to 5.1.
    The L, R, SL, and SR input channels are used for their respective output channels (L and R). Center (C) and subwoofer (LFE) channels are left silent.
    output.L   = input.L
    output.R   = input.R
    output.C   = 0
    output.LFE = 0
    output.SL  = input.SL
    output.SR  = input.SR
    6 (5.1) 1 (Mono) Down-mix from 5.1 to stereo.
    The left (L and SL), right (R and SR) and central channels are all mixed together. The surround channels are slightly attenuated and the regular lateral channels are power-compensated to make them count as a single channel. The subwoofer (LFE) channel is lost.
    output.M = 0.7071 * (input.L + input.R) + input.C + 0.5 * (input.SL + input.SR)
    6 (5.1) 2 (Stereo) Down-mix from 5.1 to stereo.
    The central channel (C) is summed with each lateral surround channel (SL or SR) and mixed to each lateral channel. As it is mixed down to two channels, it is mixed at a lower power: in each case it is multiplied by √2/2. The subwoofer (LFE) channel is lost.
    output.L   = input.L + 0.7071 * (input.C + input.SL)
    output.R   = input.R
    + 0.7071 * (input.C + input.SR)
    6 (5.1) 4 (Quad) Down-mix from 5.1 to quad.
    The central (C) is mixed with the lateral non-surround channels (L and R). As it is mixed down to two channels, it is mixed at a lower power: in each case it is multiplied by √2/2. The surround channels are passed unchanged. The subwoofer (LFE) channel is lost.
    output.L   = input.L + 0.7071 * input.C
    output.R   = input.R
    output.SL  = input.SL
    output.SR  = input.SR
    Other, non-standard layouts Non-standard channel layouts are handled as if channelInterpretation is set to discrete.
    The specification explicitly allows the future definition of new speaker layouts. This fallback is therefore not future proof as the behavior of the browsers for a specific amount of channels may change in the future.
    discrete any (x) any (y) where x<y Up-mix discrete channels.
    Fill each output channel with its input counterpart, that is the input channel with the same index. Channels with no corresponding input channels are left silent.
    any (x) any (y) where x>y Down-mix discrete channels.
    Fill each output channel with its input counterpart, that is the input channel with the same index. Input channels with no corresponding output channels are dropped.

    Visualizations

    TBD

    Cover how this all works, from a high level. What do the different values you can get an analyser to return mean? What ranges do they have, and why?

    Spatialisations

    TBD

    Cover how spatializations work, provide some diagrams, etc.

    Fan-in and Fan-out

    TBD — provide definition.

    Document Tags and Contributors

    Contributors to this page: anirudh_venkatesh, JesseTG, chrisdavidmills
    Last updated by: JesseTG,