Graham King

Solvitas perambulum

Reading MediaRecorder’s webm/opus output

Summary
JavaScript now allows recording, manipulation, and playback of sound through various audio-related APIs, with MediaRecorder being one of the simplest and most practical for recording audio from the user's microphone. By leveraging the Media Capture and Streams API to access user media, developers can set up a MediaRecorder instance to capture audio, outputting data in the WebM format with Opus codec. WebM uses the EBML format, a structure consisting of elements defined by 'Id', 'Length', and 'Value'. These elements start with a specific EBML header, followed by container-specific elements like segments and clusters, encapsulating Matroska fields and ultimately the Opus-encoded audio data. To stream WebM effectively, the stream must start with an EBML header followed by SimpleBlocks, which contain the actual audio data, unlike the simpler MP3 format that lacks such complexities.

Javascript can now record, manipulate and play sound thanks to a whole range of audio-related API’s. One of the simpler and more useful parts of that is MediaRecorder which can record sound, typically from the user’s microphone.

const stream = await navigator.mediaDevices
    .getUserMedia({audio: true, video: false});  // Media Capture and Streams API
const mediaRecorder = new MediaRecorder(stream); // MediaStream Recording API
mediaRecorder.ondataavailable = (ev) => {        // ev is a BlobEvent
    // ev.data is a Blob. Send it to the server.
};
mediaRecorder.start(50);  // call ondataavailable every 50ms

MediaRecorder on Chrome produces output with mime type audio/webm;codecs=opus.

Let me guide you through the WebM / Matroska / EBML maze.

The WebM format is a container, which can contain many things. In our case it contains audio encoded with the Opus codec. We are only going to look at the container, not the encoded audio.

Here is an example webm/opus file from MediaRecorder. Let’s open it in a hex editor:

00000000   1A 45 DF A3  9F 42 86 81  01 42 F7 81  01 42 F2 81  .E...B...B...B..
00000010   04 42 F3 81  08 42 82 84  77 65 62 6D  42 87 81 04  .B...B..webmB...
00000020   42 85 81 02  18 53 80 67  01 FF FF FF  FF FF FF FF  B....S.g........
00000030   15 49 A9 66  99 2A D7 B1  83 0F 42 40  4D 80 86 43  .I.f.*....B@M..C
00000040   68 72 6F 6D  65 57 41 86  43 68 72 6F  6D 65 16 54  hromeWA.Chrome.T
00000050   AE 6B BF AE  BD D7 81 01  73 C5 87 7A  48 E6 29 D6  .k......s..zH.).
00000060   15 3D 83 81  02 86 86 41  5F 4F 50 55  53 63 A2 93  .=.....A_OPUSc..

The first four bytes (1A 45 DF A3) are both the magic number and the EBML header element. WebM is a subset of Matroska, and Matroska is built of EBML (Extensible Binary Meta Language). The next byte (9F) is the length of the EBML header (31 bytes – read on for why). This EBML header is not specific to WebM or Matroska, it is common to all file formats that use EBML. The EBML header is specified here.

The next four bytes are more typical of an EBML element, so let’s examine them:

  • 0x4286: EBML version element.
  • 0x81: The length of the data for this element.
  • 0x01: The value of this element, so we have EBML version 1.

EBML elements are always Id Length Value.

The EBML header id’s are defined here.

The length is variable length encoded, a bit like UTF-8, to occupy between one and eight bytes. The count of inital consecutive zero-value bits plus one equals the length in bytes of, well, the length. You drop the initial 0 bits and the first 1 bit to get the value. 0x81 is 0b10000001, so there are zero inital 0 bits, meaning length one byte, and the value is 1. The 0x9F value for length of the EBML header we saw earlier is 0b10011111, still one byte, value is 0b0011111, which is 31 (the python repl is very helpful for these conversions).

The next four bytes 42 F7 81 01 are element id 0x42F7 (EBML Read Version), length 1, value 1. mkvinfo (fedora package mkvtoolnix) can do the work for us:

+ EBML head
|+ EBML version: 1
|+ EBML read version: 1
|+ Maximum EBML ID length: 4
|+ Maximum EBML size length: 8
|+ Document type: webm
|+ Document type version: 4
|+ Document type read version: 2
+ Segment: size unknown
|+ Segment information
| + Timestamp scale: 1000000
| + Multiplexing application: Chrome
| + Writing application: Chrome
|+ Tracks
| + Track
|  + Track number: 1 (track ID for mkvmerge & mkvextract: 0)
|  + Track UID: 34420100540273981
|  + Track type: audio
|  + Codec ID: A_OPUS
|  + Codec's private data: size 19
|  + Audio track
|   + Sampling frequency: 48000
|   + Channels: 1
|   + Bit depth: 16
|+ Cluster

If we jump over the EBML header (31 bytes), we go to position 36 (0x24) in the file. It’s 36 because 4 bytes for the EBML id (0x1A45DFA3), 1 for the length (0x9F) and 31 for the header’s value. We find 18 53 80 67. We are done with the generic EBML fields and now come to the Matroska fields.

The Matroska id’s are defined here.

The id 18 53 80 67 is a Segment. This segment includes all our audio. It has length 01 FF FF FF FF FF FF FF which is the special value meaning “size unknown”. Chrome couldn’t know how long our recording was going to be when we started streaming it to our server.

Next comes 15 49 A9 66 the segment information with length 25 (0x99, 0b10011001, one byte with value 25). Skipping those 25 bytes takes us to position 0x4E which is 16 54 AE 6B Tracks, length 63. Skipping the Tracks element takes us to position 0x92 1F 43 B6 75 Cluster which is also size unknown. That contains a Timecode E7, length 1, value 0, because we’re at the very start of the audio.

In the Cluster after the Timecode is a SimpleBlock A3 of length 41 86. In binary 0x41 is 0b01000001, there is one leading 0 bit meaning the length is encoded in two bytes, giving us a value of 390 bytes.

The structure of a SimpleBlock is first the track number as a variable length int 81 meaning track 1, then the timecode as two bytes 00 00, then one byte of flags 80 (keyframe flag is set).

Finally, at position 0xA8 we find our Opus-encoded audio data.

From then on we have a sequence of SimpleBlock. The first block runs until 0xA4 + 390 = 0x22A. That is the start of the next SimpleBlock. Jumping there we find A3 41 83, a SimpleBlock of length 387. And so on.

This SimpleBlock structure means that to stream WebM you have to start your stream with an EBML header, and then start the data on a SimpleBlock boundary. Your server has to parse the stream as it comes in to find the SimpleBlock boundaries (MediaRecorder does not split on SimpleBlock). By comparison MP3 is much simpler to stream, you can start the data wherever you want, there is no header, each block is self-describing and easy to find.

References: