Reading MediaRecorder’s webm/opus output
Summary
Javascript can now record, manipulate and play sound thanks to a whole range of audio-related API’s. One of the simpler and more useful parts of that is MediaRecorder which can record sound, typically from the user’s microphone.
const stream = await navigator.mediaDevices
.getUserMedia({audio: true, video: false}); // Media Capture and Streams API
const mediaRecorder = new MediaRecorder(stream); // MediaStream Recording API
mediaRecorder.ondataavailable = (ev) => { // ev is a BlobEvent
// ev.data is a Blob. Send it to the server.
};
mediaRecorder.start(50); // call ondataavailable every 50ms
MediaRecorder
on Chrome produces output with mime type audio/webm;codecs=opus.
Let me guide you through the WebM / Matroska / EBML maze.
The WebM format is a container, which can contain many things. In our case it contains audio encoded with the Opus codec. We are only going to look at the container, not the encoded audio.
Here is an example webm/opus file from MediaRecorder. Let’s open it in a hex editor:
00000000 1A 45 DF A3 9F 42 86 81 01 42 F7 81 01 42 F2 81 .E...B...B...B..
00000010 04 42 F3 81 08 42 82 84 77 65 62 6D 42 87 81 04 .B...B..webmB...
00000020 42 85 81 02 18 53 80 67 01 FF FF FF FF FF FF FF B....S.g........
00000030 15 49 A9 66 99 2A D7 B1 83 0F 42 40 4D 80 86 43 .I.f.*....B@M..C
00000040 68 72 6F 6D 65 57 41 86 43 68 72 6F 6D 65 16 54 hromeWA.Chrome.T
00000050 AE 6B BF AE BD D7 81 01 73 C5 87 7A 48 E6 29 D6 .k......s..zH.).
00000060 15 3D 83 81 02 86 86 41 5F 4F 50 55 53 63 A2 93 .=.....A_OPUSc..
The first four bytes (1A 45 DF A3
) are both the magic number and the EBML header element. WebM is a subset of Matroska, and Matroska is built of EBML (Extensible Binary Meta Language). The next byte (9F
) is the length of the EBML header (31 bytes – read on for why). This EBML header is not specific to WebM or Matroska, it is common to all file formats that use EBML. The EBML header is specified here.
The next four bytes are more typical of an EBML element, so let’s examine them:
- 0x4286: EBML version element.
- 0x81: The length of the data for this element.
- 0x01: The value of this element, so we have EBML version 1.
EBML elements are always Id Length Value.
The EBML header id’s are defined here.
The length is variable length encoded, a bit like UTF-8, to occupy between one and eight bytes. The count of inital consecutive zero-value bits plus one equals the length in bytes of, well, the length. You drop the initial 0 bits and the first 1 bit to get the value. 0x81 is 0b10000001, so there are zero inital 0 bits, meaning length one byte, and the value is 1. The 0x9F value for length of the EBML header we saw earlier is 0b10011111, still one byte, value is 0b0011111, which is 31 (the python repl is very helpful for these conversions).
The next four bytes 42 F7 81 01
are element id 0x42F7 (EBML Read Version), length 1, value 1. mkvinfo
(fedora package mkvtoolnix
) can do the work for us:
+ EBML head
|+ EBML version: 1
|+ EBML read version: 1
|+ Maximum EBML ID length: 4
|+ Maximum EBML size length: 8
|+ Document type: webm
|+ Document type version: 4
|+ Document type read version: 2
+ Segment: size unknown
|+ Segment information
| + Timestamp scale: 1000000
| + Multiplexing application: Chrome
| + Writing application: Chrome
|+ Tracks
| + Track
| + Track number: 1 (track ID for mkvmerge & mkvextract: 0)
| + Track UID: 34420100540273981
| + Track type: audio
| + Codec ID: A_OPUS
| + Codec's private data: size 19
| + Audio track
| + Sampling frequency: 48000
| + Channels: 1
| + Bit depth: 16
|+ Cluster
If we jump over the EBML header (31 bytes), we go to position 36 (0x24) in the file. It’s 36 because 4 bytes for the EBML id (0x1A45DFA3), 1 for the length (0x9F) and 31 for the header’s value. We find 18 53 80 67
. We are done with the generic EBML fields and now come to the Matroska fields.
The Matroska id’s are defined here.
The id 18 53 80 67
is a Segment. This segment includes all our audio. It has length 01 FF FF FF FF FF FF FF
which is the special value meaning “size unknown”. Chrome couldn’t know how long our recording was going to be when we started streaming it to our server.
Next comes 15 49 A9 66
the segment information with length 25 (0x99, 0b10011001, one byte with value 25). Skipping those 25 bytes takes us to position 0x4E which is 16 54 AE 6B
Tracks, length 63. Skipping the Tracks element takes us to position 0x92 1F 43 B6 75
Cluster which is also size unknown. That contains a Timecode E7
, length 1, value 0, because we’re at the very start of the audio.
In the Cluster after the Timecode is a SimpleBlock A3
of length 41 86
. In binary 0x41 is 0b01000001, there is one leading 0 bit meaning the length is encoded in two bytes, giving us a value of 390 bytes.
The structure of a SimpleBlock is first the track number as a variable length int 81
meaning track 1, then the timecode as two bytes 00 00
, then one byte of flags 80
(keyframe flag is set).
Finally, at position 0xA8 we find our Opus-encoded audio data.
From then on we have a sequence of SimpleBlock. The first block runs until 0xA4 + 390 = 0x22A. That is the start of the next SimpleBlock. Jumping there we find A3 41 83
, a SimpleBlock of length 387. And so on.
This SimpleBlock structure means that to stream WebM you have to start your stream with an EBML header, and then start the data on a SimpleBlock boundary. Your server has to parse the stream as it comes in to find the SimpleBlock boundaries (MediaRecorder does not split on SimpleBlock). By comparison MP3 is much simpler to stream, you can start the data wherever you want, there is no header, each block is self-describing and easy to find.
References: