How to Prepare a Video Stream for SSAI (Ad Stitching)

Jan Sunavec
4 min readFeb 20, 2023

Let me demonstrate how to create a proper SSAI stream that you can utilize with services such as AWS MediaTailor.

If you only have an MP4 file, you can’t simply stitch ads into every user session because you need to re-encode the entire file when the user clicks the play button. In such cases, I recommend using CSAI instead. Re-encoding video is time-consuming and costly, particularly when utilizing more complex codecs like H.265, VPx, or AV1.

The most efficient solution is to split the MP4 file into video chunks. In this way, you can effortlessly stitch ads between chunk X and X+1, without the need to re-encode the whole video. You just need to modify the list of chunks. There are also a few common protocols, such as HLS or MPEG-DASH, that handle all technical aspects. In this example, we will focus on HLS.

The following image displays what a manifest file (.m3u8) of such a video stream will look like.

To convert an MP4 file to an HLS manifest with a TS file, you can use the following command:

ffmpeg -i filename.mp4 -vcodec: libx264 -f hls filename.m3u8

The number of video chunks you get will depend on the length of the input video. For example, if you have a 10-minute video and FFMPEG creates two video chunks with the same duration, you will only have two points (at 0 and 5 minutes) where you can stitch ads. You won’t be able to insert an ad at the first minute or fifth second. It’s important to consider the granularity you need based on your technical or business requirements.

To achieve a granularity of 10 seconds, use the following FFmpeg command:

ffmpeg -i filename.mp4 -hls_time 10 -vcodec: libx264 -f hls filename.m3u8

Compex part

The last FFmpeg command is useful, but it still does not support SSAI. Why is that? Well, video encoders encode video on a frame-by-frame basis. This means that each frame must be encoded one by one. To create an effective codec, the data must be compressed in an efficient way so that each frame can be compressed as an I, P, or B frame.

I — Frames

I-frames, also known as key frames or intra-frames, consist of data about the frame itself.

P — Frames

P-frames, also known as predicted frames, use frames that have been previously encoded. When there is no motion in the video (such as during presentations), these frames are highly effective against I-frames.

B — Frames

B-frames, known as bidirectional frames, use frames that have been previously encoded and frames that will be encoded in the future. This allows for predictions based on both past and future frames.

SSAI finally

Why are we talking about these technical details? We need to understand how all these frames are stored in one video chunk. There can be one I-frame and several P-frames, or just P-frames. Let’s use the following example:

In both video chunks, there is only one I-frame. Now let’s insert an ad. What will happen?

As you can see, after a few P-frames from the main content, there will be an I-frame from the ad, followed by a P-frame from the ad and then a P-frame from the main content. After chunk 1, the ad will be played. However, chunk 2 will not be rendered at all because the first frame is a P-frame. The video decoder will try to use previous frames to reconstruct the whole frame, but there are only frames from the ad.

What should we do now? Re-encoding the rest of the frames sounds like a pain. The second option is to set an I-frame at the beginning of every chunk. This can be done with the following FFmpeg command:

ffmpeg -i filename.mp4 -hls_time 10 -hls_flags independent_segments -vcodec: libx264 -f hls filename.m3u8

After the ad is stitched, the resulting video will be as shown in the command’s output.

That’s it. Just add support for multi-bitrate, subtitles, and multiple audio tracks, and you’ll have a professional solution.

--

--

Jan Sunavec

CTO, R&D director, Ad-Tech, Video Streaming, OTT, CTV, OpenRTB