Amazon Freevee - Nina Goetzen

Assessing how ideal streaming TV ad load tolerance changes based on genre and format.

Big picture

The client: Amazon Freevee, Amazon's free, ad-supported streaming platform (now part of Prime Video)
My role: Lead Researcher @ Pilot.ly | Mixed-methods, product-focused ad experience study
The project:

Objective: Amazon Freevee wanted to move beyond a one-size-fits-all ad load and understand how many minutes of ads per hour could be shown to viewers before degrading the viewing experience, and how this changed depending on the genre and format of the content.
Research: I led a mixed-methods study where participants watched real TV shows and movies across 8 different key genres (e.g. Reality, Comedy, Action), each with varying ad loads, while we captured behavioral data (completion, tune-out intent, skipping/scrubbing attempts) and then followed up with a survey to measure perceived enjoyment, ad intrusiveness, and willingness to keep watching under each condition.
Findings: We found a clear fatigue threshold at higher ad loads, plus strong differences in tolerance depending on genre and format: viewers would tolerate more ads while watching Reality and Comedy TV with only minor drops in satisfaction, while those watching movies—especially Action movies—were far more sensitive to increases beyond the existing 6 minute/hour baseline. Then, working with Freevee’s data team, we built a genre- and format-based decision grid to extrapolate these findings to Freevee’s broader content catalog, setying different default ad loads for different types of content, instead of a single overarching ad load.
Impact: This project directly led to Freevee shifting to a dynamic ad serving model that increased ad load (and therefore revenue) in more tolerant categories, while protecting the viewer experience in more sensitive ones. The project was cited by stakeholders as a key driver in renewing Freevee’s six-figure contract with our team.

The problem

At the time, Freevee's ad load was standardized across content type and genre—viewers would see the same number of ads per hour whether they were watching an action movie or a reality TV show. Stakeholders were concerned that this one-size-fits-all approach was leading to poor user experience (and therefore churn) one one end, and missed revenue potential on the other end.

So: how to best balance the tension between user experience and revenue? How much ad load could we add across different content types before we meaningfully harm viewer satisfaction and long-term engagement?

Hypothesis

Viewers would be more tolerant of higher ad loads in “lean-back” content (e.g., reality TV, comedy) than in “lean-in” content (e.g., action movies).
Beyond a certain point, incremental ad minutes would drive disproportionate drops in enjoyment, especially for movies, making additional ad time net-negative for long-term engagement.
Differences in tolerance by genre and format would be large enough to justify moving from a single, standardized ad load to a dynamic model.

Methodology

Assessing the user's behavior and perception were both important. For this, we decided on a mixed-methods approach:

Dial testing to measure real-time engagement and tune-out risk
A post-exposure survey to capture perceived ad intrusiveness, enjoyment, and intent to keep watching

Combining both let us see not only what viewers did, but how they felt about different ad loads across genres and formats.

Dial testing

While watching, viewers were asked to “thumbs up” or “thumbs down” in real time during parts of the show or movie, a methodology we use while testing content. At any point, they could also indicate they would “tune out” and stop watching. We were looking to measure two things:

Attention/engagement: Do viewers engage less during certain genres or formats? Does this change how many ads they say they are willing to tolerate in the survey?
Enjoyment: Does enjoyment of the content itself — not just enjoyment of the overall viewing experience — suffer when ad loads are increased? When excluding segments where ads are played, is the average sentiment (ratio of thumbs up/down) lower when ad loads are higher, and by how much?

Post-exposure survey

We also asked viewers to answer a series of questions about their experience, including their overall enjoyment, their perception of the ad load, how their ad experience compared to their expected experience, and how that impacted their stated enjoyment of the viewing experience overall.

We also asked questions to cross-reference to their behavior — i.e. asking about how much attention they paid while watching and then comparing that to their observed behavior with the dial trace.

Study Design

8 titles tested across 8 genres and 3 content formats). These titles were chosen to be representative of high-priority catalog areas (well-represented in Freevee's library, and also popular among Freevee viewers) while also covering a broad spectrum of pacing and narrative styles.
4 ad load options per title — roughly n=100 per title for a total of n= 3,103 respondents

Demographics: US media consumers who watch at least 5 hours of TV on streaming platforms per week. Equal split between men/women and ages 18-34/35-54; ethnicity data to match the US census

Findings & Recommendations

As expected, we found that content enjoyment suffered as ad load levels increased, but the slope of that decline changed depending on the genre and content type.

There was a clear fatigue threshold at roughly 10 minutes of ads per hour: above this, negative sentiment and tune-out intent increased sharply, and reported viewer satisfaction dropped dramatically. However, 10 minutes was notably above the current Freevee average of 6 minutes per hour.

Our initial recommendation, based solely off the data gathered, was to implement a tiered ad load strategy: Reality, Comedy, Courtroom TV could increase ad loads up to ~10 minutes/hour, where viewers remained relatively tolerant and monetization upside was largest. Movies, especially Action movies, should maintain the current ~6 minutes/hour to avoid disproportionate drops in viewer experience.

Scaling the insights:

However, because we only tested a subset of content types, we needed a way to generalize our findings to make recommendations that could be applied to the broader Freevee catalog. For example: If comedy TV can tolerate 10 minutes of ads but movies generally top out at 6, what should we do with comedy movies?

To answer this, we:

Defined a content grid across the two dimensions of genre (e.g., Comedy, Reality, Action, Thriller) and format/length (e.g., 30-min episode, 60-min episode, 90-min movie).
Mapped tested titles (e.g., 30-min comedy TV, 90-min action movie) to this grid and quantified, for each cell, the trade-off between viewer experience KPIs (enjoyment, tune-out intent, etc) and monetization potential (additional ad minutes).
Partnered with the client’s data team to extrapolate from tested genre/format combinations to untested but similar titles the broader catalog, using Freevee’s internal similarity scores and regression modeling to estimate ad load ranges where we didn't have direct data.

Based on these findings, we recommended, for example, that comedy movies follow the movie-based ad loads rather than comedy TV ad loads, as viewers’ tolerance for movie ad loads were lower than their tolerance for comedy ad loads were high.

The final result was a practical decision grid that Freevee could apply across its catalog to set default ad loads by genre and format, with room to refine further as new viewing data came in.

Impact

Our findings directly led to Freevee shifting from its standardized ad load model to a dynamic ad serving model based on genre and format, with the ultimate goal to balance viewer experience and monetization. In practice, this meant increasing ad loads (and therefore revenue) in tolerant categories such as Reality and Comedy TV, while maintaining the viewing experience in categories like Action movies.

From our side as a vendor, this project was explicitly cited by Freevee stakeholders as the primary factor leading to the renewal of our 6-figure contract with them. (Even when Amazon folded the Freevee platform and team into Prime Video about a year later, our partnership has still persisted!)

Reflections

As a researcher, this study honed my product research skills at a time where most of my work was around content research. It pushed my ability to navigate user experience vs revenue trade-offs, build scalable recommendation frameworks from a finite study, and clearly communicate those trade-offs to stakeholders with different priorities.

It also helped develop my ability to work across a wide range of partners: our internal research team, Freevee’s insights team, their data science group, and multiple sample vendors. At each stage, I was responsible for keeping everyone aligned.

In hindsight, this study attempted to tackle a lot of dimensions at once (genre, format, length). If I were to run it again, I would push to separate it into multiple different experiments—i.e. testing format first, and then genre, or vice versa. That said, given that we were operating as a vendor, we were constrained by the client's budget, timeline, and desired scope. I still think we worked together on a great set of representative titles that allowed us to deliver strong recommendations, especially when we worked with the client's data team to scale our findings.

Another limitation of working as a vendor is that my visibility into long-term impact was limited. I do feel very proud that our recommendations were implemented, but the question remains: Well? Did it work? Where were our assumptions proven correct, and where was there some fine-tuning that needed to happen? How much did we move the needle on their key metrics? I suppose I can extrapolate based on the data we have: they've continued to renew their contract with us. But hey, a girl can still wish she had the data, not just the extrapolation!